In this notebook, we take a look at the performance of various functions in the notebook from this community post. The goal of this notebook is to help you understand which operations are fast and which ones are slow so you can optimize the way you interact with the notebook.

In [1]:

```
from quantopian.pipeline import Pipeline
from quantopian.pipeline.data.alpha_vertex import precog_top_100
from quantopian.pipeline.data import EquityPricing, factset
from quantopian.pipeline.factors import Returns, SimpleBeta, SimpleMovingAverage
from quantopian.pipeline.filters import QTradableStocksUS
from quantopian.research import run_pipeline
import time
import pandas as pd
import numpy as np
START = pd.Timestamp("2010-01-05")
END = pd.Timestamp("2017-01-01")
```

The speed at which premium data is loaded can vary widely. There are a couple of factors that affect load times.

- The speed at which premium data is loaded depends on the number of notebooks/algorithms accessing any premium dataset at one time. Generally speaking, contest algorithms are run between 3:30AM ET - 8:30AM ET. Many contest algorithms use premium datasets, so loading premium data during this time is slower than normal.
- Loading data for the first time is slow. However, if you run the same computation (a.k.a. load data in the same way) that you ran recently, the load time will be much faster.

In [2]:

```
universe = QTradableStocksUS()
```

In [3]:

```
pipe = Pipeline(columns={'alpha': precog_top_100.predicted_five_day_log_return.latest}, screen=universe)
starttime = time.time()
results_precog = run_pipeline(pipe, start_date=START, end_date=END).dropna()
print "Running a pipeline with precog data took %.2f seconds." % (time.time() - starttime)
```

If we run the exact same computation, the load time improves significantly. Restarting the notebook negates this effect.

In [4]:

```
starttime = time.time()
results_precog = run_pipeline(pipe, start_date=START, end_date=END).dropna()
print "Running a pipeline with precog data took %.2f seconds." % (time.time() - starttime)
```

Core datasets have a different backend implementation that supports much faster load times. The Fundamentals dataset is orders of magnitudes larger than the `precog_top_100 dataset`

, but it loads much more quickly. We are looking to add new datasets from FactSet in a similar way to how we added the core datasets.

In [5]:

```
pipe = Pipeline(columns={'alpha': factset.Fundamentals.mkt_val.latest}, screen=universe)
starttime = time.time()
results_mcap = run_pipeline(pipe, start_date=START, end_date=END).dropna()
print "Running a pipeline with fundamental data took %.2f seconds." % (time.time() - starttime)
```

In [6]:

```
pipe = Pipeline(columns={'alpha': EquityPricing.close.latest}, screen=universe)
starttime = time.time()
results_price = run_pipeline(pipe, start_date=START, end_date=END).dropna()
print "Running a pipeline with pricing data took %.2f seconds." % (time.time() - starttime)
```

Getting factor loadings is usually quick.

In [7]:

```
from quantopian.research.experimental import get_factor_returns, get_factor_loadings
```

In [8]:

```
assets = results_precog.index.levels[1]
```

In [9]:

```
starttime = time.time()
# Load risk factor loadings and returns
factor_loadings = get_factor_loadings(assets, START, END + pd.Timedelta(days=30))
factor_returns = get_factor_returns(START, END + pd.Timedelta(days=30))
print time.time() - starttime
```

Getting pricing data from `get_pricing`

is usually quick.

In [10]:

```
starttime = time.time()
pricing = get_pricing(assets, START, END + pd.Timedelta(days=30), fields="close_price")
print "Getting pricing data for alphalens took %.2f seconds." % (time.time() - starttime)
```

In [11]:

```
import alphalens as al
```

It seems the `get_clean_factor_and_forward_returns`

function in `alphalens.utils`

is the culprit in this notebook. It takes about 3 minutes to run. Further in the notebook, this gets called 5 times to generate a single plot.

In [12]:

```
starttime = time.time()
factor_data_total = al.utils.get_clean_factor_and_forward_returns(
results_precog['alpha'],
pricing,
periods=range(1, 15))
print "get_clean_factor_and_forward_returns took %.2f seconds." % (time.time() - starttime)
```

In [13]:

```
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import empyrical as ep
import alphalens as al
import pyfolio as pf
from quantopian.research.experimental import get_factor_returns, get_factor_loadings
def compute_specific_returns(total_returns, factor_returns, factor_loadings):
factor_returns.index = factor_returns.index.set_names(['dt'])
factor_loadings.index = factor_loadings.index.set_names(['dt', 'ticker'])
common_returns = factor_loadings.mul(factor_returns).sum(axis='columns').unstack()
specific_returns = total_returns - common_returns
return specific_returns
def factor_portfolio_returns(factor, pricing, equal_weight=True, delay=0):
if equal_weight:
factor = np.sign(factor)
bins = (-1, 0, 1)
quantiles = None
zero_aware = False
else:
bins = None
quantiles = 5
zero_aware = True
pos = factor.unstack().fillna(0)
pos = (pos / pos.abs().sum()).reindex(pricing.index).ffill().shift(delay)
# Fully invested, shorts show up as cash
pos['cash'] = pos[pos < 0].sum(axis='columns')
factor_and_returns = al.utils.get_clean_factor_and_forward_returns(
pos.stack().loc[lambda x: x != 0],
pricing, periods=(1,), quantiles=quantiles, bins=bins,
zero_aware=zero_aware)
return al.performance.factor_returns(factor_and_returns)['1D'], pos
def plot_ic_over_time(factor_data, label='', ax=None):
mic = al.performance.mean_information_coefficient(factor_data)
mic.index = mic.index.map(lambda x: int(x[:-1]))
ax = mic.plot(label=label, ax=ax)
ax.set(xlabel='Days', ylabel='Mean IC')
ax.legend()
ax.axhline(0, ls='--', color='k')
def plot_cum_returns_delay(factor, pricing, delay=range(5), ax=None):
if ax is None:
fig, ax = plt.subplots()
for d in delay:
portfolio_returns, _ = factor_portfolio_returns(factor, pricing, delay=d)
ep.cum_returns(portfolio_returns).plot(ax=ax, label=d)
ax.legend()
ax.set(ylabel='Cumulative returns', title='Cumulative returns if factor is delayed')
def plot_exposures(risk_exposures, ax=None):
rep = risk_exposures.stack().reset_index()
rep.columns = ['dt', 'factor', 'exposure']
sns.boxplot(x='exposure', y='factor', data=rep, orient='h', ax=ax, order=risk_exposures.columns[::-1])
def plot_overview_tear_sheet(factor, prices, factor_returns, factor_loadings, periods=range(1, 15)):
stock_rets = pricing.pct_change()
stock_rets_specific = compute_specific_returns(stock_rets, factor_returns, factor_loadings)
cr_specific = ep.cum_returns(stock_rets_specific, starting_value=1)
factor_data_total = al.utils.get_clean_factor_and_forward_returns(
factor,
pricing,
periods=periods)
factor_data_specific = al.utils.get_clean_factor_and_forward_returns(
factor,
cr_specific,
periods=periods)
portfolio_returns, portfolio_pos = factor_portfolio_returns(factor, pricing)
factor_loadings.index = factor_loadings.index.set_names(['dt', 'ticker'])
portfolio_pos.index = portfolio_pos.index.set_names(['dt'])
risk_exposures_portfolio, perf_attribution = pf.perf_attrib.perf_attrib(
portfolio_returns,
portfolio_pos,
factor_returns,
factor_loadings,
pos_in_dollars=False)
fig = plt.figure(figsize=(16, 16))
gs = plt.GridSpec(4, 4)
ax1 = plt.subplot(gs[0:2, 0:2])
plot_ic_over_time(factor_data_total, label='Total returns', ax=ax1)
plot_ic_over_time(factor_data_specific, label='Specific returns', ax=ax1)
ax2 = plt.subplot(gs[0:2, 2:4])
plot_cum_returns_delay(factor, pricing, ax=ax2)
ax3 = plt.subplot(gs[2:4, 0:2])
plot_exposures(risk_exposures_portfolio.reindex(columns=perf_attribution.columns),
ax=ax3)
ax4 = plt.subplot(gs[2:4, 2])
ep.cum_returns_final(perf_attribution).plot.barh(ax=ax4)
ax4.set(xlabel='Cumulative returns')
ax5 = plt.subplot(gs[2:4, 3], sharey=ax4)
perf_attribution.apply(ep.annual_volatility).plot.barh(ax=ax5, color='r')
ax5.set(xlabel='Ann. volatility')
gs.tight_layout(fig)
```

The second plot calls `get_clean_factor_and_forward_returns`

5 times. All in all, the function seems to get called 8 times (based on the number of warning messages). This is the primary reason why generating these plots takes so long.

In [14]:

```
starttime = time.time()
plot_overview_tear_sheet(results_precog['alpha'], pricing, factor_returns, factor_loadings)
print "get_clean_factor_and_forward_returns took %.2f seconds." % (time.time() - starttime)
```

There are two operations that can take significant time:

- Loading a computation that depends on a "premium" dataset for the first time. Running the same computation in the same kernel (without restarting the notebook) will be much faster. This load time can also vary based on how many people are loading it at once. The busiest time is 3:30AM - 8:30AM ET, when contest backtests are run on a daily basis.
`alphalens.utils.get_clean_factor_and_forward_returns`

is quite slow. We will have to visit this function and see if we can speed it up. Unfortunately, I'm not aware of a workaround right now.