Notebook

Tackling overfitting via cross-validation over quarters

by Thomas Wiecki, Quantopian inc.

Overfitting is probably the biggest potential pitfall in algorithmic trading. You work on a factor or algorithm that looks pretty good, you have some ideas to improve it so that it looks even better. Excitedly, you turn the algorithm on only to be dismayed when the out-of-sample performance doesn't look nearly as good as your backtest. We at Quantopian certainly observe this pattern a lot when evaluating algorithms for inclusion in the fund. If you want your algorithm to do well in the contest, you need to be very careful in this regard.

Note that the context here is not the same as in machine learning, although it is definitely related. Here, we're talking about manual factor development. As such, how we do hold-out here is also a bit different than what you might want to do with a machine learning algorithm.

We could just split our time period in two and develop on the first section, and when we're done, make sure the factor still works on e.g. the last 6 months we haven't looked at. Unfortunately, your factor might only work in certain market regimes (bull/bear market, low/high vol regimes etc) or time periods. So it is quite common that if it doesn't look great to explain that result away by saying "well the market's were really weird recently so I won't take that negative result too seriously."

Another approach is to instead only evaluate your factor on even quarters and keep the odd quarters as our final test. If it fails on the test set then, you are almost out of excuses.

In [2]:
from quantopian.research import run_pipeline
from quantopian.pipeline import Pipeline
from quantopian.pipeline.factors import Returns
from quantopian.pipeline.filters import QTradableStocksUS

import pandas as pd
import numpy as np

import alphalens as al
import pyfolio as pf
import matplotlib.pyplot as plt

Next we will define a very simple momentum factor that we want to hold for 10 days, or two weeks.

In [3]:
universe = QTradableStocksUS()
simple_momentum = Returns(window_length=252).rank().zscore()
In [4]:
pipe = Pipeline(screen=universe, columns={'simple_momentum': simple_momentum})
In [60]:
start = pd.Timestamp("2010-01-05")
end = pd.Timestamp("2017-01-01")
results = run_pipeline(pipe, start_date=start, end_date=end)

Subsample the factor to 2 weeks as that is how we would actually trade it.

In [62]:
dts = results.index.get_level_values(0).drop_duplicates()
dts_subsampled = dts.to_series().resample('2W-MON').first().tolist()
dts_subsampled[:5]
Out[62]:
[Timestamp('2010-01-05 00:00:00'),
 Timestamp('2010-01-12 00:00:00'),
 Timestamp('2010-01-26 00:00:00'),
 Timestamp('2010-02-09 00:00:00'),
 Timestamp('2010-02-23 00:00:00')]
In [63]:
results = results.loc[dts_subsampled]

Our split function returns the factor split over quarters.

In [73]:
def split_quarters(factor, train_test='train'):
    odd_even = 0 if train_test == 'train' else 1        
    return factor.iloc[factor.index.get_level_values(0).quarter % 2 == odd_even]

Split our factor results into the train component.

In [74]:
results_train = split_quarters(results, train_test='train')

Analyze factor with alphalens

In [66]:
assets = results_train.index.levels[1]
pricing = get_pricing(assets, start, end + pd.Timedelta(days=30), fields="close_price")
In [67]:
factor_train = al.utils.get_clean_factor_and_forward_returns(results_train['simple_momentum'], pricing,
                                                             periods=[10])
Dropped 1.1% entries from factor data: 1.1% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
In [69]:
al.tears.create_full_tear_sheet(factor_train)
Quantiles Statistics
min max mean std count count %
factor_quantile
1 -1.730268 -0.325803 -1.203210 0.267923 36123 20.019730
2 -1.025159 0.279893 -0.430024 0.295982 36072 19.991465
3 -0.415741 0.960416 0.336761 0.289696 36064 19.987031
4 0.371067 1.353257 0.993832 0.176791 36072 19.991465
5 1.063428 1.731804 1.474874 0.140226 36106 20.010308
Returns Analysis
10D
Ann. alpha 0.059
beta -0.113
Mean Period Wise Return Top Quantile (bps) 15.273
Mean Period Wise Return Bottom Quantile (bps) -29.399
Mean Period Wise Spread (bps) 41.794
<matplotlib.figure.Figure at 0x7fb61109c850>
Information Analysis
10D
IC Mean 0.025
IC Std. 0.159
Risk-Adjusted IC 0.159
t-stat(IC) 1.538
p-value(IC) 0.128
IC Skew -0.058
IC Kurtosis 0.044
Turnover Analysis
10D
Quantile 1 Mean Turnover 0.0
Quantile 2 Mean Turnover 0.0
Quantile 3 Mean Turnover 0.0
Quantile 4 Mean Turnover 0.0
Quantile 5 Mean Turnover 0.0
10D
Mean Factor Rank Autocorrelation 1.0

As you can see, we only have data for even quarters. Alphalens fortunately handles this reasonably well and just shows straight lines in periods where no data is present.

A couple of indicators suggest that we might be on to something. The mean returns seem to be nicely spread between the top and bottom quantile and the mean IC values is reasonably high, although they are not statistically significant which should make us suspicious.

Usually at this stage you might get a bit excited and start improving the factor. Maybe a better way to compute momentum, maybe normalize across sectors, winsorize etc. This is totally fine, you want a factor that works well. What is not fine is not having hold-out data to see if in that process you overfit your factor.

Fortunately we were wise enough to split our data, so once we are really satisified with our factor (but not sooner!) you can move on to your hold-out set. It is crucial to realize that you can only do this once so practice self-discipline, lest you shoot yourself in the foot.

Evaluating the factor on hold-out data

In [71]:
results_test = split_quarters(results, train_test='test')
In [72]:
factor_test = al.utils.get_clean_factor_and_forward_returns(results_test['simple_momentum'], pricing,
                                                           periods=[10])
al.tears.create_full_tear_sheet(factor_test)
Dropped 1.2% entries from factor data: 1.2% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
Quantiles Statistics
min max mean std count count %
factor_quantile
1 -1.729265 -0.360678 -1.210830 0.260762 35387 20.023086
2 -1.059931 0.304455 -0.452029 0.292576 35328 19.989702
3 -0.439530 0.929758 0.318451 0.292419 35326 19.988570
4 0.288471 1.349952 0.983764 0.190968 35328 19.989702
5 1.074236 1.731799 1.477151 0.141219 35362 20.008940
Returns Analysis
10D
Ann. alpha -0.008
beta -0.081
Mean Period Wise Return Top Quantile (bps) -8.704
Mean Period Wise Return Bottom Quantile (bps) 14.111
Mean Period Wise Spread (bps) -20.985
<matplotlib.figure.Figure at 0x7fb624023fd0>