Walk-Forward Validation: A Stress Test for Your Trading Strategy

Building Reliable Machine Learning Models on Rolling Windows of Data

Dec 15, 2024

∙ Paid

In this post I will be taking a look at a validation technique called walk-forward validation which is quite useful for testing time-series models. It allows us to simulate how the model would perform in a live environment, where it would be updated periodically with new data or after a set period of time.

In my previous posts I used a very simple method of validation for my strategies. The entire time-series would be divided into a training set (usually around 70-80% of the data) that would be used to train a model, which would then make predictions on the remaining out-of-sample part of the dataset. I would then judge the performance of the model on this part of the dataset and draw conclusions based on its performance metrics. You can see the visual representation of this approach on the chart below.

This out-of-sample testing is very important, since it attempts to simulate how a model would perform in conditions it has never seen before. However, this approach, while simple, has several limitations. For example, let’s look at a model that uses 16 years of historical data to train. Then we test it on next 4 years of out-of-sample data. We are happy with its performance on this testing set, so we start trading. Let’s say we trade this strategy for 1 year in the real market and it performs reasonably well. What do we do at this point? Do we just trade it indefinitely? Do we update it by adding more recent data to the training set and retrain the model? How do we know if the model will still perform well if we did that? Walk-forward validation can help us answer these questions and give us an idea of how a model will perform across a longer timeframe.

The walk-forward technique in question will utilize a rolling window approach where the model is trained and tested on consecutive periods. It creates a situation where the model is updated continuously, mimicking a process we would realistically take when trading this strategy. Additionally, if the strategy was over-fit, it would eventually fall apart when forced to perform on out-of-sample data multiple times. It can be viewed as a sort of stress-test of the strategy.

The chart below provides a good illustration of the process. In the example below we split the data we have into multiple train-test splits. The model will train on the first Train set 1, which spans 2000 and 2001, and then it will make predictions for Test set 1, which spans 1 year immediately after Train set 1 ends. After that we roll the train-test split window forward 1 year: the new Train set 2 will include 2001 and 2002 and the new Test set 2 will include 2003. We repeat this process multiple times until we arrive at present day. This allows us to end up with a model that was tested on many years of out-of-sample data, rather than testing the model on a single out-of-sample slice of data where its performance could be simply the result of luck.

The example above represents an unanchored approach to walk-forward validation. There is also the anchored approach whereby the dataset of each training period gets longer as we progress through the iterations. Anchored means that the starting point of all periods is the same as the starting point of the first and we add more data to each training set. Below is a visual representation of the anchored approach.

For this post we will focus on the unanchored version and leave the anchored approach for a different day.

There are many considerations you need to take into account when trying this approach. What should be the size of the training set? How big should the test set and the window be? These are not trivial questions and the answers generally depend on what the use case is. If the window is too small, you will end up with way too many iterations. If your train set size is too small, your model might not have enough data to work with, etc. In the example below I will provide what I consider to be a good starting point, but you should experiment with it to test out different configurations.

As usual, in the code below we download the data and add some initial indicators that will be used as features for the model. In this example we use the S&P 500 ETF SPY.

import yfinance as yf

df = yf.download('SPY').reset_index()
import pandas as pd
import numpy as np
seed=42
import os
os.environ['PYTHONHASHSEED'] = str(seed)
np.random.seed(seed)
import random
random.seed(seed)
#Tweaking the fonts, etc.
import matplotlib.pyplot as plt
from matplotlib import rcParams

rcParams['figure.figsize'] = (18, 8)
rcParams['axes.spines.top'] = False
rcParams['axes.spines.right'] = False

df['returns'] = df['Adj Close'].pct_change().shift(0)
df['pct_change_future'] = df['Adj Close'].pct_change().shift(-1)
df['target'] = np.where(df['pct_change_future'] > 0, 1, 0)
for m in [10, 20, 50, 100]:
    # Calculate moving average
    df[f'ma_{m}'] = df['Open'].rolling(m).mean()

for m in [10, 20, 50, 100]:
    df[f'feat_open_above_ma_{m}'] = np.where(df['Open'] > df['Open'].rolling(m).mean(), 1, 0)

# 10 MA above 20 MA
df['feat_ma_10_above_ma_20'] = np.where(df['ma_10'] > df['ma_20'], 1, 0)

# 10 MA above 50 MA
df['feat_ma_10_above_ma_50'] = np.where(df['ma_10'] > df['ma_50'], 1, 0)

# 10 MA above 100 MA
df['feat_ma_10_above_ma_100'] = np.where(df['ma_10'] > df['ma_100'], 1, 0)

# 20 MA above 50 MA
df['feat_ma_20_above_ma_50'] = np.where(df['ma_20'] > df['ma_50'], 1, 0)

# 20 MA above 100 MA
df['feat_ma_20_above_ma_100'] = np.where(df['ma_20'] > df['ma_100'], 1, 0)

# 50 MA above 100 MA
df['feat_ma_50_above_ma_100'] = np.where(df['ma_50'] > df['ma_100'], 1, 0)

You will notice that, unlike my previous posts, I use Open prices to calculate all indicators. This is because in this example our theoretical work flow when trading the strategy will be the following: market opens and we obtain today’s Open price, calculate the indicators and obtain predictions from the model. The model attempts to predict if tomorrow’s Close price will be higher than today’s Close price. After that we open a long position at close of the market on the same day, which we will keep at least until next day’s close. This way we avoid any look-ahead bias and allow ourselves plenty of time to get the predictions and set up an order to open or close the position at close of market.

In my other examples the work flow was different: we would wait until 10 minutes before close, get the prediction using latest price available and open/close the position at market close that same day. But that would require us to be pretty fast in the work flow and we would also need to download historical intraday prices to train our model properly without bias.

You should think about which approach suits your style better, as well as what data is available to you.

Next, we calculate the 14-day RSI indicator to use in the model and install the model itself. In this example we will be using a CatBoost model that I have used in previous posts.

df['diff'] = df['Open'].diff(1)
df['gain'] = df['diff'].clip(lower=0)
df['loss'] = -df['diff'].clip(upper=0)
df['avg_gain'] = df['gain'].rolling(window=14, min_periods=1).mean()
df['avg_loss'] = df['loss'].rolling(window=14, min_periods=1).mean()
df['rs'] = df['avg_gain'] / df['avg_loss']
df['rsi_14'] = 100 - (100 / (1 + df['rs']))

# Remove temporary columns
df.drop(['diff', 'gain', 'loss', 'avg_gain', 'avg_loss', 'rs'], axis=1, inplace=True)

# RSI above 14
df['feat_rsi_14_above_14'] = np.where(df['rsi_14'] > 14, 1, 0)

# RSI above 80
df['feat_rsi_14_above_80'] = np.where(df['rsi_14'] > 80, 1, 0)
df = df.dropna()
!pip install catboost==1.2.2

Now let’s split our data in multiple train-test splits. Each train set will consist of 5 years of data. This seems like a decent choice since the average economic cycle in US lasts for around 5.5 years. Our test set and rolling window will be 2 years. This will leave us with 14 train-test sets, a manageable number.

from datetime import timedelta  # Import timedelta for date calculations

# Ensure the index is in datetime format
df.index = pd.to_datetime(df.Date)

# Initialize lists for train-test splits
train_test_splits = []

# Parameters
train_years = 5  # Number of years in the training set
test_years = 2   # Number of years in the test set

# Rolling split
start_date = df.index.min()  # Earliest date
end_date = df.index.max()    # Latest date

while True:
    train_end_date = start_date + timedelta(days=365 * train_years) - timedelta(days=1)
    test_end_date = train_end_date + timedelta(days=365 * test_years)

    # Ensure the final test set includes all remaining data
    if test_end_date > end_date:
        test_end_date = end_date  # Extend to the end of the dataset

    # Create train and test sets
    train_data = df[(df.index >= start_date) & (df.index <= train_end_date)]
    test_data = df[(df.index > train_end_date) & (df.index <= test_end_date)]

    train_test_splits.append((train_data, test_data))

    # Break if this is the final split
    if test_end_date == end_date:
        break

    # Set the rolling window at 2 years
    start_date += timedelta(days=365 * 2)

# Print the splits
for i, (train, test) in enumerate(train_test_splits):
    print(f"Split {i+1}")
    print(f"Train period: {train.index.min()} to {train.index.max()}")
    print(f"Test period: {test.index.min()} to {test.index.max()}\n")

You can see above that we have 14 splits, where each train set has 5 years of data and each test set is 2 years, except the last one, which ended up much smaller. Now let’s store the splits in dictionaries and also take a look at the size of each split.

My goal is to provide you with the tools that will give you an edge in the markets. Follow the link below to get 10% off for the next 12 months.

Get 10% off for 1 year

Become a paid subscriber to receive:

Trading indicators and strategies. Full, ready-to-use code for your investing — no black boxes or holy grails, just full transparency and ownership of your advantage.
Weekly newsletter covering current market conditions. Analysis on economic trends, key data releases, and actionable insights to stay ahead of market shifts.

Rainmaker Trades

Walk-Forward Validation: A Stress Test for Your Trading Strategy

Building Reliable Machine Learning Models on Rolling Windows of Data

This post is for paid subscribers