Furthermore, when working with financial data, we can bring practitioners’ knowledge of markets and financial data to bear on our validation procedures. We know that since markets are competitive, factors decay over time; signals that may have worked well in the past may no longer work well by the current time. For this reason, we should generally test and validate on the most recent data possible, as testing on the recent past could be considered the most demanding test.
It’s possible that the design of the model may cause it to perform better or worse in different market regimes; so the most recent time period may not be in a market regime in which the model would perform well. But generally, we still prefer to use most recent data to test if the model would work in the time most similar to the present. In practice, of course, before investing a lot of money in a strategy, we would allow time to elapse without changing the model, and test its performance with this true out-of-sample data: what’s known as “paper trading”.
In summary, most common practice is to keep a block of data from the most recent time period as your test set.
Then, the data are split into train, valid and test sets according to the following schematic:
When working with data that are indexed by asset and day, it’s important not to split data for the same day, but for different assets, among sets. This would manifest as a subtle form of lookahead bias. For example, say data from Coca-Cola and Pepsi for the same day ended up in different sets. Since they are very similar companies, one might expect their share price trends to be correlated. If the model were trained on data from one company, and then validated on data from the other company, it might “learn” about a price movement that affects both companies, and therefore have artificially inflated performance on the validation set.
当使用按资产和日期索引的数据时，重要的是不要在同一天中将同一种资产的数据分到一组，而是将不同资产的数据分到一组内。 这将表现为超前偏差的微妙形式(译注:某种程度上像是利用了未来数据)。 例如，说来自可口可乐和百事可乐的同一天的数据以不同的集合结束。 由于它们是非常相似的公司，因此人们可能希望它们的股价趋势相互关联。 如果模型是根据一个公司的数据进行训练的，然后根据另一公司的数据进行验证的，则它可能会“了解”会影响两家公司的价格变动，因此会人为地夸大验证集上的绩效。
Source/来源: AI for Trading, Udacity