Methods for choosing training, testing and validation sets for time-series data work a little differently than the methods described so far. The main reasons we cannot use the previously described methods exactly as described are, 选择时间序列数据的训练,测试和验证集的方法与到目前为止描述的方法稍有不同。我们无法完全按照上述说明使用前述方法的主要原因是:

We want our validation and testing procedure to mimic the way our model would work in production. In production, it's impossible to train on data from the future. Accordingly, training on data that occurred later in time than the validation or test data is problematic. 我们希望我们的验证和测试过程能够模仿我们的模型在生产中的工作方式。在生产中,不可能训练未来的数据。因此,对在时间上晚于验证或测试数据的数据进行训练是有问题的。 Time series data can have the property that data from later times are dependent on data from earlier times. Therefore, leaving out an observation does not remove all the associated information due to correlations with other observations. 时间序列数据可以具有以下特性:来自较晚时间的数据取决于来自较早时间的数据(译注:自相关性?)。因此,由于与其他观察值的相关性,省略观察值不会删除所有关联的信息。 How do we modify cross validation procedures to treat time-series data? A common method is to divide the data in the following manner: 我们如何修改交叉验证程序以处理时间序列数据?一种常见的方法是按以下方式划分数据: time-series-validation-2.png

This way, each training set consists only of observations that occurred prior to the observations that form the validation set. Likewise, both the training and validation sets consist only of observations that occurred prior to the observations that form the test set. Thus, no future observations can be used in constructing the forecast. 这样,每个训练集仅包含在形成验证集的观察之前发生的观察。同样,训练集和验证集都仅包含在形成测试集的观察之前发生的观察。因此,未来的观察不能用于构建预测。


Source/来源: AI for Trading, UdaCity