标签: Data Processing

  • Note: Overlapping labels 重叠的标签, AI for Trading

    问题

    重叠标签(overlapping labels)问题是使用金融数据训练预测模型遇到的一个问题。如下图所示,假设我们要训练一个模型预测未来一周的收益,最简单的情况下我们会用某一天T的后一周连续收益作为训练的标签(label, i.e. 那个y)。这样每天的样本例子都有一个未来一周的label对应。但由于金融数据有自相关性,连续几天的label通常是相互关联的——这就和大多数机器学习模型的假设冲突,因为这些模型通常假设我们输入的每个样本间是独立同分布的(independent and identically distributed, IID)。
    example-overlapping-labels.png

    以随机森林(Random Forest)模型为例,如果按照上述样本进行训练,那么一个bag里面的样本很容易互相关联,out-bag的样本也亦如此,于是生成的各个决策树就比较相似,最终导致生成的森林的error rate上升——他们太相似了。

    解决方案1:sum-sampling 子采样

    example-subsample.png
    如上图,若要训练的目标是未来一周的收益,可以子采样每周五的未来一周收益。这种方法的缺陷很明显,就是少了很多训练数据。设想一下如果预测目标是未来一月或是一年的收益,训练数据就被删的所剩无几了。

    解决方案2:调整随机森林的bagging过程

    减少每次bag的样本数量,这样一个bag里的样本相关性就会降低。

    解决方案3:轮动数据

    rf-subsample-and-ensemble.png
    基于方案1,假设我们还是要未来一周收益,那么可以训练5个不同的模型,分别子采样周一、周二、…、周五的数据,最后合并这五个森林。

    (更多…)

  • Validation for Financial Data 金融数据的验证集

    Furthermore, when working with financial data, we can bring practitioners’ knowledge of markets and financial data to bear on our validation procedures. We know that since markets are competitive, factors decay over time; signals that may have worked well in the past may no longer work well by the current time. For this reason, we should generally test and validate on the most recent data possible, as testing on the recent past could be considered the most demanding test.
    此外,在处理财务数据时,我们可以使从业人员对市场和财务数据的了解可用于我们的验证程序。我们知道,由于市场竞争激烈,因此因素会随着时间而衰减。过去可能效果良好的信号可能在当前时间不再有效。因此,我们通常应该对最新数据进行测试和验证,因为对最近历史的测试可能被认为是最苛刻的测试。

    It’s possible that the design of the model may cause it to perform better or worse in different market regimes; so the most recent time period may not be in a market regime in which the model would perform well. But generally, we still prefer to use most recent data to test if the model would work in the time most similar to the present. In practice, of course, before investing a lot of money in a strategy, we would allow time to elapse without changing the model, and test its performance with this true out-of-sample data: what’s known as “paper trading”.
    模型的设计可能会导致它在不同的市场体制下表现更好或更差。因此,最近的时间段可能不在该模型可以正常运行的市场体制中。但总的来说,我们仍然倾向于使用最新数据来测试该模型在与当前时间最相似的时间内是否可以正常工作。当然,实际上,在实践中,在为策略投入大量资金之前,我们会花些时间而不更改模型,并使用此真实的样本外数据(即所谓的“纸面交易”)测试其性能。

    In summary, most common practice is to keep a block of data from the most recent time period as your test set.
    总之,最常见的做法是将最近一段时间内的数据作为测试集

    Then, the data are split into train, valid and test sets according to the following schematic:
    然后,根据下图将数据分为训练集,验证集和测试集:
    train-valid-test-time-2.png

    When working with data that are indexed by asset and day, it’s important not to split data for the same day, but for different assets, among sets. This would manifest as a subtle form of lookahead bias. For example, say data from Coca-Cola and Pepsi for the same day ended up in different sets. Since they are very similar companies, one might expect their share price trends to be correlated. If the model were trained on data from one company, and then validated on data from the other company, it might “learn” about a price movement that affects both companies, and therefore have artificially inflated performance on the validation set.
    当使用按资产和日期索引的数据时,重要的是不要在同一天中将同一种资产的数据分到一组,而是将不同资产的数据分到一组内。 这将表现为超前偏差的微妙形式(译注:某种程度上像是利用了未来数据)。 例如,说来自可口可乐和百事可乐的同一天的数据以不同的集合结束。 由于它们是非常相似的公司,因此人们可能希望它们的股价趋势相互关联。 如果模型是根据一个公司的数据进行训练的,然后根据另一公司的数据进行验证的,则它可能会“了解”会影响两家公司的价格变动,因此会人为地夸大验证集上的绩效。


    Source/来源: AI for Trading, Udacity

  • Cross Validation for Time Series 时间序列数据的交叉验证

    Methods for choosing training, testing and validation sets for time-series data work a little differently than the methods described so far. The main reasons we cannot use the previously described methods exactly as described are,
    选择时间序列数据的训练,测试和验证集的方法与到目前为止描述的方法稍有不同。我们无法完全按照上述说明使用前述方法的主要原因是:

    We want our validation and testing procedure to mimic the way our model would work in production. In production, it’s impossible to train on data from the future. Accordingly, training on data that occurred later in time than the validation or test data is problematic.
    我们希望我们的验证和测试过程能够模仿我们的模型在生产中的工作方式。在生产中,不可能训练未来的数据。因此,对在时间上晚于验证或测试数据的数据进行训练是有问题的。
    Time series data can have the property that data from later times are dependent on data from earlier times. Therefore, leaving out an observation does not remove all the associated information due to correlations with other observations.
    时间序列数据可以具有以下特性:来自较晚时间的数据取决于来自较早时间的数据(译注:自相关性?)。因此,由于与其他观察值的相关性,省略观察值不会删除所有关联的信息。
    How do we modify cross validation procedures to treat time-series data? A common method is to divide the data in the following manner:
    我们如何修改交叉验证程序以处理时间序列数据?一种常见的方法是按以下方式划分数据:
    time-series-validation-2.png

    This way, each training set consists only of observations that occurred prior to the observations that form the validation set. Likewise, both the training and validation sets consist only of observations that occurred prior to the observations that form the test set. Thus, no future observations can be used in constructing the forecast.
    这样,每个训练集仅包含在形成验证集的观察之前发生的观察。同样,训练集和验证集都仅包含在形成测试集的观察之前发生的观察。因此,未来的观察不能用于构建预测。


    Source/来源: AI for Trading, UdaCity