标签: Random Forest

  • Note: Overlapping labels 重叠的标签, AI for Trading

    问题

    重叠标签(overlapping labels)问题是使用金融数据训练预测模型遇到的一个问题。如下图所示,假设我们要训练一个模型预测未来一周的收益,最简单的情况下我们会用某一天T的后一周连续收益作为训练的标签(label, i.e. 那个y)。这样每天的样本例子都有一个未来一周的label对应。但由于金融数据有自相关性,连续几天的label通常是相互关联的——这就和大多数机器学习模型的假设冲突,因为这些模型通常假设我们输入的每个样本间是独立同分布的(independent and identically distributed, IID)。
    example-overlapping-labels.png

    以随机森林(Random Forest)模型为例,如果按照上述样本进行训练,那么一个bag里面的样本很容易互相关联,out-bag的样本也亦如此,于是生成的各个决策树就比较相似,最终导致生成的森林的error rate上升——他们太相似了。

    解决方案1:sum-sampling 子采样

    example-subsample.png
    如上图,若要训练的目标是未来一周的收益,可以子采样每周五的未来一周收益。这种方法的缺陷很明显,就是少了很多训练数据。设想一下如果预测目标是未来一月或是一年的收益,训练数据就被删的所剩无几了。

    解决方案2:调整随机森林的bagging过程

    减少每次bag的样本数量,这样一个bag里的样本相关性就会降低。

    解决方案3:轮动数据

    rf-subsample-and-ensemble.png
    基于方案1,假设我们还是要未来一周收益,那么可以训练5个不同的模型,分别子采样周一、周二、…、周五的数据,最后合并这五个森林。

    (更多…)

  • 笔记 – Tree-based models with financial data, AI for Trading

    Importance of Random Column Selection / 随机列选择的重要性

    Sometimes one feature will dominate in finance. If you don’t apply some type of random feature selection, then your trees will not be that different (i.e., will be correlated) and that reduces the benefit of ensembling.
    有时,一项特征将在财务数据中占主导地位。 如果您不应用某种类型的随机特征选择,那么您的树将不会有太大的不同(即, 他们之间的相关性太高),从而降低了集成(ensembling)的好处。

    What features are typically dominant? Classical, price-driven factors, like mean reversion or momentum factors, often dominate. You may also see that features that define industry sectors or market “regimes” (periods defined, for example, by high or low market volatility or other market-wide trends) are towards the root of the tree.
    典型地,价格驱动的因子(例如均值回归或动量因子)通常占主导地位。 您还可能会看到,定义行业部门或市场“制度”的特征(例如,由高或低的市场波动性或其他市场趋势确定的时期)都会靠近树的根部。

    Choosing Hyperparameter Values / 选择超参数

    In non-financial and non-time series machine learning, setting this hyperparameter is fairly straightforward: you use grid search cross-validation to find the value that maximizes the model’s performance on validation data. When you have time-series data, you typically don’t use cross-validation because usually you just want a single validation dataset that is as close in time as possible to the present. If you have a problem with high signal-to-noise, then you can try a bit of parameter tuning on the single validation set. In finance, though, you have time series data and you have low signal-to-noise. Therefore, you have one validation set and if you were to try a bunch of parameter values on this validation set, you would almost surely be overfitting. As such, you need to set the parameter with some judgement and minimal trials.
    在非金融和非时间序列的机器学习中,设置超参数非常简单:您可以使用网格搜索交叉验证来找到使模型在验证数据上的性能最大化的值。当您拥有时间序列数据时,通常不使用交叉验证,因为通常您希望验证集对应的时间越晚越好(译注: 理由可以见这里)。如果信噪比过高,可以在单个验证集中尝试一些参数调整。但是,在金融领域,您有时间序列数据,信噪比也很低。鉴于只有一个验证集,如果要在此验证集上尝试一堆参数值,则几乎肯定会过拟合。因此,您需要通过一些判断和最少的尝试来设置参数。

    Random Forests for Alpha Combination / 用于Alpha组合的随机森林

    rf-for-alpha-combination.png

    For this type of problem, we have data that look like the above. Each row is indexed by both date and asset. We typically have several alpha factors, and we then calculate “features”, which provide the random forest model additional information. For example, we may calculate date features, which the algorithm could use to learn that certain factors are particularly predictive during certain periods.
    对于这种类型的问题,我们有类似上面的数据。 每行都按日期和资产编制索引。 通常,我们有几个alpha因子,然后我们计算“特征”,这些特征为随机森林模型提供了其他信息。 例如,我们可以计算日期特征,该算法可用于了解某些因素在某些时期特别具有预测性。
    example-finance-tree.png
    What are we trying to predict? We’re trying to predict asset returns—but not their decimal values! We rank them relative to each other into only two buckets, such that we essentially predict winners and losers on the day. T
    我们要预测什么? 我们正在尝试预测资产收益,但不能预测其十进制值! 我们将它们彼此相对地分为两个等级,这样我们就可以基本上预测当天的赢家和输家。


    Source: AI for Trading, Udacity