MRF module

class MRF.MacroRandomForest(data, x_pos, oos_pos, S_pos='', y_pos=1, minsize=10, mtry_frac=0.3333333333333333, min_leaf_frac_of_x=1, VI=False, ERT=False, quantile_rate=None, S_priority_vec=None, random_x=False, trend_push=1, howmany_random_x=1, howmany_keep_best_VI=20, cheap_look_at_GTVPs=True, prior_var=[], prior_mean=[], subsampling_rate=0.75, rw_regul=0.75, keep_forest=False, block_size=12, fast_rw=True, ridge_lambda=0.1, HRW=0, B=50, resampling_opt=2, print_b=True, parallelise=True, n_cores=- 1)

Bases: object

The main class to run MRF and generate predictions and generalized time-varying parameters.

Parameters

data (-) – Dataframe, including all potential columns (y, X, S) and rows (both training and testing data)
x_pos (-) – Column positions of variables selected to be time-varying. Should be a subset of exogenous variables
y_pos (-) – Column position of the prediction target
S_pos (-) – Column positions of variables entering the forest part (S_t in the paper)
oos_pos (-) – Row positions of test set/out-of-sample observations
minsize (-) – Minimal node size to attempt a split
mtry_frac (-) – Fraction of all features S_t to consider at each split. A lower value (like 0.15) helps speeding things up and can be reasonable when S_t contains many correlated features
min_leaf_frac_of_x (-) – Minimal ratio of observations to regressors in a node. Given the ridge penalty and random-walk shrinkage, there is no problem in letting this be one. Suggested values are (1,1.5,2) but those usually have very little influence
VI (-) – Variable importance. Currently not supported
ERT (-) – ERT stands for “Extremely Randomized Tree”. Activating this means splitting points (but not splitting variables) are chosen randomly. This brings extra regularization, and most importantly, speed. Option quantile rate determines how many random splits are considered. 1 means all of them, like usual RF. 0 is like pure ERT. All values in between are possible. This option is not used in the paper but can help exploratory work by speeding things up. Also, it could potentially help in forecasting via extra regularization
quantile_rate (-) – This option has a different meaning if ERT is activated. See above. Otherwise, this feature, for early splits, reduce the number of splitting point being considered for each S. 1 means all splitting points are considered. A value between 0 and 1 means we are considering a subset of quantiles of the splitting variable. For instance, quantile.rate=0.3 means one out of every tree (ordered) values is considered for splitting. The aim of this option is to speed things up without sacrificing much predictive ability
S_priority_vec (-) – RF randomly selects potential splitting variables at each step. However, in a large macro data sets, some types of variables could be over-represented, and some, underrepresented. The user can specify weights for all members of S using this option. Thus, one can down weight overrepresented group of regressors, if that makes sense to do so
trend_push (-) – See above. Must be >=1. This option multiplies by “trend_push” the probability of the trend being included in the potential splitters set. 4 is a reasonable value with macro data. Note this can be used for anything (not necessarily a trend) that we may want to boost (in position trend_pos)
random_x (-) – Activating this lets the algorithm randomly select “howmany_random_x” regressor out of all those in “x_pos” for each tree. This is merely a predictive device, so GTVPs are not outputted in that case, and neither are VI measures
howmany_random_x (-) – See above. Must be between 1 and the length of “x_pos”
howmany_keep_best_VI (-) – How many variables should we keep by VI criteria. Currently not supported
prior_mean (-) – MRF implements a ridge prior. The user can specify a prior.mean vector of length “x_pos”+1 which differs from [0,0,0,0]. For instance, this may help when a close-to unit root is suspected. An easy (and good) data-driven prior mean vector consists of OLS estimates of regressing X’s on Y
prior_var (-) – When using prior_mean, a prior variance vector must also be specified. Remember this alters the implicit value of “ridge_lambda”. Also, the intercept should always have a larger variance
subsampling_rate (-) – Subsampling rate for the ensemble. Controls the percentage of observations used to make each tree estimate
rw_regul (-) – Egalitarian Olympic podium random-walk shrinkage parameter (see paper). Should be between 0 (nothing) and 1
keep_forest (-) – Not currently supported. Saves all the tree structures. Switch to True if you plan to forecast using the external function “pred_given_mrf”
block_size (-) – Size of blocks for block sub-sampling (resampling_opt=2) and block bayesian bootstrap (resampling_opt=4)
fast_rw (-) – When True, “rw_regul”” is only considered in the prediction step (and not in the search for splits). This speeds things up dramatically with often little influence on results
ridge_lambda (-) – Ridge shrinkage parameter for the linear part
HRW (-) – Seldomly use. See paper’s appendix. Can be useful for very large “x_pos”
B (-) – The number of trees to include for the ensemble
resampling_opt (-) – 0 is no resampling. 1 is plain subsampling. 2 is block subsampling (recommended when looking at GTVPs). 3 is Bayesian Bootstrap. 4 is Block Bayesian Bootstrap (may do better for forecasting)
print_b (-) – If True, print at which tree we are at in terms of computations. Not supported if parallelise = True
parallelise (-) – If True, computation will run across multiple cores. Speeds up computation in (almost) precise proportion to “n_cores”
n_cores (-) – Default value of -1 specifies that all available cores will be used. Use a different integer to specify fewer than all but > 1 cores

Returns

dict (dict)
YandX (pd.DataFrame): DataFrame containing original data
pred_ensemble (pd.Series): Series containing predictions of ensembled trees
pred (stacked numpy.matrix): Matrix containing raw (non-ensembled) predictions
S_names (numpy.array): Column indices corresponding to state (exogenous) variables
betas (numpy.matrix): Matrix containing average of betas across trees.
beta_draws_raw (stacked numpy.matrix): Stacked matrix containing raw betas from individual trees.
model (dict): Dictionary containing model information. Keys are [“forest”, “data”, “regul_lambda”, “HRW”, “no_rw_trespassing”, “B”, “random_vecs”, “y_pos”, “S_pos”, “x_pos”]

financial_evaluation(close_prices, k=1)

Method for automatically generating signals and backtesting the financial performance of MRF.

Parameters

close_prices (-) – Close prices of the financial asset corresponding to the target variable of interst.
k (-) – Forecast horizon.

Returns

Series corresponding to daily profit values of MRF-guided trading strategy over OOS period - cumulative_profit (pd.Series): Series corresponding to cumulative profit values of MRF-guided trading strategy over OOS period - annualised_return (float): Yearly profit earned over OOS period - sharpe_ratio (float): Sharpe Ratio metric corresponding to OOS period - max_drawdown (float): Maximum Drawdown metric corresponding to OOS period

Return type

daily_profit (pd.Series)

statistical_evaluation()

Method for automatically generating statistical evaluation metrics for MRF predictions.

Returns

Mean Absolute Error over OOS period - MSE (float): Mean Squared Error over OOS period

Return type

MAE (float)