MRF module

class MRF.MacroRandomForest(data, x_pos, oos_pos, S_pos='', y_pos=1, minsize=10, mtry_frac=0.3333333333333333, min_leaf_frac_of_x=1, VI=False, ERT=False, quantile_rate=None, S_priority_vec=None, random_x=False, trend_push=1, howmany_random_x=1, howmany_keep_best_VI=20, cheap_look_at_GTVPs=True, prior_var=[], prior_mean=[], subsampling_rate=0.75, rw_regul=0.75, keep_forest=False, block_size=12, fast_rw=True, ridge_lambda=0.1, HRW=0, B=50, resampling_opt=2, print_b=True, parallelise=True, n_cores=- 1)

Bases: object

The main class to run MRF and generate predictions and generalized time-varying parameters.

Parameters
  • data (-) – Dataframe, including all potential columns (y, X, S) and rows (both training and testing data)

  • x_pos (-) – Column positions of variables selected to be time-varying. Should be a subset of exogenous variables

  • y_pos (-) – Column position of the prediction target

  • S_pos (-) – Column positions of variables entering the forest part (S_t in the paper)

  • oos_pos (-) – Row positions of test set/out-of-sample observations

  • minsize (-) – Minimal node size to attempt a split

  • mtry_frac (-) – Fraction of all features S_t to consider at each split. A lower value (like 0.15) helps speeding things up and can be reasonable when S_t contains many correlated features

  • min_leaf_frac_of_x (-) – Minimal ratio of observations to regressors in a node. Given the ridge penalty and random-walk shrinkage, there is no problem in letting this be one. Suggested values are (1,1.5,2) but those usually have very little influence

  • VI (-) – Variable importance. Currently not supported

  • ERT (-) – ERT stands for “Extremely Randomized Tree”. Activating this means splitting points (but not splitting variables) are chosen randomly. This brings extra regularization, and most importantly, speed. Option quantile rate determines how many random splits are considered. 1 means all of them, like usual RF. 0 is like pure ERT. All values in between are possible. This option is not used in the paper but can help exploratory work by speeding things up. Also, it could potentially help in forecasting via extra regularization

  • quantile_rate (-) – This option has a different meaning if ERT is activated. See above. Otherwise, this feature, for early splits, reduce the number of splitting point being considered for each S. 1 means all splitting points are considered. A value between 0 and 1 means we are considering a subset of quantiles of the splitting variable. For instance, quantile.rate=0.3 means one out of every tree (ordered) values is considered for splitting. The aim of this option is to speed things up without sacrificing much predictive ability

  • S_priority_vec (-) – RF randomly selects potential splitting variables at each step. However, in a large macro data sets, some types of variables could be over-represented, and some, underrepresented. The user can specify weights for all members of S using this option. Thus, one can down weight overrepresented group of regressors, if that makes sense to do so

  • trend_push (-) – See above. Must be >=1. This option multiplies by “trend_push” the probability of the trend being included in the potential splitters set. 4 is a reasonable value with macro data. Note this can be used for anything (not necessarily a trend) that we may want to boost (in position trend_pos)

  • random_x (-) – Activating this lets the algorithm randomly select “howmany_random_x” regressor out of all those in “x_pos” for each tree. This is merely a predictive device, so GTVPs are not outputted in that case, and neither are VI measures

  • howmany_random_x (-) – See above. Must be between 1 and the length of “x_pos”

  • howmany_keep_best_VI (-) – How many variables should we keep by VI criteria. Currently not supported

  • prior_mean (-) – MRF implements a ridge prior. The user can specify a prior.mean vector of length “x_pos”+1 which differs from [0,0,0,0]. For instance, this may help when a close-to unit root is suspected. An easy (and good) data-driven prior mean vector consists of OLS estimates of regressing X’s on Y

  • prior_var (-) – When using prior_mean, a prior variance vector must also be specified. Remember this alters the implicit value of “ridge_lambda”. Also, the intercept should always have a larger variance

  • subsampling_rate (-) – Subsampling rate for the ensemble. Controls the percentage of observations used to make each tree estimate

  • rw_regul (-) – Egalitarian Olympic podium random-walk shrinkage parameter (see paper). Should be between 0 (nothing) and 1

  • keep_forest (-) – Not currently supported. Saves all the tree structures. Switch to True if you plan to forecast using the external function “pred_given_mrf”

  • block_size (-) – Size of blocks for block sub-sampling (resampling_opt=2) and block bayesian bootstrap (resampling_opt=4)

  • fast_rw (-) – When True, “rw_regul”” is only considered in the prediction step (and not in the search for splits). This speeds things up dramatically with often little influence on results

  • ridge_lambda (-) – Ridge shrinkage parameter for the linear part

  • HRW (-) – Seldomly use. See paper’s appendix. Can be useful for very large “x_pos”

  • B (-) – The number of trees to include for the ensemble

  • resampling_opt (-) – 0 is no resampling. 1 is plain subsampling. 2 is block subsampling (recommended when looking at GTVPs). 3 is Bayesian Bootstrap. 4 is Block Bayesian Bootstrap (may do better for forecasting)

  • print_b (-) – If True, print at which tree we are at in terms of computations. Not supported if parallelise = True

  • parallelise (-) – If True, computation will run across multiple cores. Speeds up computation in (almost) precise proportion to “n_cores”

  • n_cores (-) – Default value of -1 specifies that all available cores will be used. Use a different integer to specify fewer than all but > 1 cores

Returns

  • dict (dict)

  • YandX (pd.DataFrame): DataFrame containing original data

  • pred_ensemble (pd.Series): Series containing predictions of ensembled trees

  • pred (stacked numpy.matrix): Matrix containing raw (non-ensembled) predictions

  • S_names (numpy.array): Column indices corresponding to state (exogenous) variables

  • betas (numpy.matrix): Matrix containing average of betas across trees.

  • beta_draws_raw (stacked numpy.matrix): Stacked matrix containing raw betas from individual trees.

  • model (dict): Dictionary containing model information. Keys are [“forest”, “data”, “regul_lambda”, “HRW”, “no_rw_trespassing”, “B”, “random_vecs”, “y_pos”, “S_pos”, “x_pos”]

financial_evaluation(close_prices, k=1)

Method for automatically generating signals and backtesting the financial performance of MRF.

Parameters
  • close_prices (-) – Close prices of the financial asset corresponding to the target variable of interst.

  • k (-) – Forecast horizon.

Returns

Series corresponding to daily profit values of MRF-guided trading strategy over OOS period - cumulative_profit (pd.Series): Series corresponding to cumulative profit values of MRF-guided trading strategy over OOS period - annualised_return (float): Yearly profit earned over OOS period - sharpe_ratio (float): Sharpe Ratio metric corresponding to OOS period - max_drawdown (float): Maximum Drawdown metric corresponding to OOS period

Return type

  • daily_profit (pd.Series)

statistical_evaluation()

Method for automatically generating statistical evaluation metrics for MRF predictions.

Returns

Mean Absolute Error over OOS period - MSE (float): Mean Squared Error over OOS period

Return type

  • MAE (float)