macrosynergy.learning.sequential.beta_estimator#

Class to estimate market betas and calculate out-of-sample hedged returns based on sequential learning.

class BetaEstimator(df, xcats, benchmark_return, cids=None, start=None, end=None)[source]#

Bases: BasePanelLearner

Class for sequential beta estimation by learning optimal regression coefficients. Out-of-sample hedged returns are additionally calculated and stored.

Parameters:
  • df (pd.DataFrame) – Daily quantamental dataframe in JPMaQS format containing a panel of features, as well as a panel of returns.

  • xcats (str or list) – Name of a market return category within the panel specified in df.

  • benchmark_return (str) – Name of the benchmark return ticker within the panel specified in df.

  • cids (list, optional) – List of cross-sections for which hedged returns are to be calculated. Default is None, which calculates hedged returns for all cross-sections in the return panel.

  • start (str, optional) – Start date for considered data in subsequent analysis in ISO 8601 format. Default is None i.e. the earliest date in the dataframe.

  • end (str, optional) – End date for considered data in subsequent analysis in ISO 8601 format. Default is None i.e. the latest date in the dataframe.

Notes

The BetaEstimator class is used to sequentially estimate macro betas based on a panel of contract returns (provided in xcats) and a benchmark return ticker (provided in benchmark_return). The initial conditions of the learning process are specified by the dimensions of an initial training set. An optimal model is selected out of a provided collection (with associated hyperparameters), a beta is extracted for each cross-section (subject to availability) and out-of-sample hedged returns are calculated for each cross-section with an estimated beta. The betas and hedged returns are stored, and the training set is expanded to include the now-realized returns. This process is repeated until the end of the dataset is reached.

In addition to storing betas and hedged returns, this class also stores useful model selection information for analysis, such as the models selected at each point in time.

Model and hyperparameter selection is performed by cross-validation. Given a collection of models and associated hyperparameters to choose from, an HPO is run - currently only grid search and random search are supported - to determine the optimal choice. This is done by providing a collection of scikit-learn compatible scoring functions, as well as a collection of scikit-learn compatible cross-validation splitters and scorers. At each point in time, the cross-validation folds are the union of the folds produced by each splitter provided. Each scorer is evaluated on each test fold and summarised across test folds by either a custom function provided by the user or a common string i.e. ‘mean’.

Consequently, each model and hyperparameter combination has an associated collection of scores induced by different metrics, in units of those scorers. In order to form a composite score for each hyperparameter, the scores must be normalized across model/hyperparameter combinations. This makes scores across scorers comparable, so that the average score across adjusted scores can be used as a meaningful estimate of each model’s generalization ability. Finally, a composite score for each model and hyperparameter combination is calculated by averaging the adjusted scores across all scorers.

The optimal model is the one with the largest composite score.

estimate_beta(beta_xcat, hedged_return_xcat, models, hyperparameters, scorers, inner_splitters, search_type='grid', normalize_fold_results=False, cv_summary='mean', include_train_folds=False, min_cids=4, min_periods=36, est_freq='D', max_periods=None, split_functions=None, n_iter=None, n_jobs_outer=-1, n_jobs_inner=1)[source]#

Determines optimal model betas and associated out-of-sample hedged returns.

Parameters:
  • beta_xcat (str) – Category name for the panel of estimated betas.

  • hedged_return_xcat (str) – Category name for the panel of out-of-sample hedged returns.

  • models (dict) – Dictionary of models to choose from. The keys are model names and the values are scikit-learn compatible models.

  • hyperparameters (dict) – Dictionary of hyperparameters to choose from. The keys are model names and the values are hyperparameter dictionaries for the corresponding model. The keys must match with those provided in models.

  • scorers (dict) – Dictionary of scoring functions to use in the hyperparameter optimization process. The keys are scorer names and the values are scikit-learn compatible scoring functions.

  • inner_splitters (dict) – Dictionary of inner splitters to use in the hyperparameter optimization process. The keys are splitter names and the values are scikit-learn compatible cross-validator objects.

  • search_type (str) – Type of hyperparameter optimization to perform. Default is “grid”. Options are “grid” and “prior”.

  • normalize_fold_results (bool) – Whether to normalize the scores across folds before combining them. Default is False.

  • cv_summary (str or callable, optional) – Summary function to use to combine scores across cross-validation folds. Default is “mean”. Options are “mean”, “median”, “mean-std”, “mean/std”, “mean-std-ge” or a callable function.

  • include_train_folds (bool, optional) – Whether to calculate cross-validation statistics on the training folds in additional to the test folds. If True, the cross-validation estimator will be a function of both training data and test data. It is recommended to set cv_summary appropriately. Default is False.

  • min_cids (int) – Minimum number of cross-sections required for the initial training set. Default is 4.

  • min_periods (int) – Minimum number of periods required for the initial training set, in units of the frequency freq specified in the constructor. Default is 36.

  • est_freq (str) – Frequency at which models are refreshed. This corresponds with forward frequency of out-of-sample hedged returns and the frequency at which betas are estimated.

  • max_periods (int) – Maximum length of each training set in units of the frequency freq specified in the constructor. Default is None, in which case the sequential optimization uses expanding training sets, as opposed to rolling windows.

  • split_functions (dict, optional) – Dict of callables for determining the number of cross-validation splits to add to the initial number as a function of the number of iterations passed in the sequential learning process. Default is None. The keys must correspond to the keys in inner_splitters and should be set to None for any splitters that do not require splitter adjustment.

  • n_iter (int, optional) – Number of iterations to run in random hyperparameter search. Default is None.

  • n_jobs_outer (int, optional) – Number of jobs to run in parallel for the outer sequential loop. Default is -1. It is advised for n_jobs_inner * n_jobs_outer (replacing -1 with the number of available cores) to be less than or equal to the number of available cores on the machine.

  • n_jobs_inner (int, optional) – Number of jobs to run in parallel for the inner loop. Default is 1. It is advised for n_jobs_inner * n_jobs_outer (replacing -1 with the number of available cores) to be less than or equal to the number of available cores on the machine.

store_split_data(pipeline_name, optimal_model, optimal_model_name, optimal_model_score, optimal_model_params, inner_splitters_adj, X_train, y_train, X_test, y_test, timestamp, adjusted_test_index)[source]#

Stores characteristics of the optimal model at each retraining date.

Parameters:
  • pipeline_name (str) – Name of the signal optimization process.

  • optimal_model (BaseRegressionSystem or VotingRegressor) – Optimal model selected at each retraining date.

  • optimal_model_name (str) – Name of the optimal model.

  • optimal_model_score (float) – Cross-validation score for the optimal model.

  • optimal_model_params (dict) – Chosen hyperparameters for the optimal model.

  • inner_splitters_adj (dict) – Dictionary of adjusted inner splitters.

  • X_train (pd.DataFrame) – Training feature matrix.

  • y_train (pd.Series) – Training response variable.

  • X_test (pd.DataFrame) – Test feature matrix.

  • y_test (pd.Series) – Test response variable.

  • timestamp (pd.Timestamp) – Timestamp of the retraining date.

  • adjusted_test_index (pd.MultiIndex) – Adjusted test index to account for lagged features.

Returns:

Dictionary containing the betas and hedged returns determined at the given retraining date.

Return type:

dict

evaluate_hedged_returns(hedged_return_xcat=None, cids=None, correlation_types='pearson', title=None, start=None, end=None, blacklist=None, freqs='M')[source]#

Method to determine and display a table of average absolute correlations between the benchmark return and the computed hedged returns within the class instance, over all cross-sections in the panel. Additionally, the correlation table displays the same results for the unhedged return specified in the class instance for comparison purposes.

The returned dataframe will be multi-indexed by (benchmark return, return, frequency) and will contain each computed absolute correlation coefficient on each column.

Parameters:
  • hedged_return_xcat (str or list, optional) – Hedged returns to be evaluated. Default is None, which evaluates all hedged returns within the class instance.

  • cids (str or list, optional) – Cross-sections for which evaluation of hedged returns takes place. Default is None, which evaluates all cross-sections within the class instance.

  • correlation_types (str or list, optional) – Types of correlations to calculate. Options are “pearson”, “spearman” and “kendall”. If None, all three are calculated. Default is “pearson”.

  • title (str, optional) – Title for the correlation table. If None, the default title is “Average absolute correlations between each return and the chosen benchmark”. Default is None.

  • start (str, optional) – String in ISO format. Default is None.

  • end (str, optional) – String in ISO format. Default is None.

  • blacklist (dict, optional) – Dictionary of tuples of start and end dates to exclude from the evaluation. Default is None.

  • freqs (str or list, optional) – Letters denoting all frequencies at which the correlations may be calculated. This must be a selection of “D”, “W”, “M”, “Q” and “A”. Default is “M”. Each return series will always be summed over the sample period.

Returns:

A dataframe of average absolute correlations between the benchmark return and the computed hedged returns.

Return type:

pd.DataFrame

get_hedged_returns(hedged_return_xcat=None)[source]#

Returns a dataframe of out-of-sample hedged returns derived from beta estimation processes held within the class instance.

Parameters:

hedged_return_xcat (str or list, optional) – Category name or list of category names for the panel of derived hedged returns. If None, information from all beta estimation processes held within the class instance is returned. Default is None.

Returns:

A dataframe of out-of-sample hedged returns derived from beta estimation processes.

Return type:

pd.DataFrame

get_betas(beta_xcat=None)[source]#

Returns a dataframe of estimated betas derived from beta estimation processes held within the class instance.

Parameters:

beta_xcat (str or list) – Category name or list of category names for the panel of estimated contract betas. If None, information from all beta estimation processes held within the class instance is returned. Default is None.

Returns:

A dataframe of estimated betas derived from beta estimation processes.

Return type:

pd.DataFrame