macrosynergy.learning.sequential#

class BasePanelLearner(df, xcats, cids=None, n_targets=1, start=None, end=None, blacklist=None, freq='M', lag=1, xcat_aggs=['last', 'sum'], generate_labels=None, skip_checks=False, drop_nas=True)[source]#

Bases: ABC

run(name, models, outer_splitter, inner_splitters=None, hyperparameters=None, scorers=None, search_type='grid', normalize_fold_results=False, cv_summary='mean', include_train_folds=False, n_iter=100, split_functions=None, store_additional_data=None, n_jobs_outer=-1, n_jobs_inner=1)[source]#

Run a learning process over a panel.

Parameters:

name (str) – Category name for the forecasted panel resulting from the learning process.
models (dict) – Dictionary of model names and compatible scikit-learn model objects.
outer_splitter (WalkForwardPanelSplit) – Outer splitter for the learning process.
inner_splitters (dict, optional) – Inner splitters for the learning process.
hyperparameters (dict, optional) – Dictionary of model names and hyperparameter grids.
scorers (dict, optional) – Dictionary of scikit-learn compatible scoring functions.
search_type (str) – Search type for hyperparameter optimization. Default is “grid”. Options are “grid”, “prior” and “bayes”. If no hyperparameter tuning is required, this parameter can be disregarded.
normalize_fold_results (bool) – Whether to normalize the scores across folds before combining them. Default is False. If no hyperparameter tuning is required, this parameter can be disregarded.
cv_summary (str or callable) – Summary function to use to combine scores across cross-validation folds. Default is “mean”. Options are “mean”, “median”, “mean-std”, “mean/std”, “mean-std-ge” or a callable function. If no hyperparameter tuning is required, this parameter can be disregarded.
include_train_folds (bool, optional) – Whether to calculate cross-validation statistics on the training folds in additional to the test folds. If no hyperparameter tuning is required, this parameter can be disregarded.
n_iter (int) – Number of iterations for random or bayesian hyperparameter optimization. If no hyperparameter tuning is required, this parameter can be disregarded.
split_functions (dict, optional) – Dictionary of callables for determining the number of cross-validation splits to add to the initial number, as a function of the number of iterations passed in the sequential learning process. If no hyperparameter tuning is required, this parameter can be disregarded.
store_additional_data (list, optional) – List of optimal model attributes to store from each optimal model at each retraining date. Default is None.
n_jobs_outer (int, optional) – Number of jobs to run in parallel for the outer loop. Default is -1.
n_jobs_inner (int, optional) – Number of jobs to run in parallel for the inner loop. Default is 1. If no hyperparameter tuning is required, this parameter can be disregarded.

Returns:

List of dictionaries containing the results of the learning process.

Return type:

list

store_split_data(pipeline_name, optimal_model, optimal_model_name, optimal_model_score, optimal_model_params, optimal_model_additional_data, inner_splitters_adj, X_train, y_train, X_test, y_test, timestamp, adjusted_test_index)[source]#

Store predictive analytics for training set (X_train, y_train).

Parameters:

pipeline_name (str) – Name of the sequential optimization pipeline.
optimal_model (RegressorMixin or ClassifierMixin or Pipeline) – Optimal model selected for the training set.
optimal_model_name (str) – Name of the optimal model.
optimal_model_score (float) – Score of the optimal model.
optimal_model_params (dict) – Hyperparameters of the optimal model.
optimal_model_additional_data (dict) – Additional attributes of the optimal model to store.
inner_splitters_adj (dict) – Inner splitters for the learning process.
X_train (pd.DataFrame) – Input feature matrix.
y_train (pd.Series) – Target variable.
X_test (pd.DataFrame) – Input feature matrix.
y_test (pd.Series) – Target variable.
timestamp (pd.Timestamp) – Model retraining date.
adjusted_test_index (pd.MultiIndex) – Adjusted test index to account for lagged features.

Returns:

Dictionary containing predictive analytics.

Return type:

dict

get_optimal_models(name=None)[source]#

Returns the sequences of optimal models for one or more processes.

Parameters:: name (str or list, optional) – Label of sequential optimization process. Default is all stored in the class instance.
Returns:: Pandas dataframe of the optimal models and hyperparameters selected at each retraining date.
Return type:: pd.DataFrame

models_heatmap(name, title=None, cap=5, figsize=(12, 8), title_fontsize=None, tick_fontsize=None)[source]#

Visualized optimal models used for signal calculation.

Parameters:

name (str) – Name of the sequential optimization pipeline.
title (str, optional) – Title of the heatmap. Default is None. This creates a figure title of the form “Model Selection Heatmap for {name}”.
cap (int, optional) – Maximum number of models to display. Default (and limit) is 5. The chosen models are the ‘cap’ most frequently occurring in the pipeline.
figsize (tuple, optional) – Tuple of floats or ints denoting the figure size. Default is (12, 8).
title_fontsize (int, optional) – Font size for the title. Default is None.
tick_fontsize (int, optional) – Font size for the ticks. Default is None.

Notes

This method displays the models selected at each date in time over the span of the sequential learning process. A binary heatmap is used to visualise the model selection process.

class SignalOptimizer(df, xcats, cids=None, n_targets=1, start=None, end=None, blacklist=None, freq='M', lag=1, xcat_aggs=['last', 'sum'], generate_labels=None, drop_nas=True)[source]#

Bases: BasePanelLearner

Class for sequential optimization of return forecasts based on panels of quantamental features.

Parameters:

df (pd.DataFrame) – Daily quantamental dataframe in JPMaQS format containing a panel of features, as well as a panel of returns.
xcats (list) – List comprising feature names, with the last n_targets elements being the response variable name(s). The features and the response variable(s) must be categories in the dataframe.
cids (list) – List of cross-section identifiers for consideration in the panel. Default is None, in which case all cross-sections in df are considered.
n_targets (int) – Number of response variables to consider. Default is 1.
start (str) – Start date for considered data in subsequent analysis in ISO 8601 format. Default is None i.e. the earliest date in the dataframe.
end (str) – End date for considered data in subsequent analysis in ISO 8601 format. Default is None i.e. the latest date in the dataframe.
blacklist (list) – Blacklisting dictionary specifying date ranges for which cross-sectional information should be excluded. The keys are cross-sections and the values are tuples of start and end dates in ISO 8601 format. Default is None.
freq (str) – Frequency of the analysis. Default is “M” for monthly.
lag (int) – Number of periods to lag the response variable. Default is 1.
xcat_aggs (list) – List of aggregation functions to apply to the features, used when freq is not D. Default is [“last”, “sum”].
generate_labels (callable) – Function to transform the response variable into either alternative regression targets or classification labels. Default is None.
drop_nas (bool, str) – Strategy for dealing with NaNs in the data. Valid arguments are True, False, “X”, or “y”. If True, then all NaNs are removed from the data. “X” means only NaNs in independent variables are dropped. “y” means only NaNs in dependent variables are dropped. Default is True.

Notes

The SignalOptimizer class is used to predict the response variable(s), usually a panel of asset class returns, based on a panel of features that are lagged by a specified number of periods. This is done in a sequential manner, by specifying the size of an initial training set, choosing an optimal model out of a provided collection (with associated hyperparameters), forecasting the return panel(s), and then expanding the training set to include the now-realized returns. The process continues until the end of the dataset is reached.

In addition to storing forecasts, this class also stores useful information for analysis such as the models selected at each point in time, the feature coefficients and intercepts (where relevant) of selected models, as well as the features selected by any feature selection modules.

Model and hyperparameter selection is performed by cross-validation. Given a collection of models and associated hyperparameters to choose from, an HPO is run - currently only grid search and random search are supported - to determine the optimal choice. This is done by providing a collection of scikit-learn compatible scoring functions, as well as a collection of scikit-learn compatible cross-validation splitters and scorers. At each point in time, the cross-validation folds are the union of the folds produced by each splitter provided. Each scorer is evaluated on each test fold and summarised across test folds by either a custom function provided by the user or a common string i.e. ‘mean’.

Consequently, each model and hyperparameter combination has an associated collection of scores induced by different metrics, in units of those scorers. In order to form a composite score for each hyperparameter, the scores must be normalized across model/hyperparameter combinations. This makes scores across scorers comparable, so that the average score across adjusted scores can be used as a meaningful estimate of each model’s generalization ability. Finally, a composite score for each model and hyperparameter combination is calculated by averaging the adjusted scores across all scorers.

The optimal model is the one with the largest composite score.

calculate_predictions(name, models, hyperparameters=None, scorers=None, inner_splitters=None, search_type='grid', normalize_fold_results=False, cv_summary='mean', multi_target_fill='zero', include_train_folds=False, min_cids=4, min_periods=36, min_xcats=1, test_size=1, max_periods=None, split_functions=None, store_additional_data=None, n_iter=None, n_jobs_outer=-1, n_jobs_inner=1, store_correlations=False)[source]#

Determine forecasts and store relevant quantities over time.

Parameters:

name (str) – Name of the signal optimization process.
models (dict) – Dictionary of models to choose from. The keys are model names and the values are scikit-learn compatible models.
hyperparameters (dict, optional) – Dictionary of hyperparameters to choose from. The keys are model names and the values are hyperparameter dictionaries for the corresponding model. The keys must match with those provided in models. If no hyperparameters are required to be tuned, this parameter can be None. Default is None.
scorers (dict, optional) – Dictionary of scoring functions to use in cross-validation if hyperparameters or models are needed to be selected. The keys are scorer names and the values are scikit-learn compatible scoring functions. If no cross-validation is required, this parameter can be None. Default is None.
inner_splitters (dict, optional) – Dictionary of inner splitters to use in cross-validation. The keys are splitter names and the values are scikit-learn compatible cross-validator objects. If no cross-validation is required, this parameter can be None. Default is None.
search_type (str, optional) – Type of hyperparameter optimization to perform. Default is “grid”. Options are “grid” and “prior”. If no hyperparameter tuning is required, this parameter can be disregarded.
normalize_fold_results (bool, optional) – Whether to normalize the scores across folds before combining them. Default is False. If no hyperparameter tuning is required, this parameter can be disregarded.
cv_summary (str or callable, optional) – Summary function to use to combine scores across cross-validation folds. Default is “mean”. Options are “mean”, “median”, “mean-std”, “mean/std”, “mean-std-ge” or a callable function. If no hyperparameter tuning is required, this parameter can be disregarded.
multi_target_fill (str, optional) – Method to use to fill in predictions for targets with poor availability in the case of multi-target models. Options are “zero” and “mean”. Default is “zero”. If no multi-target models are used, this parameter can be disregarded.
include_train_folds (bool, optional) – Whether to calculate cross-validation statistics on the training folds in additional to the test folds. If True, the cross-validation estimator will be a function of both training data and test data. It is recommended to set cv_summary appropriately. Default is False. If no hyperparameter tuning is required, this parameter can be disregarded.
min_cids (int, optional) – Minimum number of cross-sections required for the initial training set. Default is 4.
min_periods (int, optional) – Minimum number of periods required for the initial training set, in units of the frequency freq specified in the constructor. Default is 36.
min_xcats (int, optional) – Minimum number of xcats required for the initial training set. Default is 1.
test_size (int, optional) – Number of periods to pass before retraining a selected model. Default is 1.
max_periods (int, optional) – Maximum length of each training set in units of the frequency freq specified in the constructor. Default is None, in which case the sequential optimization uses expanding training sets, as opposed to rolling windows.
split_functions (dict, optional) – Dict of callables for determining the number of cross-validation splits to add to the initial number as a function of the number of iterations passed in the sequential learning process. The keys must correspond to the keys in inner_splitters and should be set to None for any splitters that do not require splitter adjustment. Default is None. If no hyperparameter tuning is required, this parameter can be disregarded.
store_additional_data (list, optional) – List of optimal model attributes to store from each optimal model at each retraining date. Default is None.
n_iter (int, optional) – Number of iterations to run in random hyperparameter search. Default is None. If no hyperparameter tuning is required, this parameter can be disregarded.
n_jobs_outer (int, optional) – Number of jobs to run in parallel for the outer sequential loop. Default is -1. It is advised for n_jobs_inner * n_jobs_outer (replacing -1 with the number of available cores) to be less than or equal to the number of available cores on the machine.
n_jobs_inner (int, optional) – Number of jobs to run in parallel for the inner loop. Default is 1. It is advised for n_jobs_inner * n_jobs_outer (replacing -1 with the number of available cores) to be less than or equal to the number of available cores on the machine. If no hyperparameter tuning is required, this parameter can be disregarded.
store_correlations (bool) – Whether to store the correlations between input pipeline features and input predictor features. Default is False.

store_split_data(pipeline_name, optimal_model, optimal_model_name, optimal_model_score, optimal_model_params, inner_splitters_adj, optimal_model_additional_data, X_train, y_train, X_test, y_test, timestamp, adjusted_test_index)[source]#

Stores characteristics of the optimal model at each retraining date.

Parameters:

pipeline_name (str) – Name of the signal optimization process.
optimal_model (RegressorMixin, ClassifierMixin or Pipeline) – Optimal model selected at each retraining date.
optimal_model_name (str) – Name of the optimal model.
optimal_model_score (float) – Cross-validation score for the optimal model.
optimal_model_params (dict) – Chosen hyperparameters for the optimal model.
optimal_model_additional_data (dict) – Additional attributes of the optimal model to store.
inner_splitters_adj (dict) – Dictionary of adjusted inner splitters.
X_train (pd.DataFrame) – Training feature matrix.
y_train (pd.Series) – Training response variable.
X_test (pd.DataFrame) – Test feature matrix.
y_test (pd.Series) – Test response variable.
timestamp (pd.Timestamp) – Timestamp of the retraining date.
adjusted_test_index (pd.MultiIndex) – Adjusted test index to account for lagged features.

Returns:

Dictionary containing feature importance scores, intercepts, selected features and correlations between inputs to pipelines and those entered into a final model.

Return type:

dict

get_optimized_signals(name=None)[source]#

Returns optimized signals for one or more processes

Parameters:: name (str or list, optional) – Label(s) of signal optimization process(es). Default is all stored in the class instance.
Returns:: Pandas dataframe in JPMaQS format of working daily predictions.
Return type:: pd.DataFrame

get_selected_features(name=None)[source]#

Returns the selected features over time for one or more processes.

Parameters:: name (str or list, optional) – Label(s) of signal optimization process(es). Default is all stored in the class instance.
Returns:: Pandas dataframe of the selected features at each retraining date.
Return type:: pd.DataFrame

get_feature_importances(name=None)[source]#

Returns feature importances for a given pipeline.

Parameters:: name (str or list, optional) – Label(s) of signal optimization process(es). Default is all stored in the class instance.
Returns:: Pandas dataframe of the feature importances, if available, learnt at each retraining date for a given pipeline.
Return type:: pd.DataFrame

Notes

Availability of feature importances is subject to the selected model having a feature_importances_ or coef_ attribute.

get_intercepts(name=None)[source]#

Returns intercepts for a given pipeline.

Parameters:: name (str or list, optional) – Label(s) of signal optimization process(es). Default is all stored in the class instance.
Returns:: Pandas dataframe of the intercepts, if available, learnt at each retraining date for a given pipeline.
Return type:: pd.DataFrame

get_feature_correlations(name=None)[source]#

Returns dataframe of feature correlations for one or more processes

Parameters:: name (str or list, optional) – Label(s) of signal optimization process(es). Default is all stored in the class instance.
Returns:: Pandas dataframe of the correlations between the features passed into a model pipeline and the post-processed features inputted into the final model.
Return type:: pd.DataFrame

feature_selection_heatmap(name, remove_blanks=True, title=None, cap=None, ftrs_renamed=None, figsize=(12, 8), tick_fontsize=None)[source]#

Visualise the features chosen by the final selector in a scikit-learn pipeline over time, for a given signal optimization process that has been run.

Parameters:

name (str) – Name of the previously run signal optimization process.
remove_blanks (bool, optional) – Whether to remove features from the heatmap that were never selected. Default is True.
title (str, optional) – Title of the heatmap. Default is None. This creates a figure title of the form “Model Selection Heatmap for {name}”.
cap (int, optional) – Maximum number of features to display. Default is None. The chosen features are the ‘cap’ most frequently occurring in the pipeline.
ftrs_renamed (dict, optional) – Dictionary to rename the feature names for visualisation in the plot axis. Default is None, which uses the original feature names.
figsize (tuple of floats or ints, optional) – Tuple of floats or ints denoting the figure size. Default is (12, 8).
tick_fontsize (int, optional) – Font size of the ticks on the heatmap. Default is None.

Notes

This method displays the features selected by the final selector in a scikit-learn pipeline over time, for a given signal optimization process that has been run. This information is contained within a binary heatmap. This does not take into account inherent feature selection within the predictor.

available_cid_heatmap(title='Number of Available CIDs Heatmap', figsize=(12, 8), xcats=None, xcat_labels=None, start_date=None, tick_fontsize=None, title_fontsize=None)[source]#

Visualise the number of cids with data for each xcat at each date

Parameters:

title (str) – Title of the heatmap. Default is “Number of Available CIDs Heatmap””
figsize (tuple of floats or ints, optional) – Tuple of floats or ints denoting the figure size. Default is (12, 8).
xcats (List[str], optional) – A list of xcats to include in the heatmap. Default is None.
xcat_labels (Dict[str, str], optional) – A dictionary which renames xcats for plotting. Default is None.
start_date (str or pd.Timestamp, optional) – Show data from this date onwards
tick_fontsize (int, optional) – Font size of the ticks on the heatmap. Default is None.
title_fontsize (int, optional) – Font size of the title of the heatmap. Default is None.

Return type:

None

correlations_heatmap(name, feature_name, title=None, cap=None, ftrs_renamed=None, figsize=(12, 8))[source]#

Method to visualise correlations between features entering a model, and those that entered a preprocessing pipeline.

Parameters:

name (str) – Name of the signal optimization process.
feature_name (str) – Name of the feature passed into the final predictor.
title (str, optional) – Title of the heatmap. Default is None. This creates a figure title of the form “Correlation Heatmap for feature {feature_name} and pipeline {name}”.
cap (int, optional) – Maximum number of correlations to display. Default is None. The chosen features are the ‘cap’ most highly correlated.
ftrs_renamed (dict, optional) – Dictionary to rename the feature names for visualisation in the plot axis. Default is None, which uses the original feature names.
figsize (tuple of floats or ints, optional) – Tuple of floats or ints denoting the figure size. Default is (12, 8).

Notes

This method displays the correlation between a feature that is about to be entered into a final predictor and the cap most correlated features entered into the original pipeline. This information is contained within a heatmap.

In a given pipeline, the features that enter it can be transformed in any way. Sometimes the transformation is non-trivial, resulting in a feature space that is not easily interpretable. This method allows the user to see how the original features are correlated with the features that enter the final model, providing insight into the transformation process.

As an example, dimensionality reduction techniques such as PCA and LDA rotate the feature space, resulting in factors that can be hard to interpret. A neural network aims to learn a non-linear transformation of the feature space, which can also be hard to interpret. This method allows the user to see how the original features are correlated with the transformed features, providing insight into the transformation that took place.

feature_importance_timeplot(name, ftrs=None, title=None, ftrs_renamed=None, figsize=(10, 6), title_fontsize=None, label_fontsize=None, tick_fontsize=None)[source]#

Visualise time series of feature importances for the final predictor in a given pipeline, when available.

Parameters:

name (str) – Name of the previously run signal optimization process.
ftrs (list, optional) – List of feature names to plot. Default is None.
title (str, optional) – Title of the plot. Default is None. This creates a figure title of the form “Feature importances for pipeline: {name}”.
ftrs_renamed (dict, optional) – Dictionary to rename the feature names for visualisation in the plot legend. Default is None, which uses the original feature names.
figsize (tuple of floats or ints, optional) – Tuple of floats or ints denoting the figure size. Default is (10, 6).
title_fontsize (int, optional) – Font size for the title. Default is None.
label_fontsize (int, optional) – Font size for the axis labels. Default is None.
tick_fontsize (int, optional) – Font size for the axis ticks. Default is None.

Notes

This method displays the time series of feature importances for a given pipeline, when available. Availability depends on whether or not the final predictor in the pipeline has either a coefs_ or feature_importances_ attribute. This information is contained within a line plot. The default behaviour is to sort the feature importance columns in ascending order of the number of NAs, accounting for a possible feature selection module in the pipeline and plot the feature importances for the first 10 features in the sorted order. If more than 10 features were involved in the learning procedure, the default is to plot the feature importances for the first 10 sorted features. By specifying a ftrs list (which can be no longer than 10 elements in length), this default behaviour can be overridden.

By sorting by NAs, the plot displays the model feature importances for either the first 10 features in the dataframe or, when a feature selection module was present, the 10 most frequently selected features.

intercepts_timeplot(name, title=None, figsize=(10, 6))[source]#

Visualise time series of intercepts for a given pipeline, when available.

Parameters:

name (str) – Name of the previously run signal optimization process.
title (str, optional) – Title of the plot. Default is None. This creates a figure title of the form “Intercepts for pipeline: {name}”.
figsize (tuple of floats or ints, optional) – Tuple of floats or ints denoting the figure size. Default is (10, 6).

Notes

This method displays the time series of intercepts for a given pipeline, when available. This information is contained within a line plot.

coefs_stackedbarplot(name, ftrs=None, title=None, cap=None, ftrs_renamed=None, figsize=(10, 6), title_fontsize=None, label_fontsize=None, tick_fontsize=None)[source]#

Visualise feature coefficients for a given pipeline in a stacked bar plot.

Parameters:

name (str) – Name of the previously run signal optimization process.
ftrs (list, optional) – List of feature names to plot. Default is None.
title (str, optional) – Title of the plot. Default is None. This creates a figure title of the form “Stacked bar plot of model coefficients: {name}”.
cap (int, optional) – Maximum number of features to display. Default is None. The chosen features are the ‘cap’ most frequently occurring in the pipeline. This cannot exceed 10.
ftrs_renamed (dict, optional) – Dictionary to rename the feature names for visualisation in the plot legend. Default is None, which uses the original feature names.
figsize (tuple of floats or ints, optional) – Tuple of floats or ints denoting the figure size. Default is (10, 6).
title_fontsize (int, optional) – Font size for the title. Default is None.
label_fontsize (int, optional) – Font size for the axis labels. Default is None.
tick_fontsize (int, optional) – Font size for the axis ticks. Default is None.

Notes

This method displays the average feature coefficients for a given pipeline in each calendar year, when available. This information is contained within a stacked bar plot. The default behaviour is to plot the first 10 features in the order specified during training. If more than 10 features were involved in the learning procedure, the default is to plot the first 10 features. By specifying a ftrs list (which can be no longer than 10 elements in length), this default behaviour can be overridden.

nsplits_timeplot(name, title=None, figsize=(10, 6), title_fontsize=None, label_fontsize=None, tick_fontsize=None)[source]#

Method to plot the time series for the number of cross-validation splits used by the signal optimizer.

Parameters:

name (str) – Name of the previously run signal optimization process.
title (str, optional) – Title of the plot. Default is None. This creates a figure title of the form “Stacked bar plot of model coefficients: {name}”.
figsize (tuple of floats or ints, optional) – Tuple of floats or ints denoting the figure size. Default is (10, 6).
title_fontsize (int, optional) – Font size for the title. Default is None.
label_fontsize (int, optional) – Font size for the x and y axis labels. Default is None.
tick_fontsize (int, optional) – Font size for the x and y axis ticks. Default is None.

class BetaEstimator(df, xcats, benchmark_return, cids=None, start=None, end=None)[source]#

Bases: BasePanelLearner

Class for sequential beta estimation by learning optimal regression coefficients. Out-of-sample hedged returns are additionally calculated and stored.

Parameters:

df (pd.DataFrame) – Daily quantamental dataframe in JPMaQS format containing a panel of features, as well as a panel of returns.
xcats (str or list) – Name of a market return category within the panel specified in df.
benchmark_return (str) – Name of the benchmark return ticker within the panel specified in df.
cids (list, optional) – List of cross-sections for which hedged returns are to be calculated. Default is None, which calculates hedged returns for all cross-sections in the return panel.
start (str, optional) – Start date for considered data in subsequent analysis in ISO 8601 format. Default is None i.e. the earliest date in the dataframe.
end (str, optional) – End date for considered data in subsequent analysis in ISO 8601 format. Default is None i.e. the latest date in the dataframe.

Notes

The BetaEstimator class is used to sequentially estimate macro betas based on a panel of contract returns (provided in xcats) and a benchmark return ticker (provided in benchmark_return). The initial conditions of the learning process are specified by the dimensions of an initial training set. An optimal model is selected out of a provided collection (with associated hyperparameters), a beta is extracted for each cross-section (subject to availability) and out-of-sample hedged returns are calculated for each cross-section with an estimated beta. The betas and hedged returns are stored, and the training set is expanded to include the now-realized returns. This process is repeated until the end of the dataset is reached.

In addition to storing betas and hedged returns, this class also stores useful model selection information for analysis, such as the models selected at each point in time.

The optimal model is the one with the largest composite score.

estimate_beta(beta_xcat, hedged_return_xcat, models, hyperparameters, scorers, inner_splitters, search_type='grid', normalize_fold_results=False, cv_summary='mean', include_train_folds=False, min_cids=4, min_periods=36, est_freq='D', max_periods=None, split_functions=None, n_iter=None, n_jobs_outer=-1, n_jobs_inner=1)[source]#

Determines optimal model betas and associated out-of-sample hedged returns.

Parameters:

beta_xcat (str) – Category name for the panel of estimated betas.
hedged_return_xcat (str) – Category name for the panel of out-of-sample hedged returns.
models (dict) – Dictionary of models to choose from. The keys are model names and the values are scikit-learn compatible models.
hyperparameters (dict) – Dictionary of hyperparameters to choose from. The keys are model names and the values are hyperparameter dictionaries for the corresponding model. The keys must match with those provided in models.
scorers (dict) – Dictionary of scoring functions to use in the hyperparameter optimization process. The keys are scorer names and the values are scikit-learn compatible scoring functions.
inner_splitters (dict) – Dictionary of inner splitters to use in the hyperparameter optimization process. The keys are splitter names and the values are scikit-learn compatible cross-validator objects.
search_type (str) – Type of hyperparameter optimization to perform. Default is “grid”. Options are “grid” and “prior”.
normalize_fold_results (bool) – Whether to normalize the scores across folds before combining them. Default is False.
cv_summary (str or callable, optional) – Summary function to use to combine scores across cross-validation folds. Default is “mean”. Options are “mean”, “median”, “mean-std”, “mean/std”, “mean-std-ge” or a callable function.
include_train_folds (bool, optional) – Whether to calculate cross-validation statistics on the training folds in additional to the test folds. If True, the cross-validation estimator will be a function of both training data and test data. It is recommended to set cv_summary appropriately. Default is False.
min_cids (int) – Minimum number of cross-sections required for the initial training set. Default is 4.
min_periods (int) – Minimum number of periods required for the initial training set, in units of the frequency freq specified in the constructor. Default is 36.
est_freq (str) – Frequency at which models are refreshed. This corresponds with forward frequency of out-of-sample hedged returns and the frequency at which betas are estimated.
max_periods (int) – Maximum length of each training set in units of the frequency freq specified in the constructor. Default is None, in which case the sequential optimization uses expanding training sets, as opposed to rolling windows.
split_functions (dict, optional) – Dict of callables for determining the number of cross-validation splits to add to the initial number as a function of the number of iterations passed in the sequential learning process. Default is None. The keys must correspond to the keys in inner_splitters and should be set to None for any splitters that do not require splitter adjustment.
n_iter (int, optional) – Number of iterations to run in random hyperparameter search. Default is None.
n_jobs_outer (int, optional) – Number of jobs to run in parallel for the outer sequential loop. Default is -1. It is advised for n_jobs_inner * n_jobs_outer (replacing -1 with the number of available cores) to be less than or equal to the number of available cores on the machine.
n_jobs_inner (int, optional) – Number of jobs to run in parallel for the inner loop. Default is 1. It is advised for n_jobs_inner * n_jobs_outer (replacing -1 with the number of available cores) to be less than or equal to the number of available cores on the machine.

Stores characteristics of the optimal model at each retraining date.

Parameters:

pipeline_name (str) – Name of the signal optimization process.
optimal_model (BaseRegressionSystem or VotingRegressor) – Optimal model selected at each retraining date.
optimal_model_name (str) – Name of the optimal model.
optimal_model_score (float) – Cross-validation score for the optimal model.
optimal_model_params (dict) – Chosen hyperparameters for the optimal model.
inner_splitters_adj (dict) – Dictionary of adjusted inner splitters.
X_train (pd.DataFrame) – Training feature matrix.
y_train (pd.Series) – Training response variable.
X_test (pd.DataFrame) – Test feature matrix.
y_test (pd.Series) – Test response variable.
timestamp (pd.Timestamp) – Timestamp of the retraining date.
adjusted_test_index (pd.MultiIndex) – Adjusted test index to account for lagged features.

Returns:

Dictionary containing the betas and hedged returns determined at the given retraining date.

Return type:

dict

evaluate_hedged_returns(hedged_return_xcat=None, cids=None, correlation_types='pearson', title=None, start=None, end=None, blacklist=None, freqs='M')[source]#

Method to determine and display a table of average absolute correlations between the benchmark return and the computed hedged returns within the class instance, over all cross-sections in the panel. Additionally, the correlation table displays the same results for the unhedged return specified in the class instance for comparison purposes.

The returned dataframe will be multi-indexed by (benchmark return, return, frequency) and will contain each computed absolute correlation coefficient on each column.

Parameters:

hedged_return_xcat (str or list, optional) – Hedged returns to be evaluated. Default is None, which evaluates all hedged returns within the class instance.
cids (str or list, optional) – Cross-sections for which evaluation of hedged returns takes place. Default is None, which evaluates all cross-sections within the class instance.
correlation_types (str or list, optional) – Types of correlations to calculate. Options are “pearson”, “spearman” and “kendall”. If None, all three are calculated. Default is “pearson”.
title (str, optional) – Title for the correlation table. If None, the default title is “Average absolute correlations between each return and the chosen benchmark”. Default is None.
start (str, optional) – String in ISO format. Default is None.
end (str, optional) – String in ISO format. Default is None.
blacklist (dict, optional) – Dictionary of tuples of start and end dates to exclude from the evaluation. Default is None.
freqs (str or list, optional) – Letters denoting all frequencies at which the correlations may be calculated. This must be a selection of “D”, “W”, “M”, “Q” and “A”. Default is “M”. Each return series will always be summed over the sample period.

Returns:

A dataframe of average absolute correlations between the benchmark return and the computed hedged returns.

Return type:

pd.DataFrame

get_hedged_returns(hedged_return_xcat=None)[source]#

Returns a dataframe of out-of-sample hedged returns derived from beta estimation processes held within the class instance.

Parameters:: hedged_return_xcat (str or list, optional) – Category name or list of category names for the panel of derived hedged returns. If None, information from all beta estimation processes held within the class instance is returned. Default is None.
Returns:: A dataframe of out-of-sample hedged returns derived from beta estimation processes.
Return type:: pd.DataFrame

get_betas(beta_xcat=None)[source]#

Returns a dataframe of estimated betas derived from beta estimation processes held within the class instance.

Parameters:: beta_xcat (str or list) – Category name or list of category names for the panel of estimated contract betas. If None, information from all beta estimation processes held within the class instance is returned. Default is None.
Returns:: A dataframe of estimated betas derived from beta estimation processes.
Return type:: pd.DataFrame

class ReturnForecaster(df, xcats, real_date, cids=None, blacklist=None, freq='M', lag=1, xcat_aggs=['last', 'sum'], generate_labels=None)[source]#

Bases: BasePanelLearner

Class to produce return forecasts for a single forward frequency, based on the indicator states at a specific date.

Parameters:

df (pd.DataFrame) – Daily quantamental dataframe in JPMaQS format containing a panel of features, as well as a panel of returns.
xcats (list) – List comprising feature names, with the last element being the response variable name. The features and the response variable must be categories in the dataframe.
real_date (str) – Date in ISO 8601 format at which time a forward forecast is made based on the information states on that day.
cids (list, optional) – List of cross-section identifiers for consideration in the panel. Default is None, in which case all cross-sections in df are considered.
blacklist (list, optional) – Blacklisting dictionary specifying date ranges for which cross-sectional information should be excluded. The keys are cross-sections and the values are tuples of start and end dates in ISO 8601 format. Default is None.
freq (str, optional) – Frequency of the analysis. Default is “M” for monthly.
lag (int, optional) – Number of periods to lag the response variable. Default is 1.
xcat_aggs (list, optional) – List of aggregation functions to apply to the features, used when freq is not D. Default is [“last”, “sum”].
generate_labels (callable, optional) – Function to transform the response variable into either alternative regression targets or classification labels. Default is None.

Notes

This class is a simple interface to produce a single period forward forecast. The real_date parameter specifies the date of the information state used to generate the forecast. As an example, if the provided date is “2025-03-01”, a monthly frequency is specified and the lag is 1, the information states on this date are set aside, and the previous data is downsampled to monthly (with the features lagged by 1 period). On this dataset, model selection and fitting happen - and the forecast is produced for the single out-of-sample period (March 2025).

calculate_predictions(name, models, hyperparameters, scorers, inner_splitters, search_type='grid', normalize_fold_results=False, cv_summary='mean', include_train_folds=False, n_iter=None, n_jobs_cv=1, n_jobs_model=1, store_correlations=False)[source]#

Calculate predictions for the out-of-sample period.

Parameters:

name (str) – Name of the signal optimization process.
models (dict) – Dictionary of models to choose from. The keys are model names and the values are scikit-learn compatible models.
hyperparameters (dict) – Dictionary of hyperparameters to choose from. The keys are model names and the values are hyperparameter dictionaries for the corresponding model. The keys must match with those provided in models.
scorers (dict) – Dictionary of scoring functions to use in the hyperparameter optimization process. The keys are scorer names and the values are scikit-learn compatible scoring functions.
inner_splitters (dict) – Dictionary of inner splitters to use in the hyperparameter optimization process. The keys are splitter names and the values are scikit-learn compatible cross-validator objects.
search_type (str, optional) – Type of hyperparameter optimization to perform. Default is “grid”. Options are “grid” and “prior”.
normalize_fold_results (bool, optional) – Whether to normalize the scores across folds before combining them. Default is False.
cv_summary (str or callable, optional) – Summary function to use to combine scores across cross-validation folds. Default is “mean”. Options are “mean”, “median”, “mean-std”, “mean/std”, “mean-std-ge” or a callable function.
include_train_folds (bool, optional) – Whether to calculate cross-validation statistics on the training folds in additional to the test folds. If True, the cross-validation estimator will be a function of both training data and test data. It is recommended to set cv_summary appropriately. Default is False.
n_iter (int, optional) – Number of iterations to run in random hyperparameter search. Default is None.
n_jobs_cv (int, optional) – Number of parallel jobs to run the cross-validation process. Default is 1.
n_jobs_model (int, optional) – Number of parallel jobs to run the model fitting process (if relevant). Default is 1.
store_correlations (bool) – Whether to store the correlations between input pipeline features and input predictor features. Default is False.

Stores characteristics of the optimal model at each retraining date.

Parameters:

pipeline_name (str) – Name of the signal optimization process.
optimal_model (RegressorMixin, ClassifierMixin or Pipeline) – Optimal model selected at each retraining date.
optimal_model_name (str) – Name of the optimal model.
optimal_model_score (float) – Cross-validation score for the optimal model.
optimal_model_params (dict) – Chosen hyperparameters for the optimal model.
optimal_model_additional_data (dict) – Additional data returned by the hyperparameter optimization process.
inner_splitters_adj (dict) – Dictionary of adjusted inner splitters.
X_train (pd.DataFrame) – Training feature matrix.
y_train (pd.Series) – Training response variable.
X_test (pd.DataFrame) – Test feature matrix.
y_test (pd.Series) – Test response variable.
timestamp (pd.Timestamp) – Timestamp of the retraining date.
adjusted_test_index (pd.MultiIndex) – Adjusted test index to account for lagged features.

Returns:

Dictionary containing feature importance scores, intercepts, selected features and correlations between inputs to pipelines and those entered into a final model.

Return type:

dict

get_optimized_signals(name=None)[source]#

Returns forward forecasts for one or more pipelines.

Parameters:: name (str or list, optional) – Label(s) of forecast(s). Default is all stored in the class instance.
Returns:: Pandas dataframe in JPMaQS format of working daily predictions.
Return type:: pd.DataFrame

get_selected_features(name=None)[source]#

Returns the selected features for one or more pipelines.

Parameters:: name (str or list, optional) – Label(s) of pipeline(s). Default is all stored in the class instance.
Returns:: Pandas dataframe of the selected features at each retraining date.
Return type:: pd.DataFrame

get_feature_importances(name=None)[source]#

Returns feature importances for one or more pipelines.

Parameters:: name (str or list, optional) – Label(s) of pipeline(s). Default is all stored in the class instance.
Returns:: Pandas dataframe of the feature importances, if available, learnt at each retraining date for a given pipeline.
Return type:: pd.DataFrame

Notes

Availability of feature importances is subject to the selected model having a feature_importances_ or coef_ attribute.

get_intercepts(name=None)[source]#

Returns intercepts for one or more pipelines.

Parameters:: name (str or list, optional) – Label(s) of pipeline(s). Default is all stored in the class instance.
Returns:: Pandas dataframe of the intercepts, if available, learnt at each retraining date for a given pipeline.
Return type:: pd.DataFrame

get_feature_correlations(name=None)[source]#

Returns dataframe of feature correlations for one or more pipelines.

Parameters:: name (str or list, optional) – Label(s) of the pipeline(s). Default is all stored in the class instance.
Returns:: Pandas dataframe of the correlations between the features passed into a model pipeline and the post-processed features inputted into the final model.
Return type:: pd.DataFrame

macrosynergy.learning.sequential#

Submodules#