macrosynergy.learning.sequential.signal_optimizer#

Class to determine and store sequentially-optimized panel forecasts based on statistical machine learning.

class SignalOptimizer(df, xcats, cids=None, n_targets=1, start=None, end=None, blacklist=None, freq='M', lag=1, xcat_aggs=['last', 'sum'], generate_labels=None, drop_nas=True)[source]#

Bases: BasePanelLearner

Class for sequential optimization of return forecasts based on panels of quantamental features.

Parameters:

df (pd.DataFrame) – Daily quantamental dataframe in JPMaQS format containing a panel of features, as well as a panel of returns.
xcats (list) – List comprising feature names, with the last n_targets elements being the response variable name(s). The features and the response variable(s) must be categories in the dataframe.
cids (list) – List of cross-section identifiers for consideration in the panel. Default is None, in which case all cross-sections in df are considered.
n_targets (int) – Number of response variables to consider. Default is 1.
start (str) – Start date for considered data in subsequent analysis in ISO 8601 format. Default is None i.e. the earliest date in the dataframe.
end (str) – End date for considered data in subsequent analysis in ISO 8601 format. Default is None i.e. the latest date in the dataframe.
blacklist (list) – Blacklisting dictionary specifying date ranges for which cross-sectional information should be excluded. The keys are cross-sections and the values are tuples of start and end dates in ISO 8601 format. Default is None.
freq (str) – Frequency of the analysis. Default is “M” for monthly.
lag (int) – Number of periods to lag the response variable. Default is 1.
xcat_aggs (list) – List of aggregation functions to apply to the features, used when freq is not D. Default is [“last”, “sum”].
generate_labels (callable) – Function to transform the response variable into either alternative regression targets or classification labels. Default is None.
drop_nas (bool, str) – Strategy for dealing with NaNs in the data. Valid arguments are True, False, “X”, or “y”. If True, then all NaNs are removed from the data. “X” means only NaNs in independent variables are dropped. “y” means only NaNs in dependent variables are dropped. Default is True.

Notes

The SignalOptimizer class is used to predict the response variable(s), usually a panel of asset class returns, based on a panel of features that are lagged by a specified number of periods. This is done in a sequential manner, by specifying the size of an initial training set, choosing an optimal model out of a provided collection (with associated hyperparameters), forecasting the return panel(s), and then expanding the training set to include the now-realized returns. The process continues until the end of the dataset is reached.

In addition to storing forecasts, this class also stores useful information for analysis such as the models selected at each point in time, the feature coefficients and intercepts (where relevant) of selected models, as well as the features selected by any feature selection modules.

Model and hyperparameter selection is performed by cross-validation. Given a collection of models and associated hyperparameters to choose from, an HPO is run - currently only grid search and random search are supported - to determine the optimal choice. This is done by providing a collection of scikit-learn compatible scoring functions, as well as a collection of scikit-learn compatible cross-validation splitters and scorers. At each point in time, the cross-validation folds are the union of the folds produced by each splitter provided. Each scorer is evaluated on each test fold and summarised across test folds by either a custom function provided by the user or a common string i.e. ‘mean’.

Consequently, each model and hyperparameter combination has an associated collection of scores induced by different metrics, in units of those scorers. In order to form a composite score for each hyperparameter, the scores must be normalized across model/hyperparameter combinations. This makes scores across scorers comparable, so that the average score across adjusted scores can be used as a meaningful estimate of each model’s generalization ability. Finally, a composite score for each model and hyperparameter combination is calculated by averaging the adjusted scores across all scorers.

The optimal model is the one with the largest composite score.

calculate_predictions(name, models, hyperparameters=None, scorers=None, inner_splitters=None, search_type='grid', normalize_fold_results=False, cv_summary='mean', multi_target_fill='zero', include_train_folds=False, min_cids=4, min_periods=36, min_xcats=1, test_size=1, max_periods=None, split_functions=None, store_additional_data=None, n_iter=None, n_jobs_outer=-1, n_jobs_inner=1, store_correlations=False)[source]#

Determine forecasts and store relevant quantities over time.

Parameters:

name (str) – Name of the signal optimization process.
models (dict) – Dictionary of models to choose from. The keys are model names and the values are scikit-learn compatible models.
hyperparameters (dict, optional) – Dictionary of hyperparameters to choose from. The keys are model names and the values are hyperparameter dictionaries for the corresponding model. The keys must match with those provided in models. If no hyperparameters are required to be tuned, this parameter can be None. Default is None.
scorers (dict, optional) – Dictionary of scoring functions to use in cross-validation if hyperparameters or models are needed to be selected. The keys are scorer names and the values are scikit-learn compatible scoring functions. If no cross-validation is required, this parameter can be None. Default is None.
inner_splitters (dict, optional) – Dictionary of inner splitters to use in cross-validation. The keys are splitter names and the values are scikit-learn compatible cross-validator objects. If no cross-validation is required, this parameter can be None. Default is None.
search_type (str, optional) – Type of hyperparameter optimization to perform. Default is “grid”. Options are “grid” and “prior”. If no hyperparameter tuning is required, this parameter can be disregarded.
normalize_fold_results (bool, optional) – Whether to normalize the scores across folds before combining them. Default is False. If no hyperparameter tuning is required, this parameter can be disregarded.
cv_summary (str or callable, optional) – Summary function to use to combine scores across cross-validation folds. Default is “mean”. Options are “mean”, “median”, “mean-std”, “mean/std”, “mean-std-ge” or a callable function. If no hyperparameter tuning is required, this parameter can be disregarded.
multi_target_fill (str, optional) – Method to use to fill in predictions for targets with poor availability in the case of multi-target models. Options are “zero” and “mean”. Default is “zero”. If no multi-target models are used, this parameter can be disregarded.
include_train_folds (bool, optional) – Whether to calculate cross-validation statistics on the training folds in additional to the test folds. If True, the cross-validation estimator will be a function of both training data and test data. It is recommended to set cv_summary appropriately. Default is False. If no hyperparameter tuning is required, this parameter can be disregarded.
min_cids (int, optional) – Minimum number of cross-sections required for the initial training set. Default is 4.
min_periods (int, optional) – Minimum number of periods required for the initial training set, in units of the frequency freq specified in the constructor. Default is 36.
min_xcats (int, optional) – Minimum number of xcats required for the initial training set. Default is 1.
test_size (int, optional) – Number of periods to pass before retraining a selected model. Default is 1.
max_periods (int, optional) – Maximum length of each training set in units of the frequency freq specified in the constructor. Default is None, in which case the sequential optimization uses expanding training sets, as opposed to rolling windows.
split_functions (dict, optional) – Dict of callables for determining the number of cross-validation splits to add to the initial number as a function of the number of iterations passed in the sequential learning process. The keys must correspond to the keys in inner_splitters and should be set to None for any splitters that do not require splitter adjustment. Default is None. If no hyperparameter tuning is required, this parameter can be disregarded.
store_additional_data (list, optional) – List of optimal model attributes to store from each optimal model at each retraining date. Default is None.
n_iter (int, optional) – Number of iterations to run in random hyperparameter search. Default is None. If no hyperparameter tuning is required, this parameter can be disregarded.
n_jobs_outer (int, optional) – Number of jobs to run in parallel for the outer sequential loop. Default is -1. It is advised for n_jobs_inner * n_jobs_outer (replacing -1 with the number of available cores) to be less than or equal to the number of available cores on the machine.
n_jobs_inner (int, optional) – Number of jobs to run in parallel for the inner loop. Default is 1. It is advised for n_jobs_inner * n_jobs_outer (replacing -1 with the number of available cores) to be less than or equal to the number of available cores on the machine. If no hyperparameter tuning is required, this parameter can be disregarded.
store_correlations (bool) – Whether to store the correlations between input pipeline features and input predictor features. Default is False.

store_split_data(pipeline_name, optimal_model, optimal_model_name, optimal_model_score, optimal_model_params, inner_splitters_adj, optimal_model_additional_data, X_train, y_train, X_test, y_test, timestamp, adjusted_test_index)[source]#

Stores characteristics of the optimal model at each retraining date.

Parameters:

pipeline_name (str) – Name of the signal optimization process.
optimal_model (RegressorMixin, ClassifierMixin or Pipeline) – Optimal model selected at each retraining date.
optimal_model_name (str) – Name of the optimal model.
optimal_model_score (float) – Cross-validation score for the optimal model.
optimal_model_params (dict) – Chosen hyperparameters for the optimal model.
optimal_model_additional_data (dict) – Additional attributes of the optimal model to store.
inner_splitters_adj (dict) – Dictionary of adjusted inner splitters.
X_train (pd.DataFrame) – Training feature matrix.
y_train (pd.Series) – Training response variable.
X_test (pd.DataFrame) – Test feature matrix.
y_test (pd.Series) – Test response variable.
timestamp (pd.Timestamp) – Timestamp of the retraining date.
adjusted_test_index (pd.MultiIndex) – Adjusted test index to account for lagged features.

Returns:

Dictionary containing feature importance scores, intercepts, selected features and correlations between inputs to pipelines and those entered into a final model.

Return type:

dict

get_optimized_signals(name=None)[source]#

Returns optimized signals for one or more processes

Parameters:: name (str or list, optional) – Label(s) of signal optimization process(es). Default is all stored in the class instance.
Returns:: Pandas dataframe in JPMaQS format of working daily predictions.
Return type:: pd.DataFrame

get_selected_features(name=None)[source]#

Returns the selected features over time for one or more processes.

Parameters:: name (str or list, optional) – Label(s) of signal optimization process(es). Default is all stored in the class instance.
Returns:: Pandas dataframe of the selected features at each retraining date.
Return type:: pd.DataFrame

get_feature_importances(name=None)[source]#

Returns feature importances for a given pipeline.

Parameters:: name (str or list, optional) – Label(s) of signal optimization process(es). Default is all stored in the class instance.
Returns:: Pandas dataframe of the feature importances, if available, learnt at each retraining date for a given pipeline.
Return type:: pd.DataFrame

Notes

Availability of feature importances is subject to the selected model having a feature_importances_ or coef_ attribute.

get_intercepts(name=None)[source]#

Returns intercepts for a given pipeline.

Parameters:: name (str or list, optional) – Label(s) of signal optimization process(es). Default is all stored in the class instance.
Returns:: Pandas dataframe of the intercepts, if available, learnt at each retraining date for a given pipeline.
Return type:: pd.DataFrame

get_feature_correlations(name=None)[source]#

Returns dataframe of feature correlations for one or more processes

Parameters:: name (str or list, optional) – Label(s) of signal optimization process(es). Default is all stored in the class instance.
Returns:: Pandas dataframe of the correlations between the features passed into a model pipeline and the post-processed features inputted into the final model.
Return type:: pd.DataFrame

feature_selection_heatmap(name, remove_blanks=True, title=None, cap=None, ftrs_renamed=None, figsize=(12, 8), tick_fontsize=None)[source]#

Visualise the features chosen by the final selector in a scikit-learn pipeline over time, for a given signal optimization process that has been run.

Parameters:

name (str) – Name of the previously run signal optimization process.
remove_blanks (bool, optional) – Whether to remove features from the heatmap that were never selected. Default is True.
title (str, optional) – Title of the heatmap. Default is None. This creates a figure title of the form “Model Selection Heatmap for {name}”.
cap (int, optional) – Maximum number of features to display. Default is None. The chosen features are the ‘cap’ most frequently occurring in the pipeline.
ftrs_renamed (dict, optional) – Dictionary to rename the feature names for visualisation in the plot axis. Default is None, which uses the original feature names.
figsize (tuple of floats or ints, optional) – Tuple of floats or ints denoting the figure size. Default is (12, 8).
tick_fontsize (int, optional) – Font size of the ticks on the heatmap. Default is None.

Notes

This method displays the features selected by the final selector in a scikit-learn pipeline over time, for a given signal optimization process that has been run. This information is contained within a binary heatmap. This does not take into account inherent feature selection within the predictor.

available_cid_heatmap(title='Number of Available CIDs Heatmap', figsize=(12, 8), xcats=None, xcat_labels=None, start_date=None, tick_fontsize=None, title_fontsize=None)[source]#

Visualise the number of cids with data for each xcat at each date

Parameters:

title (str) – Title of the heatmap. Default is “Number of Available CIDs Heatmap””
figsize (tuple of floats or ints, optional) – Tuple of floats or ints denoting the figure size. Default is (12, 8).
xcats (List[str], optional) – A list of xcats to include in the heatmap. Default is None.
xcat_labels (Dict[str, str], optional) – A dictionary which renames xcats for plotting. Default is None.
start_date (str or pd.Timestamp, optional) – Show data from this date onwards
tick_fontsize (int, optional) – Font size of the ticks on the heatmap. Default is None.
title_fontsize (int, optional) – Font size of the title of the heatmap. Default is None.

Return type:

None

correlations_heatmap(name, feature_name, title=None, cap=None, ftrs_renamed=None, figsize=(12, 8))[source]#

Method to visualise correlations between features entering a model, and those that entered a preprocessing pipeline.

Parameters:

name (str) – Name of the signal optimization process.
feature_name (str) – Name of the feature passed into the final predictor.
title (str, optional) – Title of the heatmap. Default is None. This creates a figure title of the form “Correlation Heatmap for feature {feature_name} and pipeline {name}”.
cap (int, optional) – Maximum number of correlations to display. Default is None. The chosen features are the ‘cap’ most highly correlated.
ftrs_renamed (dict, optional) – Dictionary to rename the feature names for visualisation in the plot axis. Default is None, which uses the original feature names.
figsize (tuple of floats or ints, optional) – Tuple of floats or ints denoting the figure size. Default is (12, 8).

Notes

This method displays the correlation between a feature that is about to be entered into a final predictor and the cap most correlated features entered into the original pipeline. This information is contained within a heatmap.

In a given pipeline, the features that enter it can be transformed in any way. Sometimes the transformation is non-trivial, resulting in a feature space that is not easily interpretable. This method allows the user to see how the original features are correlated with the features that enter the final model, providing insight into the transformation process.

As an example, dimensionality reduction techniques such as PCA and LDA rotate the feature space, resulting in factors that can be hard to interpret. A neural network aims to learn a non-linear transformation of the feature space, which can also be hard to interpret. This method allows the user to see how the original features are correlated with the transformed features, providing insight into the transformation that took place.

feature_importance_timeplot(name, ftrs=None, title=None, ftrs_renamed=None, figsize=(10, 6), title_fontsize=None, label_fontsize=None, tick_fontsize=None)[source]#

Visualise time series of feature importances for the final predictor in a given pipeline, when available.

Parameters:

name (str) – Name of the previously run signal optimization process.
ftrs (list, optional) – List of feature names to plot. Default is None.
title (str, optional) – Title of the plot. Default is None. This creates a figure title of the form “Feature importances for pipeline: {name}”.
ftrs_renamed (dict, optional) – Dictionary to rename the feature names for visualisation in the plot legend. Default is None, which uses the original feature names.
figsize (tuple of floats or ints, optional) – Tuple of floats or ints denoting the figure size. Default is (10, 6).
title_fontsize (int, optional) – Font size for the title. Default is None.
label_fontsize (int, optional) – Font size for the axis labels. Default is None.
tick_fontsize (int, optional) – Font size for the axis ticks. Default is None.

Notes

This method displays the time series of feature importances for a given pipeline, when available. Availability depends on whether or not the final predictor in the pipeline has either a coefs_ or feature_importances_ attribute. This information is contained within a line plot. The default behaviour is to sort the feature importance columns in ascending order of the number of NAs, accounting for a possible feature selection module in the pipeline and plot the feature importances for the first 10 features in the sorted order. If more than 10 features were involved in the learning procedure, the default is to plot the feature importances for the first 10 sorted features. By specifying a ftrs list (which can be no longer than 10 elements in length), this default behaviour can be overridden.

By sorting by NAs, the plot displays the model feature importances for either the first 10 features in the dataframe or, when a feature selection module was present, the 10 most frequently selected features.

intercepts_timeplot(name, title=None, figsize=(10, 6))[source]#

Visualise time series of intercepts for a given pipeline, when available.

Parameters:

name (str) – Name of the previously run signal optimization process.
title (str, optional) – Title of the plot. Default is None. This creates a figure title of the form “Intercepts for pipeline: {name}”.
figsize (tuple of floats or ints, optional) – Tuple of floats or ints denoting the figure size. Default is (10, 6).

Notes

This method displays the time series of intercepts for a given pipeline, when available. This information is contained within a line plot.

coefs_stackedbarplot(name, ftrs=None, title=None, cap=None, ftrs_renamed=None, figsize=(10, 6), title_fontsize=None, label_fontsize=None, tick_fontsize=None)[source]#

Visualise feature coefficients for a given pipeline in a stacked bar plot.

Parameters:

name (str) – Name of the previously run signal optimization process.
ftrs (list, optional) – List of feature names to plot. Default is None.
title (str, optional) – Title of the plot. Default is None. This creates a figure title of the form “Stacked bar plot of model coefficients: {name}”.
cap (int, optional) – Maximum number of features to display. Default is None. The chosen features are the ‘cap’ most frequently occurring in the pipeline. This cannot exceed 10.
ftrs_renamed (dict, optional) – Dictionary to rename the feature names for visualisation in the plot legend. Default is None, which uses the original feature names.
figsize (tuple of floats or ints, optional) – Tuple of floats or ints denoting the figure size. Default is (10, 6).
title_fontsize (int, optional) – Font size for the title. Default is None.
label_fontsize (int, optional) – Font size for the axis labels. Default is None.
tick_fontsize (int, optional) – Font size for the axis ticks. Default is None.

Notes

This method displays the average feature coefficients for a given pipeline in each calendar year, when available. This information is contained within a stacked bar plot. The default behaviour is to plot the first 10 features in the order specified during training. If more than 10 features were involved in the learning procedure, the default is to plot the first 10 features. By specifying a ftrs list (which can be no longer than 10 elements in length), this default behaviour can be overridden.

nsplits_timeplot(name, title=None, figsize=(10, 6), title_fontsize=None, label_fontsize=None, tick_fontsize=None)[source]#

Method to plot the time series for the number of cross-validation splits used by the signal optimizer.

Parameters:

name (str) – Name of the previously run signal optimization process.
title (str, optional) – Title of the plot. Default is None. This creates a figure title of the form “Stacked bar plot of model coefficients: {name}”.
figsize (tuple of floats or ints, optional) – Tuple of floats or ints denoting the figure size. Default is (10, 6).
title_fontsize (int, optional) – Font size for the title. Default is None.
label_fontsize (int, optional) – Font size for the x and y axis labels. Default is None.
tick_fontsize (int, optional) – Font size for the x and y axis ticks. Default is None.