macrosynergy.learning.sequential.signal_optimizer#
Class to determine and store sequentially-optimized panel forecasts based on statistical machine learning.
- class SignalOptimizer(df, xcats, cids=None, start=None, end=None, blacklist=None, freq='M', lag=1, xcat_aggs=['last', 'sum'], generate_labels=None)[source]#
Bases:
BasePanelLearner
Class for sequential optimization of return forecasts based on panels of quantamental features.
- Parameters:
df (pd.DataFrame) – Daily quantamental dataframe in JPMaQS format containing a panel of features, as well as a panel of returns.
xcats (list) – List comprising feature names, with the last element being the response variable name. The features and the response variable must be categories in the dataframe.
cids (list, optional) – List of cross-section identifiers for consideration in the panel. Default is None, in which case all cross-sections in df are considered.
start (str, optional) – Start date for considered data in subsequent analysis in ISO 8601 format. Default is None i.e. the earliest date in the dataframe.
end (str, optional) – End date for considered data in subsequent analysis in ISO 8601 format. Default is None i.e. the latest date in the dataframe.
blacklist (list, optional) – Blacklisting dictionary specifying date ranges for which cross-sectional information should be excluded. The keys are cross-sections and the values are tuples of start and end dates in ISO 8601 format. Default is None.
freq (str, optional) – Frequency of the analysis. Default is “M” for monthly.
lag (int, optional) – Number of periods to lag the response variable. Default is 1.
xcat_aggs (list, optional) – List of aggregation functions to apply to the features, used when freq is not D. Default is [“last”, “sum”].
generate_labels (callable, optional) – Function to transform the response variable into either alternative regression targets or classification labels. Default is None.
Notes
The SignalOptimizer class is used to predict the response variable, usually a panel of asset class returns, based on a panel of features that are lagged by a specified number of periods. This is done in a sequential manner, by specifying the size of an initial training set, choosing an optimal model out of a provided collection (with associated hyperparameters), forecasting the return panel, and then expanding the training set to include the now-realized returns. The process continues until the end of the dataset is reached.
In addition to storing forecasts, this class also stores useful information for analysis such as the models selected at each point in time, the feature coefficients and intercepts (where relevant) of selected models, as well as the features selected by any feature selection modules.
Model and hyperparameter selection is performed by cross-validation. Given a collection of models and associated hyperparameters to choose from, an HPO is run - currently only grid search and random search are supported - to determine the optimal choice. This is done by providing a collection of scikit-learn compatible scoring functions, as well as a collection of scikit-learn compatible cross-validation splitters and scorers. At each point in time, the cross-validation folds are the union of the folds produced by each splitter provided. Each scorer is evaluated on each test fold and summarised across test folds by either a custom function provided by the user or a common string i.e. ‘mean’.
Consequently, each model and hyperparameter combination has an associated collection of scores induced by different metrics, in units of those scorers. In order to form a composite score for each hyperparameter, the scores must be normalized across model/hyperparameter combinations. This makes scores across scorers comparable, so that the average score across adjusted scores can be used as a meaningful estimate of each model’s generalization ability. Finally, a composite score for each model and hyperparameter combination is calculated by averaging the adjusted scores across all scorers.
The optimal model is the one with the largest composite score.
- calculate_predictions(name, models, hyperparameters, scorers, inner_splitters, search_type='grid', normalize_fold_results=False, cv_summary='mean', min_cids=4, min_periods=36, test_size=1, max_periods=None, split_functions=None, n_iter=None, n_jobs_outer=-1, n_jobs_inner=1, store_correlations=False)[source]#
Determine forecasts and store relevant quantities over time.
- Parameters:
name (str) – Name of the signal optimization process.
models (dict) – Dictionary of models to choose from. The keys are model names and the values are scikit-learn compatible models.
hyperparameters (dict) – Dictionary of hyperparameters to choose from. The keys are model names and the values are hyperparameter dictionaries for the corresponding model. The keys must match with those provided in models.
scorers (dict) – Dictionary of scoring functions to use in the hyperparameter optimization process. The keys are scorer names and the values are scikit-learn compatible scoring functions.
inner_splitters (dict) – Dictionary of inner splitters to use in the hyperparameter optimization process. The keys are splitter names and the values are scikit-learn compatible cross-validator objects.
search_type (str, optional) – Type of hyperparameter optimization to perform. Default is “grid”. Options are “grid” and “prior”.
normalize_fold_results (bool, optional) – Whether to normalize the scores across folds before combining them. Default is False.
cv_summary (str or callable, optional) – Summary function to use to combine scores across cross-validation folds. Default is “mean”. Options are “mean”, “median” or a callable function.
min_cids (int, optional) – Minimum number of cross-sections required for the initial training set. Default is 4.
min_periods (int, optional) – Minimum number of periods required for the initial training set, in units of the frequency freq specified in the constructor. Default is 36.
test_size (int, optional) – Number of periods to pass before retraining a selected model. Default is 1.
max_periods (int, optional) – Maximum length of each training set in units of the frequency freq specified in the constructor. Default is None, in which case the sequential optimization uses expanding training sets, as opposed to rolling windows.
split_functions (dict, optional) – Dict of callables for determining the number of cross-validation splits to add to the initial number as a function of the number of iterations passed in the sequential learning process. Default is None. The keys must correspond to the keys in inner_splitters and should be set to None for any splitters that do not require splitter adjustment.
n_iter (int, optional) – Number of iterations to run in random hyperparameter search. Default is None.
n_jobs_outer (int, optional) – Number of jobs to run in parallel for the outer sequential loop. Default is -1. It is advised for n_jobs_inner * n_jobs_outer (replacing -1 with the number of available cores) to be less than or equal to the number of available cores on the machine.
n_jobs_inner (int, optional) – Number of jobs to run in parallel for the inner loop. Default is 1. It is advised for n_jobs_inner * n_jobs_outer (replacing -1 with the number of available cores) to be less than or equal to the number of available cores on the machine.
store_correlations (bool) – Whether to store the correlations between input pipeline features and input predictor features. Default is False.
- store_split_data(pipeline_name, optimal_model, optimal_model_name, optimal_model_score, optimal_model_params, inner_splitters_adj, X_train, y_train, X_test, y_test, timestamp, adjusted_test_index)[source]#
Stores characteristics of the optimal model at each retraining date.
- Parameters:
pipeline_name (str) – Name of the signal optimization process.
optimal_model (RegressorMixin, ClassifierMixin or Pipeline) – Optimal model selected at each retraining date.
optimal_model_name (str) – Name of the optimal model.
optimal_model_score (float) – Cross-validation score for the optimal model.
optimal_model_params (dict) – Chosen hyperparameters for the optimal model.
inner_splitters_adj (dict) – Dictionary of adjusted inner splitters.
X_train (pd.DataFrame) – Training feature matrix.
y_train (pd.Series) – Training response variable.
X_test (pd.DataFrame) – Test feature matrix.
y_test (pd.Series) – Test response variable.
timestamp (pd.Timestamp) – Timestamp of the retraining date.
adjusted_test_index (pd.MultiIndex) – Adjusted test index to account for lagged features.
- Returns:
Dictionary containing feature importance scores, intercepts, selected features and correlations between inputs to pipelines and those entered into a final model.
- Return type:
- get_selected_features(name=None)[source]#
Returns the selected features over time for one or more processes.
- get_feature_importances(name=None)[source]#
Returns feature importances for a given pipeline.
- Parameters:
name (str or list, optional) – Label(s) of signal optimization process(es). Default is all stored in the class instance.
- Returns:
Pandas dataframe of the feature importances, if available, learnt at each retraining date for a given pipeline.
- Return type:
pd.DataFrame
Notes
Availability of feature importances is subject to the selected model having a feature_importances_ or coef_ attribute.
- get_feature_correlations(name=None)[source]#
Returns dataframe of feature correlations for one or more processes
- Parameters:
name (str or list, optional) – Label(s) of signal optimization process(es). Default is all stored in the class instance.
- Returns:
Pandas dataframe of the correlations between the features passed into a model pipeline and the post-processed features inputted into the final model.
- Return type:
pd.DataFrame
- feature_selection_heatmap(name, remove_blanks=True, title=None, cap=None, ftrs_renamed=None, figsize=(12, 8))[source]#
Visualise the features chosen by the final selector in a scikit-learn pipeline over time, for a given signal optimization process that has been run.
- Parameters:
name (str) – Name of the previously run signal optimization process.
remove_blanks (bool, optional) – Whether to remove features from the heatmap that were never selected. Default is True.
title (str, optional) – Title of the heatmap. Default is None. This creates a figure title of the form “Model Selection Heatmap for {name}”.
cap (int, optional) – Maximum number of features to display. Default is None. The chosen features are the ‘cap’ most frequently occurring in the pipeline.
ftrs_renamed (dict, optional) – Dictionary to rename the feature names for visualisation in the plot axis. Default is None, which uses the original feature names.
figsize (tuple of floats or ints, optional) – Tuple of floats or ints denoting the figure size. Default is (12, 8).
Notes
This method displays the features selected by the final selector in a scikit-learn pipeline over time, for a given signal optimization process that has been run. This information is contained within a binary heatmap. This does not take into account inherent feature selection within the predictor.
- correlations_heatmap(name, feature_name, title=None, cap=None, ftrs_renamed=None, figsize=(12, 8))[source]#
Method to visualise correlations between features entering a model, and those that entered a preprocessing pipeline.
- Parameters:
name (str) – Name of the signal optimization process.
feature_name (str) – Name of the feature passed into the final predictor.
title (str, optional) – Title of the heatmap. Default is None. This creates a figure title of the form “Correlation Heatmap for feature {feature_name} and pipeline {name}”.
cap (int, optional) – Maximum number of correlations to display. Default is None. The chosen features are the ‘cap’ most highly correlated.
ftrs_renamed (dict, optional) – Dictionary to rename the feature names for visualisation in the plot axis. Default is None, which uses the original feature names.
figsize (tuple of floats or ints, optional) – Tuple of floats or ints denoting the figure size. Default is (12, 8).
Notes
This method displays the correlation between a feature that is about to be entered into a final predictor and the cap most correlated features entered into the original pipeline. This information is contained within a heatmap.
In a given pipeline, the features that enter it can be transformed in any way. Sometimes the transformation is non-trivial, resulting in a feature space that is not easily interpretable. This method allows the user to see how the original features are correlated with the features that enter the final model, providing insight into the transformation process.
As an example, dimensionality reduction techniques such as PCA and LDA rotate the feature space, resulting in factors that can be hard to interpret. A neural network aims to learn a non-linear transformation of the feature space, which can also be hard to interpret. This method allows the user to see how the original features are correlated with the transformed features, providing insight into the transformation that took place.
- feature_importance_timeplot(name, ftrs=None, title=None, ftrs_renamed=None, figsize=(10, 6))[source]#
Visualise time series of feature importances for the final predictor in a given pipeline, when available.
- Parameters:
name (str) – Name of the previously run signal optimization process.
ftrs (list, optional) – List of feature names to plot. Default is None.
title (str, optional) – Title of the plot. Default is None. This creates a figure title of the form “Feature importances for pipeline: {name}”.
ftrs_renamed (dict, optional) – Dictionary to rename the feature names for visualisation in the plot legend. Default is None, which uses the original feature names.
figsize (tuple of floats or ints, optional) – Tuple of floats or ints denoting the figure size. Default is (10, 6).
Notes
This method displays the time series of feature importances for a given pipeline, when available. Availability depends on whether or not the final predictor in the pipeline has either a coefs_ or feature_importances_ attribute. This information is contained within a line plot. The default behaviour is to sort the feature importance columns in ascending order of the number of NAs, accounting for a possible feature selection module in the pipeline and plot the feature importances for the first 10 features in the sorted order. If more than 10 features were involved in the learning procedure, the default is to plot the feature importances for the first 10 sorted features. By specifying a ftrs list (which can be no longer than 10 elements in length), this default behaviour can be overridden.
By sorting by NAs, the plot displays the model feature importances for either the first 10 features in the dataframe or, when a feature selection module was present, the 10 most frequently selected features.
- intercepts_timeplot(name, title=None, figsize=(10, 6))[source]#
Visualise time series of intercepts for a given pipeline, when available.
- Parameters:
name (str) – Name of the previously run signal optimization process.
title (str, optional) – Title of the plot. Default is None. This creates a figure title of the form “Intercepts for pipeline: {name}”.
figsize (tuple of floats or ints, optional) – Tuple of floats or ints denoting the figure size. Default is (10, 6).
Notes
This method displays the time series of intercepts for a given pipeline, when available. This information is contained within a line plot.
- coefs_stackedbarplot(name, ftrs=None, title=None, cap=None, ftrs_renamed=None, figsize=(10, 6))[source]#
Visualise feature coefficients for a given pipeline in a stacked bar plot.
- Parameters:
name (str) – Name of the previously run signal optimization process.
ftrs (list, optional) – List of feature names to plot. Default is None.
title (str, optional) – Title of the plot. Default is None. This creates a figure title of the form “Stacked bar plot of model coefficients: {name}”.
cap (int, optional) – Maximum number of features to display. Default is None. The chosen features are the ‘cap’ most frequently occurring in the pipeline. This cannot exceed 10.
ftrs_renamed (dict, optional) – Dictionary to rename the feature names for visualisation in the plot legend. Default is None, which uses the original feature names.
figsize (tuple of floats or ints, optional) – Tuple of floats or ints denoting the figure size. Default is (10, 6).
Notes
This method displays the average feature coefficients for a given pipeline in each calendar year, when available. This information is contained within a stacked bar plot. The default behaviour is to plot the first 10 features in the order specified during training. If more than 10 features were involved in the learning procedure, the default is to plot the first 10 features. By specifying a ftrs list (which can be no longer than 10 elements in length), this default behaviour can be overridden.
- nsplits_timeplot(name, title=None, figsize=(10, 6))[source]#
Method to plot the time series for the number of cross-validation splits used by the signal optimizer.
- Parameters:
name (str) – Name of the previously run signal optimization process.
title (str, optional) – Title of the plot. Default is None. This creates a figure title of the form “Stacked bar plot of model coefficients: {name}”.
figsize (tuple of floats or ints, optional) – Tuple of floats or ints denoting the figure size. Default is (10, 6).