macrosynergy.learning.signal_optimizer#

Class to handle the calculation of quantamental predictions based on adaptive hyperparameter and model selection.


Classes#


SignalOptimizer#

None


SignalOptimizer.__init__()#

SignalOptimizer.__init__(self, inner_splitter, X, y, blacklist, initial_nsplits, threshold_ndates)

Class for sequential optimization of raw signals based on quantamental features. Optimization is performed through nested cross-validation, with the outer splitter an instance of ExpandingIncrementPanelSplit reflecting a pipeline through time simulating the experience of an investor. In each iteration of the outer splitter, a training and test set are created, and a grid search using the specified ‘inner_splitter’ is performed to determine an optimal model amongst a set of candidate models. Once this is selected, the chosen model is used to make the test set forecasts. Lastly, we cast these forecasts back by a frequency period to account for the lagged features, creating point-in-time signals.

The features in the dataframe, X, are expected to be lagged quantamental indicators, at a single native frequency unit, with the targets, in y, being the cumulative returns at the native frequency. By providing a blacklisting dictionary, preferably through macrosynergy.management.make_blacklist, the user can specify time periods to ignore.

:param <BasePanelSplit> inner_splitter: Panel splitter that is used to split each training set into smaller (training, test) pairs for cross-validation. At present that splitter has to be an instance of RollingKFoldPanelSplit, ExpandingKFoldPanelSplit or ExpandingIncrementPanelSplit.

:param <pd.DataFrame> X: Wide pandas dataframe of features and date-time indexes that capture the periods for which the signals are to be calculated. Since signals must make time seried predictions, the features in X must be lagged by one period, i.e., the values used for the current period must be those that were originally recorded for the previous period. The frequency of features (and targets) determines the frequency at which model predictions are made and evaluated. This means that if we have monthly data, the learning process uses the performance of monthly predictions.

:param <Union[pd.DataFrame,pd.Series]> y: Pandas dataframe or series of targets corresponding with a time index equal to the features in X.

:param <Dict[str, Tuple[pd.Timestamp, pd.Timestamp]]> blacklist: cross-sections with date ranges that should be excluded from the data frame.

:param <Optional[Union[int, np.int_]]> initial_nsplits: The number of splits to be used in the initial training set for cross-validation. If not None, this parameter ovverrides the number of splits in the inner splitter. By setting this value, the ensuing signal optimization process uses a changing number of cross-validation splits over time. The parameter “threshold_ndates”, which defines the dynamics of the adaptive number of splits, must be set in this case. Default is None.

:param <Optional[Union[int, np.int_]]> threshold_ndates: Number of unique dates, in units of the native dataset frequency, to be made available for the currently-used number of cross-validation splits to increase by one. If not None, the “initial_nsplits” parameter must be set. Default is None.

Note: Optimization is based on expanding time series panels and maximizes a defined criterion over a grid of sklearn pipelines and hyperparameters of the involved models. The frequency of the input data sets X and y determines the frequency at which the training set is expanded. The training set itself is split into various (training, test) pairs by the inner_splitter argument for cross- validation. Based on inner cross-validation an optimal model is chosen and used for predicting the targets of the next period. A prediction for a particular cross-section and time period is made only if all required information has been available for that point. Optimized signals that are produced by the class are always stored for the end of the original data period that precedes the predicted period. For example, if the frequency of the input data set is monthly, signals for a month are recorded at the end of the previous month. If the frequency is working daily, signals for a day are recorded at the end of the previous business day. The date adjustment step ensures that the point-in-time principle is followed, in the JPMaQS format output of the class.

Example use:#

# Suppose X_train and y_train comprise monthly-frequency features and targets
so = SignalOptimizer(
    inner_splitter=RollingKFoldPanelSplit(n_splits=5),
    X=X_train,
    y=y_train,
)

# (1) Linear Regression signal with no hyperparameter optimisation
so.calculate_predictions(
    name="OLS",
    models = {"linreg" : LinearRegression()},
    metric = make_scorer(mean_squared_error, greater_is_better=False),
    hparam_grid = {"linreg" : {}},
)
print(so.get_optimized_signals("OLS"))

# (2) KNN signal with adaptive hyperparameter optimisation
so.calculate_predictions(
    name="KNN",
    models = {"knn" : KNeighborsRegressor()},
    metric = make_scorer(mean_squared_error, greater_is_better=False),
    hparam_grid = {"knn" : {"n_neighbors" : [1, 2, 5]}},
)
print(so.get_optimized_signals("KNN"))

# (3) Linear regression & KNN mixture signal with adaptive hyperparameter
    optimisation
so.calculate_predictions(
    name="MIX",
    models = {"linreg" : LinearRegression(), "knn" : KNeighborsRegressor()},
    metric = make_scorer(mean_squared_error, greater_is_better=False),
    hparam_grid = {"linreg" : {}, "knn" : {"n_neighbors" : [1, 2, 5]}},
)
print(so.get_optimized_signals("MIX"))

# (4) Visualise the models chosen by the adaptive signal algorithm for the
#     nearest neighbors and mixture signals.
so.models_heatmap(name="KNN")
so.models_heatmap(name="MIX")

SignalOptimizer._checks_init_params()#

SignalOptimizer._checks_init_params(self, inner_splitter, X, y, blacklist, initial_nsplits, threshold_ndates)

Private method to check the initialisation parameters of the class.


SignalOptimizer.calculate_predictions()#

SignalOptimizer.calculate_predictions(self, name, models, metric, hparam_grid, hparam_type, min_cids, min_periods, max_periods, n_iter, n_jobs)

Calculate, store and return sequentially optimized signals for a given process. This method implements the nested cross-validation and subsequent signal generation. The name of the process, together with models to fit, hyperparameters to search over and a metric to optimize, are provided as compulsory arguments.

:param <str> name: Label of signal optimization process.

:param <Dict[str, Union[BaseEstimator,Pipeline]]> models: dictionary of sklearn predictors or pipelines.

:param <Callable> metric: A sklearn scorer object that serves as the criterion for optimization.

:param <str> hparam_type: Hyperparameter search type. This must be either “grid”, “random” or “bayes”. Default is “grid”.

:param <Dict[str, Dict[str, List]]> hparam_grid: Nested dictionary defining the hyperparameters to consider for each model. The outer dictionary needs keys representing the model name and should match the keys in the models. dictionary. The inner dictionary depends on the hyperparameter search type. If hparam_type is “grid”, then the inner dictionary should have keys corresponding to the hyperparameter names and values equal to a list of hyperparameter values to search over. For example: hparam_grid = { “lasso” : {“alpha” : [1e-1, 1e-2, 1e-3]}, “knn” : {“n_neighbors” : [1, 2, 5]} }. If hparam_type is “random”, the inner dictionary needs keys corresponding to the hyperparameter names and values either equal to a distribution from which to sample or a list of them. For example: hparam_grid = { “lasso” : {“alpha” : scipy.stats.expon()}, “knn” : {“n_neighbors” : scipy.stats.randint(low=1, high=10)} }. Distributions must provide a rvs method for sampling (such as those from scipy.stats.distributions). See https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html and https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html for more details.

:param <int> min_cids: Minimum number of cross-sections required for the initial training set. Default is 4.

:param <int> min_periods: minimum number of base periods of the input data frequency required for the initial training set. Default is 12.

:param <Optional[int]> max_periods: maximum length of each training set in units of the input data frequency. If this maximum is exceeded, the earliest periods are cut off. Default is None, which means that the full training history is considered in each iteration.

:param <int> n_iter: Number of iterations to run for random search. Default is 10.

:param <int> n_jobs: Number of jobs to run in parallel. Default is -1, which uses all available cores.

Note: The method produces signals for financial contract positions. They are calculated sequentially at the frequency of the input data set. Sequentially here means that the training set is expanded by one base period of the frequency. Each time the training set itself is split into various (training, test) pairs by the inner_splitter argument. Based on inner cross-validation an optimal model is chosen and used for predicting the targets of the next period.


SignalOptimizer._checks_calcpred_params()#

SignalOptimizer._checks_calcpred_params(self, name, models, metric, hparam_grid, hparam_type, min_cids, min_periods, max_periods, n_iter, n_jobs)

Private method to check the calculate_predictions method parameters.


SignalOptimizer._worker()#

SignalOptimizer._worker(self, train_idx, test_idx, name, models, metric, original_date_levels, hparam_grid, n_iter, hparam_type, nsplits_add)

Private helper function to run the grid search for a single (train, test) pair and a collection of models. It is used to parallelise the pipeline.

:param <np.ndarray> train_idx: Array of indices corresponding to the training set.

:param <np.ndarray> test_idx: Array of indices corresponding to the test set.

:param <str> name: Name of the prediction model.

:param <Dict[str, Union[BaseEstimator,Pipeline]]> models: dictionary of sklearn predictors.

:param <Callable> metric: Sklearn scorer object.

:param <List[pd.Timestamp]> original_date_levels: List of dates corresponding to the original dataset.

:param <str> hparam_type: Hyperparameter search type. This must be either “grid”, “random” or “bayes”. Default is “grid”.

:param <Dict[str, Dict[str, List]]> hparam_grid: Nested dictionary denoting the hyperparameters to consider for each model. See https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html and https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html for more details.

:param <int> n_iter: Number of iterations to run for random search. Default is 10.

:param <str> hparam_type: Hyperparameter search type. This must be either “grid”, “random” or “bayes”. Default is “grid”.

:param <Optional[Union[int, np.int_]]> nsplits_add: Additional number of splits to add to the number of splits in the inner splitter. Default is None.


SignalOptimizer.get_optimized_signals()#

SignalOptimizer.get_optimized_signals(self, name)

Returns optimized signals for one or more processes

:param <Optional[Union[str, List]]> name: Label of signal optimization process. Default is all stored in the class instance.

:return <pd.DataFrame>: Pandas dataframe in JPMaQS format of working daily predictions based insequentially optimzed models.


SignalOptimizer.get_optimal_models()#

SignalOptimizer.get_optimal_models(self, name)

Returns the sequences of optimal models for one or more processes

:param <str> name: Label of signal optimization process. Default is all stored in the class instance.

:return <pd.DataFrame>: Pandas dataframe of the optimal models or hyperparameters at the end of the base period in which they were determined (to be applied in the subsequent period).


SignalOptimizer.get_selected_features()#

SignalOptimizer.get_selected_features(self, name)

Returns the selected features over time for one or more processes

:param <str> name: Label of signal optimization process. Default is all stored in the class instance.

:return <pd.DataFrame>: Pandas dataframe of the selected features over time at the end of the base period in which they were determined (to be applied in the subsequent period).


SignalOptimizer.feature_selection_heatmap()#

SignalOptimizer.feature_selection_heatmap(self, name, title, figsize)

Method to visualise the selected features in a scikit-learn pipeline.

:param <str> name: Name of the prediction model.

:param <Optional[str]> title: Title of the heatmap. Default is None. This creates a figure title of the form “Model Selection Heatmap for {name}”.

:param <Optional[Tuple[Union[int, float], Union[int, float]]]> figsize: Tuple of floats or ints denoting the figure size. Default is (12, 8).

Note: This method displays the times at which each feature was used in the learning process and used for signal generation, as a binary heatmap.


SignalOptimizer.models_heatmap()#

SignalOptimizer.models_heatmap(self, name, title, cap, figsize)

Visualized optimal models used for signal calculation.

:param <str> name: Name of the prediction model.

:param <Optional[str]> title: Title of the heatmap. Default is None. This creates a figure title of the form “Model Selection Heatmap for {name}”.

:param <Optional[int]> cap: Maximum number of models to display. Default (and limit) is 5. The chosen models are the ‘cap’ most frequently occurring in the pipeline.

:param <Optional[Tuple[Union[int, float], Union[int, float]]]> figsize: Tuple of floats or ints denoting the figure size. Default is (12, 8).

Note: This method displays the times at which each model in a learning process has been optimal and used for signal generation, as a binary heatmap.


SignalOptimizer.get_ftr_coefficients()#

SignalOptimizer.get_ftr_coefficients(self, name)

Method to return the feature coefficients for a given pipeline.

:param <str> name: Name of the pipeline.

:return <pd.DataFrame>: Pandas dataframe of the changing feature coefficients over time for the specified pipeline.


SignalOptimizer.get_intercepts()#

SignalOptimizer.get_intercepts(self, name)

Method to return the intercepts for a given pipeline.

:param <str> name: Name of the pipeline.

:return <pd.DataFrame>: Pandas dataframe of the changing intercepts over time for the specified pipeline.


SignalOptimizer.get_parameter_stats()#

SignalOptimizer.get_parameter_stats(self, name, include_intercept)

Function to return the means and standard deviations of linear model feature coefficients and intercepts (if available) for a given pipeline.

:param <str> name: Name of the pipeline.

:param <Optional[bool]> include_intercept: Whether to include the intercepts in the output. Default is False.

:return Tuple of means and standard deviations of feature coefficients and: intercepts (if chosen) for the specified pipeline.


SignalOptimizer.coefs_timeplot()#

SignalOptimizer.coefs_timeplot(self, name, ftrs, title, ftrs_renamed, figsize)

Function to plot the time series of feature coefficients for a given pipeline. At most, 10 feature coefficient paths can be plotted at once. If more than 10 features were involved in the learning procedure, the default is to plot the first 10 features in the order specified during training. By specifying a ftrs list (which can be no longer than 10 elements in length), this default behaviour can be overridden.

:param <str> name: Name of the pipeline.

:param <Optional[List]> ftrs: List of feature names to plot. Default is None.

:param <Optional[str]> title: Title of the plot. Default is None. This creates a figure title of the form “Feature coefficients for pipeline: {name}”.

:param <Optional[dict]> ftrs_renamed: Dictionary to rename the feature names for visualisation in the plot legend. Default is None, which uses the original feature names.

:param <Tuple[Union[float, int], Union[float,int]]> figsize: Tuple of floats or ints denoting the figure size.

:return Time series plot of feature coefficients for the given pipeline.:


SignalOptimizer.intercepts_timeplot()#

SignalOptimizer.intercepts_timeplot(self, name, title, figsize)

Function to plot the time series of intercepts for a given pipeline.

:param <str> name: Name of the pipeline.

:param <Optional[str]> title: Title of the plot. Default is None. This creates a figure title of the form “Intercepts for pipeline: {name}”.

:param <Tuple[Union[float, int], Union[float,int]]> figsize: Tuple of floats or ints denoting the figure size.

:return: Time series plot of intercepts for the given pipeline.:


SignalOptimizer.coefs_stackedbarplot()#

SignalOptimizer.coefs_stackedbarplot(self, name, ftrs, title, ftrs_renamed, figsize)

Function to create a stacked bar plot of feature coefficients for a given pipeline. At most, 10 feature coefficients can be considered in the plot. If more than 10 features were involved in the learning procedure, the default is to plot the first 10 features in the order specified during training. By specifying a ftrs list (which can be no longer than 10 elements in length), this default behaviour can be overridden.

:param <str> name: Name of the pipeline.

:param <Optional[List]> ftrs: List of feature names to plot. Default is None.

:param <Optional[str]> title: Title of the plot. Default is None. This creates a figure title of the form “Stacked bar plot of model coefficients: {name}”.

:param <Optional[dict]> ftrs_renamed: Dictionary to rename the feature names for visualisation in the plot legend. Default is None, which uses the original feature names.

:param <Tuple[int, int]> figsize: Tuple of floats or ints denoting the figure size.

:return: Stacked bar plot of feature coefficients for the given pipeline.:


SignalOptimizer.nsplits_timeplot()#

SignalOptimizer.nsplits_timeplot(self, name, title, figsize)

Method to plot the time series for the number of cross-validation splits used by the signal optimizer.

:param <str> name: Name of the pipeline.

:param <Optional[str]> title: Title of the plot. Default is None. This creates a figure title of the form “Number of CV splits for pipeline: {name}”.

:param <Tuple[Union[float, int], Union[float,int]]> figsize: Tuple of floats or ints denoting the figure size.

:return: Time series plot of the number of cross-validation splits for the given: pipeline.