`macrosynergy.learning.signal_optimizer`

#

`macrosynergy.learning.signal_optimizer`

Class to handle the calculation of quantamental predictions based on adaptive hyperparameter and model selection.

**Classes**#

`SignalOptimizer`

#

`SignalOptimizer`

None

`SignalOptimizer.__init__()`

#

`SignalOptimizer.__init__(self, inner_splitter, X, y, blacklist, initial_nsplits, threshold_ndates)`

Class for sequential optimization of raw signals based on quantamental features.
Optimization is performed through nested cross-validation, with the outer splitter an
instance of `ExpandingIncrementPanelSplit`

reflecting a pipeline through time simulating
the experience of an investor. In each iteration of the outer splitter, a training and
test set are created, and a grid search using the specified ‘inner_splitter’ is performed
to determine an optimal model amongst a set of candidate models. Once this is selected,
the chosen model is used to make the test set forecasts. Lastly, we cast these forecasts
back by a frequency period to account for the lagged features, creating point-in-time
signals.

The features in the dataframe, X, are expected to be lagged quantamental indicators, at a single native frequency unit, with the targets, in y, being the cumulative returns at the native frequency. By providing a blacklisting dictionary, preferably through macrosynergy.management.make_blacklist, the user can specify time periods to ignore.

`:param <BasePanelSplit> inner_splitter`

: Panel splitter that is used to split each
training set into smaller (training, test) pairs for cross-validation. At present that
splitter has to be an instance of `RollingKFoldPanelSplit`

, `ExpandingKFoldPanelSplit`

or
`ExpandingIncrementPanelSplit`

.

`:param <pd.DataFrame> X`

: Wide pandas dataframe of features and date-time indexes that
capture the periods for which the signals are to be calculated. Since signals must make
time seried predictions, the features in `X`

must be lagged by one period, i.e., the
values used for the current period must be those that were originally recorded for the
previous period. The frequency of features (and targets) determines the frequency at which
model predictions are made and evaluated. This means that if we have monthly data, the
learning process uses the performance of monthly predictions.

`:param <Union[pd.DataFrame,pd.Series]> y`

: Pandas dataframe or series of targets
corresponding with a time index equal to the features in `X`

.

`:param <Dict[str, Tuple[pd.Timestamp, pd.Timestamp]]> blacklist`

: cross-sections with
date ranges that should be excluded from the data frame.

`:param <Optional[Union[int, np.int_]]> initial_nsplits`

: The number of splits to be used
in the initial training set for cross-validation. If not None, this parameter ovverrides
the number of splits in the inner splitter. By setting this value, the ensuing signal
optimization process uses a changing number of cross-validation splits over time. The
parameter “threshold_ndates”, which defines the dynamics of the adaptive number of splits,
must be set in this case. Default is None.

`:param <Optional[Union[int, np.int_]]> threshold_ndates`

: Number of unique dates, in
units of the native dataset frequency, to be made available for the currently-used number
of cross-validation splits to increase by one. If not None, the “initial_nsplits”
parameter must be set. Default is None.

Note: Optimization is based on expanding time series panels and maximizes a defined
criterion over a grid of sklearn pipelines and hyperparameters of the involved models. The
frequency of the input data sets `X`

and `y`

determines the frequency at which the
training set is expanded. The training set itself is split into various (training, test)
pairs by the `inner_splitter`

argument for cross- validation. Based on inner
cross-validation an optimal model is chosen and used for predicting the targets of the
next period. A prediction for a particular cross-section and time period is made only if
all required information has been available for that point. Optimized signals that are
produced by the class are always stored for the end of the original data period that
precedes the predicted period. For example, if the frequency of the input data set is
monthly, signals for a month are recorded at the end of the previous month. If the
frequency is working daily, signals for a day are recorded at the end of the previous
business day. The date adjustment step ensures that the point-in-time principle is
followed, in the JPMaQS format output of the class.

# Example use:#

```
# Suppose X_train and y_train comprise monthly-frequency features and targets
so = SignalOptimizer(
inner_splitter=RollingKFoldPanelSplit(n_splits=5),
X=X_train,
y=y_train,
)
# (1) Linear Regression signal with no hyperparameter optimisation
so.calculate_predictions(
name="OLS",
models = {"linreg" : LinearRegression()},
metric = make_scorer(mean_squared_error, greater_is_better=False),
hparam_grid = {"linreg" : {}},
)
print(so.get_optimized_signals("OLS"))
# (2) KNN signal with adaptive hyperparameter optimisation
so.calculate_predictions(
name="KNN",
models = {"knn" : KNeighborsRegressor()},
metric = make_scorer(mean_squared_error, greater_is_better=False),
hparam_grid = {"knn" : {"n_neighbors" : [1, 2, 5]}},
)
print(so.get_optimized_signals("KNN"))
# (3) Linear regression & KNN mixture signal with adaptive hyperparameter
optimisation
so.calculate_predictions(
name="MIX",
models = {"linreg" : LinearRegression(), "knn" : KNeighborsRegressor()},
metric = make_scorer(mean_squared_error, greater_is_better=False),
hparam_grid = {"linreg" : {}, "knn" : {"n_neighbors" : [1, 2, 5]}},
)
print(so.get_optimized_signals("MIX"))
# (4) Visualise the models chosen by the adaptive signal algorithm for the
# nearest neighbors and mixture signals.
so.models_heatmap(name="KNN")
so.models_heatmap(name="MIX")
```

`SignalOptimizer._checks_init_params()`

#

`SignalOptimizer._checks_init_params(self, inner_splitter, X, y, blacklist, initial_nsplits, threshold_ndates)`

Private method to check the initialisation parameters of the class.

`SignalOptimizer.calculate_predictions()`

#

`SignalOptimizer.calculate_predictions(self, name, models, metric, hparam_grid, hparam_type, min_cids, min_periods, max_periods, n_iter, n_jobs)`

Calculate, store and return sequentially optimized signals for a given process. This method implements the nested cross-validation and subsequent signal generation. The name of the process, together with models to fit, hyperparameters to search over and a metric to optimize, are provided as compulsory arguments.

`:param <str> name`

: Label of signal optimization process.

`:param <Dict[str, Union[BaseEstimator,Pipeline]]> models`

: dictionary of sklearn
predictors or pipelines.

`:param <Callable> metric`

: A sklearn scorer object that serves as the criterion for
optimization.

`:param <str> hparam_type`

: Hyperparameter search type. This must be either “grid”,
“random” or “bayes”. Default is “grid”.

`:param <Dict[str, Dict[str, List]]> hparam_grid`

: Nested dictionary defining the
hyperparameters to consider for each model. The outer dictionary needs keys representing
the model name and should match the keys in the `models`

. dictionary. The inner dictionary
depends on the hyperparameter search type. If hparam_type is “grid”, then the inner
dictionary should have keys corresponding to the hyperparameter names and values equal to
a list of hyperparameter values to search over. For example: hparam_grid = { “lasso” :
{“alpha” : [1e-1, 1e-2, 1e-3]}, “knn” : {“n_neighbors” : [1, 2, 5]} }. If hparam_type
is “random”, the inner dictionary needs keys corresponding to the hyperparameter names and
values either equal to a distribution from which to sample or a list of them. For example:
hparam_grid = { “lasso” : {“alpha” : scipy.stats.expon()}, “knn” : {“n_neighbors” :
scipy.stats.randint(low=1, high=10)} }. Distributions must provide a rvs method for
sampling (such as those from scipy.stats.distributions). See
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
and
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html
for more details.

`:param <int> min_cids`

: Minimum number of cross-sections required for the initial
training set. Default is 4.

`:param <int> min_periods`

: minimum number of base periods of the input data frequency
required for the initial training set. Default is 12.

`:param <Optional[int]> max_periods`

: maximum length of each training set in units of the
input data frequency. If this maximum is exceeded, the earliest periods are cut off.
Default is None, which means that the full training history is considered in each
iteration.

`:param <int> n_iter`

: Number of iterations to run for random search. Default is 10.

`:param <int> n_jobs`

: Number of jobs to run in parallel. Default is -1, which uses all
available cores.

Note: The method produces signals for financial contract positions. They are calculated
sequentially at the frequency of the input data set. Sequentially here means that the
training set is expanded by one base period of the frequency. Each time the training set
itself is split into various (training, test) pairs by the `inner_splitter`

argument.
Based on inner cross-validation an optimal model is chosen and used for predicting the
targets of the next period.

`SignalOptimizer._checks_calcpred_params()`

#

`SignalOptimizer._checks_calcpred_params(self, name, models, metric, hparam_grid, hparam_type, min_cids, min_periods, max_periods, n_iter, n_jobs)`

Private method to check the calculate_predictions method parameters.

`SignalOptimizer._worker()`

#

`SignalOptimizer._worker(self, train_idx, test_idx, name, models, metric, original_date_levels, hparam_grid, n_iter, hparam_type, nsplits_add)`

Private helper function to run the grid search for a single (train, test) pair and a collection of models. It is used to parallelise the pipeline.

`:param <np.ndarray> train_idx`

: Array of indices corresponding to the training set.

`:param <np.ndarray> test_idx`

: Array of indices corresponding to the test set.

`:param <str> name`

: Name of the prediction model.

`:param <Dict[str, Union[BaseEstimator,Pipeline]]> models`

: dictionary of sklearn
predictors.

`:param <Callable> metric`

: Sklearn scorer object.

`:param <List[pd.Timestamp]> original_date_levels`

: List of dates corresponding to the
original dataset.

`:param <str> hparam_type`

: Hyperparameter search type. This must be either “grid”,
“random” or “bayes”. Default is “grid”.

`:param <Dict[str, Dict[str, List]]> hparam_grid`

: Nested dictionary denoting the
hyperparameters to consider for each model. See
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
and
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html
for more details.

`:param <int> n_iter`

: Number of iterations to run for random search. Default is 10.

`:param <str> hparam_type`

: Hyperparameter search type. This must be either “grid”,
“random” or “bayes”. Default is “grid”.

`:param <Optional[Union[int, np.int_]]> nsplits_add`

: Additional number of splits to add
to the number of splits in the inner splitter. Default is None.

`SignalOptimizer.get_optimized_signals()`

#

`SignalOptimizer.get_optimized_signals(self, name)`

Returns optimized signals for one or more processes

`:param <Optional[Union[str, List]]> name`

: Label of signal optimization process. Default
is all stored in the class instance.

`:return <pd.DataFrame>`

: Pandas dataframe in JPMaQS format of working daily predictions
based insequentially optimzed models.

`SignalOptimizer.get_optimal_models()`

#

`SignalOptimizer.get_optimal_models(self, name)`

Returns the sequences of optimal models for one or more processes

`:param <str> name`

: Label of signal optimization process. Default is all stored in the
class instance.

`:return <pd.DataFrame>`

: Pandas dataframe of the optimal models or hyperparameters at the
end of the base period in which they were determined (to be applied in the subsequent
period).

`SignalOptimizer.get_selected_features()`

#

`SignalOptimizer.get_selected_features(self, name)`

Returns the selected features over time for one or more processes

`:param <str> name`

: Label of signal optimization process. Default is all stored in the
class instance.

`:return <pd.DataFrame>`

: Pandas dataframe of the selected features over time at the end
of the base period in which they were determined (to be applied in the subsequent period).

`SignalOptimizer.feature_selection_heatmap()`

#

`SignalOptimizer.feature_selection_heatmap(self, name, title, figsize)`

Method to visualise the selected features in a scikit-learn pipeline.

`:param <str> name`

: Name of the prediction model.

`:param <Optional[str]> title`

: Title of the heatmap. Default is None. This creates a
figure title of the form “Model Selection Heatmap for {name}”.

`:param <Optional[Tuple[Union[int, float], Union[int, float]]]> figsize`

: Tuple of floats
or ints denoting the figure size. Default is (12, 8).

Note: This method displays the times at which each feature was used in the learning process and used for signal generation, as a binary heatmap.

`SignalOptimizer.models_heatmap()`

#

`SignalOptimizer.models_heatmap(self, name, title, cap, figsize)`

Visualized optimal models used for signal calculation.

`:param <str> name`

: Name of the prediction model.

`:param <Optional[str]> title`

: Title of the heatmap. Default is None. This creates a
figure title of the form “Model Selection Heatmap for {name}”.

`:param <Optional[int]> cap`

: Maximum number of models to display. Default (and limit) is
5. The chosen models are the ‘cap’ most frequently occurring in the pipeline.

`:param <Optional[Tuple[Union[int, float], Union[int, float]]]> figsize`

: Tuple of floats
or ints denoting the figure size. Default is (12, 8).

Note: This method displays the times at which each model in a learning process has been optimal and used for signal generation, as a binary heatmap.

`SignalOptimizer.get_ftr_coefficients()`

#

`SignalOptimizer.get_ftr_coefficients(self, name)`

Method to return the feature coefficients for a given pipeline.

`:param <str> name`

: Name of the pipeline.

`:return <pd.DataFrame>`

: Pandas dataframe of the changing feature coefficients over time
for the specified pipeline.

`SignalOptimizer.get_intercepts()`

#

`SignalOptimizer.get_intercepts(self, name)`

Method to return the intercepts for a given pipeline.

`:param <str> name`

: Name of the pipeline.

`:return <pd.DataFrame>`

: Pandas dataframe of the changing intercepts over time for the
specified pipeline.

`SignalOptimizer.get_parameter_stats()`

#

`SignalOptimizer.get_parameter_stats(self, name, include_intercept)`

Function to return the means and standard deviations of linear model feature coefficients and intercepts (if available) for a given pipeline.

`:param <str> name`

: Name of the pipeline.

`:param <Optional[bool]> include_intercept`

: Whether to include the intercepts in the
output. Default is False.

`:return Tuple of means and standard deviations of feature coefficients and`

: intercepts
(if chosen) for the specified pipeline.

`SignalOptimizer.coefs_timeplot()`

#

`SignalOptimizer.coefs_timeplot(self, name, ftrs, title, ftrs_renamed, figsize)`

Function to plot the time series of feature coefficients for a given pipeline. At most, 10
feature coefficient paths can be plotted at once. If more than 10 features were involved
in the learning procedure, the default is to plot the first 10 features in the order
specified during training. By specifying a `ftrs`

list (which can be no longer than 10
elements in length), this default behaviour can be overridden.

`:param <str> name`

: Name of the pipeline.

`:param <Optional[List]> ftrs`

: List of feature names to plot. Default is None.

`:param <Optional[str]> title`

: Title of the plot. Default is None. This creates a figure
title of the form “Feature coefficients for pipeline: {name}”.

`:param <Optional[dict]> ftrs_renamed`

: Dictionary to rename the feature names for
visualisation in the plot legend. Default is None, which uses the original feature names.

`:param <Tuple[Union[float, int], Union[float,int]]> figsize`

: Tuple of floats or ints
denoting the figure size.

`:return Time series plot of feature coefficients for the given pipeline.`

:

`SignalOptimizer.intercepts_timeplot()`

#

`SignalOptimizer.intercepts_timeplot(self, name, title, figsize)`

Function to plot the time series of intercepts for a given pipeline.

`:param <str> name`

: Name of the pipeline.

`:param <Optional[str]> title`

: Title of the plot. Default is None. This creates a figure
title of the form “Intercepts for pipeline: {name}”.

`:param <Tuple[Union[float, int], Union[float,int]]> figsize`

: Tuple of floats or ints
denoting the figure size.

`:return: Time series plot of intercepts for the given pipeline.`

:

`SignalOptimizer.coefs_stackedbarplot()`

#

`SignalOptimizer.coefs_stackedbarplot(self, name, ftrs, title, ftrs_renamed, figsize)`

Function to create a stacked bar plot of feature coefficients for a given pipeline. At
most, 10 feature coefficients can be considered in the plot. If more than 10 features were
involved in the learning procedure, the default is to plot the first 10 features in the
order specified during training. By specifying a `ftrs`

list (which can be no longer than
10 elements in length), this default behaviour can be overridden.

`:param <str> name`

: Name of the pipeline.

`:param <Optional[List]> ftrs`

: List of feature names to plot. Default is None.

`:param <Optional[str]> title`

: Title of the plot. Default is None. This creates a figure
title of the form “Stacked bar plot of model coefficients: {name}”.

`:param <Optional[dict]> ftrs_renamed`

: Dictionary to rename the feature names for
visualisation in the plot legend. Default is None, which uses the original feature names.

`:param <Tuple[int, int]> figsize`

: Tuple of floats or ints denoting the figure size.

`:return: Stacked bar plot of feature coefficients for the given pipeline.`

:

`SignalOptimizer.nsplits_timeplot()`

#

`SignalOptimizer.nsplits_timeplot(self, name, title, figsize)`

Method to plot the time series for the number of cross-validation splits used by the signal optimizer.

`:param <str> name`

: Name of the pipeline.

`:param <Optional[str]> title`

: Title of the plot. Default is None. This creates a figure
title of the form “Number of CV splits for pipeline: {name}”.

`:param <Tuple[Union[float, int], Union[float,int]]> figsize`

: Tuple of floats or ints
denoting the figure size.

`:return: Time series plot of the number of cross-validation splits for the given`

:
pipeline.