macrosynergy.learning.sequential.return_forecaster#

Class to produce point forecasts of returns given knowledge of an indicator state on a specific date.

class ReturnForecaster(df, xcats, real_date, cids=None, blacklist=None, freq='M', lag=1, xcat_aggs=['last', 'sum'], generate_labels=None)[source]#

Bases: BasePanelLearner

Class to produce return forecasts for a single forward frequency, based on the indicator states at a specific date.

Parameters:
  • df (pd.DataFrame) – Daily quantamental dataframe in JPMaQS format containing a panel of features, as well as a panel of returns.

  • xcats (list) – List comprising feature names, with the last element being the response variable name. The features and the response variable must be categories in the dataframe.

  • real_date (str) – Date in ISO 8601 format at which time a forward forecast is made based on the information states on that day.

  • cids (list, optional) – List of cross-section identifiers for consideration in the panel. Default is None, in which case all cross-sections in df are considered.

  • blacklist (list, optional) – Blacklisting dictionary specifying date ranges for which cross-sectional information should be excluded. The keys are cross-sections and the values are tuples of start and end dates in ISO 8601 format. Default is None.

  • freq (str, optional) – Frequency of the analysis. Default is “M” for monthly.

  • lag (int, optional) – Number of periods to lag the response variable. Default is 1.

  • xcat_aggs (list, optional) – List of aggregation functions to apply to the features, used when freq is not D. Default is [“last”, “sum”].

  • generate_labels (callable, optional) – Function to transform the response variable into either alternative regression targets or classification labels. Default is None.

Notes

This class is a simple interface to produce a single period forward forecast. The real_date parameter specifies the date of the information state used to generate the forecast. As an example, if the provided date is “2025-03-01”, a monthly frequency is specified and the lag is 1, the information states on this date are set aside, and the previous data is downsampled to monthly (with the features lagged by 1 period). On this dataset, model selection and fitting happen - and the forecast is produced for the single out-of-sample period (March 2025).

calculate_predictions(name, models, hyperparameters, scorers, inner_splitters, search_type='grid', normalize_fold_results=False, cv_summary='mean', include_train_folds=False, n_iter=None, n_jobs_cv=1, n_jobs_model=1, store_correlations=False)[source]#

Calculate predictions for the out-of-sample period.

Parameters:
  • name (str) – Name of the signal optimization process.

  • models (dict) – Dictionary of models to choose from. The keys are model names and the values are scikit-learn compatible models.

  • hyperparameters (dict) – Dictionary of hyperparameters to choose from. The keys are model names and the values are hyperparameter dictionaries for the corresponding model. The keys must match with those provided in models.

  • scorers (dict) – Dictionary of scoring functions to use in the hyperparameter optimization process. The keys are scorer names and the values are scikit-learn compatible scoring functions.

  • inner_splitters (dict) – Dictionary of inner splitters to use in the hyperparameter optimization process. The keys are splitter names and the values are scikit-learn compatible cross-validator objects.

  • search_type (str, optional) – Type of hyperparameter optimization to perform. Default is “grid”. Options are “grid” and “prior”.

  • normalize_fold_results (bool, optional) – Whether to normalize the scores across folds before combining them. Default is False.

  • cv_summary (str or callable, optional) – Summary function to use to combine scores across cross-validation folds. Default is “mean”. Options are “mean”, “median”, “mean-std”, “mean/std”, “mean-std-ge” or a callable function.

  • include_train_folds (bool, optional) – Whether to calculate cross-validation statistics on the training folds in additional to the test folds. If True, the cross-validation estimator will be a function of both training data and test data. It is recommended to set cv_summary appropriately. Default is False.

  • n_iter (int, optional) – Number of iterations to run in random hyperparameter search. Default is None.

  • n_jobs_cv (int, optional) – Number of parallel jobs to run the cross-validation process. Default is 1.

  • n_jobs_model (int, optional) – Number of parallel jobs to run the model fitting process (if relevant). Default is 1.

  • store_correlations (bool) – Whether to store the correlations between input pipeline features and input predictor features. Default is False.

store_split_data(pipeline_name, optimal_model, optimal_model_name, optimal_model_score, optimal_model_params, inner_splitters_adj, X_train, y_train, X_test, y_test, timestamp, adjusted_test_index)[source]#

Stores characteristics of the optimal model at each retraining date.

Parameters:
  • pipeline_name (str) – Name of the signal optimization process.

  • optimal_model (RegressorMixin, ClassifierMixin or Pipeline) – Optimal model selected at each retraining date.

  • optimal_model_name (str) – Name of the optimal model.

  • optimal_model_score (float) – Cross-validation score for the optimal model.

  • optimal_model_params (dict) – Chosen hyperparameters for the optimal model.

  • inner_splitters_adj (dict) – Dictionary of adjusted inner splitters.

  • X_train (pd.DataFrame) – Training feature matrix.

  • y_train (pd.Series) – Training response variable.

  • X_test (pd.DataFrame) – Test feature matrix.

  • y_test (pd.Series) – Test response variable.

  • timestamp (pd.Timestamp) – Timestamp of the retraining date.

  • adjusted_test_index (pd.MultiIndex) – Adjusted test index to account for lagged features.

Returns:

Dictionary containing feature importance scores, intercepts, selected features and correlations between inputs to pipelines and those entered into a final model.

Return type:

dict

get_optimized_signals(name=None)[source]#

Returns forward forecasts for one or more pipelines.

Parameters:

name (str or list, optional) – Label(s) of forecast(s). Default is all stored in the class instance.

Returns:

Pandas dataframe in JPMaQS format of working daily predictions.

Return type:

pd.DataFrame

get_selected_features(name=None)[source]#

Returns the selected features for one or more pipelines.

Parameters:

name (str or list, optional) – Label(s) of pipeline(s). Default is all stored in the class instance.

Returns:

Pandas dataframe of the selected features at each retraining date.

Return type:

pd.DataFrame

get_feature_importances(name=None)[source]#

Returns feature importances for one or more pipelines.

Parameters:

name (str or list, optional) – Label(s) of pipeline(s). Default is all stored in the class instance.

Returns:

Pandas dataframe of the feature importances, if available, learnt at each retraining date for a given pipeline.

Return type:

pd.DataFrame

Notes

Availability of feature importances is subject to the selected model having a feature_importances_ or coef_ attribute.

get_intercepts(name=None)[source]#

Returns intercepts for one or more pipelines.

Parameters:

name (str or list, optional) – Label(s) of pipeline(s). Default is all stored in the class instance.

Returns:

Pandas dataframe of the intercepts, if available, learnt at each retraining date for a given pipeline.

Return type:

pd.DataFrame

get_feature_correlations(name=None)[source]#

Returns dataframe of feature correlations for one or more pipelines.

Parameters:

name (str or list, optional) – Label(s) of the pipeline(s). Default is all stored in the class instance.

Returns:

Pandas dataframe of the correlations between the features passed into a model pipeline and the post-processed features inputted into the final model.

Return type:

pd.DataFrame