macrosynergy.learning.splitters.base_splitters#

Base classes for panel cross-validation splitters.

class BasePanelSplit[source]#

Bases: BaseCrossValidator, ABC

Generic cross-validation class for panel data.

Notes

This class is designed to provide a common interface for cross-validation on panel data. Much of the logic can be written in subclasses, but this base class contains the necessary code to visualise the splits for each cross-section in the panel.

get_n_splits(X=None, y=None, groups=None)[source]#

Returns the number of splits in the cross-validator.

Parameters:
  • X (pd.DataFrame, optional) – Pandas dataframe of features, multi-indexed by (cross-section, date). The dates must be in datetime format. Otherwise the dataframe must be in wide format: each feature is a column.

  • y (Union[pd.Series, pd.DataFrame], optional) – Pandas dataframe or series of a target variable, multi-indexed by (cross-section, date). The dates must be in datetime format. If a dataframe is provided, the target variable must be the sole column.

  • groups (None) – Always ignored, exists for compatibility with scikit-learn.

Returns:

n_splits – Number of splits in the cross-validator.

Return type:

int

visualise_splits(X, y, figsize=(20, 5), show_title=True, tick_fontsize=None, label_fontsize=None, subtitle_fontsize=None, drop_nas=True)[source]#

Visualise the cross-validation splits.

Parameters:
  • X (pd.DataFrame) – Pandas dataframe of features/quantamental indicators, multi-indexed by (cross-section, date). The dates must be in datetime format. The dataframe must be in wide format: each feature is a column.

  • y (pd.DataFrame) – Pandas dataframe of target variable, multi-indexed by (cross-section, date). The dates must be in datetime format.

  • figsize (Tuple[int, int]) – Tuple of integers specifying the splitter visualisation figure size.

  • show_title (bool, optional) – Boolean specifying whether to show the title of the figure. Default is True.

  • tick_fontsize (int, optional) – Integer specifying the size of the x-axis tick labels. Default is None.

  • label_fontsize (int, optional) – Integer specifying the size of the y-axis labels. Default is None.

  • subtitle_fontsize (int, optional) – Integer specifying the size of the subplot titles. Default is None.

  • drop_nas (bool, optional) – Whether to drop rows with NaN values in the dataframe. Default is True. If False, only the rows with NaN values in the dependent variable are dropped.

class WalkForwardPanelSplit(min_cids, min_periods, start_date=None, max_periods=None, drop_nas=True)[source]#

Bases: BasePanelSplit, ABC

Generic walk-forward panel cross-validator.

Parameters:
  • min_cids (int) – Minimum number of cross-sections required for the first training set. Either start_date or (min_cids, min_periods) must be provided. If both are provided, start_date takes precedence.

  • min_periods (int) – Minimum number of time periods required for the first training set. Either start_date or (min_cids, min_periods) must be provided. If both are provided, start_date takes precedence.

  • start_date (str, optional) – The targeted final date in the initial training set in ISO 8601 format. Default is None. Either start_date or (min_cids, min_periods) must be provided. If both are provided, start_date takes precedence.

  • max_periods (int, optional) – The maximum number of time periods in each training set. If the maximum is exceeded, the earliest periods are cut off. This effectively creates rolling training sets. Default is None.

  • drop_nas (bool, optional) – Whether to drop rows with NaN values in the dataframe. Default is True. If False, only the rows with NaN values in the dependent variable are dropped.

Notes

Provides train/test indices to split a panel into train/test sets. Following an initial training set construction, a forward test set is created. The training and test set pair evolves over time by walking forward through the panel.

class KFoldPanelSplit(n_splits=5, min_n_splits=2)[source]#

Bases: BasePanelSplit, ABC

split(X, y, groups=None)[source]#

Generate indices to split data into training and test sets.

Parameters:
  • X (pd.DataFrame) – Pandas dataframe of features, multi-indexed by (cross-section, date). The dates must be in datetime format. Otherwise the dataframe must be in wide format: each feature is a column.

  • y (Union[pd.DataFrame, pd.Series]) – Pandas dataframe or series of a target variable, multi-indexed by (cross-section, date). The dates must be in datetime format. If a dataframe is provided, the target variable must be the sole column.

  • groups (None) – Ignored. Exists for compatibility with scikit-learn.

Yields:
  • train (np.ndarray) – The training set indices for that split.

  • test (np.ndarray) – The testing set indices for that split.