macrosynergy.learning.splitters.base_splitters#

Base classes for panel cross-validation splitters.

class BasePanelSplit[source]#

Bases: BaseCrossValidator, ABC

Generic cross-validation class for panel data.

Notes

This class is designed to provide a common interface for cross-validation on panel data. Much of the logic can be written in subclasses, but this base class contains the necessary code to visualise the splits for each cross-section in the panel.

get_n_splits(X=None, y=None, groups=None)[source]#

Returns the number of splits in the cross-validator.

Parameters:

X (pd.DataFrame, optional) – Pandas dataframe of features, multi-indexed by (cross-section, date). The dates must be in datetime format. Otherwise the dataframe must be in wide format: each feature is a column.
y (Union[pd.Series, pd.DataFrame], optional) – Pandas dataframe or series of a target variable, multi-indexed by (cross-section, date). The dates must be in datetime format. If a dataframe is provided, the target variable must be the sole column.
groups (None) – Always ignored, exists for compatibility with scikit-learn.

Returns:

n_splits – Number of splits in the cross-validator.

Return type:

int

visualise_splits(X, y, figsize=(20, 5), show_title=True, tick_fontsize=None, label_fontsize=None, subtitle_fontsize=None)[source]#

Visualise the cross-validation splits.

Parameters:

X (pd.DataFrame) – Pandas dataframe of features/quantamental indicators, multi-indexed by (cross-section, date). The dates must be in datetime format. The dataframe must be in wide format: each feature is a column.
y (pd.DataFrame) – Pandas dataframe of target variable, multi-indexed by (cross-section, date). The dates must be in datetime format.
figsize (Tuple[int, int]) – Tuple of integers specifying the splitter visualisation figure size.
show_title (bool, optional) – Boolean specifying whether to show the title of the figure. Default is True.
tick_fontsize (int, optional) – Integer specifying the size of the x-axis tick labels. Default is None.
label_fontsize (int, optional) – Integer specifying the size of the y-axis labels. Default is None.
subtitle_fontsize (int, optional) – Integer specifying the size of the subplot titles. Default is None.

class WalkForwardPanelSplit(min_cids, min_periods, min_xcats, start_date=None, max_periods=None)[source]#

Bases: BasePanelSplit, ABC

Generic walk-forward panel cross-validator.

Parameters:

min_cids (int) – Minimum number of cross-sections required for the first training set. Either start_date or (min_cids, min_periods, min_xcats) must be provided. If both are provided, start_date takes precedence.
min_periods (int) – Minimum number of time periods required for the first training set. Either start_date or (min_cids, min_periods, min_xcats) must be provided. If both are provided, start_date takes precedence.
min_xcats (int) – Minimum number of xcats required for the first training set. Either start_date or (min_cids, min_periods, min_xcats) must be provided. If both are provided, start_date takes precedence.
start_date (str, optional) – The targeted final date in the initial training set in ISO 8601 format. Default is None. Either start_date or (min_cids, min_periods) must be provided. If both are provided, start_date takes precedence.
max_periods (int, optional) – The maximum number of time periods in each training set. If the maximum is exceeded, the earliest periods are cut off. This effectively creates rolling training sets. Default is None.

Notes

Provides train/test indices to split a panel into train/test sets. Following an initial training set construction, a forward test set is created. The training and test set pair evolves over time by walking forward through the panel.

class KFoldPanelSplit(n_splits=5, min_n_splits=2)[source]#

Bases: BasePanelSplit, ABC

split(X, y, groups=None)[source]#

Generate indices to split data into training and test sets.

Parameters:

X (pd.DataFrame) – Pandas dataframe of features, multi-indexed by (cross-section, date). The dates must be in datetime format. Otherwise the dataframe must be in wide format: each feature is a column.
y (Union[pd.DataFrame, pd.Series]) – Pandas dataframe or series of a target variable, multi-indexed by (cross-section, date). The dates must be in datetime format. If a dataframe is provided, the target variable must be the sole column.
groups (None) – Ignored. Exists for compatibility with scikit-learn.

Yields:

train (np.ndarray) – The training set indices for that split.
test (np.ndarray) – The testing set indices for that split.