macrosynergy.learning.splitters#

class BasePanelSplit[source]#

Bases: BaseCrossValidator, ABC

Generic cross-validation class for panel data.

Notes

This class is designed to provide a common interface for cross-validation on panel data. Much of the logic can be written in subclasses, but this base class contains the necessary code to visualise the splits for each cross-section in the panel.

get_n_splits(X=None, y=None, groups=None)[source]#

Returns the number of splits in the cross-validator.

Parameters:
  • X (pd.DataFrame, optional) – Pandas dataframe of features, multi-indexed by (cross-section, date). The dates must be in datetime format. Otherwise the dataframe must be in wide format: each feature is a column.

  • y (Union[pd.Series, pd.DataFrame], optional) – Pandas dataframe or series of a target variable, multi-indexed by (cross-section, date). The dates must be in datetime format. If a dataframe is provided, the target variable must be the sole column.

  • groups (None) – Always ignored, exists for compatibility with scikit-learn.

Returns:

n_splits – Number of splits in the cross-validator.

Return type:

int

visualise_splits(X, y, figsize=(20, 5))[source]#

Visualise the cross-validation splits.

Parameters:
  • X (pd.DataFrame) – Pandas dataframe of features/quantamental indicators, multi-indexed by (cross-section, date). The dates must be in datetime format. The dataframe must be in wide format: each feature is a column.

  • y (pd.DataFrame) – Pandas dataframe of target variable, multi-indexed by (cross-section, date). The dates must be in datetime format.

  • figsize (Tuple[int, int]) – Tuple of integers specifying the splitter visualisation figure size.

class KFoldPanelSplit(n_splits=5, min_n_splits=2)[source]#

Bases: BasePanelSplit, ABC

split(X, y, groups=None)[source]#

Generate indices to split data into training and test sets.

Parameters:
  • X (pd.DataFrame) – Pandas dataframe of features, multi-indexed by (cross-section, date). The dates must be in datetime format. Otherwise the dataframe must be in wide format: each feature is a column.

  • y (Union[pd.DataFrame, pd.Series]) – Pandas dataframe or series of a target variable, multi-indexed by (cross-section, date). The dates must be in datetime format. If a dataframe is provided, the target variable must be the sole column.

  • groups (None) – Ignored. Exists for compatibility with scikit-learn.

Yields:
  • train (np.ndarray) – The training set indices for that split.

  • test (np.ndarray) – The testing set indices for that split.

class WalkForwardPanelSplit(min_cids, min_periods, start_date=None, max_periods=None)[source]#

Bases: BasePanelSplit, ABC

Generic walk-forward panel cross-validator.

Parameters:
  • min_cids (int) – Minimum number of cross-sections required for the first training set. Either start_date or (min_cids, min_periods) must be provided. If both are provided, start_date takes precedence.

  • min_periods (int) – Minimum number of time periods required for the first training set. Either start_date or (min_cids, min_periods) must be provided. If both are provided, start_date takes precedence.

  • start_date (str, optional) – The targeted final date in the initial training set in ISO 8601 format. Default is None. Either start_date or (min_cids, min_periods) must be provided. If both are provided, start_date takes precedence.

  • max_periods (int, optional) – The maximum number of time periods in each training set. If the maximum is exceeded, the earliest periods are cut off. This effectively creates rolling training sets. Default is None.

Notes

Provides train/test indices to split a panel into train/test sets. Following an initial training set construction, a forward test set is created. The training and test set pair evolves over time by walking forward through the panel.

class ExpandingKFoldPanelSplit(n_splits=5, min_n_splits=2)[source]#

Bases: KFoldPanelSplit

Time-respecting K-Fold cross-validator for panel data.

Parameters:

n_splits (int) – Number of folds i.e. (training set, test set) pairs. Default is 5. Must be at least 2.

Notes

This splitter can be considered to be a panel data analogue to the TimeSeriesSplit splitter provided by scikit-learn.

Unique dates in the panel are divided into ‘n_splits + 1’ sequential and non-overlapping intervals, resulting in ‘n_splits’ pairs of training and test sets. The ‘i’th training set is the union of the first ‘i’ intervals, and the ‘i’th test set is the ‘i+1’th interval.

class RollingKFoldPanelSplit(n_splits=5, min_n_splits=2)[source]#

Bases: KFoldPanelSplit

Unshuffled K-Fold cross-validator for panel data.

Parameters:

n_splits (int) – Number of folds. Default is 5. Must be at least 2.

Notes

This splitter can be considered to be a panel data analogue to the KFold splitter provided by scikit-learn, with shuffle=False and with splits determined on the time dimension.

Unique dates in the panel are divided into ‘n_splits’ sequential and non-overlapping intervals of equal size, resulting in ‘n_splits’ pairs of training and test sets. The ‘i’th test set is the ‘i’th interval, and the ‘i’th training set is all other intervals.

class RecencyKFoldPanelSplit(n_splits=5, n_periods=252)[source]#

Bases: KFoldPanelSplit

Time-respecting K-Fold panel cross-validator that creates training and test sets based on the most recent samples in the panel.

Parameters:
  • n_splits (int) – Number of folds i.e. (training set, test set) pairs. Default is 5. Must be at least 1.

  • n_periods (int) – Number of time periods, in units of native dataset frequency, to comprise each test set. Default is 252 (1 year for daily data).

Notes

This splitter is similar to the ExpandingKFoldPanelSplit, except that the sorted unique timestamps are not divided into equal intervals. Instead, the last n_periods * n_splits timestamps in the panel are divided into n_splits non-overlapping intervals, each of which is used as a test set. The corresponding training set is comprised of all samples with timestamps earlier than its test set. Consequently, this is a K-Fold walk-forward cross-validator, but with test folds concentrated on the most recent information.

class ExpandingIncrementPanelSplit(train_intervals=21, test_size=21, min_cids=4, min_periods=500, start_date=None, max_periods=None)[source]#

Bases: WalkForwardPanelSplit

Walk-forward cross-validator over a panel.

Provides train/test indices to split data into train/test sets. The dataset is split so that subsequent training sets are expanded by a fixed number of time periods to incorporate the latest available information. Each training set is followed by a test set of fixed length.

Parameters:
  • train_intervals (int) – The number of time periods by which the previous training set is expanded. Default is 21.

  • test_size (int) – The number of time periods forward of each training set to use in the associated test set. Default is 21.

  • min_cids (int) – The minimum number of cross-sections required for the first training set. Default is 4. Either start_date or (min_cids, min_periods) must be provided. If both are provided, start_date takes precedence.

  • min_periods (int) – The minimum number of time periods required for the first training set. Default is 500. Either start_date or (min_cids, min_periods) must be provided. If both are provided, start_date takes precedence.

  • start_date (Optional[str]) – The targeted final date in the initial training set in ISO 8601 format. Default is None. Either start_date or (min_cids, min_periods) must be provided. If both are provided, start_date takes precedence.

  • max_periods (Optional[int]) – The maximum number of time periods in each training set. If the maximum is exceeded, the earliest periods are cut off. This effectively creates rolling training sets. Default is None.

Notes

The first training set is determined by the specification of either start_date or by the parameters min_cids and min_periods. When start_date is provided, the initial training set comprises all available data before and including the start_date, unless max_periods is specified, in which case at most the last max_periods periods prior to the start_date are included.

If start_date is not provided, the first training set is determined by the parameters min_cids and min_periods. This set comprises at least min_periods time periods for at least min_cids cross-sections.

split(X, y, groups=None)[source]#

Generate indices to split data into training and test sets.

Parameters:
  • X (pd.DataFrame) – Pandas dataframe of features, multi-indexed by (cross-section, date). The dates must be in datetime format. Otherwise the dataframe must be in wide format: each feature is a column.

  • y (Union[pd.DataFrame, pd.Series]) – Pandas dataframe or series of a target variable, multi-indexed by (cross-section, date). The dates must be in datetime format. If a dataframe is provided, the target variable must be the sole column.

  • groups (None) – Ignored. Exists for compatibility with scikit-learn.

Yields:
  • train (np.ndarray) – The training set indices for that split.

  • test (np.ndarray) – The testing set indices for that split.

get_n_splits(X, y, groups=None)[source]#

Calculates and returns the number of splits.

Parameters:
  • X (pd.DataFrame) – Pandas dataframe of features, multi-indexed by (cross-section, date). The dates must be in datetime format. Otherwise the dataframe must be in wide format: each feature is a column.

  • y (Union[pd.DataFrame, pd.Series]) – Pandas dataframe or series of a target variable, multi-indexed by (cross-section, date). The dates must be in datetime format. If a dataframe is provided, the target variable must be the sole column.

  • groups (None) – Ignored. Exists for compatibility with scikit-learn.

Returns:

n_splits – The number of splits.

Return type:

int

class ExpandingFrequencyPanelSplit(expansion_freq='D', test_freq='D', min_cids=4, min_periods=500, start_date=None, max_periods=None)[source]#

Bases: WalkForwardPanelSplit

Walk-forward cross-validator over a panel.

Provides train/test indices to split data into train/test sets. The dataset is split so that subsequent training sets are expanded by a user-specified frequency to incorporate the latest available information. Each training set is followed by a test set spanning a user-defined frequency.

Parameters:
  • expansion_freq (str) – Frequency of training set expansion. For a given native dataset frequency, the training sets expand by the smallest number of dates to cover this frequency. Default is “D”. Accepted values are “D”, “W”, “M”, “Q” and “Y”.

  • test_freq (str) – Frequency forward of each training set for the unique dates in each test set to cover. Default is “D”. Accepted values are “D”, “W”, “M”, “Q” and “Y”.

  • min_cids (int) – Minimum number of cross-sections required for the initial training set. Default is 4. Either start_date or (min_cids, min_periods, min_xcats) must be provided. If both are provided, start_date takes precedence.

  • min_periods (int) – Minimum number of time periods required for the initial training set. Default is 500. Either start_date or (min_cids, min_periods, min_xcats) must be provided. If both are provided, start_date takes precedence.

  • start_date (Optional[str]) – First rebalancing date in ISO 8601 format. This is the last date of the first training set. Default is None. Either start_date or (min_cids, min_periods) must be provided. If both are provided, start_date takes precedence.

  • max_periods (Optional[int]) – The maximum number of time periods in each training set. If the maximum is exceeded, the earliest periods are cut off. This effectively creates rolling training sets. Default is None.

Notes

The first training set is either determined by the specification of start_date or by the parameters min_cids and min_periods collectively. When start_date is provided, the initial training set comprises all available data prior to the start_date, unless max_periods is specified, in which case at most the last max_periods periods prior to the start_date are included.

If start_date is not provided, the first training set is determined by the parameters min_cids and min_periods. This set comprises at least min_periods time periods for at least min_cids cross-sections.

This initial training set is immediately adjusted depending on the specified training interval frequency. For instance, if the training frequency is “M”, the initial training set is further expanding so that all samples prior to the end of the month are included.

The associated test set immediately follows the adjusted initial training set and spans the specified test set frequency forward of its associated training set. For instance, if the test frequency is “Q”, the available dates that cover the subsequent quarter are grouped together to form the test set.

Subsequent training sets are created by expanding the previous training set by the smallest number of dates to cover the training frequency. As before, each test set immediately follows its associated training set and is determined in the same manner as the initial test set.

split(X, y, groups=None)[source]#

Generate indices to split data into training and test sets.

Parameters:
  • X (pd.DataFrame) – Pandas dataframe of features, multi-indexed by (cross-section, date). The dates must be in datetime format. Otherwise the dataframe must be in wide format: each feature is a column.

  • y (Union[pd.DataFrame, pd.Series]) – Pandas dataframe or series of a target variable, multi-indexed by (cross-section, date). The dates must be in datetime format. If a dataframe is provided, the target variable must be the sole column.

  • groups (None) – Ignored. Exists for compatibility with scikit-learn.

Yields:
  • train (np.ndarray) – The training set indices for that split.

  • test (np.ndarray) – The testing set indices for that split.

get_n_splits(X=None, y=None, groups=None)[source]#

Calculates and returns the number of splits.

Parameters:
  • X (<pd.DataFrame>) – Pandas dataframe of features, multi-indexed by (cross-section, date). The dates must be in datetime format. Otherwise the dataframe must be in wide format: each feature is a column.

  • y (<pd.DataFrame>) – Pandas dataframe of the target variable, multi-indexed by (cross-section, date). The dates must be in datetime format.

  • groups (<pd.DataFrame>) – Always ignored, exists for compatibility.

Return <int> n_splits:

Returns the number of splits.

Return type:

int

Submodules#