macrosynergy.learning.splitters.walk_forward_splitters#
Classes for incremental expanding panel cross-validators.
- class ExpandingIncrementPanelSplit(train_intervals=21, test_size=21, min_cids=4, min_periods=500, min_xcats=1, start_date=None, max_periods=None)[source]#
Bases:
WalkForwardPanelSplitWalk-forward cross-validator over a panel.
Provides train/test indices to split data into train/test sets. The dataset is split so that subsequent training sets are expanded by a fixed number of time periods to incorporate the latest available information. Each training set is followed by a test set of fixed length.
- Parameters:
train_intervals (int) – The number of time periods by which the previous training set is expanded. Default is 21.
test_size (int) – The number of time periods forward of each training set to use in the associated test set. Default is 21.
min_cids (int) – The minimum number of cross-sections required for the first training set. Default is 4. Either start_date or (min_cids, min_periods, min_xcats) must be provided. If both are provided, start_date takes precedence.
min_periods (int) – The minimum number of time periods required for the first training set. Default is 500. Either start_date or (min_cids, min_periods, min_xcats) must be provided. If both are provided, start_date takes precedence.
min_xcats (int) – The minimum number of xcats required for the first training set. Default is 1. Either start_date or (min_cids, min_periods, min_xcats) must be provided. If both are provided, start_date takes precedence.
start_date (Optional[str]) – The targeted final date in the initial training set in ISO 8601 format. Default is None. Either start_date or (min_cids, min_periods) must be provided. If both are provided, start_date takes precedence.
max_periods (Optional[int]) – The maximum number of time periods in each training set. If the maximum is exceeded, the earliest periods are cut off. This effectively creates rolling training sets. Default is None.
Notes
The first training set is determined by the specification of either start_date or by the parameters min_cids, min_periods, and min_xcats. When start_date is provided, the initial training set comprises all available data before and including the start_date, unless max_periods is specified, in which case at most the last max_periods periods prior to the start_date are included.
If start_date is not provided, the first training set is determined by the parameters min_cids, min_periods, and min_xcats. This set comprises at least min_xcats categories for at least min_periods time periods for at least min_cids cross-sections.
- split(X, y, groups=None)[source]#
Generate indices to split data into training and test sets.
- Parameters:
X (pd.DataFrame) – Pandas dataframe of features, multi-indexed by (cross-section, date). The dates must be in datetime format. Otherwise the dataframe must be in wide format: each feature is a column.
y (Union[pd.DataFrame, pd.Series]) – Pandas dataframe or series of a target variable, multi-indexed by (cross-section, date). The dates must be in datetime format. If a dataframe is provided, the target variable must be the sole column.
groups (None) – Ignored. Exists for compatibility with scikit-learn.
- Yields:
train (np.ndarray) – The training set indices for that split.
test (np.ndarray) – The testing set indices for that split.
- get_n_splits(X, y, groups=None)[source]#
Calculates and returns the number of splits.
- Parameters:
X (pd.DataFrame) – Pandas dataframe of features, multi-indexed by (cross-section, date). The dates must be in datetime format. Otherwise the dataframe must be in wide format: each feature is a column.
y (Union[pd.DataFrame, pd.Series]) – Pandas dataframe or series of a target variable, multi-indexed by (cross-section, date). The dates must be in datetime format. If a dataframe is provided, the target variable must be the sole column.
groups (None) – Ignored. Exists for compatibility with scikit-learn.
- Returns:
n_splits – The number of splits.
- Return type:
- visualise_splits(X, y, figsize=(20, 5), show_title=True, tick_fontsize=None, label_fontsize=None, subtitle_fontsize=None)[source]#
Visualise the cross-validation splits.
- Parameters:
X (pd.DataFrame) – Pandas dataframe of features/quantamental indicators, multi-indexed by (cross-section, date). The dates must be in datetime format. The dataframe must be in wide format: each feature is a column.
y (pd.DataFrame) – Pandas dataframe of target variable, multi-indexed by (cross-section, date). The dates must be in datetime format.
figsize (Tuple[int, int]) – Tuple of integers specifying the splitter visualisation figure size.
show_title (bool, optional) – Boolean specifying whether to show the title of the figure. Default is True.
tick_fontsize (int, optional) – Integer specifying the size of the x-axis tick labels. Default is None.
label_fontsize (int, optional) – Integer specifying the size of the y-axis labels. Default is None.
subtitle_fontsize (int, optional) – Integer specifying the size of the subplot titles. Default is None.
- class ExpandingFrequencyPanelSplit(expansion_freq='D', test_freq='D', min_cids=4, min_periods=500, min_xcats=1, start_date=None, max_periods=None)[source]#
Bases:
WalkForwardPanelSplitWalk-forward cross-validator over a panel.
Provides train/test indices to split data into train/test sets. The dataset is split so that subsequent training sets are expanded by a user-specified frequency to incorporate the latest available information. Each training set is followed by a test set spanning a user-defined frequency.
- Parameters:
expansion_freq (str) – Frequency of training set expansion. For a given native dataset frequency, the training sets expand by the smallest number of dates to cover this frequency. Default is “D”. Accepted values are “D”, “W”, “M”, “Q” and “Y”.
test_freq (str) – Frequency forward of each training set for the unique dates in each test set to cover. Default is “D”. Accepted values are “D”, “W”, “M”, “Q” and “Y”.
min_cids (int) – Minimum number of cross-sections required for the initial training set. Default is 4. Either start_date or (min_cids, min_periods, min_xcats) must be provided. If both are provided, start_date takes precedence.
min_periods (int) – Minimum number of time periods required for the initial training set. Default is 500. Either start_date or (min_cids, min_periods, min_xcats) must be provided. If both are provided, start_date takes precedence.
min_xcats (int) – Minimum number of xcats required for the initial training set. Default is 1. Either start_date or (min_cids, min_periods, min_xcats) must be provided. If both are provided, start_date takes precedence.
start_date (Optional[str]) – First rebalancing date in ISO 8601 format. This is the last date of the first training set. Default is None. Either start_date or (min_cids, min_periods) must be provided. If both are provided, start_date takes precedence.
max_periods (Optional[int]) – The maximum number of time periods in each training set. If the maximum is exceeded, the earliest periods are cut off. This effectively creates rolling training sets. Default is None.
Notes
The first training set is either determined by the specification of start_date or by the parameters min_cids and min_periods collectively. When start_date is provided, the initial training set comprises all available data prior to the start_date, unless max_periods is specified, in which case at most the last max_periods periods prior to the start_date are included.
If start_date is not provided, the first training set is determined by the parameters min_cids, min_periods, and min_xcats. This set comprises at least min_xcats categories for at least min_periods time periods for at least min_cids cross-sections.
This initial training set is immediately adjusted depending on the specified training interval frequency. For instance, if the training frequency is “M”, the initial training set is further expanding so that all samples prior to the end of the month are included.
The associated test set immediately follows the adjusted initial training set and spans the specified test set frequency forward of its associated training set. For instance, if the test frequency is “Q”, the available dates that cover the subsequent quarter are grouped together to form the test set.
Subsequent training sets are created by expanding the previous training set by the smallest number of dates to cover the training frequency. As before, each test set immediately follows its associated training set and is determined in the same manner as the initial test set.
- split(X, y, groups=None)[source]#
Generate indices to split data into training and test sets.
- Parameters:
X (pd.DataFrame) – Pandas dataframe of features, multi-indexed by (cross-section, date). The dates must be in datetime format. Otherwise the dataframe must be in wide format: each feature is a column.
y (Union[pd.DataFrame, pd.Series]) – Pandas dataframe or series of a target variable, multi-indexed by (cross-section, date). The dates must be in datetime format. If a dataframe is provided, the target variable must be the sole column.
groups (None) – Ignored. Exists for compatibility with scikit-learn.
- Yields:
train (np.ndarray) – The training set indices for that split.
test (np.ndarray) – The testing set indices for that split.
- get_n_splits(X=None, y=None, groups=None)[source]#
Calculates and returns the number of splits.
- Parameters:
X (<pd.DataFrame>) – Pandas dataframe of features, multi-indexed by (cross-section, date). The dates must be in datetime format. Otherwise the dataframe must be in wide format: each feature is a column.
y (<pd.DataFrame>) – Pandas dataframe of the target variable, multi-indexed by (cross-section, date). The dates must be in datetime format.
groups (<pd.DataFrame>) – Always ignored, exists for compatibility.
- Return <int> n_splits:
Returns the number of splits.
- Return type:
- visualise_splits(X, y, figsize=(20, 5), show_title=True, tick_fontsize=None, label_fontsize=None, subtitle_fontsize=None)[source]#
Visualise the cross-validation splits.
- Parameters:
X (pd.DataFrame) – Pandas dataframe of features/quantamental indicators, multi-indexed by (cross-section, date). The dates must be in datetime format. The dataframe must be in wide format: each feature is a column.
y (pd.DataFrame) – Pandas dataframe of target variable, multi-indexed by (cross-section, date). The dates must be in datetime format.
figsize (Tuple[int, int]) – Tuple of integers specifying the splitter visualisation figure size.
show_title (bool, optional) – Boolean specifying whether to show the title of the figure. Default is True.
tick_fontsize (int, optional) – Integer specifying the size of the x-axis tick labels. Default is None.
label_fontsize (int, optional) – Integer specifying the size of the y-axis labels. Default is None.
subtitle_fontsize (int, optional) – Integer specifying the size of the subplot titles. Default is None.