macrosynergy.learning.panel_time_series_split#

Tools to produce, visualise and use walk-forward validation splits across panels.


Classes#


BasePanelSplit#

Base class for the production of paired training and test splits for panel data. All children classes possess the following methods: get_n_splits and visualise_splits. The method ‘get_n_splits’ is required so that our panel splitters can inherit from sklearn’s BaseCrossValidator class, allowing for seamless integration with sklearn’s API. The method ‘visualise_splits’ is a convenience method for visualising the splits produced by each child splitter, giving the user confidence in the splits produced for their use case.


BasePanelSplit.get_n_splits()#

BasePanelSplit.get_n_splits(self, X, y, groups)

Returns the number of splits in the cross-validator.

:param <pd.DataFrame> X: Always ignored, exists for compatibility with scikit-learn.

:param <pd.DataFrame> y: Always ignored, exists for compatibility with scikit-learn.

:param <pd.DataFrame> groups: Always ignored, exists for compatibility with scikit-learn.

:return <int> n_splits: Returns the number of splits.


BasePanelSplit._validate_Xy()#

BasePanelSplit._validate_Xy(self, X, y)

Private helper method to validate the input dataframes X and y.

:param <pd.DataFrame> X: Pandas dataframe of features/quantamental indicators, multi-indexed by (cross-section, date). The dates must be in datetime format. Otherwise the dataframe must be in wide format: each feature is a column.

:param <pd.DataFrame> y: Pandas dataframe of target variable, multi-indexed by (cross-section, date). The dates must be in datetime format.


BasePanelSplit._calculate_xranges()#

BasePanelSplit._calculate_xranges(self, cs_dates, real_dates, freq_offset)

Private helper method to determine the ranges of contiguous dates in each training and test set, for use in visualisation.

:param <pd.DatetimeIndex> cs_dates: DatetimeIndex of dates in a set for a given cross-section.

:param <pd.DatetimeIndex> real_dates: DatetimeIndex of all dates in the panel.

:param <pd.DateOffset> freq_offset: DateOffset object representing the frequency of the dates in the panel.

:return <List[Tuple[pd.Timestamp,pd.Timedelta]]> xranges: list of tuples of the form (start date, length of contiguous dates).


BasePanelSplit.visualise_splits()#

BasePanelSplit.visualise_splits(self, X, y, figsize)

Method to visualise the splits created according to the parameters specified in the constructor.

:param <pd.DataFrame> X: Pandas dataframe of features/quantamental indicators, multi-indexed by (cross-section, date). The dates must be in datetime format. Otherwise the dataframe must be in wide format: each feature is a column.

:param <pd.DataFrame> y: Pandas dataframe of target variable, multi-indexed by (cross-section, date). The dates must be in datetime format.

:param <Tuple[int,int]> figsize: tuple of integers specifying the splitter visualisation figure size.

:return None:


ExpandingIncrementPanelSplit#

Class for the production of paired training and test splits, created over a panel of countries. ExpandingIncrementPanelSplit differs from ExpandingKFoldPanelSplit by specifying the structure of an initial training and test set, as well as the number of time periods to expand both the initial and subsequent training and test sets by. This is a flexible alternative to defining the number of splits to make.

The first training set is determined by the parameters ‘min_cids’ and ‘min_periods’, defined below. This set comprises at least ‘min_periods’ time periods for at least ‘min_cids’ cross-sections. Its associated test set immediately follows the training set, and is of length ‘test_size’. Subsequent training sets are created by expanding the previous training set by ‘train_intervals’ time periods, in the native frequency of the concerned datasets. As before, each test set immediately follows its associated training set, and is of length ‘test_size’. We also provide a parameter ‘max_periods’, which allows the user to roll the training set forward as opposed to expanding it. If the number of time periods in the training set exceeds ‘max_periods’, the earliest time periods are truncated.

This splitter can be employed, in addition to standard use, to reflect a pipeline through time in a real-world setting. This is especially the case when ‘test_size’ is set to 1.

:param <int> train_intervals: training interval length in time periods for sequential training. This is the number of periods by which the training set is expanded at each subsequent split. Default is 21.

:param <int> min_cids: minimum number of cross-sections required for the initial training set. Default is 4.

:param <int> min_periods: minimum number of time periods required for the initial training set. Default is 500.

:param <int> test_size: test set length for interval training. This is the number of periods to use for the test set subsequent to the training set. Default is 21.

:param <int> max_periods: maximum length of each training set in interval training. If the maximum is exceeded, the earliest periods are cut off. Default is None.


ExpandingIncrementPanelSplit._determine_unique_time_splits()#

ExpandingIncrementPanelSplit._determine_unique_time_splits(self, X, y)

Private helper method to determine the unique dates in each training split. This method is called by self.split(). It further returns other variables needed for ensuing components of the split method.

:param <pd.DataFrame> X: Pandas dataframe of features multi-indexed by (cross-section, date). The dates must be in datetime format. The dataframe must be in wide format: each feature is a column.

:param <pd.DataFrame> y: Pandas dataframe of the target variable, multi-indexed by (cross-section, date). The dates must be in datetime format.

:return <Tuple[List[pd.DatetimeIndex],pd.DataFrame,int]>: (train_splits_basic, Xy, n_splits): Tuple comprising the unique dates in each training split, the concatenated dataframe of X and y, and the number of splits.


ExpandingIncrementPanelSplit.split()#

ExpandingIncrementPanelSplit.split(self, X, y, groups)

Method that produces pairs of training and test indices as intended by the ExpandingIncrementPanelSplit class. Wide format Pandas (panel) dataframes are expected, multi-indexed by cross-section and date. It is recommended for the features to lag behind the associated targets by a single native frequency unit.

:param <pd.DataFrame> X: Pandas dataframe of features, multi-indexed by (cross-section, date). The dates must be in datetime format. Otherwise the dataframe must be in wide format: each feature is a column.

:param <pd.DataFrame> y: Pandas dataframe of the target variable, multi-indexed by (cross-section, date). The dates must be in datetime format.

:param <int> groups: Always ignored, exists for compatibility with scikit-learn.

:return <Iterable[Tuple[np.ndarray[int],np.ndarray[int]]]> splits: Iterable of (train,test) indices.


ExpandingIncrementPanelSplit.get_n_splits()#

ExpandingIncrementPanelSplit.get_n_splits(self, X, y, groups)

Calculates and returns the number of splits.

:param <pd.DataFrame> X: Pandas dataframe of features, multi-indexed by (cross-section, date). The dates must be in datetime format. Otherwise the dataframe must be in wide format: each feature is a column.

:param <pd.DataFrame> y: Pandas dataframe of the target variable, multi-indexed by (cross-section, date). The dates must be in datetime format.

:param <pd.DataFrame> groups: Always ignored, exists for compatibility.

:return <int> n_splits: Returns the number of splits.


ExpandingKFoldPanelSplit#

Class for the production of paired training and test splits, created over a panel of countries. ExpandingKFoldPanelSplit operates similarly to sklearn’s TimeSeriesSplit class, but is designed to handle panels of data, as opposed to single time series’. To create the splits, the sorted, unique dates in the panel are divided into ‘n_splits + 1’ sequential and non-overlapping intervals. This results in ‘n_splits’ pairs of training and test sets, where the ‘i’th training set is the union of the first ‘i’ intervals, and the ‘i’th test set is the ‘i+1’th interval.

:param <int> n_splits: number of splits. Must be at least 2.


ExpandingKFoldPanelSplit.split()#

ExpandingKFoldPanelSplit.split(self, X, y, groups)

Method that produces pairs of training and test indices as intended by the ExpandingKFoldPanelSplit class. Wide format Pandas (panel) dataframes are expected, multi-indexed by cross-section and date. It is recommended for the features to lag behind the associated targets by a single native frequency unit.

:param <pd.DataFrame> X: Pandas dataframe of features, multi-indexed by (cross-section, date). The dates must be in datetime format. Otherwise the dataframe must be in wide format: each feature is a column.

:param <pd.DataFrame> y: Pandas dataframe of the target variable, multi-indexed by (cross-section, date). The dates must be in datetime format.

:param <int> groups: Always ignored, exists for compatibility with scikit-learn.

:return <Iterable[Tuple[np.ndarray[int],np.ndarray[int]]]> splits: Iterable of (train,test) indices.


RollingKFoldPanelSplit#

Class for the production of paired training and test splits, created over a panel of countries. RollingKFoldPanelSplit operates similarly to sklearn’s KFold class (without shuffle enabled), but is designed to handle panels of data, as opposed to single time series’. To create the splits, the sorted, unique dates in the panel are divided into ‘n_splits’ sequential and non-overlapping intervals. This results in ‘n_splits’ pairs of training and test sets, where the ‘i’th training set is the ‘i’th interval, and the ‘i’th test set are all other intervals. This gives the effect of the test set “rolling” forward in time.

:param <int> n_splits: number of splits. Must be at least 2.


RollingKFoldPanelSplit.split()#

RollingKFoldPanelSplit.split(self, X, y, groups)

Method that produces pairs of training and test indices as intended by the RollingKFoldPanelSplit class. Wide format Pandas (panel) dataframes are expected, multi-indexed by cross-section and date. It is recommended for the features to lag behind the associated targets by a single native frequency unit.

:param <pd.DataFrame> X: Pandas dataframe of features, multi-indexed by (cross-section, date). The dates must be in datetime format. Otherwise the dataframe must be in wide format: each feature is a column.

:param <pd.DataFrame> y: Pandas dataframe of the target variable, multi-indexed by (cross-section, date). The dates must be in datetime format.

:param <int> groups: Always ignored, exists for compatibility with scikit-learn.

:return <Iterable[Tuple[np.ndarray[int],np.ndarray[int]]]> splits: Iterable of (train,test) indices.