macrosynergy.learning.preprocessing#

class BasePanelSelector[source]#

Bases: BaseEstimator, SelectorMixin, ABC

Base class for statistical feature selection over a panel.

fit(X, y=None)[source]#

Learn optimal features based on a training set pair (X, y).

Parameters:
abstract determine_features(X, y)[source]#

Determine mask of selected features based on a training set pair (X, y).

Parameters:
Returns:

mask – Boolean mask of selected features.

Return type:

np.ndarray

transform(X)[source]#

Transform method to return only the selected features of the dataframe.

Parameters:

X (pandas.DataFrame) – The feature matrix.

Returns:

X_transformed – The feature matrix with only the selected features.

Return type:

pandas.DataFrame

get_feature_names_out()[source]#

Method to mask feature names according to selected features.

class LarsSelector(n_factors=10, fit_intercept=False)[source]#

Bases: BasePanelSelector

determine_features(X, y)[source]#

Create feature mask based on the LARS algorithm.

Parameters:
Returns:

mask – Boolean mask of selected features.

Return type:

list

class LassoSelector(n_factors=10, positive=False)[source]#

Bases: BasePanelSelector

determine_features(X, y)[source]#

Create feature mask based on the LASSO-LARS algorithm.

Parameters:
Returns:

mask – Boolean mask of selected features.

Return type:

np.ndarray

class MapSelector(n_factors=None, significance_level=0.05, positive=False)[source]#

Bases: BasePanelSelector

determine_features(X, y)[source]#

Create feature mask based on the Macrosynergy panel test.

Parameters:
Returns:

mask – Boolean mask of selected features.

Return type:

np.ndarray

class BasePanelScaler(type='panel')[source]#

Bases: BaseEstimator, TransformerMixin, OneToOneFeatureMixin, ABC

Base class for scaling a panel of features in a learning pipeline.

Parameters:

type (str, default="panel") – The panel dimension over which the scaling is applied. Options are “panel” and “cross_section”.

Notes

Learning algorithms can benefit from scaling each feature to a similar range. This ensures they consider each feature equally in the model training process. It can also encourage faster convergence of an optimization algorithm.

fit(X, y=None)[source]#

Fit method to learn training set quantities for feature scaling.

Parameters:
  • X (pd.DataFrame) – The feature matrix.

  • y (pd.Series or pd.DataFrame, default=None) – The target vector.

Returns:

The fitted scaler.

Return type:

self

transform(X)[source]#

Transform method to scale the input data based on extracted training statistics.

Parameters:

X (pandas.DataFrame) – The feature matrix.

Returns:

X_transformed – The feature matrix with scaled features.

Return type:

pandas.DataFrame

abstract extract_statistics(X, feature)[source]#

Determine the relevant statistics for feature scaling.

abstract scale(X, feature, statistics)[source]#

Scale the input data based on the relevant statistics.

class PanelMinMaxScaler(type='panel')[source]#

Bases: BasePanelScaler

Scale and translate panel features to lie within the range [0,1].

Notes

This class is designed to replicate scikit-learn’s MinMaxScaler() class, with the additional option to scale within cross-sections. Unlike the MinMaxScaler() class, dataframes are always returned, preserving the multi-indexing of the inputs.

extract_statistics(X, feature)[source]#

Determine the minimum and maximum values of a feature in the input matrix.

Parameters:
  • X (pandas.DataFrame) – The feature matrix.

  • feature (str) – The feature to extract statistics for.

Returns:

statistics – List containing the minimum and maximum values of the feature.

Return type:

list

scale(X, feature, statistics)[source]#

Scale the ‘feature’ column in the design matrix ‘X’ based on the minimum and maximum values of the feature.

Parameters:
  • X (pandas.DataFrame) – The feature matrix.

  • feature (str) – The feature to scale.

  • statistics (list) – List containing the minimum and maximum values of the feature, in that order.

Returns:

X_transformed – The scaled feature.

Return type:

pandas.Series

class PanelStandardScaler(type='panel', with_mean=True, with_std=True)[source]#

Bases: BasePanelScaler

Scale and translate panel features to have zero mean and unit variance.

Parameters:
  • type (str, default="panel") – The panel dimension over which the scaling is applied. Options are “panel” and “cross_section”.

  • with_mean (bool, default=True) – Whether to centre the data before scaling.

  • with_std (bool, default=True) – Whether to scale the data to unit variance.

Notes

This class is designed to replicate scikit-learn’s StandardScaler() class, with the additional option to scale within cross-sections. Unlike the StandardScaler() class, dataframes are always returned, preserving the multi-indexing of the inputs.

extract_statistics(X, feature)[source]#

Determine the mean and standard deviation of values of a feature in the input matrix.

Parameters:
  • X (pandas.DataFrame) – The feature matrix.

  • feature (str) – The feature to extract statistics for.

Returns:

statistics – List containing the mean and standard deviation of values of the feature.

Return type:

list

scale(X, feature, statistics)[source]#

Scale the ‘feature’ column in the design matrix ‘X’ based on the mean and standard deviation values of the feature.

Parameters:
  • X (pandas.DataFrame) – The feature matrix.

  • feature (str) – The feature to scale.

  • statistics (list) – List containing the mean and standard deviation of values of the feature, in that order.

Returns:

X_transformed – The scaled feature.

Return type:

pandas.Series

class PanelPCA(n_components=None, kaiser_criterion=False, adjust_signs=False)[source]#

Bases: BaseEstimator, TransformerMixin

fit(X, y=None)[source]#

Fit method to determine an eigenbasis for the PCA.

Parameters:
  • X (pd.DataFrame) – Input feature matrix.

  • y (pd.DataFrame, pd.Series or np.ndarray, default=None) – Target variable.

Notes

The target variable y is only ever used to adjust the signs of the eigenvectors to ensure consistency of eigenvector signs when retrained over time. This does not affect the PCA itself.

transform(X)[source]#

Project input features onto the principal components.

Parameters:

X (pd.DataFrame) – Input feature matrix.

Returns:

Projected features.

Return type:

pd.DataFrame

class ZnScoreAverager(neutral='zero', use_signs=False)[source]#

Bases: BaseEstimator, TransformerMixin

fit(X, y=None)[source]#

Extract relevant standardisation/normalisation statistics.

Parameters:
  • X (pd.DataFrame) – Input feature matrix.

  • y (Any, default=None) – Placeholder for scikit-learn compatibility.

transform(X)[source]#

Create an OOS conceptual parity signal by averaging PiT z-scores of features.

Parameters:

X (pd.DataFrame) – Input feature matrix.

Returns:

Output signal.

Return type:

pd.DataFrame

Subpackages#