macrosynergy.learning.preprocessing#

class BasePanelSelector[source]#

Bases: BaseEstimator, SelectorMixin, ABC

Base class for statistical feature selection over a panel.

fit(X, y=None)[source]#

Learn optimal features based on a training set pair (X, y).

Parameters:

X (pandas.DataFrame) – The feature matrix.
y (pandas.Series or pandas.DataFrame, optional) – The target vector.

abstract determine_features(X, y)[source]#

Determine mask of selected features based on a training set pair (X, y).

Parameters:

X (pandas.DataFrame) – The feature matrix.
y (pandas.Series or pandas.DataFrame) – The target vector.

Returns:

mask – Boolean mask of selected features.

Return type:

np.ndarray

transform(X)[source]#

Transform method to return only the selected features of the dataframe.

Parameters:: X (pandas.DataFrame) – The feature matrix.
Returns:: X_transformed – The feature matrix with only the selected features.
Return type:: pandas.DataFrame

get_feature_names_out()[source]#: Method to mask feature names according to selected features.

class LarsSelector(n_factors=10, fit_intercept=False)[source]#

Bases: BasePanelSelector

determine_features(X, y)[source]#

Create feature mask based on the LARS algorithm.

Parameters:

X (pandas.DataFrame) – The feature matrix.
y (pandas.Series or pandas.DataFrame) – The target vector.

Returns:

mask – Boolean mask of selected features.

Return type:

list

class LassoSelector(n_factors=10, positive=False)[source]#

Bases: BasePanelSelector

determine_features(X, y)[source]#

Create feature mask based on the LASSO-LARS algorithm.

Parameters:

X (pandas.DataFrame) – The feature matrix.
y (pandas.Series or pandas.DataFrame) – The target vector.

Returns:

mask – Boolean mask of selected features.

Return type:

np.ndarray

class MapSelector(n_factors=None, significance_level=0.05, positive=False)[source]#

Bases: BasePanelSelector

determine_features(X, y)[source]#

Create feature mask based on the Macrosynergy panel test.

Parameters:

X (pandas.DataFrame) – The feature matrix.
y (pandas.Series or pandas.DataFrame) – The target vector.

Returns:

mask – Boolean mask of selected features.

Return type:

np.ndarray

class KendallSignificanceSelector(alpha=0.05)[source]#

Bases: BasePanelSelector

Univariate statistical feature selection using Kendall correlation tests.

Future enhancements will include Bonferroni corrections for multiple testing.

Parameters:: alpha (float, default=0.05) – Significance level.

determine_features(X, y)[source]#

Create feature mask based on the Macrosynergy panel test.

Parameters:

X (pandas.DataFrame) – The feature matrix.
y (pandas.Series or pandas.DataFrame) – The target vector.

Returns:

mask – Boolean mask of selected features.

Return type:

np.ndarray

class FactorAvailabilitySelector(min_cids=2, min_periods=36)[source]#

Bases: BasePanelSelector

determine_features(X, y)[source]#

Determine mask of selected features based on a training set pair (X, y).

Parameters:

X (pandas.DataFrame) – The feature matrix.
y (pandas.Series or pandas.DataFrame) – The target vector.

Returns:

mask – Boolean mask of selected features.

Return type:

np.ndarray

class BasePanelScaler(type='panel')[source]#

Bases: BaseEstimator, TransformerMixin, OneToOneFeatureMixin, ABC

Base class for scaling a panel of features in a learning pipeline.

Parameters:: type (str, default="panel") – The panel dimension over which the scaling is applied. Options are “panel” and “cross_section”.

Notes

Learning algorithms can benefit from scaling each feature to a similar range. This ensures they consider each feature equally in the model training process. It can also encourage faster convergence of an optimization algorithm.

fit(X, y=None)[source]#

Fit method to learn training set quantities for feature scaling.

Parameters:

X (pd.DataFrame) – The feature matrix.
y (pd.Series or pd.DataFrame, default=None) – The target vector.

Returns:

The fitted scaler.

Return type:

self

transform(X)[source]#

Transform method to scale the input data based on extracted training statistics.

Parameters:: X (pandas.DataFrame) – The feature matrix.
Returns:: X_transformed – The feature matrix with scaled features.
Return type:: pandas.DataFrame

abstract extract_statistics(X, feature)[source]#: Determine the relevant statistics for feature scaling.

abstract scale(X, feature, statistics)[source]#: Scale the input data based on the relevant statistics.

class PanelMinMaxScaler(type='panel')[source]#

Bases: BasePanelScaler

Scale and translate panel features to lie within the range [0,1].

Notes

This class is designed to replicate scikit-learn’s MinMaxScaler() class, with the additional option to scale within cross-sections. Unlike the MinMaxScaler() class, dataframes are always returned, preserving the multi-indexing of the inputs.

extract_statistics(X, feature)[source]#

Determine the minimum and maximum values of a feature in the input matrix.

Parameters:

X (pandas.DataFrame) – The feature matrix.
feature (str) – The feature to extract statistics for.

Returns:

statistics – List containing the minimum and maximum values of the feature.

Return type:

list

scale(X, feature, statistics)[source]#

Scale the ‘feature’ column in the design matrix ‘X’ based on the minimum and maximum values of the feature.

Parameters:

X (pandas.DataFrame) – The feature matrix.
feature (str) – The feature to scale.
statistics (list) – List containing the minimum and maximum values of the feature, in that order.

Returns:

X_transformed – The scaled feature.

Return type:

pandas.Series

class PanelStandardScaler(type='panel', with_mean=True, with_std=True)[source]#

Bases: BasePanelScaler

Scale and translate panel features to have zero mean and unit variance.

Parameters:

type (str, default="panel") – The panel dimension over which the scaling is applied. Options are “panel” and “cross_section”.
with_mean (bool, default=True) – Whether to centre the data before scaling.
with_std (bool, default=True) – Whether to scale the data to unit variance.

Notes

This class is designed to replicate scikit-learn’s StandardScaler() class, with the additional option to scale within cross-sections. Unlike the StandardScaler() class, dataframes are always returned, preserving the multi-indexing of the inputs.

extract_statistics(X, feature)[source]#

Determine the mean and standard deviation of values of a feature in the input matrix.

Parameters:

X (pandas.DataFrame) – The feature matrix.
feature (str) – The feature to extract statistics for.

Returns:

statistics – List containing the mean and standard deviation of values of the feature.

Return type:

list

scale(X, feature, statistics)[source]#

Scale the ‘feature’ column in the design matrix ‘X’ based on the mean and standard deviation values of the feature.

Parameters:

X (pandas.DataFrame) – The feature matrix.
feature (str) – The feature to scale.
statistics (list) – List containing the mean and standard deviation of values of the feature, in that order.

Returns:

X_transformed – The scaled feature.

Return type:

pandas.Series

class PanelPCA(n_components=None, kaiser_criterion=False, adjust_signs=False)[source]#

Bases: BaseEstimator, TransformerMixin

fit(X, y=None)[source]#

Fit method to determine an eigenbasis for the PCA.

Parameters:

X (pd.DataFrame) – Input feature matrix.
y (pd.DataFrame, pd.Series or np.ndarray, default=None) – Target variable.

Notes

The target variable y is only ever used to adjust the signs of the eigenvectors to ensure consistency of eigenvector signs when retrained over time. This does not affect the PCA itself.

transform(X)[source]#

Project input features onto the principal components.

Parameters:: X (pd.DataFrame) – Input feature matrix.
Returns:: Projected features.
Return type:: pd.DataFrame

get_feature_names_out(input_features=None)[source]#

Get output feature names produced by the transformation.

Parameters:: input_features (None) – This parameter has no effect and is included for compatibility with the scikit-learn API.

class ZnScoreAverager(neutral='zero', use_signs=False)[source]#

Bases: BaseEstimator, TransformerMixin

fit(X, y=None)[source]#

Extract relevant standardisation/normalisation statistics.

Parameters:

X (pd.DataFrame) – Input feature matrix.
y (Any, default=None) – Placeholder for scikit-learn compatibility.

transform(X)[source]#

Create an OOS conceptual parity signal by averaging PiT z-scores of features.

Parameters:: X (pd.DataFrame) – Input feature matrix.
Returns:: Output signal.
Return type:: pd.DataFrame

class BaseImputer(missing_values=nan, nan_threshold=1.0)[source]#

Bases: BaseEstimator, TransformerMixin, ABC

Base class for imputers operating on panel data

Parameters:

missing_values (int, float, str, np.nan, None or pandas.NA, default=np.nan) – The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values can be set to either np.nan or pd.NA.
nan_threshold (float, default=1.0) – If the proportion of NaNs in column is greater than this, we get rid of the column.

feature_names_in_#

Names of features seen during fit

Type:: ndarray of shape (n_features_in_,)

missing_fraction_by_col_#

Fraction of missing values for each column

Type:: pd.Series

missing_fraction_by_cid_and_col_#

Fraction of missing values for each column split by cid

Type:: pd.DataFrame

dropped_features_#

Names of features to be dropped from the data

Type:: list

kept_features_#

Names of features that are not dropped

Type:: list

n_features_out_#

Number of features left after transforming

Type:: Integral

fit(X, y=None)[source]#

transform(X)[source]#

get_feature_names_out(input_features=None)[source]#

Return type:: ndarray

class ConstantImputer(fill_value=0, nan_threshold=1.0, missing_values=nan)[source]#

Bases: BaseImputer

Class for imputing missing values with a constant

Parameters:

fill_value – Value to replace missing values with. Default is 0.
missing_values (int, float, str, np.nan, None or pandas.NA, default=np.nan) – The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values can be set to either np.nan or pd.NA.
nan_threshold (float, default=1.0) – If the proportion of NaNs in column is greater than this, we get rid of the column.

feature_names_in_#

Names of features seen during fit

Type:: ndarray of shape (n_features_in_,)

missing_fraction_by_col_#

Fraction of missing values for each column

Type:: pd.Series

missing_fraction_by_cid_and_col_#

Fraction of missing values for each column split by cid

Type:: pd.DataFrame

dropped_features_#

Names of features to be dropped from the data

Type:: list

kept_features_#

Names of features that are not dropped

Type:: list

n_features_out_#

Number of features left after transforming

Type:: Integral

class CrossSectionalImputer(peer_map=None, default_peers='all', fallback='mean', missing_values=nan, nan_threshold=1.0)[source]#

Bases: BaseImputer

Impute missing values using the cross-sectional mean across configured peers at the same real_date (per feature).

Parameters:

peer_map (dict[str, list[str]] or None) –
Mapping from target cid -> list of peer cids to use for imputation. Example:

{“CAD”: [“USD”, “GBP”, “EUR”], “USD”: [“CAD”, “GBP”, “EUR”]}

If None, peers default to “all other cids” (unless default_peers=”none”).
default_peers ({"all", "none"}) –
Behaviour for cids not present in peer_map:
- ”all”: use all other cids as peers
- ”none”: do not impute for that cid (unless fallback kicks in)
fallback ({"none", "zero", "mean"}) – If “mean”, any values still missing after peer-based imputation are filled with the global mean per feature computed at fit time. If “zero” values are filled with 0.
missing_values (scalar) – Value to treat as missing (converted to np.nan internally).

feature_names_in_#

Names of features seen during fit

Type:: ndarray of shape (n_features_in_,)

missing_fraction_by_col_#

Fraction of missing values for each column

Type:: pd.Series

missing_fraction_by_cid_and_col_#

Fraction of missing values for each column split by cid

Type:: pd.DataFrame

dropped_features_#

Names of features to be dropped from the data

Type:: list

kept_features_#

Names of features that are not dropped

Type:: list

n_features_out_#

Number of features left after transforming

Type:: Integral

class EstimatorImputer(estimator=None, fallback='mean', missing_values=nan, nan_threshold=1.0, complete_rows_only=True, predictor_fill_value='mean')[source]#

Bases: BaseImputer

Impute missing values using a per-feature sklearn-compatible estimator trained on the remaining features at fit time.

For each feature with missing values, a clone of the provided estimator is trained using all other features as predictors, on rows where the target feature is observed. At transform time the learned model fills in missing values in that feature.

Parameters:

estimator (BaseEstimator or None, default=None) – Any sklearn-compatible estimator (e.g. RandomForestRegressor, LinearRegression, Pipeline). If None, defaults to RandomForestRegressor().
fallback (str, default="none") – Strategy for handling values still missing after model-based imputation. - “mean”: fill with column means - “zero”: fill with zeros - “none”: leave remaining NaNs in place
missing_values (scalar, default=np.nan) – Value to treat as missing (converted to np.nan internally).
nan_threshold (float, default=1.0) – If the proportion of NaNs in a column exceeds this threshold, the column is dropped entirely.
complete_rows_only (bool, default=True) – If True, each per-feature model is trained only on rows where all predictor columns are also non-NaN. This allows any sklearn estimator to be used, not just those that handle NaN natively.
predictor_fill_value (str, float, int, or None, default="mean") – How to handle NaN predictor values at transform time. “mean” fills with per-column means from fit time, a numeric scalar fills with that constant, “skip” skips prediction for rows with NaN predictors (leaving them for the fallback to handle), and None applies no fill.

feature_names_in_#

Names of features seen during fit.

Type:: ndarray of shape (n_features_in_,)

missing_fraction_by_col_#

Fraction of missing values for each column.

Type:: pd.Series

missing_fraction_by_cid_and_col_#

Fraction of missing values for each column split by cid.

Type:: pd.DataFrame

dropped_features_#

Names of features dropped due to exceeding nan_threshold.

Type:: list

kept_features_#

Names of features retained after thresholding.

Type:: list

n_features_out_#

Number of features remaining after transform.

Type:: int

models_#

Mapping from feature name -> fitted estimator. Only populated for features that had at least one missing value during fit and had enough observed rows to train a model.

Type:: dict[str, Predictor]

predictor_means_#

Column means of the kept features (computed on training data, used to fill missing predictor values before prediction).

Type:: pd.Series

class GaussianConditionalImputer(fallback='mean', missing_values=nan, nan_threshold=1.0)[source]#

Bases: BaseImputer

Impute missing values using the closed-form Gaussian conditional mean.

For each row with missing values, the imputer partitions the feature vector into observed (o) and missing (m) components and computes:

mu_{m|o} = mu_m + Sigma_{mo} @ Sigma_{oo}^{-1} @ (x_o - mu_o)

A single global Gaussian (mean + Ledoit-Wolf covariance) is fitted on all complete rows across all cross-section identifiers.

Parameters:

fallback ({"mean", "zero", "none"}, default="mean") – Strategy for any values still missing after conditional imputation: - “mean”: fill with column means - “zero”: fill with zeros - “none”: leave remaining NaNs in place
missing_values (scalar, default=np.nan) – Value to treat as missing (converted to np.nan internally).
nan_threshold (float, default=1.0) – If the proportion of NaNs in a column exceeds this threshold, the column is dropped entirely.

macrosynergy.learning.preprocessing#

Subpackages#