macrosynergy.learning.preprocessing#
- class BasePanelSelector[source]#
Bases:
BaseEstimator,SelectorMixin,ABCBase class for statistical feature selection over a panel.
- fit(X, y=None)[source]#
Learn optimal features based on a training set pair (X, y).
- Parameters:
X (pandas.DataFrame) – The feature matrix.
y (pandas.Series or pandas.DataFrame, optional) – The target vector.
- abstract determine_features(X, y)[source]#
Determine mask of selected features based on a training set pair (X, y).
- Parameters:
X (pandas.DataFrame) – The feature matrix.
y (pandas.Series or pandas.DataFrame) – The target vector.
- Returns:
mask – Boolean mask of selected features.
- Return type:
np.ndarray
- transform(X)[source]#
Transform method to return only the selected features of the dataframe.
- Parameters:
X (pandas.DataFrame) – The feature matrix.
- Returns:
X_transformed – The feature matrix with only the selected features.
- Return type:
- class LarsSelector(n_factors=10, fit_intercept=False)[source]#
Bases:
BasePanelSelector- determine_features(X, y)[source]#
Create feature mask based on the LARS algorithm.
- Parameters:
X (pandas.DataFrame) – The feature matrix.
y (pandas.Series or pandas.DataFrame) – The target vector.
- Returns:
mask – Boolean mask of selected features.
- Return type:
- class LassoSelector(n_factors=10, positive=False)[source]#
Bases:
BasePanelSelector- determine_features(X, y)[source]#
Create feature mask based on the LASSO-LARS algorithm.
- Parameters:
X (pandas.DataFrame) – The feature matrix.
y (pandas.Series or pandas.DataFrame) – The target vector.
- Returns:
mask – Boolean mask of selected features.
- Return type:
np.ndarray
- class MapSelector(n_factors=None, significance_level=0.05, positive=False)[source]#
Bases:
BasePanelSelector- determine_features(X, y)[source]#
Create feature mask based on the Macrosynergy panel test.
- Parameters:
X (pandas.DataFrame) – The feature matrix.
y (pandas.Series or pandas.DataFrame) – The target vector.
- Returns:
mask – Boolean mask of selected features.
- Return type:
np.ndarray
- class KendallSignificanceSelector(alpha=0.05)[source]#
Bases:
BasePanelSelectorUnivariate statistical feature selection using Kendall correlation tests.
Future enhancements will include Bonferroni corrections for multiple testing.
- Parameters:
alpha (float, default=0.05) – Significance level.
- determine_features(X, y)[source]#
Create feature mask based on the Macrosynergy panel test.
- Parameters:
X (pandas.DataFrame) – The feature matrix.
y (pandas.Series or pandas.DataFrame) – The target vector.
- Returns:
mask – Boolean mask of selected features.
- Return type:
np.ndarray
- class FactorAvailabilitySelector(min_cids=2, min_periods=36)[source]#
Bases:
BasePanelSelector- determine_features(X, y)[source]#
Determine mask of selected features based on a training set pair (X, y).
- Parameters:
X (pandas.DataFrame) – The feature matrix.
y (pandas.Series or pandas.DataFrame) – The target vector.
- Returns:
mask – Boolean mask of selected features.
- Return type:
np.ndarray
- class BasePanelScaler(type='panel')[source]#
Bases:
BaseEstimator,TransformerMixin,OneToOneFeatureMixin,ABCBase class for scaling a panel of features in a learning pipeline.
- Parameters:
type (str, default="panel") – The panel dimension over which the scaling is applied. Options are “panel” and “cross_section”.
Notes
Learning algorithms can benefit from scaling each feature to a similar range. This ensures they consider each feature equally in the model training process. It can also encourage faster convergence of an optimization algorithm.
- fit(X, y=None)[source]#
Fit method to learn training set quantities for feature scaling.
- Parameters:
X (pd.DataFrame) – The feature matrix.
y (pd.Series or pd.DataFrame, default=None) – The target vector.
- Returns:
The fitted scaler.
- Return type:
self
- transform(X)[source]#
Transform method to scale the input data based on extracted training statistics.
- Parameters:
X (pandas.DataFrame) – The feature matrix.
- Returns:
X_transformed – The feature matrix with scaled features.
- Return type:
- class PanelMinMaxScaler(type='panel')[source]#
Bases:
BasePanelScalerScale and translate panel features to lie within the range [0,1].
Notes
This class is designed to replicate scikit-learn’s MinMaxScaler() class, with the additional option to scale within cross-sections. Unlike the MinMaxScaler() class, dataframes are always returned, preserving the multi-indexing of the inputs.
- extract_statistics(X, feature)[source]#
Determine the minimum and maximum values of a feature in the input matrix.
- Parameters:
X (pandas.DataFrame) – The feature matrix.
feature (str) – The feature to extract statistics for.
- Returns:
statistics – List containing the minimum and maximum values of the feature.
- Return type:
- scale(X, feature, statistics)[source]#
Scale the ‘feature’ column in the design matrix ‘X’ based on the minimum and maximum values of the feature.
- Parameters:
X (pandas.DataFrame) – The feature matrix.
feature (str) – The feature to scale.
statistics (list) – List containing the minimum and maximum values of the feature, in that order.
- Returns:
X_transformed – The scaled feature.
- Return type:
- class PanelStandardScaler(type='panel', with_mean=True, with_std=True)[source]#
Bases:
BasePanelScalerScale and translate panel features to have zero mean and unit variance.
- Parameters:
Notes
This class is designed to replicate scikit-learn’s StandardScaler() class, with the additional option to scale within cross-sections. Unlike the StandardScaler() class, dataframes are always returned, preserving the multi-indexing of the inputs.
- extract_statistics(X, feature)[source]#
Determine the mean and standard deviation of values of a feature in the input matrix.
- Parameters:
X (pandas.DataFrame) – The feature matrix.
feature (str) – The feature to extract statistics for.
- Returns:
statistics – List containing the mean and standard deviation of values of the feature.
- Return type:
- scale(X, feature, statistics)[source]#
Scale the ‘feature’ column in the design matrix ‘X’ based on the mean and standard deviation values of the feature.
- Parameters:
X (pandas.DataFrame) – The feature matrix.
feature (str) – The feature to scale.
statistics (list) – List containing the mean and standard deviation of values of the feature, in that order.
- Returns:
X_transformed – The scaled feature.
- Return type:
- class PanelPCA(n_components=None, kaiser_criterion=False, adjust_signs=False)[source]#
Bases:
BaseEstimator,TransformerMixin- fit(X, y=None)[source]#
Fit method to determine an eigenbasis for the PCA.
- Parameters:
X (pd.DataFrame) – Input feature matrix.
y (pd.DataFrame, pd.Series or np.ndarray, default=None) – Target variable.
Notes
The target variable y is only ever used to adjust the signs of the eigenvectors to ensure consistency of eigenvector signs when retrained over time. This does not affect the PCA itself.
- class ZnScoreAverager(neutral='zero', use_signs=False)[source]#
Bases:
BaseEstimator,TransformerMixin
- class BaseImputer(missing_values=nan, nan_threshold=1.0)[source]#
Bases:
BaseEstimator,TransformerMixin,ABCBase class for imputers operating on panel data
- Parameters:
missing_values (int, float, str, np.nan, None or pandas.NA, default=np.nan) – The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values can be set to either np.nan or pd.NA.
nan_threshold (float, default=1.0) – If the proportion of NaNs in column is greater than this, we get rid of the column.
- feature_names_in_#
Names of features seen during fit
- Type:
ndarray of shape (n_features_in_,)
- missing_fraction_by_col_#
Fraction of missing values for each column
- Type:
pd.Series
- missing_fraction_by_cid_and_col_#
Fraction of missing values for each column split by cid
- Type:
pd.DataFrame
- n_features_out_#
Number of features left after transforming
- Type:
Integral
- class ConstantImputer(fill_value=0, nan_threshold=1.0, missing_values=nan)[source]#
Bases:
BaseImputerClass for imputing missing values with a constant
- Parameters:
fill_value – Value to replace missing values with. Default is 0.
missing_values (int, float, str, np.nan, None or pandas.NA, default=np.nan) – The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values can be set to either np.nan or pd.NA.
nan_threshold (float, default=1.0) – If the proportion of NaNs in column is greater than this, we get rid of the column.
- feature_names_in_#
Names of features seen during fit
- Type:
ndarray of shape (n_features_in_,)
- missing_fraction_by_col_#
Fraction of missing values for each column
- Type:
pd.Series
- missing_fraction_by_cid_and_col_#
Fraction of missing values for each column split by cid
- Type:
pd.DataFrame
- n_features_out_#
Number of features left after transforming
- Type:
Integral
- class CrossSectionalImputer(peer_map=None, default_peers='all', fallback='mean', missing_values=nan, nan_threshold=1.0)[source]#
Bases:
BaseImputerImpute missing values using the cross-sectional mean across configured peers at the same real_date (per feature).
- Parameters:
peer_map (dict[str, list[str]] or None) –
Mapping from target cid -> list of peer cids to use for imputation. Example:
{“CAD”: [“USD”, “GBP”, “EUR”], “USD”: [“CAD”, “GBP”, “EUR”]}
If None, peers default to “all other cids” (unless default_peers=”none”).
default_peers ({"all", "none"}) –
- Behaviour for cids not present in peer_map:
”all”: use all other cids as peers
”none”: do not impute for that cid (unless fallback kicks in)
fallback ({"none", "zero", "mean"}) – If “mean”, any values still missing after peer-based imputation are filled with the global mean per feature computed at fit time. If “zero” values are filled with 0.
missing_values (scalar) – Value to treat as missing (converted to np.nan internally).
- feature_names_in_#
Names of features seen during fit
- Type:
ndarray of shape (n_features_in_,)
- missing_fraction_by_col_#
Fraction of missing values for each column
- Type:
pd.Series
- missing_fraction_by_cid_and_col_#
Fraction of missing values for each column split by cid
- Type:
pd.DataFrame
- n_features_out_#
Number of features left after transforming
- Type:
Integral
- class EstimatorImputer(estimator=None, fallback='mean', missing_values=nan, nan_threshold=1.0, complete_rows_only=True, predictor_fill_value='mean')[source]#
Bases:
BaseImputerImpute missing values using a per-feature sklearn-compatible estimator trained on the remaining features at fit time.
For each feature with missing values, a clone of the provided estimator is trained using all other features as predictors, on rows where the target feature is observed. At transform time the learned model fills in missing values in that feature.
- Parameters:
estimator (BaseEstimator or None, default=None) – Any sklearn-compatible estimator (e.g. RandomForestRegressor, LinearRegression, Pipeline). If None, defaults to RandomForestRegressor().
fallback (str, default="none") – Strategy for handling values still missing after model-based imputation. - “mean”: fill with column means - “zero”: fill with zeros - “none”: leave remaining NaNs in place
missing_values (scalar, default=np.nan) – Value to treat as missing (converted to np.nan internally).
nan_threshold (float, default=1.0) – If the proportion of NaNs in a column exceeds this threshold, the column is dropped entirely.
complete_rows_only (bool, default=True) – If True, each per-feature model is trained only on rows where all predictor columns are also non-NaN. This allows any sklearn estimator to be used, not just those that handle NaN natively.
predictor_fill_value (str, float, int, or None, default="mean") – How to handle NaN predictor values at transform time. “mean” fills with per-column means from fit time, a numeric scalar fills with that constant, “skip” skips prediction for rows with NaN predictors (leaving them for the fallback to handle), and None applies no fill.
- feature_names_in_#
Names of features seen during fit.
- Type:
ndarray of shape (n_features_in_,)
- missing_fraction_by_col_#
Fraction of missing values for each column.
- Type:
pd.Series
- missing_fraction_by_cid_and_col_#
Fraction of missing values for each column split by cid.
- Type:
pd.DataFrame
- models_#
Mapping from feature name -> fitted estimator. Only populated for features that had at least one missing value during fit and had enough observed rows to train a model.
- predictor_means_#
Column means of the kept features (computed on training data, used to fill missing predictor values before prediction).
- Type:
pd.Series
- class GaussianConditionalImputer(fallback='mean', missing_values=nan, nan_threshold=1.0)[source]#
Bases:
BaseImputerImpute missing values using the closed-form Gaussian conditional mean.
For each row with missing values, the imputer partitions the feature vector into observed (o) and missing (m) components and computes:
mu_{m|o} = mu_m + Sigma_{mo} @ Sigma_{oo}^{-1} @ (x_o - mu_o)
A single global Gaussian (mean + Ledoit-Wolf covariance) is fitted on all complete rows across all cross-section identifiers.
- Parameters:
fallback ({"mean", "zero", "none"}, default="mean") – Strategy for any values still missing after conditional imputation: - “mean”: fill with column means - “zero”: fill with zeros - “none”: leave remaining NaNs in place
missing_values (scalar, default=np.nan) – Value to treat as missing (converted to np.nan internally).
nan_threshold (float, default=1.0) – If the proportion of NaNs in a column exceeds this threshold, the column is dropped entirely.
Subpackages#
- macrosynergy.learning.preprocessing.imputers
- macrosynergy.learning.preprocessing.panel_selectors
- macrosynergy.learning.preprocessing.scalers
- macrosynergy.learning.preprocessing.transformers