macrosynergy.learning.preprocessing.imputers#

class BaseImputer(missing_values=nan, nan_threshold=1.0)[source]#

Bases: BaseEstimator, TransformerMixin, ABC

Base class for imputers operating on panel data

Parameters:

missing_values (int, float, str, np.nan, None or pandas.NA, default=np.nan) – The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values can be set to either np.nan or pd.NA.
nan_threshold (float, default=1.0) – If the proportion of NaNs in column is greater than this, we get rid of the column.

feature_names_in_#

Names of features seen during fit

Type:: ndarray of shape (n_features_in_,)

missing_fraction_by_col_#

Fraction of missing values for each column

Type:: pd.Series

missing_fraction_by_cid_and_col_#

Fraction of missing values for each column split by cid

Type:: pd.DataFrame

dropped_features_#

Names of features to be dropped from the data

Type:: list

kept_features_#

Names of features that are not dropped

Type:: list

n_features_out_#

Number of features left after transforming

Type:: Integral

fit(X, y=None)[source]#

transform(X)[source]#

get_feature_names_out(input_features=None)[source]#

Return type:: ndarray

class ConstantImputer(fill_value=0, nan_threshold=1.0, missing_values=nan)[source]#

Bases: BaseImputer

Class for imputing missing values with a constant

Parameters:

fill_value – Value to replace missing values with. Default is 0.
missing_values (int, float, str, np.nan, None or pandas.NA, default=np.nan) – The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values can be set to either np.nan or pd.NA.
nan_threshold (float, default=1.0) – If the proportion of NaNs in column is greater than this, we get rid of the column.

feature_names_in_#

Names of features seen during fit

Type:: ndarray of shape (n_features_in_,)

missing_fraction_by_col_#

Fraction of missing values for each column

Type:: pd.Series

missing_fraction_by_cid_and_col_#

Fraction of missing values for each column split by cid

Type:: pd.DataFrame

dropped_features_#

Names of features to be dropped from the data

Type:: list

kept_features_#

Names of features that are not dropped

Type:: list

n_features_out_#

Number of features left after transforming

Type:: Integral

class CrossSectionalImputer(peer_map=None, default_peers='all', fallback='mean', missing_values=nan, nan_threshold=1.0)[source]#

Bases: BaseImputer

Impute missing values using the cross-sectional mean across configured peers at the same real_date (per feature).

Parameters:

peer_map (dict[str, list[str]] or None) –
Mapping from target cid -> list of peer cids to use for imputation. Example:

{“CAD”: [“USD”, “GBP”, “EUR”], “USD”: [“CAD”, “GBP”, “EUR”]}

If None, peers default to “all other cids” (unless default_peers=”none”).
default_peers ({"all", "none"}) –
Behaviour for cids not present in peer_map:
- ”all”: use all other cids as peers
- ”none”: do not impute for that cid (unless fallback kicks in)
fallback ({"none", "zero", "mean"}) – If “mean”, any values still missing after peer-based imputation are filled with the global mean per feature computed at fit time. If “zero” values are filled with 0.
missing_values (scalar) – Value to treat as missing (converted to np.nan internally).

feature_names_in_#

Names of features seen during fit

Type:: ndarray of shape (n_features_in_,)

missing_fraction_by_col_#

Fraction of missing values for each column

Type:: pd.Series

missing_fraction_by_cid_and_col_#

Fraction of missing values for each column split by cid

Type:: pd.DataFrame

dropped_features_#

Names of features to be dropped from the data

Type:: list

kept_features_#

Names of features that are not dropped

Type:: list

n_features_out_#

Number of features left after transforming

Type:: Integral

class EstimatorImputer(estimator=None, fallback='mean', missing_values=nan, nan_threshold=1.0, complete_rows_only=True, predictor_fill_value='mean')[source]#

Bases: BaseImputer

Impute missing values using a per-feature sklearn-compatible estimator trained on the remaining features at fit time.

For each feature with missing values, a clone of the provided estimator is trained using all other features as predictors, on rows where the target feature is observed. At transform time the learned model fills in missing values in that feature.

Parameters:

estimator (BaseEstimator or None, default=None) – Any sklearn-compatible estimator (e.g. RandomForestRegressor, LinearRegression, Pipeline). If None, defaults to RandomForestRegressor().
fallback (str, default="none") – Strategy for handling values still missing after model-based imputation. - “mean”: fill with column means - “zero”: fill with zeros - “none”: leave remaining NaNs in place
missing_values (scalar, default=np.nan) – Value to treat as missing (converted to np.nan internally).
nan_threshold (float, default=1.0) – If the proportion of NaNs in a column exceeds this threshold, the column is dropped entirely.
complete_rows_only (bool, default=True) – If True, each per-feature model is trained only on rows where all predictor columns are also non-NaN. This allows any sklearn estimator to be used, not just those that handle NaN natively.
predictor_fill_value (str, float, int, or None, default="mean") – How to handle NaN predictor values at transform time. “mean” fills with per-column means from fit time, a numeric scalar fills with that constant, “skip” skips prediction for rows with NaN predictors (leaving them for the fallback to handle), and None applies no fill.

feature_names_in_#

Names of features seen during fit.

Type:: ndarray of shape (n_features_in_,)

missing_fraction_by_col_#

Fraction of missing values for each column.

Type:: pd.Series

missing_fraction_by_cid_and_col_#

Fraction of missing values for each column split by cid.

Type:: pd.DataFrame

dropped_features_#

Names of features dropped due to exceeding nan_threshold.

Type:: list

kept_features_#

Names of features retained after thresholding.

Type:: list

n_features_out_#

Number of features remaining after transform.

Type:: int

models_#

Mapping from feature name -> fitted estimator. Only populated for features that had at least one missing value during fit and had enough observed rows to train a model.

Type:: dict[str, Predictor]

predictor_means_#

Column means of the kept features (computed on training data, used to fill missing predictor values before prediction).

Type:: pd.Series

class GaussianConditionalImputer(fallback='mean', missing_values=nan, nan_threshold=1.0)[source]#

Bases: BaseImputer

Impute missing values using the closed-form Gaussian conditional mean.

For each row with missing values, the imputer partitions the feature vector into observed (o) and missing (m) components and computes:

mu_{m|o} = mu_m + Sigma_{mo} @ Sigma_{oo}^{-1} @ (x_o - mu_o)

A single global Gaussian (mean + Ledoit-Wolf covariance) is fitted on all complete rows across all cross-section identifiers.

Parameters:

fallback ({"mean", "zero", "none"}, default="mean") – Strategy for any values still missing after conditional imputation: - “mean”: fill with column means - “zero”: fill with zeros - “none”: leave remaining NaNs in place
missing_values (scalar, default=np.nan) – Value to treat as missing (converted to np.nan internally).
nan_threshold (float, default=1.0) – If the proportion of NaNs in a column exceeds this threshold, the column is dropped entirely.

Submodules#

macrosynergy.learning.preprocessing.imputers.imputers