macrosynergy.learning.preprocessing.imputers#
- class BaseImputer(missing_values=nan, nan_threshold=1.0)[source]#
Bases:
BaseEstimator,TransformerMixin,ABCBase class for imputers operating on panel data
- Parameters:
missing_values (int, float, str, np.nan, None or pandas.NA, default=np.nan) – The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values can be set to either np.nan or pd.NA.
nan_threshold (float, default=1.0) – If the proportion of NaNs in column is greater than this, we get rid of the column.
- feature_names_in_#
Names of features seen during fit
- Type:
ndarray of shape (n_features_in_,)
- missing_fraction_by_col_#
Fraction of missing values for each column
- Type:
pd.Series
- missing_fraction_by_cid_and_col_#
Fraction of missing values for each column split by cid
- Type:
pd.DataFrame
- n_features_out_#
Number of features left after transforming
- Type:
Integral
- class ConstantImputer(fill_value=0, nan_threshold=1.0, missing_values=nan)[source]#
Bases:
BaseImputerClass for imputing missing values with a constant
- Parameters:
fill_value – Value to replace missing values with. Default is 0.
missing_values (int, float, str, np.nan, None or pandas.NA, default=np.nan) – The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values can be set to either np.nan or pd.NA.
nan_threshold (float, default=1.0) – If the proportion of NaNs in column is greater than this, we get rid of the column.
- feature_names_in_#
Names of features seen during fit
- Type:
ndarray of shape (n_features_in_,)
- missing_fraction_by_col_#
Fraction of missing values for each column
- Type:
pd.Series
- missing_fraction_by_cid_and_col_#
Fraction of missing values for each column split by cid
- Type:
pd.DataFrame
- n_features_out_#
Number of features left after transforming
- Type:
Integral
- class CrossSectionalImputer(peer_map=None, default_peers='all', fallback='mean', missing_values=nan, nan_threshold=1.0)[source]#
Bases:
BaseImputerImpute missing values using the cross-sectional mean across configured peers at the same real_date (per feature).
- Parameters:
peer_map (dict[str, list[str]] or None) –
Mapping from target cid -> list of peer cids to use for imputation. Example:
{“CAD”: [“USD”, “GBP”, “EUR”], “USD”: [“CAD”, “GBP”, “EUR”]}
If None, peers default to “all other cids” (unless default_peers=”none”).
default_peers ({"all", "none"}) –
- Behaviour for cids not present in peer_map:
”all”: use all other cids as peers
”none”: do not impute for that cid (unless fallback kicks in)
fallback ({"none", "zero", "mean"}) – If “mean”, any values still missing after peer-based imputation are filled with the global mean per feature computed at fit time. If “zero” values are filled with 0.
missing_values (scalar) – Value to treat as missing (converted to np.nan internally).
- feature_names_in_#
Names of features seen during fit
- Type:
ndarray of shape (n_features_in_,)
- missing_fraction_by_col_#
Fraction of missing values for each column
- Type:
pd.Series
- missing_fraction_by_cid_and_col_#
Fraction of missing values for each column split by cid
- Type:
pd.DataFrame
- n_features_out_#
Number of features left after transforming
- Type:
Integral
- class EstimatorImputer(estimator=None, fallback='mean', missing_values=nan, nan_threshold=1.0, complete_rows_only=True, predictor_fill_value='mean')[source]#
Bases:
BaseImputerImpute missing values using a per-feature sklearn-compatible estimator trained on the remaining features at fit time.
For each feature with missing values, a clone of the provided estimator is trained using all other features as predictors, on rows where the target feature is observed. At transform time the learned model fills in missing values in that feature.
- Parameters:
estimator (BaseEstimator or None, default=None) – Any sklearn-compatible estimator (e.g. RandomForestRegressor, LinearRegression, Pipeline). If None, defaults to RandomForestRegressor().
fallback (str, default="none") – Strategy for handling values still missing after model-based imputation. - “mean”: fill with column means - “zero”: fill with zeros - “none”: leave remaining NaNs in place
missing_values (scalar, default=np.nan) – Value to treat as missing (converted to np.nan internally).
nan_threshold (float, default=1.0) – If the proportion of NaNs in a column exceeds this threshold, the column is dropped entirely.
complete_rows_only (bool, default=True) – If True, each per-feature model is trained only on rows where all predictor columns are also non-NaN. This allows any sklearn estimator to be used, not just those that handle NaN natively.
predictor_fill_value (str, float, int, or None, default="mean") – How to handle NaN predictor values at transform time. “mean” fills with per-column means from fit time, a numeric scalar fills with that constant, “skip” skips prediction for rows with NaN predictors (leaving them for the fallback to handle), and None applies no fill.
- feature_names_in_#
Names of features seen during fit.
- Type:
ndarray of shape (n_features_in_,)
- missing_fraction_by_col_#
Fraction of missing values for each column.
- Type:
pd.Series
- missing_fraction_by_cid_and_col_#
Fraction of missing values for each column split by cid.
- Type:
pd.DataFrame
- models_#
Mapping from feature name -> fitted estimator. Only populated for features that had at least one missing value during fit and had enough observed rows to train a model.
- predictor_means_#
Column means of the kept features (computed on training data, used to fill missing predictor values before prediction).
- Type:
pd.Series
- class GaussianConditionalImputer(fallback='mean', missing_values=nan, nan_threshold=1.0)[source]#
Bases:
BaseImputerImpute missing values using the closed-form Gaussian conditional mean.
For each row with missing values, the imputer partitions the feature vector into observed (o) and missing (m) components and computes:
mu_{m|o} = mu_m + Sigma_{mo} @ Sigma_{oo}^{-1} @ (x_o - mu_o)
A single global Gaussian (mean + Ledoit-Wolf covariance) is fitted on all complete rows across all cross-section identifiers.
- Parameters:
fallback ({"mean", "zero", "none"}, default="mean") – Strategy for any values still missing after conditional imputation: - “mean”: fill with column means - “zero”: fill with zeros - “none”: leave remaining NaNs in place
missing_values (scalar, default=np.nan) – Value to treat as missing (converted to np.nan internally).
nan_threshold (float, default=1.0) – If the proportion of NaNs in a column exceeds this threshold, the column is dropped entirely.