macrosynergy.learning.forecasting.nn#

class MLPRegressor(n_latent=32, fit_encoder_intercept=True, fit_head_intercept=True, encoder_activation='relu', head_activation='identity', dropout_p=0, torch_model=None, loss_func=MSELoss(), optimizer='AdamW', scheduler=None, batch_size=32, learning_rate=0.0003, weight_decay=0.0001, reg_turnover=0, use_ts_sampler=True, aggregate_last=True, drop_last=False, epochs=10000, patience=1000, train_pct=0.7, x_scaler=StandardScaler(with_mean=False), y_scaler=StandardScaler(with_mean=False), verbose=False, random_state=42, inverse_transform_preds=False, min_samples=36)[source]#

Bases: BaseEstimator, RegressorMixin

Scikit-learn compatible multi-layer perceptron, implemented in PyTorch.

Parameters:
  • n_latent (Union[int, List[int]], optional) – Numer of hidden units in the latent layer(s) of the MLP. If an integer is provided, the MLP will have a single hidden layer with n_latent units. If a list of integers is provided, the MLP will have multiple hidden layers with the number of units in each layer specified by the corresponding element in the list. If provided, all (n_latent, fit_encoder_intercept, fit_head_intercept, encoder_activation, head_activation) must be specified and torch_model must be None. Default is 32.

  • fit_encoder_intercept (bool, optional) – Whether to include an intercept (bias term) in the encoder layers of the MLP. If provided, all (n_latent, fit_encoder_intercept, fit_head_intercept, encoder_activation, head_activation) must be specified and torch_model must be None. Default is True.

  • fit_head_intercept (bool, optional) – Whether to include an intercept (bias term) in the output layer of the MLP. If provided, all (n_latent, fit_encoder_intercept, fit_head_intercept, encoder_activation, head_activation) must be specified and torch_model must be None.Default is True.

  • encoder_activation (str, optional) – Activation function for the encoder (hidden) component of the network. If provided, all (n_latent, fit_encoder_intercept, fit_head_intercept, encoder_activation, head_activation) must be specified and torch_model must be None. Default is “relu”. Must be one of “tanh”, “relu”, or “sigmoid”.

  • head_activation (str, optional) – Activation function for the head (output) component of the network. If provided, all (n_latent, fit_encoder_intercept, fit_head_intercept, encoder_activation, head_activation) must be specified and torch_model must be None. Default is “identity”. Must be one of “tanh”, “relu”, “sigmoid”, or “identity”.

  • torch_model (Intersection[torch.nn.Module, BaseEstimator], optional) – Custom PyTorch model to use instead of the default MLP. Must be a subclass of both torch.nn.Module and sklearn.base.BaseEstimator. If torch_model is provided, all parameters (n_latent, fit_encoder_intercept, fit_head_intercept, encoder_activation, head_activation) must be None. Default is None.

  • loss_func (torch.nn.Module, optional) – Loss function used during training. Must be a subclass of torch.nn.Module. Default is nn.MSELoss().

  • optimizer (Union[str, List[str]], optional) – Optimizer(s) used during training. If a single string is provided, it specifies the optimizer used in backpropagation. If a list of strings is provided, each string specifies an optimizer to be used in separate training runs, forming an neural network ensemble. Currently supported optimizers are “AdamW”, “SGD”, and “SGD+mom”. Default is “AdamW”.

  • scheduler (Optional[str], optional) – Learning rate scheduler used during training. Currently supported schedulers are “OneCycleLR” and None. Default is None.

  • batch_size (int, optional) – Batch size used during training. Default is 32.

  • learning_rate (float, optional) – Learning rate used by the optimizer. Default is 3e-4.

  • weight_decay (float, optional) – Weight decay used by the optimizer. Default is 1e-4.

  • reg_turnover (float, optional) – L2 regularization strength for the turnover in model outputs. Default is 0.

  • use_ts_sampler (bool, optional) – Whether to use a time-series aware batch sampler during training. Default is True.

  • aggregate_last (bool, optional) – When using time-series batch sampling, whether or not to aggregate the last batch into the previous batch if it is smaller than the specified batch size. When True, drop_last must be False. Default is True.

  • drop_last (bool, optional) – When using time-series batch sampling, whether or not to drop the last batch if it is smaller than the specified batch size. When True, aggregate_last must be False. Default is False.

  • epochs (int, optional) – Maximum number of training epochs. Default is 10000.

  • patience (int, optional) – Number of epochs to wait for improvement before early stopping. Default is 1000.

  • train_pct (float, optional) – Fraction of samples used for training (remainder used for validation). This is needed for the early stopping process. Default is 0.7.

  • x_scaler (Optional[TransformerMixin], optional) – Scaler for the input features. Must be a subclass of sklearn’s TransformerMixin. This can also be set to None. Default is StandardScaler(with_mean=False).

  • y_scaler (Optional[TransformerMixin], optional) – Scaler for the target values. Must be a subclass of sklearn’s TransformerMixin. This can also be set to None. Default is StandardScaler(with_mean=False).

  • verbose (bool, optional) – Whether to print training diagnostics during training. Default is False.

  • random_state (Union[int, List[int]], optional) – Random seed(s) used for PyTorch initialization and training. If multiple seeds are spsecified, then a neural network ensemble will be trained with each seed in the list. Default is 42.

  • inverse_transform_preds (bool, optional) – Whether to inverse-transform predictions back to the original target scale using the fitted target scaler. Default is False.

  • min_samples (int, optional) – Minimum number of samples for an asset to have a head in the neural network. Default is 36.

Notes

A neural network is a parametric model that, given a collection of input features, learns a mapping to target values by passing the feature set through “neurons”, which are themselves the composition of a linear transformation and a non-linear ‘activation function’. The output of these neurons should be interpreted as latent factors. These neuron outputs can then be passed through further neurons, and so on, until the final ‘layer’ of neurons that produces the model predictions. The parameters of the linear transformations are learned during training. This is the basic structure of a neural network, with other types of neural network building upon this to handle sequential data/images/videos more efficiently.

When the input dataset is tabular, with each sample consisting of a set of features and a target value, and each treated as independent, then the model defined by mapping the input features to a layer of latent factors via neurons, then (possibly) to another layer of latent factors, and so on, until the final layer of neurons that maps to the target value(s), is called a multi-layer perceptron (MLP).

Learning corresponds to estimating the optimal parameters of the neural network. Optimality refers to the suitability of the parameters for the forecasting task at hand, which is quantified by a loss function. MLPRegressor expects a PyTorch-compatible loss function to be provided, which inherits from torch.nn.Module and has a forward method that takes in the model predictions and the true target values and outputs a scalar loss value. The default loss function is mean squared error. Practically optimizing the parameters of this network is not trivial, because unlike an OLS model (which optimizes mean squared error) the activation functions introduce non-linearity in the model, which (firstly) means that no closed-form solution exists for optimal parameters, and (secondly) means that the loss landscape is non-convex with many local minima, saddle points and generically complicated geometry. The algorithm used to train such a neural network is called ‘backpropogation’, which involves:

  1. Randomly initializing the parameters of the network

  2. Passing the input features through the network to get (initially rubbish) predictions

3. Calculating the loss of the predictions with respect to the true target values using the specified loss function 4. Calculating the derivative of the loss with respect to each parameter in the network, based on the data. 5. Updating the parameters in the direction that reduces the loss, with the step size determined by the learning rate and the optimizer. 6. Iterating until convergence.

Traditionally, the optimizer used in step 5 was stochastic gradient descent (SGD), which simply updates the parameters in the direction of the negative gradient of the loss. If one imagines a ball rolling down a hill, to get the bottom the ball has to move in the direction of the steepest descent, which is the negative gradient. The ‘stochastic’ part means that data is provided to the network in batches, meaning that the gradient calculation is noisy. This noise is helpful for optimization because it prevents convergence to a poor minimum in the loss surface. In particular, SGD tends to converge to flatter minima in the loss surface, which are associated with better generalization performance. SGD, however, can be slow and other optimizers have been developed that can converge faster, such as SGD + momentum, or AdamW.

The previous paragraph touches on the importance of the geometry of the loss surface for optimization and generalization. For those who are new to the world of neural networks, it likely seems that the goal is to optimize the parameters to achieve the global minimum in the loss surface. This, however, is a bad idea. The global minimum is very likely to memorise the training data and consequently generalise poorly. This is because the neural network typically has a vast number of parameters. This means that is in fact preferable to converge to a local minimum, particularly if we can characterise certain local minima as being better than others. Indeed, we can; we prefer flatter minima rather than steep minima. Intuitively, if we converge to a steep minimum, then a small change in the underlying data leads us out of the minimum, indicating that the model is unstable and likely to generalise poorly. On the other hand, small changes in the data do not lead us out of a flat minimum, indicating that the model is stable and likely to generalise better. Certain techniques can be employed to encourage convergence to a flatter minimum, such as using a learning rate scheduler that forces a large learning rate at periods of training, allowing the model to escape steep minima, and reducing the learning rate when a favourable region of the parameter space is being explored. Small batch sizes also encourage convergence to a flatter minimum.

Convergence is also complicated by the fact that indefinite training of the network leads to overfitting. Early stopping is a common regularization strategy for neural network training. The idea is split a training set into a smaller training subset and a validation subset. The model is trained on the training subset, but at the end of each epoch (each complete pass of the training subset), it is evaluated against the validation subset. If the validation loss does not improve for a certain number of epochs, then training is stopped and the parameters from the epoch with the best validation loss are returned.

In this implementation of a multilayer perceptron, the structure of the model is determined either by setting (n_latent, fit_encoder_intercept, fit_head_intercept, encoder_activation, head_activation) jointly or by providing a custom torch_model. The loss function is determined by the loss_func parameter, and the training dynamics are determined by the optimizer, scheduler, batch_size, learning_rate, weight_decay, and reg_turnover parameters. Weight decay is a regularization strategy that penalizes large weights in the network, whilst reg_turnover penalizes large changes in model outputs from one time period to the next, which is useful information when transaction cost data is incorporated in the loss function.

The usual theory for neural network training is centred around each sample within a batch being independent and identically distributed, implying that the random variables corresponding to the derivative of the loss, for a fixed set of parameters, evaluated at each sample are independent and identically distributed. This means that the average derivative over a batch is a consistent, unbiased estimate of the true gradient of the loss with respect to the parameters. On time series data, however, mixing samples from different time periods leads to can lead to biased gradient estimates due to the presence of different regimes within a single batch, violating the assumption of samples coming from the same distribution. This confuses the learning process because the model is pulled in conflicting directions by samples drawn from different regimes, resulting in a poorly performing learning algorithm. To remedy this, we have provided the option to use a time series-aware batch sampler that ensures that each batch is comprised of samples from contiguous time periods. This should help convergence. This can be toggled on/off with the use_ts_sampler parameter.

Further work#

  • Implement turnover regularization

  • Custom optimizer and scheduler

  • LARS and ReduceLROnPlateau

  • Optional retraining after early stopping to avoid data waste

fit(X, y, sample_weight=None)[source]#
predict(X)[source]#
initialize_model(n_inputs, n_latent, n_outputs, encoder_activation, head_activation, fit_encoder_intercept, fit_head_intercept, dropout_p)[source]#
create_train_valid_splits(X, y, train_pct)[source]#
scale_data(X_train, X_valid, y_train, y_valid, x_scaler, y_scaler)[source]#
make_tensor_datasets(X_train_s, X_valid_s, y_train_s, y_valid_s, sample_weight)[source]#
make_dataloaders(train_dataset, valid_dataset, batch_size, use_ts_sampler, aggregate_last, drop_last)[source]#

TODO: run through aggregate last and drop last logic

make_optimizer(model, optimizer_name, learning_rate, weight_decay)[source]#
make_scheduler(optimizer, scheduler_name, epochs, steps_per_epoch)[source]#
train_model(model, train_loader, train_loader_eval, valid_loader, optimizer, scheduler, loss_func, sample_weight, sample_weight_strategy, patience, verbose)[source]#
update_es_stats(model, train_loss, valid_loss, best_score, best_state, counter, patience)[source]#
set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') MLPRegressor#

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

Returns:

self – The updated object.

Return type:

object

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') MLPRegressor#

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

Returns:

self – The updated object.

Return type:

object

Submodules#