macrosynergy.learning.signal_optimizer#

Class to handle the calculation of quantamental predictions based on adaptive hyperparameter and model selection.

class SignalOptimizer(inner_splitter, X, y, blacklist=None, initial_nsplits=None, threshold_ndates=None, lagged_features=True)[source]#

Bases: object

calculate_predictions(name, models, metric, hparam_grid, hparam_type='grid', min_cids=4, min_periods=36, test_size=1, max_periods=None, n_iter=10, n_jobs=-1)[source]#

Calculate, store and return sequentially optimized signals for a given process. This method implements the nested cross-validation and subsequent signal generation. The name of the process, together with models to fit, hyperparameters to search over and a metric to optimize, are provided as compulsory arguments.

Parameters:
  • name (str) – Label of signal optimization process.

  • models (Dict[str, Union[BaseEstimator, Pipeline]]) – dictionary of sklearn predictors or pipelines.

  • metric (Callable) – A sklearn scorer object that serves as the criterion for optimization.

  • hparam_type (str) – Hyperparameter search type. This must be either “grid”, “random” or “bayes”. Default is “grid”.

  • hparam_grid (Dict[str, Dict[str, List]]) –

    Nested dictionary defining the hyperparameters to consider for each model. The outer dictionary needs keys representing the model name and should match the keys in the models. dictionary. The inner dictionary depends on the hyperparameter search type. If hparam_type is “grid”, then the inner dictionary should have keys corresponding to the hyperparameter names and values equal to a list of hyperparameter values to search over. For example: hparam_grid = {

    ”lasso” : {“alpha” : [1e-1, 1e-2, 1e-3]}, “knn” : {“n_neighbors” : [1, 2, 5]}

    }. If hparam_type is “random”, the inner dictionary needs keys corresponding to the hyperparameter names and values either equal to a distribution from which to sample or a list of them. For example: hparam_grid = {

    ”lasso” : {“alpha” : scipy.stats.expon()}, “knn” : {“n_neighbors” : scipy.stats.randint(low=1, high=10)}

    }. Distributions must provide a rvs method for sampling (such as those from scipy.stats.distributions). See https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html and https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html for more details.

  • min_cids (int) – Minimum number of cross-sections required for the initial training set. Default is 4.

  • min_periods (int) – minimum number of base periods of the input data frequency required for the initial training set. Default is 12.

  • max_periods (Optional[int]) – maximum length of each training set in units of the input data frequency. If this maximum is exceeded, the earliest periods are cut off. Default is None, which means that the full training history is considered in each iteration.

  • n_iter (Optional[int]) – Number of iterations to run for random search. Default is 10.

  • n_jobs (Optional[int]) – Number of jobs to run in parallel. Default is -1, which uses all available cores.

Note: The method produces signals for financial contract positions. They are calculated sequentially at the frequency of the input data set. Sequentially here means that the training set is expanded/rolled by one base period of the frequency. Each time the training set itself is split into various (training, test) pairs by the inner_splitter argument. Based on inner cross-validation an optimal model is chosen and used for predicting the targets of the next period.

Return type:

None

get_optimized_signals(name=None)[source]#

Returns optimized signals for one or more processes

Parameters:

name (Union[str, List, None]) – Label of signal optimization process. Default is all stored in the class instance.

Return <pd.DataFrame>:

Pandas dataframe in JPMaQS format of working daily predictions based insequentially optimzed models.

Return type:

DataFrame

get_optimal_models(name=None)[source]#

Returns the sequences of optimal models for one or more processes

Parameters:

name (Union[str, List, None]) – Label of signal optimization process. Default is all stored in the class instance.

Return <pd.DataFrame>:

Pandas dataframe of the optimal models or hyperparameters at the end of the base period in which they were determined (to be applied in the subsequent period).

Return type:

DataFrame

get_selected_features(name=None)[source]#

Returns the selected features over time for one or more processes

Parameters:

name (Union[str, List, None]) – Label of signal optimization process. Default is all stored in the class instance.

Return <pd.DataFrame>:

Pandas dataframe of the selected features over time at the end of the base period in which they were determined (to be applied in the subsequent period).

Return type:

DataFrame

feature_selection_heatmap(name, title=None, figsize=(12, 8))[source]#

Method to visualise the selected features in a scikit-learn pipeline.

Parameters:
  • name (str) – Name of the prediction model.

  • title (Optional[str]) – Title of the heatmap. Default is None. This creates a figure title of the form “Model Selection Heatmap for {name}”.

  • figsize (Optional[Tuple[Union[int, float], Union[int, float]]]) – Tuple of floats or ints denoting the figure size. Default is (12, 8).

Note: This method displays the times at which each feature was used in the learning process and used for signal generation, as a binary heatmap.

models_heatmap(name, title=None, cap=5, figsize=(12, 8))[source]#

Visualized optimal models used for signal calculation.

Parameters:
  • name (str) – Name of the prediction model.

  • title (Optional[str]) – Title of the heatmap. Default is None. This creates a figure title of the form “Model Selection Heatmap for {name}”.

  • cap (Optional[int]) – Maximum number of models to display. Default (and limit) is 5. The chosen models are the ‘cap’ most frequently occurring in the pipeline.

  • figsize (Optional[Tuple[Union[int, float], Union[int, float]]]) – Tuple of floats or ints denoting the figure size. Default is (12, 8).

Note: This method displays the times at which each model in a learning process has been optimal and used for signal generation, as a binary heatmap.

get_ftr_coefficients(name)[source]#

Method to return the feature coefficients for a given pipeline.

Parameters:

name – Name of the pipeline.

Return <pd.DataFrame>:

Pandas dataframe of the changing feature coefficients over time for the specified pipeline.

get_intercepts(name)[source]#

Method to return the intercepts for a given pipeline.

Parameters:

name – Name of the pipeline.

Return <pd.DataFrame>:

Pandas dataframe of the changing intercepts over time for the specified pipeline.

get_parameter_stats(name, include_intercept=False)[source]#

Function to return the means and standard deviations of linear model feature coefficients and intercepts (if available) for a given pipeline.

Parameters:
  • name – Name of the pipeline.

  • include_intercept – Whether to include the intercepts in the output. Default is False.

:return Tuple of means and standard deviations of feature coefficients and

intercepts (if chosen) for the specified pipeline.

coefs_timeplot(name, ftrs=None, title=None, ftrs_renamed=None, figsize=(10, 6))[source]#

Function to plot the time series of feature coefficients for a given pipeline. At most, 10 feature coefficient paths can be plotted at once. If more than 10 features were involved in the learning procedure, the default is to plot the first 10 features in the order specified during training. By specifying a ftrs list (which can be no longer than 10 elements in length), this default behaviour can be overridden.

Parameters:
  • name (str) – Name of the pipeline.

  • ftrs (List[str]) – List of feature names to plot. Default is None.

  • title (str) – Title of the plot. Default is None. This creates a figure title of the form “Feature coefficients for pipeline: {name}”.

  • ftrs_renamed (dict) – Dictionary to rename the feature names for visualisation in the plot legend. Default is None, which uses the original feature names.

  • figsize (Tuple[Union[int, float], Union[int, float]]) – Tuple of floats or ints denoting the figure size.

:return Time series plot of feature coefficients for the given pipeline.

intercepts_timeplot(name, title=None, figsize=(10, 6))[source]#

Function to plot the time series of intercepts for a given pipeline.

Parameters:
  • name – Name of the pipeline.

  • title – Title of the plot. Default is None. This creates a figure title of the form “Intercepts for pipeline: {name}”.

  • figsize – Tuple of floats or ints denoting the figure size.

Returns:

Time series plot of intercepts for the given pipeline.

coefs_stackedbarplot(name, ftrs=None, title=None, ftrs_renamed=None, figsize=(10, 6))[source]#

Function to create a stacked bar plot of feature coefficients for a given pipeline. At most, 10 feature coefficients can be considered in the plot. If more than 10 features were involved in the learning procedure, the default is to plot the first 10 features in the order specified during training. By specifying a ftrs list (which can be no longer than 10 elements in length), this default behaviour can be overridden.

Parameters:
  • name (str) – Name of the pipeline.

  • ftrs (List[str]) – List of feature names to plot. Default is None.

  • title (str) – Title of the plot. Default is None. This creates a figure title of the form “Stacked bar plot of model coefficients: {name}”.

  • ftrs_renamed (dict) – Dictionary to rename the feature names for visualisation in the plot legend. Default is None, which uses the original feature names.

  • figsize – Tuple of floats or ints denoting the figure size.

Returns:

Stacked bar plot of feature coefficients for the given pipeline.

nsplits_timeplot(name, title=None, figsize=(10, 6))[source]#

Method to plot the time series for the number of cross-validation splits used by the signal optimizer.

Parameters:
  • name – Name of the pipeline.

  • title – Title of the plot. Default is None. This creates a figure title of the form “Number of CV splits for pipeline: {name}”.

  • figsize – Tuple of floats or ints denoting the figure size.

Returns:

Time series plot of the number of cross-validation splits for the given pipeline.