Regressor

Regressor(dataset[, outputs, seed])

Surface learning and prediction.

Methods

Regressor.append_categorical_points(...)

Appends coordinates for the supplied categorical dim-level pairs to tall array of continuous coordinates.

Regressor.build_model(*args, **kwargs)

Defined by subclass

Regressor.cross_validate([unit, n_train, ...])

Fits model on random subset of tidy and evaluates accuracy of predictions on remaining observations.

Regressor.fit(*args, **kwargs)

Defined by subclass

Regressor.get_conditional_prediction(...)

The conditional prediction at the given values of the specified dimensions over the remaining dimension(s).

Regressor.get_filtered_data([standardized, ...])

The portion of the dataset under consideration

Regressor.get_shaped_data([metric, dropna])

Formats input data and observations as plain numpy arrays

Regressor.get_structured_data([metric])

Formats input data and observations as parrays

Regressor.marginal_grids(*dims)

Get grids corresponding to only specified dimensions

Regressor.mvuparray(*uparrays, cor, **kwargs)

Creates a uparray with the current instance's stdzr attached

Regressor.parray(**kwargs)

Creates a parray with the current instance's stdzr attached

Regressor.predict(points_array[, with_noise])

Defined by subclass.

Regressor.predict_grid([output, ...])

Make predictions and reshape into grid.

Regressor.predict_points(points[, output, ...])

Make predictions at supplied points

Regressor.prepare_grid([limits, at, resolution])

Prepare unobserved input coordinates for specified continuous dimensions.

Regressor.propose(target[, acquisition])

Bayesian Optimization with Expected Improvement acquisition function

Regressor.specify_model([outputs, ...])

Checks for consistency among dimensions and levels and formats appropriately.

Regressor.uparray(name, μ, σ2, **kwargs)

Creates a uparray with the current instance's stdzr attached

Attributes

Regressor.coords

Dictionary of numerical coordinates of each level within each dimension as {dim: {level: coord}}

Regressor.dims

List of all dimensions under consideration

Regressor.levels

Dictionary of values considered within each dimension as {dim: [level1, level2]}

class gumbi.regression.Regressor(dataset: DataSet, outputs=None, seed=2021)

Bases: ABC

Surface learning and prediction.

A Regressor is built from a dataframe in the form of a DataSet object. This is stored as tidy. The model inputs are constructed by filtering this dataframe, extracting column values, and converting these to numerical input coordinates. Each subclass defines at least build_model, fit, and predict_points methods in addition to subclass-specific methods.

Dimensions fall into several categories:

  • Filter dimensions, those with only one level, are used to subset the dataframe but are not included as explicit inputs to the model. These are not specified explicitly, but rather any continuous or categorical dimension with only one level is treated as a filter dimension.

  • Continuous dimensions are treated as explicit coordinates and given a Radial Basis Function kernel

    • Linear dimensions (which must be a subset of continuous_dims) have an additional linear kernel.

  • Coregion dimensions imply a distinct but correlated output for each level

    • If more than one output is specified, self.out_col is treated as a categorical dim.

Parameters:
  • dataset (DataSet) – Data for fitting.

  • outputs (str or list of str, default None) – Name(s) of output(s) to learn. If None, uses all values from outputs attribute of dataset.

  • seed (int) – Random seed

data

Data for fitting.

Type:

DataSet

outputs

Name(s) of output(s) to learn.

Type:

list of str, optional

seed

Random seed

Type:

int

continuous_dims

Columns of dataframe used as continuous dimensions

Type:

list of str

linear_dims

Subset of continuous dimensions to apply an additional linear kernel.

Type:

list of str

continuous_levels

Values considered within each continuous column as {dim: [level1, level2]}

Type:

dict

continuous_coords

Numerical coordinates of each continuous level within each continuous dimension as {dim: {level: coord}}

Type:

dict

categorical_dims

Columns of dataframe used as categorical dimensions

Type:

list of str

categorical_levels

Values considered within each categorical column as {dim: [level1, level2]}

Type:

dict

categorical_coords

Numerical coordinates of each categorical level within each categorical dimension as {dim: {level: coord}}

Type:

dict

additive

Whether to treat categorical dimensions as additive or joint

Type:

bool

filter_dims

Dictionary of column-value pairs used to filter dataset before fitting

Type:

dict

X

A 2D tall array of input coordinates.

Type:

array

y

A 1D vector of observations

Type:

array

append_categorical_points(continuous_parray, categorical_levels)

Appends coordinates for the supplied categorical dim-level pairs to tall array of continuous coordinates.

Parameters:
  • continuous_points (ParameterArray) – Tall ParameterArray of coordinates, one layer per continuous dimension

  • categorical_levels (dict) – Single level for each categorical_dims at which to make prediction

Returns:

points – Tall ParameterArray of coordinates, one layer per continuous and categorical dimension

Return type:

ParameterArray

abstract build_model(*args, **kwargs)

Defined by subclass

See also

GP.build_model()

meth:GLM.build_model

property coords: dict

Dictionary of numerical coordinates of each level within each dimension as {dim: {level: coord}}

cross_validate(unit=None, *, n_train=None, pct_train=None, train_only=None, warm_start=True, seed=None, errors='natural', **MAP_kws)

Fits model on random subset of tidy and evaluates accuracy of predictions on remaining observations.

This method finds unique combinations of values in the columns specified by dims, takes a random subset of these for training, and evaluates the predictions made for the remaining tidy.

Notes

cross_validate() is reproducibly random by default. In order to evaluate different test/train subsets of the same size, you will need to set the seed explicitly.

Specifying unit changes the interpretation of n_train and pct_train: rather than the number or fraction of all individual observations to be included in the training set, these now represent the number of distinct entities in the unit column from the wide-form dataset.

Criteria in train_only are enforced before grouping observations by unit. If train_only and unit are both specified, but the train_only criteria encompass only some observations of a given entity in unit, this could lead to expected behavior.

Similarly, if warm_start and unit are both specified, but a given entity appears in multiple categories from any of the categorical_dims, this could lead to expected behavior. It is recommended to set warm_start to False if this is the case.

Parameters:
  • unit (list of str) – Columns from which to take unique combinations as training and testing sets. This could be useful when the data contains multiple (noisy) observations for each of several distinct entities.

  • n_train (int, optional) – Number of training points to use. Exactly one of n_train and pct_train must be specified.

  • pct_train (float, optional) – Percent of training points to use. Exactly one of n_train and pct_train must be specified.

  • train_only (dict, optional) – Specifications for observations to be always included in the training set. This will select all rows of the wide-form dataset which exactly match all criteria.

  • warm_start (bool, default True) – Whether to include a minimum of one observation for each level in each categorical_dim in the training set.

  • seed (int, optional) – Random seed

  • errors ({'natural', 'standardized', 'transformed'}) – “Space” in which to return prediction errors

  • **MAP_kws – Additional

Returns:

Dictionary with nested dictionaries ‘train’ and ‘test’, both containing fields ‘data’, ‘NLPDs’, and ‘errors’. These fields contain the relevant subset of observations as a DataSet, an array of the negative log posterior densities of observations given the predictions, and an array of the natural-space difference between observations and prediction means, respectively.

Return type:

dict

property dims: list

List of all dimensions under consideration

abstract fit(*args, **kwargs)

Defined by subclass

See also

GP.fit()

meth:GLM.fit

get_conditional_prediction(**dim_values)

The conditional prediction at the given values of the specified dimensions over the remaining dimension(s).

Conditioning the prediction on specific values of m dimensions can be thought of as taking a “slice” along the remaining n dimensions.

Performs (m+n)-dimensional interpolation over the entire prediction grid for each of the mean and variance separately, then returns the interpolation evaluated at the specified values for the provided dimensions and the original values for the remaining dimensions.

Parameters:

dim_values – Keyword arguments specifying value for each dimension at which to return the conditional prediction of the remaining dimensions.

Returns:

  • conditional_grid (ParameterArray) – n-dimensional grid with n parameters (layers) at which the conditional prediction is evaluated

  • conditional_prediction (UncertainParameterArray) – n-dimensional grid of predictions conditional on the given values of the m specified dimensions

get_filtered_data(standardized=False, metric='mean')

The portion of the dataset under consideration

A filter is built by comparing the values in the unstandardized dataframe with those in filter_dims, categorical_levels, and continuous_levels, then the filter is applied to the standardized or unstandardized dataframe as indicated by the standardized input argument.

Parameters:
  • standardized (bool, default True) – Whether to return a subset of the raw tidy or the centered and scaled tidy

  • metric (str, default 'mean') – Which summary statistic to return (must be a value in the Metric column)

Returns:

tidy

Return type:

pd.DataFrame

get_shaped_data(metric='mean', dropna=True)

Formats input data and observations as plain numpy arrays

Parameters:

metric (str, default 'mean') – Which summary statistic to return (must be a value in the Metric column)

Returns:

  • X (np.ndarray) – A tall matrix of input coordinates with shape (n_obs, n_dims).

  • y (np.ndarray) – A (1D) vector of observations

get_structured_data(metric='mean')

Formats input data and observations as parrays

Parameters:

metric (str, default 'mean') – Which summary statistic to return (must be a value in the Metric column)

Returns:

  • X (parray) – A multilayered column vector of input coordinates.

  • y (parray) – A multilayered (1D) vector of observations

property levels: dict

Dictionary of values considered within each dimension as {dim: [level1, level2]}

marginal_grids(*dims)

Get grids corresponding to only specified dimensions

Parameters:

*dims (str) – Named dimensions along which to extract marginal grid. Must be a subset of prediction_dims.

Returns:

*grids – Grid for each named dimension specified, in the order supplied, each with len(dims) dimensions.

Return type:

ParameterArray

mvuparray(*uparrays, cor, **kwargs) MVUncertainParameterArray

Creates a uparray with the current instance’s stdzr attached

parray(**kwargs) ParameterArray

Creates a parray with the current instance’s stdzr attached

abstract predict(points_array, with_noise=True, **kwargs)

Defined by subclass.

It is not recommended to call predict() directly, since it requires a very specific formatting for inputs, specifically a tall array of standardized coordinates in the same order as dims. Rather, one of the convenience functions predict_points() or predict_grid() should be used, as these have a more intuitive input structure and format the tidy appropriately prior to calling predict().

See also

GP.predict()

meth:GLM.predict

Returns:

prediction_mean, prediction_var – Mean and variance of predictions at each supplied points

Return type:

list of np.ndarray

predict_grid(output=None, categorical_levels=None, with_noise=True, **kwargs)

Make predictions and reshape into grid.

If the model has categorical_dims, a specific level for each dimension must be specified as key-value pairs in categorical_levels.

Parameters:
  • output (str or list of str, optional) – Variable(s) for which to make predictions

  • categorical_levels (dict, optional) – Level for each categorical_dims at which to make prediction

  • with_noise (bool, default True) – Whether to incorporate aleatoric uncertainty into prediction error

Returns:

prediction – Predictions as a grid with len(continuous_dims) dimensions

Return type:

UncertainParameterArray

predict_points(points, output=None, with_noise=True, **kwargs)

Make predictions at supplied points

Parameters:
  • points (ParameterArray) – 1-D ParameterArray vector of coordinates for prediction, must have one layer per self.dims

  • output (str or list of str, optional) – Variable for which to make predictions

  • with_noise (bool, default True) – Whether to incorporate aleatoric uncertainty into prediction error

  • **kwargs – Additional keyword arguments passed to subclass-specific predict() method

Returns:

prediction – Predictions as a uparray

Return type:

UncertainParameterArray

prepare_grid(limits=None, at=None, resolution=100)

Prepare unobserved input coordinates for specified continuous dimensions.

Parameters:
  • limits (ParameterArray) – List of min/max values as a single parray with one layer for each of a subset of continuous_dims.

  • at (ParameterArray) – A single parray of length 1 with one layer for each remaining continuous_dims by name.

  • ticks (dict resolution : dict or int) – Number of points along each dimension, either as a dictionary or one value applied to all dimensions

propose(target, acquisition='EI')

Bayesian Optimization with Expected Improvement acquisition function

specify_model(outputs=None, linear_dims=None, continuous_dims=None, continuous_levels=None, continuous_coords=None, categorical_dims=None, categorical_levels=None, additive=False)

Checks for consistency among dimensions and levels and formats appropriately.

Parameters:
  • outputs (str or list of str, default None) – Name(s) of output(s) to learn. If None, outputs is used.

  • linear_dims (str or list of str, optional) – Subset of continuous dimensions to apply an additional linear kernel. If None, defaults to ['Y','X'].

  • continuous_dims (str or list of str, optional) – Columns of dataframe used as continuous dimensions

  • continuous_levels (str, list, or dict, optional) – Values considered within each continuous column as {dim: [level1, level2]}

  • continuous_coords (list or dict, optional) – Numerical coordinates of each continuous level within each continuous dimension as {dim: {level: coord}}

  • categorical_dims (str or list of str, optional) – Columns of dataframe used as categorical dimensions

  • categorical_levels (str, list, or dict, optional) – Values considered within each categorical column as {dim: [level1, level2]}

  • additive (bool, default False) – Whether to treat categorical_dims as additive or joint (default)

Returns:

self

Return type:

GP

uparray(name: str, μ: ndarray, σ2: ndarray, **kwargs) UncertainParameterArray

Creates a uparray with the current instance’s stdzr attached