Regressor

Regressor(dataset[, outputs, seed])

Surface learning and prediction.

Methods

`Regressor.append_categorical_points`(...)	Appends coordinates for the supplied categorical dim-level pairs to tall array of continuous coordinates.
`Regressor.build_model`(args, *kwargs)	Defined by subclass
`Regressor.cross_validate`([unit, n_train, ...])	Fits model on random subset of tidy and evaluates accuracy of predictions on remaining observations.
`Regressor.fit`(args, *kwargs)	Defined by subclass
`Regressor.get_conditional_prediction`(...)	The conditional prediction at the given values of the specified dimensions over the remaining dimension(s).
`Regressor.get_filtered_data`([standardized, ...])	The portion of the dataset under consideration
`Regressor.get_shaped_data`([metric, dropna])	Formats input data and observations as plain numpy arrays
`Regressor.get_structured_data`([metric])	Formats input data and observations as parrays
`Regressor.marginal_grids`(*dims)	Get grids corresponding to only specified dimensions
`Regressor.mvuparray`(uparrays, cor, *kwargs)	Creates a uparray with the current instance's stdzr attached
`Regressor.parray`(**kwargs)	Creates a parray with the current instance's stdzr attached
`Regressor.predict`(points_array[, with_noise])	Defined by subclass.
`Regressor.predict_grid`([output, ...])	Make predictions and reshape into grid.
`Regressor.predict_points`(points[, output, ...])	Make predictions at supplied points
`Regressor.prepare_grid`([limits, at, resolution])	Prepare unobserved input coordinates for specified continuous dimensions.
`Regressor.propose`(target[, acquisition])	Bayesian Optimization with Expected Improvement acquisition function
`Regressor.specify_model`([outputs, ...])	Checks for consistency among dimensions and levels and formats appropriately.
`Regressor.uparray`(name, μ, σ2, **kwargs)	Creates a uparray with the current instance's stdzr attached

Attributes

`Regressor.coords`	Dictionary of numerical coordinates of each level within each dimension as `{dim: {level: coord}}`
`Regressor.dims`	List of all dimensions under consideration
`Regressor.levels`	Dictionary of values considered within each dimension as `{dim: [level1, level2]}`

class gumbi.regression.Regressor(dataset: DataSet, outputs=None, seed=2021)

Bases: ABC

Surface learning and prediction.

A Regressor is built from a dataframe in the form of a DataSet object. This is stored as tidy. The model inputs are constructed by filtering this dataframe, extracting column values, and converting these to numerical input coordinates. Each subclass defines at least build_model, fit, and predict_points methods in addition to subclass-specific methods.

Dimensions fall into several categories:

Filter dimensions, those with only one level, are used to subset the dataframe but are not included as explicit inputs to the model. These are not specified explicitly, but rather any continuous or categorical dimension with only one level is treated as a filter dimension.
Continuous dimensions are treated as explicit coordinates and given a Radial Basis Function kernel
- Linear dimensions (which must be a subset of continuous_dims) have an additional linear kernel.
Coregion dimensions imply a distinct but correlated output for each level
- If more than one output is specified, self.out_col is treated as a categorical dim.

Parameters:

dataset (DataSet) – Data for fitting.
outputs (str or list of str, default None) – Name(s) of output(s) to learn. If None, uses all values from outputs attribute of dataset.
seed (int) – Random seed

data

Data for fitting.

Type:: DataSet

outputs

Name(s) of output(s) to learn.

Type:: list of str, optional

seed

Random seed

Type:: int

continuous_dims

Columns of dataframe used as continuous dimensions

Type:: list of str

linear_dims

Subset of continuous dimensions to apply an additional linear kernel.

Type:: list of str

continuous_levels

Values considered within each continuous column as {dim: [level1, level2]}

Type:: dict

continuous_coords

Numerical coordinates of each continuous level within each continuous dimension as {dim: {level: coord}}

Type:: dict

categorical_dims

Columns of dataframe used as categorical dimensions

Type:: list of str

categorical_levels

Values considered within each categorical column as {dim: [level1, level2]}

Type:: dict

categorical_coords

Numerical coordinates of each categorical level within each categorical dimension as {dim: {level: coord}}

Type:: dict

additive

Whether to treat categorical dimensions as additive or joint

Type:: bool

filter_dims

Dictionary of column-value pairs used to filter dataset before fitting

Type:: dict

X

A 2D tall array of input coordinates.

Type:: array

y

A 1D vector of observations

Type:: array

append_categorical_points(continuous_parray, categorical_levels)

Appends coordinates for the supplied categorical dim-level pairs to tall array of continuous coordinates.

Parameters:

continuous_points (ParameterArray) – Tall ParameterArray of coordinates, one layer per continuous dimension
categorical_levels (dict) – Single level for each categorical_dims at which to make prediction

Returns:

points – Tall ParameterArray of coordinates, one layer per continuous and categorical dimension

Return type:

ParameterArray

abstract build_model(*args, **kwargs)

Defined by subclass

See also

GP.build_model(): meth:GLM.build_model

property coords: dict: Dictionary of numerical coordinates of each level within each dimension as {dim: {level: coord}}

cross_validate(unit=None, *, n_train=None, pct_train=None, train_only=None, warm_start=True, seed=None, errors='natural', **MAP_kws)

Fits model on random subset of tidy and evaluates accuracy of predictions on remaining observations.

This method finds unique combinations of values in the columns specified by dims, takes a random subset of these for training, and evaluates the predictions made for the remaining tidy.

Notes

cross_validate() is reproducibly random by default. In order to evaluate different test/train subsets of the same size, you will need to set the seed explicitly.

Specifying unit changes the interpretation of n_train and pct_train: rather than the number or fraction of all individual observations to be included in the training set, these now represent the number of distinct entities in the unit column from the wide-form dataset.

Criteria in train_only are enforced before grouping observations by unit. If train_only and unit are both specified, but the train_only criteria encompass only some observations of a given entity in unit, this could lead to expected behavior.

Similarly, if warm_start and unit are both specified, but a given entity appears in multiple categories from any of the categorical_dims, this could lead to expected behavior. It is recommended to set warm_start to False if this is the case.

Parameters:

unit (list of str) – Columns from which to take unique combinations as training and testing sets. This could be useful when the data contains multiple (noisy) observations for each of several distinct entities.
n_train (int, optional) – Number of training points to use. Exactly one of n_train and pct_train must be specified.
pct_train (float, optional) – Percent of training points to use. Exactly one of n_train and pct_train must be specified.
train_only (dict, optional) – Specifications for observations to be always included in the training set. This will select all rows of the wide-form dataset which exactly match all criteria.
warm_start (bool, default True) – Whether to include a minimum of one observation for each level in each categorical_dim in the training set.
seed (int, optional) – Random seed
errors ({'natural', 'standardized', 'transformed'}) – “Space” in which to return prediction errors
**MAP_kws – Additional

Returns:

Dictionary with nested dictionaries ‘train’ and ‘test’, both containing fields ‘data’, ‘NLPDs’, and ‘errors’. These fields contain the relevant subset of observations as a DataSet, an array of the negative log posterior densities of observations given the predictions, and an array of the natural-space difference between observations and prediction means, respectively.

Return type:

dict

property dims: list: List of all dimensions under consideration

abstract fit(*args, **kwargs)

Defined by subclass

See also

GP.fit(): meth:GLM.fit

get_conditional_prediction(**dim_values)

The conditional prediction at the given values of the specified dimensions over the remaining dimension(s).

Conditioning the prediction on specific values of m dimensions can be thought of as taking a “slice” along the remaining n dimensions.

Performs (m+n)-dimensional interpolation over the entire prediction grid for each of the mean and variance separately, then returns the interpolation evaluated at the specified values for the provided dimensions and the original values for the remaining dimensions.

Parameters:

dim_values – Keyword arguments specifying value for each dimension at which to return the conditional prediction of the remaining dimensions.

Returns:

conditional_grid (ParameterArray) – n-dimensional grid with n parameters (layers) at which the conditional prediction is evaluated
conditional_prediction (UncertainParameterArray) – n-dimensional grid of predictions conditional on the given values of the m specified dimensions

get_filtered_data(standardized=False, metric='mean')

The portion of the dataset under consideration

A filter is built by comparing the values in the unstandardized dataframe with those in filter_dims, categorical_levels, and continuous_levels, then the filter is applied to the standardized or unstandardized dataframe as indicated by the standardized input argument.

Parameters:

standardized (bool, default True) – Whether to return a subset of the raw tidy or the centered and scaled tidy
metric (str, default 'mean') – Which summary statistic to return (must be a value in the Metric column)

Returns:

tidy

Return type:

pd.DataFrame

get_shaped_data(metric='mean', dropna=True)

Formats input data and observations as plain numpy arrays

Parameters:

metric (str, default 'mean') – Which summary statistic to return (must be a value in the Metric column)

Returns:

X (np.ndarray) – A tall matrix of input coordinates with shape (n_obs, n_dims).
y (np.ndarray) – A (1D) vector of observations

See also

get_filtered_data()

get_structured_data(metric='mean')

Formats input data and observations as parrays

Parameters:

metric (str, default 'mean') – Which summary statistic to return (must be a value in the Metric column)

Returns:

X (parray) – A multilayered column vector of input coordinates.
y (parray) – A multilayered (1D) vector of observations

See also

get_filtered_data()

property levels: dict: Dictionary of values considered within each dimension as {dim: [level1, level2]}

marginal_grids(*dims)

Get grids corresponding to only specified dimensions

Parameters:: *dims (str) – Named dimensions along which to extract marginal grid. Must be a subset of prediction_dims.
Returns:: *grids – Grid for each named dimension specified, in the order supplied, each with len(dims) dimensions.
Return type:: ParameterArray

mvuparray(*uparrays, cor, **kwargs) → MVUncertainParameterArray: Creates a uparray with the current instance’s stdzr attached

parray(**kwargs) → ParameterArray: Creates a parray with the current instance’s stdzr attached

abstract predict(points_array, with_noise=True, **kwargs)

Defined by subclass.

It is not recommended to call predict() directly, since it requires a very specific formatting for inputs, specifically a tall array of standardized coordinates in the same order as dims. Rather, one of the convenience functions predict_points() or predict_grid() should be used, as these have a more intuitive input structure and format the tidy appropriately prior to calling predict().

See also

GP.predict(): meth:GLM.predict

Returns:: prediction_mean, prediction_var – Mean and variance of predictions at each supplied points
Return type:: list of np.ndarray

predict_grid(output=None, categorical_levels=None, with_noise=True, **kwargs)

Make predictions and reshape into grid.

If the model has categorical_dims, a specific level for each dimension must be specified as key-value pairs in categorical_levels.

Parameters:

output (str or list of str, optional) – Variable(s) for which to make predictions
categorical_levels (dict, optional) – Level for each categorical_dims at which to make prediction
with_noise (bool, default True) – Whether to incorporate aleatoric uncertainty into prediction error

Returns:

prediction – Predictions as a grid with len(continuous_dims) dimensions

Return type:

UncertainParameterArray

predict_points(points, output=None, with_noise=True, **kwargs)

Make predictions at supplied points

Parameters:

points (ParameterArray) – 1-D ParameterArray vector of coordinates for prediction, must have one layer per self.dims
output (str or list of str, optional) – Variable for which to make predictions
with_noise (bool, default True) – Whether to incorporate aleatoric uncertainty into prediction error
**kwargs – Additional keyword arguments passed to subclass-specific predict() method

Returns:

prediction – Predictions as a uparray

Return type:

UncertainParameterArray

prepare_grid(limits=None, at=None, resolution=100)

Prepare unobserved input coordinates for specified continuous dimensions.

Parameters:

limits (ParameterArray) – List of min/max values as a single parray with one layer for each of a subset of continuous_dims.
at (ParameterArray) – A single parray of length 1 with one layer for each remaining continuous_dims by name.
ticks (dict resolution : dict or int) – Number of points along each dimension, either as a dictionary or one value applied to all dimensions

propose(target, acquisition='EI'): Bayesian Optimization with Expected Improvement acquisition function

specify_model(outputs=None, linear_dims=None, continuous_dims=None, continuous_levels=None, continuous_coords=None, categorical_dims=None, categorical_levels=None, additive=False)

Checks for consistency among dimensions and levels and formats appropriately.

Parameters:

outputs (str or list of str, default None) – Name(s) of output(s) to learn. If None, outputs is used.
linear_dims (str or list of str, optional) – Subset of continuous dimensions to apply an additional linear kernel. If None, defaults to ['Y','X'].
continuous_dims (str or list of str, optional) – Columns of dataframe used as continuous dimensions
continuous_levels (str, list, or dict, optional) – Values considered within each continuous column as {dim: [level1, level2]}
continuous_coords (list or dict, optional) – Numerical coordinates of each continuous level within each continuous dimension as {dim: {level: coord}}
categorical_dims (str or list of str, optional) – Columns of dataframe used as categorical dimensions
categorical_levels (str, list, or dict, optional) – Values considered within each categorical column as {dim: [level1, level2]}
additive (bool, default False) – Whether to treat categorical_dims as additive or joint (default)

Returns:

self

Return type:

GP

uparray(name: str, μ: ndarray, σ2: ndarray, **kwargs) → UncertainParameterArray: Creates a uparray with the current instance’s stdzr attached