Regressor
|
Surface learning and prediction. |
Methods
Appends coordinates for the supplied categorical dim-level pairs to tall array of continuous coordinates. |
|
|
Defined by subclass |
|
Fits model on random subset of tidy and evaluates accuracy of predictions on remaining observations. |
|
Defined by subclass |
The conditional prediction at the given values of the specified dimensions over the remaining dimension(s). |
|
|
The portion of the dataset under consideration |
|
Formats input data and observations as plain numpy arrays |
|
Formats input data and observations as parrays |
|
Get grids corresponding to only specified dimensions |
|
Creates a uparray with the current instance's stdzr attached |
|
Creates a parray with the current instance's stdzr attached |
|
Defined by subclass. |
|
Make predictions and reshape into grid. |
|
Make predictions at supplied points |
|
Prepare unobserved input coordinates for specified continuous dimensions. |
|
Bayesian Optimization with Expected Improvement acquisition function |
|
Checks for consistency among dimensions and levels and formats appropriately. |
|
Creates a uparray with the current instance's stdzr attached |
Attributes
Dictionary of numerical coordinates of each level within each dimension as |
|
List of all dimensions under consideration |
|
Dictionary of values considered within each dimension as |
- class gumbi.regression.Regressor(dataset: DataSet, outputs=None, seed=2021)
Bases:
ABC
Surface learning and prediction.
A Regressor is built from a dataframe in the form of a
DataSet
object. This is stored astidy
. The model inputs are constructed by filtering this dataframe, extracting column values, and converting these to numerical input coordinates. Each subclass defines at least build_model, fit, and predict_points methods in addition to subclass-specific methods.Dimensions fall into several categories:
Filter dimensions, those with only one level, are used to subset the dataframe but are not included as explicit inputs to the model. These are not specified explicitly, but rather any continuous or categorical dimension with only one level is treated as a filter dimension.
Continuous dimensions are treated as explicit coordinates and given a Radial Basis Function kernel
Linear dimensions (which must be a subset of continuous_dims) have an additional linear kernel.
Coregion dimensions imply a distinct but correlated output for each level
If more than one output is specified,
self.out_col
is treated as a categorical dim.
- Parameters:
dataset (DataSet) – Data for fitting.
outputs (str or list of str, default None) – Name(s) of output(s) to learn. If
None
, uses all values fromoutputs
attribute of dataset.seed (int) – Random seed
- outputs
Name(s) of output(s) to learn.
- Type:
list of str, optional
- seed
Random seed
- Type:
int
- continuous_dims
Columns of dataframe used as continuous dimensions
- Type:
list of str
- linear_dims
Subset of continuous dimensions to apply an additional linear kernel.
- Type:
list of str
- continuous_levels
Values considered within each continuous column as
{dim: [level1, level2]}
- Type:
dict
- continuous_coords
Numerical coordinates of each continuous level within each continuous dimension as
{dim: {level: coord}}
- Type:
dict
- categorical_dims
Columns of dataframe used as categorical dimensions
- Type:
list of str
- categorical_levels
Values considered within each categorical column as
{dim: [level1, level2]}
- Type:
dict
- categorical_coords
Numerical coordinates of each categorical level within each categorical dimension as
{dim: {level: coord}}
- Type:
dict
- additive
Whether to treat categorical dimensions as additive or joint
- Type:
bool
- filter_dims
Dictionary of column-value pairs used to filter dataset before fitting
- Type:
dict
- X
A 2D tall array of input coordinates.
- Type:
array
- y
A 1D vector of observations
- Type:
array
- append_categorical_points(continuous_parray, categorical_levels)
Appends coordinates for the supplied categorical dim-level pairs to tall array of continuous coordinates.
- Parameters:
continuous_points (ParameterArray) – Tall
ParameterArray
of coordinates, one layer per continuous dimensioncategorical_levels (dict) – Single level for each
categorical_dims
at which to make prediction
- Returns:
points – Tall ParameterArray of coordinates, one layer per continuous and categorical dimension
- Return type:
- abstract build_model(*args, **kwargs)
Defined by subclass
See also
GP.build_model()
meth:GLM.build_model
- property coords: dict
Dictionary of numerical coordinates of each level within each dimension as
{dim: {level: coord}}
- cross_validate(unit=None, *, n_train=None, pct_train=None, train_only=None, warm_start=True, seed=None, errors='natural', **MAP_kws)
Fits model on random subset of tidy and evaluates accuracy of predictions on remaining observations.
This method finds unique combinations of values in the columns specified by
dims
, takes a random subset of these for training, and evaluates the predictions made for the remaining tidy.Notes
cross_validate()
is reproducibly random by default. In order to evaluate different test/train subsets of the same size, you will need to set the seed explicitly.Specifying unit changes the interpretation of n_train and pct_train: rather than the number or fraction of all individual observations to be included in the training set, these now represent the number of distinct entities in the unit column from the wide-form dataset.
Criteria in train_only are enforced before grouping observations by unit. If train_only and unit are both specified, but the train_only criteria encompass only some observations of a given entity in unit, this could lead to expected behavior.
Similarly, if warm_start and unit are both specified, but a given entity appears in multiple categories from any of the
categorical_dims
, this could lead to expected behavior. It is recommended to set warm_start to False if this is the case.- Parameters:
unit (list of str) – Columns from which to take unique combinations as training and testing sets. This could be useful when the data contains multiple (noisy) observations for each of several distinct entities.
n_train (int, optional) – Number of training points to use. Exactly one of n_train and pct_train must be specified.
pct_train (float, optional) – Percent of training points to use. Exactly one of n_train and pct_train must be specified.
train_only (dict, optional) – Specifications for observations to be always included in the training set. This will select all rows of the wide-form dataset which exactly match all criteria.
warm_start (bool, default True) – Whether to include a minimum of one observation for each level in each categorical_dim in the training set.
seed (int, optional) – Random seed
errors ({'natural', 'standardized', 'transformed'}) – “Space” in which to return prediction errors
**MAP_kws – Additional
- Returns:
Dictionary with nested dictionaries ‘train’ and ‘test’, both containing fields ‘data’, ‘NLPDs’, and ‘errors’. These fields contain the relevant subset of observations as a DataSet, an array of the negative log posterior densities of observations given the predictions, and an array of the natural-space difference between observations and prediction means, respectively.
- Return type:
dict
- property dims: list
List of all dimensions under consideration
- get_conditional_prediction(**dim_values)
The conditional prediction at the given values of the specified dimensions over the remaining dimension(s).
Conditioning the prediction on specific values of m dimensions can be thought of as taking a “slice” along the remaining n dimensions.
Performs (m+n)-dimensional interpolation over the entire prediction grid for each of the mean and variance separately, then returns the interpolation evaluated at the specified values for the provided dimensions and the original values for the remaining dimensions.
- Parameters:
dim_values – Keyword arguments specifying value for each dimension at which to return the conditional prediction of the remaining dimensions.
- Returns:
conditional_grid (ParameterArray) – n-dimensional grid with n parameters (layers) at which the conditional prediction is evaluated
conditional_prediction (UncertainParameterArray) – n-dimensional grid of predictions conditional on the given values of the m specified dimensions
- get_filtered_data(standardized=False, metric='mean')
The portion of the dataset under consideration
A filter is built by comparing the values in the unstandardized dataframe with those in
filter_dims
,categorical_levels
, andcontinuous_levels
, then the filter is applied to the standardized or unstandardized dataframe as indicated by the standardized input argument.- Parameters:
standardized (bool, default True) – Whether to return a subset of the raw tidy or the centered and scaled tidy
metric (str, default 'mean') – Which summary statistic to return (must be a value in the Metric column)
- Returns:
tidy
- Return type:
pd.DataFrame
- get_shaped_data(metric='mean', dropna=True)
Formats input data and observations as plain numpy arrays
- Parameters:
metric (str, default 'mean') – Which summary statistic to return (must be a value in the Metric column)
- Returns:
X (np.ndarray) – A tall matrix of input coordinates with shape (n_obs, n_dims).
y (np.ndarray) – A (1D) vector of observations
See also
- get_structured_data(metric='mean')
Formats input data and observations as parrays
- Parameters:
metric (str, default 'mean') – Which summary statistic to return (must be a value in the Metric column)
- Returns:
X (parray) – A multilayered column vector of input coordinates.
y (parray) – A multilayered (1D) vector of observations
See also
- property levels: dict
Dictionary of values considered within each dimension as
{dim: [level1, level2]}
- marginal_grids(*dims)
Get grids corresponding to only specified dimensions
- Parameters:
*dims (str) – Named dimensions along which to extract marginal grid. Must be a subset of
prediction_dims
.- Returns:
*grids – Grid for each named dimension specified, in the order supplied, each with len(dims) dimensions.
- Return type:
- mvuparray(*uparrays, cor, **kwargs) MVUncertainParameterArray
Creates a uparray with the current instance’s stdzr attached
- parray(**kwargs) ParameterArray
Creates a parray with the current instance’s stdzr attached
- abstract predict(points_array, with_noise=True, **kwargs)
Defined by subclass.
It is not recommended to call
predict()
directly, since it requires a very specific formatting for inputs, specifically a tall array of standardized coordinates in the same order asdims
. Rather, one of the convenience functionspredict_points()
orpredict_grid()
should be used, as these have a more intuitive input structure and format the tidy appropriately prior to callingpredict()
.See also
GP.predict()
meth:GLM.predict
- Returns:
prediction_mean, prediction_var – Mean and variance of predictions at each supplied points
- Return type:
list of np.ndarray
- predict_grid(output=None, categorical_levels=None, with_noise=True, **kwargs)
Make predictions and reshape into grid.
If the model has
categorical_dims
, a specific level for each dimension must be specified as key-value pairs in categorical_levels.- Parameters:
output (str or list of str, optional) – Variable(s) for which to make predictions
categorical_levels (dict, optional) – Level for each
categorical_dims
at which to make predictionwith_noise (bool, default True) – Whether to incorporate aleatoric uncertainty into prediction error
- Returns:
prediction – Predictions as a grid with len(
continuous_dims
) dimensions- Return type:
- predict_points(points, output=None, with_noise=True, **kwargs)
Make predictions at supplied points
- Parameters:
points (ParameterArray) – 1-D ParameterArray vector of coordinates for prediction, must have one layer per
self.dims
output (str or list of str, optional) – Variable for which to make predictions
with_noise (bool, default True) – Whether to incorporate aleatoric uncertainty into prediction error
**kwargs – Additional keyword arguments passed to subclass-specific
predict()
method
- Returns:
prediction – Predictions as a uparray
- Return type:
- prepare_grid(limits=None, at=None, resolution=100)
Prepare unobserved input coordinates for specified continuous dimensions.
- Parameters:
limits (ParameterArray) – List of min/max values as a single parray with one layer for each of a subset of continuous_dims.
at (ParameterArray) – A single parray of length 1 with one layer for each remaining continuous_dims by name.
ticks (dict resolution : dict or int) – Number of points along each dimension, either as a dictionary or one value applied to all dimensions
- propose(target, acquisition='EI')
Bayesian Optimization with Expected Improvement acquisition function
- specify_model(outputs=None, linear_dims=None, continuous_dims=None, continuous_levels=None, continuous_coords=None, categorical_dims=None, categorical_levels=None, additive=False)
Checks for consistency among dimensions and levels and formats appropriately.
- Parameters:
outputs (str or list of str, default None) – Name(s) of output(s) to learn. If
None
,outputs
is used.linear_dims (str or list of str, optional) – Subset of continuous dimensions to apply an additional linear kernel. If
None
, defaults to['Y','X']
.continuous_dims (str or list of str, optional) – Columns of dataframe used as continuous dimensions
continuous_levels (str, list, or dict, optional) – Values considered within each continuous column as
{dim: [level1, level2]}
continuous_coords (list or dict, optional) – Numerical coordinates of each continuous level within each continuous dimension as
{dim: {level: coord}}
categorical_dims (str or list of str, optional) – Columns of dataframe used as categorical dimensions
categorical_levels (str, list, or dict, optional) – Values considered within each categorical column as
{dim: [level1, level2]}
additive (bool, default False) – Whether to treat categorical_dims as additive or joint (default)
- Returns:
self
- Return type:
- uparray(name: str, μ: ndarray, σ2: ndarray, **kwargs) UncertainParameterArray
Creates a uparray with the current instance’s stdzr attached