GP

GP(dataset[, outputs, seed])

Gaussian Process surface learning and prediction.

Methods

`GP.build_latent`([seed, continuous_kernel, ...])
`GP.build_model`([seed, continuous_kernel, ...])	Compile a marginalized pymc model for the GP.
`GP.draw_grid_samples`(*args[, source, ...])	Draw posterior samples at points defined by `prepare_grid()`.
`GP.draw_point_samples`(points, *args[, ...])	Draw posterior samples at supplied points
`GP.find_MAP`(args, *kwargs)	Finds maximum a posteriori value for hyperparameters in model.
`GP.fit`([outputs, linear_dims, ...])	Fits a GP surface
`GP.predict`(points_array[, with_noise, ...])	Make predictions at supplied points using specified gp
`GP.sample`(args, *kwargs)	Draws samples from the posterior for the hyperparameters in model.

Attributes

class gumbi.regression.GP(dataset: DataSet, outputs=None, seed=2021)

Bases: Regressor

Gaussian Process surface learning and prediction.

See also

Regressor

Notes

The GP class is built from a dataframe in the form of a DataSet object. The output(s) are taken from DataSet.outputs and the corresponding column in the tidy data frame taken from DataSet.names_column. This column will be generically referred to as the output_column in this documentation, but can take any value specifed when the DataSet is constructed. The model inputs are constructed by filtering this dataframe, extracting column values, and converting these to numerical input coordinates. The main entry point will be fit(), which parses the dimensions of the model with specify_model(), extracts numerical input coordinates with get_structured_data(), compiles the Pymc model with build_model(), and finally learns the hyperparameters with find_MAP().

Dimensions fall into several categories:

Filter dimensions, those with only one level, are used to subset the dataframe but are not included as explicit inputs to the model. These are not specified explicitly, but rather any continuous or categorical dimension with only one level is treated as a filter dimension.
Continuous dimensions are treated as explicit coordinates and given a Radial Basis Function kernel
- Linear dimensions (which must be a subset of continuous_dims) have an additional linear kernel.
Coregion dimensions imply a distinct but correlated output for each level
- If more than one output is specified, the output_column is treated as a categorical dim.

A non-additive model has the form:

\[\begin{split}y &\sim \text{Normal} \left( \mu, \sigma \right) \\ mu &\sim \mathcal{GP} \left( K \right) \\ K &= \left( K^\text{cont}+K^\text{lin} \right) K^\text{coreg}_\text{outputs} \prod_{n} K^\text{coreg}_{n} \\ K^\text{cont} &= \text{RBF} \left( \ell_{i}, \eta \right) \\ K^\text{lin} &= \text{LIN} \left( c_{j}, \tau \right) \\ K^\text{coreg} &= \text{Coreg} \left( \boldsymbol{W}, \kappa \right) \\ \sigma &\sim \text{Exponential} \left( 1 \right) \\\end{split}\]

Where \(i\) denotes a continuous dimension, \(j\) denotes a linear dimension, and \(n\) denotes a categorical dimension (excluding the output_column). \(K^\text{cont}\) and \(K^\text{lin}\) each consist of a joint kernel encompassing all continuous and linear dimensions, respectively, whereas \(K^\text{coreg}_{n}\) is a distinct kernel for a given categorical dimension.

The additive model has the form:

\[\begin{split}mu &\sim \mathcal{GP}\left( K^\text{global} \right) + \sum_{n} \mathcal{GP}\left( K_{n} \right) \\ K^\text{global} &= \left( K^\text{cont}+K^\text{lin} \right) K^\text{coreg}_\text{outputs} \\ K_{n} &= \left( K^\text{cont}_{n}+K^\text{lin}_{n} \right) K^\text{coreg}_\text{outputs} K^\text{coreg}_{n} \\\end{split}\]

Note that, in the additive model, \(K^\text{cont}_{n}\) and \(K^\text{lin}_{n}\) still consist of only the continuous and linear dimensions, respectively, but have unique priors corresponding to each categorical dimension. However, there is only one \(K^\text{coreg}_\text{outputs}\) kernel.

Parameters:

dataset (DataSet) – Data for fitting.
outputs (str or list of str, default None) – Name(s) of output(s) to learn. If None, uses all values from outputs attribute of dataset.
seed (int) – Random seed

Examples

A GP object is created from a DataSet and can be fit immediately with the default dimension configuration (regressing r with RBF + linear kernels for X and Y):

>>> import gumbi as gmb
>>> df = pd.read_pickle(gmb.data.example_dataset)
>>> outputs=['a', 'b', 'c', 'd', 'e', 'f']
>>> ds = gmb.DataSet(df, outputs=outputs, log_vars=['Y', 'b', 'c', 'd', 'f'], logit_vars=['X', 'e'])
>>> gp = gmb.GP(ds, outputs='d').fit()

Note that last line is equivalent to

>>> gp = gmb.GP(ds, outputs='d')
>>> gp.specify_model()
>>> gp.build_model()
>>> gp.find_MAP()

The model can be specified with various continuous, linear, and categorical dimensions. X and Y are always included in both continuous_dims and linear_dims.

>>> gp.specify_model(continuous_dims='lg10_Z', linear_dims='lg10_Z', categorical_dims='Pair')
>>> gmb.GP(ds).fit(continuous_dims='lg10_Z', linear_dims='lg10_Z', categorical_dims='Pair')  # equivalent

After the model is fit, define a grid of points at which to make predictions. The result is a ParameterArray:

>>> gp.prepare_grid()
>>> gp.grid_points
('X', 'Y'): [(0.075     ,  10.) (0.08358586,  10.) (0.09217172,  10.) ...
 (0.90782828, 800.) (0.91641414, 800.) (0.925     , 800.)]

Make predictions, returning an UncertainParameterArray

>>> gp.predict_grid()
>>> gp.predictions
d['μ', 'σ2']: [[(0.70728056, 0.16073197) (0.70728172, 0.16073197)
                (0.70728502, 0.16073197) ... (0.70727954, 0.16073197)
                (0.7072809 , 0.16073197) (0.70728058, 0.16073197)]
               ...
               [(0.70749247, 0.1607318 ) (0.70773573, 0.16073116)
                (0.70806603, 0.16072949) ... (0.70728449, 0.16073197)
                (0.70728194, 0.16073197) (0.7072807 , 0.16073197)]]

The uparray makes it easy to calculate standard statistics in natural or transformed/standardized space while maintaining the original shape of the array:

>>> gp.predictions.z.dist.ppf(0.025)
array([[-3.1887916 , -3.18878491, -3.18876601, ..., -3.18879744,
        -3.18878966, -3.18879146],
       ...,
       [-3.1875742 , -3.18617286, -3.18426272, ..., -3.18876906,
        -3.18878366, -3.18879081]])

Finally, plot the results:

>>> import matplotlib.pyplot as plt
>>>
>>> plt.style.use(str(gmb.style.futura))
>>> x_pa = gp.predictions_X
>>> y_upa = gp.predictions
>>> gmb.ParrayPlotter(x_pa, y_upa).plot()

Plot a slice down the center of the prediction along each axis

>>> x_pa, y_upa = gp.get_conditional_prediction(Y=88)
>>>
>>> ax = gmb.ParrayPlotter(x_pa, y_upa).plot()
>>> ax.set_xticklabels([int(float(txt.get_text())*100) for txt in ax.get_xticklabels()]);

Plot a slice down the center of the prediction along each axis

>>> x_pa, y_upa = gp.get_conditional_prediction(X=0.5)
>>>
>>> ax = gmb.ParrayPlotter(x_pa, y_upa).plot()
>>> ax.set_xticklabels([int(float(txt.get_text())*100) for txt in ax.get_xticklabels()]);

dataset

Data for fitting.

Type:: DataSet

outputs

Name(s) of output(s) to learn.

Type:: list of str

seed

Random seed

Type:: int

continuous_dims

Columns of dataframe used as continuous dimensions

Type:: list of str

linear_dims

Subset of continuous dimensions to apply an additional linear kernel.

Type:: list of str

continuous_levels

Values considered within each continuous column as {dim: [level1, level2]}

Type:: dict

continuous_coords

Numerical coordinates of each continuous level within each continuous dimension as {dim: {level: coord}}

Type:: dict

categorical_dims

Columns of dataframe used as categorical dimensions

Type:: list of str

categorical_levels

Values considered within each categorical column as {dim: [level1, level2]}

Type:: dict

categorical_coords

Numerical coordinates of each categorical level within each categorical dimension as {dim: {level: coord}}

Type:: dict

additive

Whether to treat categorical dimensions as additive or joint

Type:: bool

filter_dims

Dictionary of column-value pairs used to filter dataset before fitting

Type:: dict

X

A 2D tall array of input coordinates.

Type:: array

y

A 1D vector of observations

Type:: array

model

Compiled pymc model

Type:: pymc.model.Model

gp_dict

Dictionary of model GP objects. Contains at least ‘total’.

Type:: dict

build_model(seed=None, continuous_kernel='ExpQuad', period=None, heteroskedastic_inputs=False, heteroskedastic_outputs=True, sparse=False, n_u=100, ARD=True)

Compile a marginalized pymc model for the GP.

Each dimension in continuous_dims is combined in an ExpQuad kernel with a principled \(\text{InverseGamma}\) prior for each lengthscale (as suggested by Michael Betancourt) and a \(\text{Gamma}\left(2, 1\right)\) prior for variance.

Parameters:

seed (int, optional.) – Random seed. If None, seed is used.
continuous_kernel ({'ExpQuad', 'Matern32', 'Matern52', 'Exponential', 'Cosine', or 'Periodic'}) – Covariance function to use for continuous dimensions. See pymc docs for more details.
period (ParameterArray, optional) – A single parray of length 1 with one layer for each continuous_dims by name containing the period of the kernel, if periodic-like kernel is used.
heteroskedastic_inputs (bool, default False) – Whether to allow heteroskedasticity along continuous dimensions (input-dependent noise).
heteroskedastic_outputs (bool, default True) – Whether to allow heteroskedasticity between multiple outputs (output-dependent noise). Not yet implemented.
sparse (bool, default False) – Whether to use a sparse approximation to the GP.
n_u (int, default 100) – Number of inducing points to use for the sparse approximation, if required.
ARD (bool, default True) – Whether to use “Automatic Relevance Determination” in the continuous kernel. If _True_, each continuous dimension receives its own lengthscale; otherwise a single lengthscale is used for all continuous dimensions.

Returns:

self

Return type:

GP

draw_grid_samples(*args, source=None, output=None, categorical_levels=None, var_name='posterior_samples', additive_level='total', increment_var=True, **kwargs)

Draw posterior samples at points defined by prepare_grid().

Parameters:

source ({None, dict, az.data.inference_data.InferenceData}) – GP parameters for which to draw samples. Should be the result of find_MAP(), sample(), or _None_.
output (str or list of str, optional) – Variable(s) for which to make predictions
categorical_levels (dict, optional) – Level for each categorical_dims at which to make prediction
var_name (str, default "posterior_samples") – Name to assign new variable to contain conditional predictions.
additive_level (str, default "total") – Level of additive GP at which to make predictions.
increment_var (bool, default True) – Whether to append ‘_’ to the end of _var_name_ if it already exists in model.

Returns:

samples – Samples as a ‘Parray’ reshaped into a grid with _len(continuous_dims)_ dimensions

Return type:

ParameterArray

draw_point_samples(points, *args, source=None, output=None, var_name='posterior_samples', additive_level='total', increment_var=True, **kwargs)

Draw posterior samples at supplied points

Parameters:

points (ParameterArray) – 1-D ParameterArray vector of coordinates for prediction, must have one layer per self.dims
output (str or list of str, optional) – Variable for which to make predictions
source ({None, dict, az.data.inference_data.InferenceData}) – GP parameters for which to draw samples. Should be the result of find_MAP(), sample(), or _None_.
var_name (str, default "posterior_samples") – Name to assign new variable to contain conditional predictions.
additive_level (str, default "total") – Level of additive GP at which to make predictions.
increment_var (bool, default True) – Whether to append ‘_’ to the end of _var_name_ if it already exists in model.

Returns:

samples – Samples as a ‘Parray’

Return type:

parray

find_MAP(*args, **kwargs)

Finds maximum a posteriori value for hyperparameters in model.

Parameters:

*args – Positional arguments passed to pm.find_MAP()
**kwargs – Keyword arguments passed to pm.find_MAP()

fit(outputs=None, linear_dims=None, continuous_dims=None, continuous_levels=None, continuous_coords=None, categorical_dims=None, categorical_levels=None, additive=False, seed=None, continuous_kernel='ExpQuad', period=None, heteroskedastic_inputs=False, heteroskedastic_outputs=True, sparse=False, n_u=100, ARD=True, spec_kwargs=None, build_kwargs=None, MAP_kwargs=None)

Fits a GP surface

Parses inputs, compiles a Pymc model, then finds the MAP value for the hyperparameters. {}_dims arguments indicate the columns of the dataframe to be included in the model, with {}_levels indicating which values of those columns are to be included (None implies all values).

If additive==True, the model is constructed as the sum of a global GP and a distinct GP for each categorical dimension. Each of these GPs, including the global GP, consists of an RBF+linear kernel multiplied by a coregion kernel for the output_column if necessary. Although the same continuous kernel structure is used for each GP in this model, unique priors are assigned to each distinct kernel. However, there is always only one coregion kernel for the output_column. The kernel for each dimension-specific GP is further multiplied by a coregion kernel that provides an output for each level in that dimension.