GP
|
Gaussian Process surface learning and prediction. |
Methods
|
|
|
Compile a marginalized pymc model for the GP. |
|
Draw posterior samples at points defined by |
|
Draw posterior samples at supplied points |
|
Finds maximum a posteriori value for hyperparameters in model. |
|
Fits a GP surface |
|
Make predictions at supplied points using specified gp |
|
Draws samples from the posterior for the hyperparameters in model. |
Attributes
- class gumbi.regression.GP(dataset: DataSet, outputs=None, seed=2021)
Bases:
Regressor
Gaussian Process surface learning and prediction.
See also
Notes
The GP class is built from a dataframe in the form of a
DataSet
object. The output(s) are taken fromDataSet.outputs
and the corresponding column in the tidy data frame taken fromDataSet.names_column
. This column will be generically referred to as the output_column in this documentation, but can take any value specifed when theDataSet
is constructed. The model inputs are constructed by filtering this dataframe, extracting column values, and converting these to numerical input coordinates. The main entry point will befit()
, which parses the dimensions of the model withspecify_model()
, extracts numerical input coordinates withget_structured_data()
, compiles the Pymc model withbuild_model()
, and finally learns the hyperparameters withfind_MAP()
.Dimensions fall into several categories:
Filter dimensions, those with only one level, are used to subset the dataframe but are not included as explicit inputs to the model. These are not specified explicitly, but rather any continuous or categorical dimension with only one level is treated as a filter dimension.
Continuous dimensions are treated as explicit coordinates and given a Radial Basis Function kernel
Linear dimensions (which must be a subset of continuous_dims) have an additional linear kernel.
Coregion dimensions imply a distinct but correlated output for each level
If more than one output is specified, the output_column is treated as a categorical dim.
A non-additive model has the form:
\[\begin{split}y &\sim \text{Normal} \left( \mu, \sigma \right) \\ mu &\sim \mathcal{GP} \left( K \right) \\ K &= \left( K^\text{cont}+K^\text{lin} \right) K^\text{coreg}_\text{outputs} \prod_{n} K^\text{coreg}_{n} \\ K^\text{cont} &= \text{RBF} \left( \ell_{i}, \eta \right) \\ K^\text{lin} &= \text{LIN} \left( c_{j}, \tau \right) \\ K^\text{coreg} &= \text{Coreg} \left( \boldsymbol{W}, \kappa \right) \\ \sigma &\sim \text{Exponential} \left( 1 \right) \\\end{split}\]Where \(i\) denotes a continuous dimension, \(j\) denotes a linear dimension, and \(n\) denotes a categorical dimension (excluding the output_column). \(K^\text{cont}\) and \(K^\text{lin}\) each consist of a joint kernel encompassing all continuous and linear dimensions, respectively, whereas \(K^\text{coreg}_{n}\) is a distinct kernel for a given categorical dimension.
The additive model has the form:
\[\begin{split}mu &\sim \mathcal{GP}\left( K^\text{global} \right) + \sum_{n} \mathcal{GP}\left( K_{n} \right) \\ K^\text{global} &= \left( K^\text{cont}+K^\text{lin} \right) K^\text{coreg}_\text{outputs} \\ K_{n} &= \left( K^\text{cont}_{n}+K^\text{lin}_{n} \right) K^\text{coreg}_\text{outputs} K^\text{coreg}_{n} \\\end{split}\]Note that, in the additive model, \(K^\text{cont}_{n}\) and \(K^\text{lin}_{n}\) still consist of only the continuous and linear dimensions, respectively, but have unique priors corresponding to each categorical dimension. However, there is only one \(K^\text{coreg}_\text{outputs}\) kernel.
- Parameters:
dataset (DataSet) – Data for fitting.
outputs (str or list of str, default None) – Name(s) of output(s) to learn. If
None
, uses all values fromoutputs
attribute of dataset.seed (int) – Random seed
Examples
A GP object is created from a
DataSet
and can be fit immediately with the default dimension configuration (regressing r with RBF + linear kernels for X and Y):>>> import gumbi as gmb >>> df = pd.read_pickle(gmb.data.example_dataset) >>> outputs=['a', 'b', 'c', 'd', 'e', 'f'] >>> ds = gmb.DataSet(df, outputs=outputs, log_vars=['Y', 'b', 'c', 'd', 'f'], logit_vars=['X', 'e']) >>> gp = gmb.GP(ds, outputs='d').fit()
Note that last line is equivalent to
>>> gp = gmb.GP(ds, outputs='d') >>> gp.specify_model() >>> gp.build_model() >>> gp.find_MAP()
The model can be specified with various continuous, linear, and categorical dimensions. X and Y are always included in both
continuous_dims
andlinear_dims
.>>> gp.specify_model(continuous_dims='lg10_Z', linear_dims='lg10_Z', categorical_dims='Pair') >>> gmb.GP(ds).fit(continuous_dims='lg10_Z', linear_dims='lg10_Z', categorical_dims='Pair') # equivalent
After the model is fit, define a grid of points at which to make predictions. The result is a
ParameterArray
:>>> gp.prepare_grid() >>> gp.grid_points ('X', 'Y'): [(0.075 , 10.) (0.08358586, 10.) (0.09217172, 10.) ... (0.90782828, 800.) (0.91641414, 800.) (0.925 , 800.)]
Make predictions, returning an
UncertainParameterArray
>>> gp.predict_grid() >>> gp.predictions d['μ', 'σ2']: [[(0.70728056, 0.16073197) (0.70728172, 0.16073197) (0.70728502, 0.16073197) ... (0.70727954, 0.16073197) (0.7072809 , 0.16073197) (0.70728058, 0.16073197)] ... [(0.70749247, 0.1607318 ) (0.70773573, 0.16073116) (0.70806603, 0.16072949) ... (0.70728449, 0.16073197) (0.70728194, 0.16073197) (0.7072807 , 0.16073197)]]
The uparray makes it easy to calculate standard statistics in natural or transformed/standardized space while maintaining the original shape of the array:
>>> gp.predictions.z.dist.ppf(0.025) array([[-3.1887916 , -3.18878491, -3.18876601, ..., -3.18879744, -3.18878966, -3.18879146], ..., [-3.1875742 , -3.18617286, -3.18426272, ..., -3.18876906, -3.18878366, -3.18879081]])
Finally, plot the results:
>>> import matplotlib.pyplot as plt >>> >>> plt.style.use(str(gmb.style.futura)) >>> x_pa = gp.predictions_X >>> y_upa = gp.predictions >>> gmb.ParrayPlotter(x_pa, y_upa).plot()
Plot a slice down the center of the prediction along each axis
>>> x_pa, y_upa = gp.get_conditional_prediction(Y=88) >>> >>> ax = gmb.ParrayPlotter(x_pa, y_upa).plot() >>> ax.set_xticklabels([int(float(txt.get_text())*100) for txt in ax.get_xticklabels()]);
Plot a slice down the center of the prediction along each axis
>>> x_pa, y_upa = gp.get_conditional_prediction(X=0.5) >>> >>> ax = gmb.ParrayPlotter(x_pa, y_upa).plot() >>> ax.set_xticklabels([int(float(txt.get_text())*100) for txt in ax.get_xticklabels()]);
- outputs
Name(s) of output(s) to learn.
- Type:
list of str
- seed
Random seed
- Type:
int
- continuous_dims
Columns of dataframe used as continuous dimensions
- Type:
list of str
- linear_dims
Subset of continuous dimensions to apply an additional linear kernel.
- Type:
list of str
- continuous_levels
Values considered within each continuous column as
{dim: [level1, level2]}
- Type:
dict
- continuous_coords
Numerical coordinates of each continuous level within each continuous dimension as
{dim: {level: coord}}
- Type:
dict
- categorical_dims
Columns of dataframe used as categorical dimensions
- Type:
list of str
- categorical_levels
Values considered within each categorical column as
{dim: [level1, level2]}
- Type:
dict
- categorical_coords
Numerical coordinates of each categorical level within each categorical dimension as
{dim: {level: coord}}
- Type:
dict
- additive
Whether to treat categorical dimensions as additive or joint
- Type:
bool
- filter_dims
Dictionary of column-value pairs used to filter dataset before fitting
- Type:
dict
- X
A 2D tall array of input coordinates.
- Type:
array
- y
A 1D vector of observations
- Type:
array
- model
Compiled pymc model
- Type:
pymc.model.Model
- gp_dict
Dictionary of model GP objects. Contains at least ‘total’.
- Type:
dict
- build_model(seed=None, continuous_kernel='ExpQuad', period=None, heteroskedastic_inputs=False, heteroskedastic_outputs=True, sparse=False, n_u=100, ARD=True)
Compile a marginalized pymc model for the GP.
Each dimension in
continuous_dims
is combined in an ExpQuad kernel with a principled \(\text{InverseGamma}\) prior for each lengthscale (as suggested by Michael Betancourt) and a \(\text{Gamma}\left(2, 1\right)\) prior for variance.- Parameters:
seed (int, optional.) – Random seed. If
None
,seed
is used.continuous_kernel ({'ExpQuad', 'Matern32', 'Matern52', 'Exponential', 'Cosine', or 'Periodic'}) – Covariance function to use for continuous dimensions. See pymc docs for more details.
period (ParameterArray, optional) – A single parray of length 1 with one layer for each continuous_dims by name containing the period of the kernel, if periodic-like kernel is used.
heteroskedastic_inputs (bool, default False) – Whether to allow heteroskedasticity along continuous dimensions (input-dependent noise).
heteroskedastic_outputs (bool, default True) – Whether to allow heteroskedasticity between multiple outputs (output-dependent noise). Not yet implemented.
sparse (bool, default False) – Whether to use a sparse approximation to the GP.
n_u (int, default 100) – Number of inducing points to use for the sparse approximation, if required.
ARD (bool, default True) – Whether to use “Automatic Relevance Determination” in the continuous kernel. If _True_, each continuous dimension receives its own lengthscale; otherwise a single lengthscale is used for all continuous dimensions.
- Returns:
self
- Return type:
- draw_grid_samples(*args, source=None, output=None, categorical_levels=None, var_name='posterior_samples', additive_level='total', increment_var=True, **kwargs)
Draw posterior samples at points defined by
prepare_grid()
.- Parameters:
source ({None, dict, az.data.inference_data.InferenceData}) – GP parameters for which to draw samples. Should be the result of
find_MAP()
,sample()
, or _None_.output (str or list of str, optional) – Variable(s) for which to make predictions
categorical_levels (dict, optional) – Level for each
categorical_dims
at which to make predictionvar_name (str, default "posterior_samples") – Name to assign new variable to contain conditional predictions.
additive_level (str, default "total") – Level of additive GP at which to make predictions.
increment_var (bool, default True) – Whether to append ‘_’ to the end of _var_name_ if it already exists in model.
- Returns:
samples – Samples as a ‘Parray’ reshaped into a grid with _len(
continuous_dims
)_ dimensions- Return type:
- draw_point_samples(points, *args, source=None, output=None, var_name='posterior_samples', additive_level='total', increment_var=True, **kwargs)
Draw posterior samples at supplied points
- Parameters:
points (ParameterArray) – 1-D ParameterArray vector of coordinates for prediction, must have one layer per
self.dims
output (str or list of str, optional) – Variable for which to make predictions
source ({None, dict, az.data.inference_data.InferenceData}) – GP parameters for which to draw samples. Should be the result of
find_MAP()
,sample()
, or _None_.var_name (str, default "posterior_samples") – Name to assign new variable to contain conditional predictions.
additive_level (str, default "total") – Level of additive GP at which to make predictions.
increment_var (bool, default True) – Whether to append ‘_’ to the end of _var_name_ if it already exists in model.
- Returns:
samples – Samples as a ‘Parray’
- Return type:
parray
- find_MAP(*args, **kwargs)
Finds maximum a posteriori value for hyperparameters in model.
- Parameters:
*args – Positional arguments passed to
pm.find_MAP()
**kwargs – Keyword arguments passed to
pm.find_MAP()
- fit(outputs=None, linear_dims=None, continuous_dims=None, continuous_levels=None, continuous_coords=None, categorical_dims=None, categorical_levels=None, additive=False, seed=None, continuous_kernel='ExpQuad', period=None, heteroskedastic_inputs=False, heteroskedastic_outputs=True, sparse=False, n_u=100, ARD=True, spec_kwargs=None, build_kwargs=None, MAP_kwargs=None)
Fits a GP surface
Parses inputs, compiles a Pymc model, then finds the MAP value for the hyperparameters. {}_dims arguments indicate the columns of the dataframe to be included in the model, with {}_levels indicating which values of those columns are to be included (
None
implies all values).If
additive==True
, the model is constructed as the sum of a global GP and a distinct GP for each categorical dimension. Each of these GPs, including the global GP, consists of an RBF+linear kernel multiplied by a coregion kernel for the output_column if necessary. Although the same continuous kernel structure is used for each GP in this model, unique priors are assigned to each distinct kernel. However, there is always only one coregion kernel for the output_column. The kernel for each dimension-specific GP is further multiplied by a coregion kernel that provides an output for each level in that dimension.See also
- Parameters:
outputs (str or list of str, default None) – Name(s) of output(s) to learn. If
None
,outputs
is used.linear_dims (str or list of str, optional) – Subset of continuous dimensions to apply an additional linear kernel. If
None
, defaults to['Y','X']
.continuous_dims (str or list of str, optional) – Columns of dataframe used as continuous dimensions.
continuous_levels (str, list, or dict, optional) – Values considered within each continuous column as
{dim: [level1, level2]}
.continuous_coords (list or dict, optional) – Numerical coordinates of each continuous level within each continuous dimension as
{dim: {level: coord}}
.categorical_dims (str or list of str, optional) – Columns of dataframe used as categorical dimensions.
categorical_levels (str, list, or dict, optional) – Values considered within each categorical column as
{dim: [level1, level2]}
.additive (bool, default False) – Whether to treat categorical_dims as additive or joint (default).
seed (int, optional.) – Random seed for model instantiation. If
None
,seed
is used.continuous_kernel ({'ExpQuad', 'Matern32', 'Matern52', 'Exponential', or 'Cosine'}) – Covariance function to use for continuous dimensions. See pymc docs for more details.
period (ParameterArray, optional) – A single parray of length 1 with one layer for each continuous_dims by name containing the period of the kernel, if periodic-like kernel is used.
heteroskedastic_inputs (bool, default False) – Whether to allow heteroskedasticity along continuous dimensions (input-dependent noise)
heteroskedastic_outputs (bool, default True) – Whether to allow heteroskedasticity between multiple outputs (output-dependent noise). Not yet implemented
sparse (bool, default False) – Whether to use a sparse approximation to the GP.
n_u (int, default 100) – Number of inducing points to use for the sparse approximation, if required.
ARD (bool, default True) – Whether to use “Automatic Relevance Determination” in the continuous kernel. If _True_, each continuous dimension receives its own lengthscale; otherwise a single lengthscale is used for all continuous dimensions.
**MAP_kwargs – Additional keyword arguments passed to
pm.find_MAP()
.
- Returns:
self
- Return type:
- predict(points_array, with_noise=True, additive_level='total', **kwargs)
Make predictions at supplied points using specified gp
- Parameters:
output (str points : ParameterArray) – Tall ParameterArray vector of coordinates for prediction, must have one layer per
self.dims
with_noise (bool, default True) – Whether to incorporate aleatoric uncertainty into prediction error
- Returns:
prediction – Predictions as a uparray
- Return type:
- sample(*args, **kwargs)
Draws samples from the posterior for the hyperparameters in model.
- Parameters:
*args – Positional arguments passed to
pm.sample()
**kwargs – Keyword arguments passed to
pm.sample()