Standardizer

Standardizer([log_vars, logit_vars])

Container for dict of mean (μ) and variance (σ2) for every parameter.

Methods

Standardizer.from_DataFrame(df[, log_vars, ...])

Construct from wide-form DataFrame

Standardizer.stdz(name, μ=None, σ2=None)

Transforms, mean-centers, and scales a parameter, distribution, or Series

Standardizer.transform(name, μ=None, σ2=None)

Transforms a parameter, distribution, or Series

Standardizer.unstdz(name, μ=None, σ2=None)

Untransforms, un-centers, and un-scales a parameter, distribution, or Series

Standardizer.untransform(name, μ=None, σ2=None)

Untransforms a parameter, distribution, or Series

Standardizer.validate(dct)

Ensures provided dictionary has all required attributes

Attributes

Standardizer.log_vars

List of log-normal variables

Standardizer.logit_vars

List of logit-normal variables

Standardizer.mean_transforms

Function that transforms the mean of a distribution.

Standardizer.transforms

Collection of forward and reverse transform functions for each variable

Standardizer.var_transforms

Function that transforms the variance of a distribution.

class gumbi.aggregation.Standardizer(log_vars=None, logit_vars=None, **kwargs)

Bases: dict

Container for dict of mean (μ) and variance (σ2) for every parameter.

Standardizer objects allow transformation and normalization of datasets. The main methods are stdz(), which attempts to coerce the values of a given variable to a standard normal distribution (z-scores), and its complement unstdz(). The steps are

\[\mathbf{\text{tidy}} \rightarrow \text{transform} \rightarrow \text{mean-center} \rightarrow \text{scale} \rightarrow \mathbf{\text{tidy.z}}\]

For example, reaction rate must clearly be strictly positive, so we use a log transformation so that it behaves as a normally-distributed random variable. We then mean-center and scale this transformed value to obtain z-scores indicating how similar a given estimate is to all the other estimates we’ve observed. Standardizer stores the transforms and population mean and variance for every parameter, allowing us to convert back and forth between natural space (\(rate\)), transformed space (\(\text{ln}\; rate\)), and standardized space (\(\left( \text{ln}\; rate - \mu_{\text{ln}\; rate} \right)/\sigma_{\text{ln}\; rate}\)).

Typically, a Standardizer will be constructed from a dataframe (from_DataFrame()), but the individual means and variances can be provided at instantiation as well. Note, however, that these should be the mean/std of the transformed variable. For example, if r should be treated as log-normal with a natural-space mean of 1 and variance of 0.1, the right way to instantiate the class would be Standardizer(d={‘μ’: 0, ‘σ2’: 0.1}, log_vars=[‘d’]).

Notes

Standardizer is just a dictionary with some extra methods and defaults, so standard dictionary methods like dict.update() still work.

Parameters:
  • log_vars (list, optional) – List of input and output variables to be treated as log-normal.

  • logit_vars (list, optional) – List of input and output variables to be treated as logit-normal.

  • **kwargs – Mean and variance of each variable as a dictionary, e.g. d={‘μ’: 0, ‘σ2’: 0.1}

Examples

>>> import numpy as np
>>> import pandas as pd
>>> from gumbi import Standardizer
>>> stdzr = Standardizer(x={'μ': 1, 'σ2': 0.1}, d={'μ': 0, 'σ2': 0.1}, log_vars=['d'])

Transforming and standardizing a single parameter:

>>> stdzr.transform('x', μ=1)
1
>>> stdzr.stdz('x', 1)
0.0
>>> stdzr.unstdz('x', 0)
1.0
>>> stdzr.stdz('x', 1+0.1**0.5)
1.0  # approximately
>>> stdzr.unstdz('x', 1)
1.316227766016838
>>> stdzr.stdz('d', 1)
0.0
>>> stdzr.stdz('d', np.exp(0.1**0.5))
1.0  # approximately

Transforming and standardizing a distribution:

>>> stdzr.transform('x', μ=1., σ2=0.1)
(1, 0.1)
>>> stdzr.stdz('x', 1, 0.1)
(0.0, 1.0)
>>> stdzr.stdz('d', 1, 0.1)
(0.0, 1.0)
>>> stdzr.transform('d', 1, 0.1)
(0.0, 0.1)

Standardizing a series:

>>> x_series = pd.Series(np.arange(1,5), name='x')
>>> stdzr.stdz(x_series)
0    0.000000
1    3.162278
2    6.324555
3    9.486833
Name: x, dtype: float64
>>> r_series = pd.Series(np.arange(1,5), name='d')
>>> stdzr.stdz(r_series)
0    0.000000
1    2.191924
2    3.474117
3    4.383848
Name: r, dtype: float64
classmethod from_DataFrame(df: DataFrame, log_vars=None, logit_vars=None)

Construct from wide-form DataFrame

property log_vars: list[str]

List of log-normal variables

property logit_vars: list[str]

List of logit-normal variables

property mean_transforms

Function that transforms the mean of a distribution.

These transform’s should follow scipy’s conventions such that a distribution can be defined in the given space by passing (loc=μ, scale=σ2**0.5). For a lognormal variable, an RV defined as lognorm(loc=μ, scale=σ2**0.5) in “natural” space is equivalent to norm(loc=np.log(μ), scale=σ2**0.5) in log space, so this transform should return np.log(μ) when converting from natural to log space, and np.exp(μ) when converting from log to natural space. Similarly for a logit-normal variable, an RV defined as logitnorm(loc=μ, scale=σ2**0.5)) in natural space is equivalent to norm(loc=logit(μ), scale=σ2**0.5) in logit space, so this transform should return logit(μ) when converting from natural to logit space, and expit(μ) when converting from logit to natural space.

stdz(name: str | pd.Series, μ: float = None, σ2: float = None) float | tuple | pd.Series

Transforms, mean-centers, and scales a parameter, distribution, or Series

Parameters:
  • name (str or pd.Series) – Name of parameter. If a Series is supplied, the name of the series must be the parameter name.

  • μ (float, optional) – Value of parameter or mean of parameter distribution. Only optional if first argument is a Series.

  • σ2 (float, optional) – Variance of parameter distribution.

Returns:

Standardized parameter, (mean, variance) of standardized distribution, or standardized Series

Return type:

float, tuple, or pd.Series

transform(name: str | pd.Series, μ: float = None, σ2: float = None) float | tuple | pd.Series

Transforms a parameter, distribution, or Series

Parameters:
  • name (str or pd.Series) – Name of parameter. If a Series is supplied, the name of the series must be the parameter name.

  • μ (float, optional) – Value of parameter or mean of parameter distribution. Only optional if first argument is a Series.

  • σ2 (float, optional) – Variance of parameter distribution.

Returns:

Transformed parameter, (mean, variance) of untransformed distribution, or untransformed Series

Return type:

float, tuple, or pd.Series

property transforms: dict

Collection of forward and reverse transform functions for each variable

unstdz(name: str | pd.Series, μ: float = None, σ2: float = None) float | tuple | pd.Series

Untransforms, un-centers, and un-scales a parameter, distribution, or Series

Parameters:
  • name (str or pd.Series) – Name of parameter. If a Series is supplied, the name of the series must be the parameter name.

  • μ (float, optional) – Value of parameter or mean of parameter distribution. Only optional if first argument is a Series.

  • σ2 (float, optional) – Variance of parameter distribution.

Returns:

Unstandardized parameter, (mean, variance) of unstandardized distribution, or unstandardized Series

Return type:

float, tuple, or pd.Series

untransform(name: str | pd.Series, μ: float = None, σ2: float = None) float | tuple | pd.Series

Untransforms a parameter, distribution, or Series

Parameters:
  • name (str or pd.Series) – Name of parameter. If a Series is supplied, the name of the series must be the parameter name.

  • μ (float, optional) – Value of parameter or mean of parameter distribution. Only optional if first argument is a Series.

  • σ2 (float, optional) – Variance of parameter distribution.

Returns:

Untransformed parameter, (mean, variance) of untransformed distribution, or untransformed Series

Return type:

float, tuple, or pd.Series

classmethod validate(dct: dict)

Ensures provided dictionary has all required attributes

property var_transforms

Function that transforms the variance of a distribution.

These transform’s should follow scipy’s conventions such that a distribution can be defined in the given space by passing (loc=μ, scale=σ2**0.5). Accordingly, since both log-normal and logit-normal variables are defined in terms of the scale (standard deviation) in their respective transformed spaces, this function simply returns the variance unchanged in these cases.