DataSet

DataSet(data, outputs[, names_column, ...])

Container for tabular data, allowing simple access to standardized values and wide or tidy dataframe formats.

Methods

`DataSet.from_tidy`(tidy[, outputs, ...])	Constructs a DataSet from a tidy-form dataframe.
`DataSet.from_wide`(wide[, outputs, ...])	Constructs a DataSet from a wide-form dataframe.
`DataSet.update_stdzr`()	Updates internal `Standardizer` with current data, `log_vars`, and `logit_vars`.

Attributes

`DataSet.float_inputs`	Columns of dataframe with "float64" dtype.
`DataSet.inputs`	Columns of dataframe not contained in `outputs`.
`DataSet.isotropic_vars`
`DataSet.log_vars`
`DataSet.logit_vars`
`DataSet.names_column`
`DataSet.specs`	Provides keyword arguments for easy instantiation of a similar `DataSet`.
`DataSet.stdzr`
`DataSet.tidy`	Tidy-form view of data
`DataSet.values_column`
`DataSet.wide`	Wide-form view of data
`DataSet.data`
`DataSet.outputs`

class gumbi.aggregation.DataSet(data: DataFrame, outputs: list, names_column: str = 'Variable', values_column: str = 'Value', log_vars: list = None, logit_vars: list = None, isotropic_vars: list = None, stdzr: Standardizer = None)

Bases: object

Container for tabular data, allowing simple access to standardized values and wide or tidy dataframe formats.

DataSet is instantiated with a wide-form dataframe, with all outputs of a given observation in a single row, but allows easy access to the corresponding tidy dataframe, with each output in a separate row ( the from_tidy() also allows construction from tidy data`). The titles of the tidy-form columns for the output names and their values are supplied at instantiation, defaulting to “Variable” and “Value”. For example, say we have an observation at position (x,y) with measurements of i, j, and k. The wide-form dataframe would have one column for each of x, y, i, j, and k, while the tidy-form dataframe would have a column for each of x and y, a “Variable” column where each row contains either “i”, “j”, or “k” as strings, and a “Value” column containing the corresponding measurement. Wide data is more space-efficient and perhaps more intuitive to construct and inspect, while tidy data more clearly distinguishes inputs and outputs. These views are accessible through the wide and tidy attributes as instances of WideData and TidyData, respectively.

As a container for WideData and TidyData, this class also provides simple access to standardized values of the data through wide.z and tidy.z or transformed values through wide.t and tidy.t. A Standardizer instance can be supplied as a keyword argument, otherwise one will be constructed automatically from the supplied dataframe with the supplied values of log_vars and logit_vars. Unlike WideData and TidyData, the wide and tidy attributes of a DataSet can be altered and sliced while retaining their functionality, with a cursory integrity check. The Standardizer instance can be updated with update_stdzr(), for example following manipulation of the data or alteration of log_vars and logit_vars.

Parameters:

data (pd.DataFrame) – A wide-form dataframe. See class method from_tidy() for instantiation from tidy data.
outputs (list) – Columns of data to be treated as outputs.
names_column (str, default 'Variable') – Name to be used in tidy view for column containing output names.
values_column (str, default 'Value') – Name to be used in tidy view for column containing output values.
log_vars (list, optional) – List of input and output variables to be treated as log-normal. Ignored if stdzr is supplied.
logit_vars (list, optional) – List of input and output variables to be treated as logit-normal. Ignored if stdzr is supplied.
stdzr (Standardizer, optional) – An Standardizer instance. If not supplied, one will be created automatically.

Examples

>>> df = pd.read_pickle(test_data / 'estimates_test_data.pkl')
>>> ds = DataSet.from_tidy(df, names_column='Parameter', log_vars=['Y', 'c', 'b'], logit_vars=['X', 'e'])
>>> ds
DataSet:
    wide: [66 rows x 13 columns]
    tidy: [396 rows x 9 columns]
    outputs: ['e', 'f', 'b', 'c', 'a', 'd']
    inputs: ['Code', 'Target', 'Y', 'X', 'Reaction', 'lg10_Z', 'Metric']

>>> ds.wide = ds.wide.drop(range(0,42,2))
DataSet:
    wide: [45 rows x 13 columns]
    tidy: [270 rows x 9 columns]
    outputs: ['e', 'f', 'b', 'c', 'a', 'd']
    inputs: ['Code', 'Target', 'Y', 'X', 'Reaction', 'lg10_Z', 'Metric']

>>> ds.tidy.z  # tidy-form dataframe with standardized values
>>> ds.wide.z  # wide-form dataframe with standardized values

property float_inputs: Columns of dataframe with “float64” dtype.

classmethod from_tidy(tidy, outputs=None, names_column='Variable', values_column='Value', stdzr=None, log_vars=None, logit_vars=None): Constructs a DataSet from a tidy-form dataframe. See DataSet for explanation of arguments.

classmethod from_wide(wide, outputs=None, names_column='Variable', values_column='Value', stdzr=None, log_vars=None, logit_vars=None): Constructs a DataSet from a wide-form dataframe. See DataSet for explanation of arguments.

property inputs: Columns of dataframe not contained in outputs.

property specs: Provides keyword arguments for easy instantiation of a similar DataSet.

property tidy: TidyData: Tidy-form view of data

update_stdzr(): Updates internal Standardizer with current data, log_vars, and logit_vars.

property wide: WideData: Wide-form view of data