xomics.PreProcess

class xomics.PreProcess(col_id='protein_id', col_name='gene_name', str_quant='log2_lfq')[source]

Bases: object

Pre-processing class for quantifications of omics data.

Parameters:

col_id (str) –
col_name (str) –
str_quant (str) –

__init__(col_id='protein_id', col_name='gene_name', str_quant='log2_lfq')[source]

Parameters:

col_id (str) – Name of column with identifiers in DataFrame.
col_name (str) – Name of column with sample names in DataFrame.
str_quant (str) – Identifier for the LFQ columns in the DataFrame.

Methods

`__init__`([col_id, col_name, str_quant])	type col_id: `str`
`add_ids`([df, list_ids])	Add column with protein ids to DataFrame.
`add_significance`([df, col_fc, col_pval, ...])	Add a column indicating significance regarding threshold for fold change and p-value
`apply_exp`([df, cols, base, neg])	Apply an exponential transformation to specified columns of a DataFrame.
`apply_log`([df, cols, log2, neg])	Apply a logarithmic transformation to specified columns of a DataFrame.
`filter_duplicated_names`([df, col, ...])	Filter for duplicated items in columns (e.g., names).
`filter_groups`([df, groups, min_pct])	Remove samples with missing values unless one group has at least `min_pct` non-missing values.
`filter_nan`([df, groups, cols])	Filter missing values based on provided `groups` or columns (`cols`).
`get_dict_group_qcols`([df, groups])	Create a dictionary with for groups from df and their corresponding columns with quantifications
`get_dict_qcol_group`([df, groups])	Create a dictionary with quantification columns and the group they are subordinated to
`get_qcols`([df, groups])	Create a list with groups from df based on str_quant and given groups
`run`([df, groups, groups_ctrl, ...])	Perform pairwise t-tests for groups to obtain -log10 p-values and log2 fold changes, with optional p-value correction, nan policy, and log-scale output.

get_qcols(df=None, groups=None)[source]

Create a list with groups from df based on str_quant and given groups

Parameters:

df (Optional[DataFrame]) – DataFrame with quantifications. Rows typically correspond to proteins and columns to conditions.
groups (Optional[list]) – List with names grouping conditions from df columns.

Returns:

List with all quantification columns across all groups

Return type:

cols_quant

get_dict_qcol_group(df=None, groups=None)[source]

Create a dictionary with quantification columns and the group they are subordinated to

Parameters:

df (Optional[DataFrame]) – DataFrame with quantifications. Rows typically correspond to proteins and columns to conditions.
groups (Optional[list]) – List with names grouping conditions from df columns.

Returns:

Dictionary assigning names of columns with quantifications (keys) to group names (values)

Return type:

dict_qcol_group

get_dict_group_qcols(df=None, groups=None)[source]

Create a dictionary with for groups from df and their corresponding columns with quantifications

Parameters:

df (Optional[DataFrame]) – DataFrame with quantifications. Rows typically correspond to proteins and columns to conditions.
groups (Optional[list]) – List with names grouping conditions from df columns.

Returns:

Dictionary assigning group names (keys) to list of columns with quantifications (values)

Return type:

dict_group_qcols

filter_nan(df=None, groups=None, cols=None)[source]

Filter missing values based on provided groups or columns (cols).

Parameters:

df (Optional[DataFrame]) – DataFrame with quantifications. Rows typically correspond to proteins and columns to conditions.
groups (Optional[list]) – List with names grouping conditions from df columns.
cols (Optional[list]) – List of columns from df to consider for filtering.

Returns:

The filtered DataFrame.

Return type:

df

Notes

Two options for selecting filtering columns are provided (groups and cols) because removing samples with any missing value is a very strict filtering step.

filter_groups(df=None, groups=None, min_pct=0.8)[source]

Remove samples with missing values unless one group has at least min_pct non-missing values.

Parameters:

df (Optional[DataFrame]) – DataFrame with quantifications. Rows typically correspond to proteins and columns to conditions.
groups (Optional[list]) – List with names grouping conditions from df columns.
min_pct (float) – Minimum percentage threshold of non-missing values in at least one group.

Returns:

The filtered DataFrame.

Return type:

df

static filter_duplicated_names(df=None, col=None, str_split=';', split_names=False)[source]

Filter for duplicated items in columns (e.g., names). Items can be split by str_split.

Parameters:

df (Optional[DataFrame]) – The DataFrame with quantifications to filter. Rows typically correspond to proteins and columns to conditions.
col (Optional[str]) – Column from df in which to perform string splitting and filtering.
str_split (str) – The string character(s) to use for splitting string values in col.
split_names (bool) – Whether to split names using str_split in the specified cols.

Returns:

The modified and filtered DataFrame.

Return type:

df

static apply_log(df=None, cols=None, log2=True, neg=False)[source]

Apply a logarithmic transformation to specified columns of a DataFrame.

Parameters:

df (Optional[DataFrame]) – DataFrame containing data to transform.
cols (Optional[list]) – Names of columns to apply the logarithmic transformation to.
log2 (bool) – If True, apply a log2 transformation. Otherwise, apply a log10 transformation.
neg (bool) – If True, multiply the logarithmic result by -1.

Returns:

DataFrame with specified columns log-transformed.

Return type:

df

Notes

Make sure all values in the specified columns of df are > 0 before applying this function.
NaN values will remain NaN after the transformation.

static apply_exp(df=None, cols=None, base=2.0, neg=False)[source]

Apply an exponential transformation to specified columns of a DataFrame.

Parameters:

df (Optional[DataFrame]) – DataFrame containing data to transform.
cols (Optional[list]) – Names of columns to apply the exponential transformation to.
base (int) – The base of the exponential function. If base=2, apply a 2**x transformation, otherwise apply a 10**x transformation if base=10.
neg (bool) – If True, multiply the exponential result by -1.

Returns:

DataFrame with specified columns exponentially transformed.

Return type:

df

Notes

NaN values will remain NaN after the transformation.

add_ids(df=None, list_ids=None)[source]

Add column with protein ids to DataFrame.

Parameters:

df (Optional[DataFrame]) – DataFrame containing fold-change and p-values.
list_ids (Optional[NewType()(ArrayLike1D, Union[Sequence[Union[int, float]], ndarray, Series])]) – List or array of protein/gene identifiers.

Returns:

DataFrame with added significance column.

Return type:

df

static add_significance(df=None, col_fc=None, col_pval=None, th_fc=0.5, th_pval=0.05)[source]

Add a column indicating significance regarding threshold for fold change and p-value

Three types of significance classes are defined:

Up: Significant hits that are ‘up-regulated’(i.e., right quadrant of volcano plot)
Down: Significant hits that are ‘down-regulated’ (i.e., left quadrant of volcano plot)
Not Sig.: Hits that are not significant.

Parameters:

df (Optional[DataFrame]) – DataFrame containing fold-change and p-values.
col_fc (Optional[str]) – Column name containing fold change values.
col_pval (Optional[str]) – Column name containing p-values.
th_fc (float) – Threshold for fold-change, applied for negative and positive values.
th_pval (float) – Threshold for p-value, -log10 transformed before applied.

Returns:

DataFrame with added significance column.

Return type:

df

run(df=None, groups=None, groups_ctrl=None, pvals_correction=None, pvals_neg_log10=True)[source]

Perform pairwise t-tests for groups to obtain -log10 p-values and log2 fold changes, with optional p-value correction, nan policy, and log-scale output.

Parameters:

df (Optional[DataFrame]) – DataFrame with quantifications. Rows typically correspond to proteins and columns to conditions.
groups (Optional[list]) – List with names grouping conditions from df columns.
groups_ctrl (Optional[list]) – List with names control grouping conditions from df columns.
pvals_correction (Optional[str]) – Correction method for t-tests {“bonferroni”, “sidak”, “holm”, “hommel”, “fdr_bh”}.
pvals_neg_log10 (bool) – Whether to return p-values in -log10 scale.

Returns:

DataFrame with p-values and log2 fold changes for each group comparison.

Return type:

df_fc

Notes

Fold changes (FC) and P-values will be computed for each group in groups compared against each group in group_ctrl (group/group_ctrl), where self-comparison is omitted.