xomics.PreProcess

class xomics.PreProcess(col_id='protein_id', col_name='gene_name', str_quant='log2_lfq')[source]

Bases: object

Pre-processing class for quantifications of omics data.

Parameters:
  • col_id (str) –

  • col_name (str) –

  • str_quant (str) –

__init__(col_id='protein_id', col_name='gene_name', str_quant='log2_lfq')[source]
Parameters:
  • col_id (str) – Name of column with identifiers in DataFrame.

  • col_name (str) – Name of column with sample names in DataFrame.

  • str_quant (str) – Identifier for the LFQ columns in the DataFrame.

Methods

__init__([col_id, col_name, str_quant])

type col_id:

str

add_ids([df, list_ids])

Add column with protein ids to DataFrame.

add_significance([df, col_fc, col_pval, ...])

Add a column indicating significance regarding threshold for fold change and p-value

apply_exp([df, cols, base, neg])

Apply an exponential transformation to specified columns of a DataFrame.

apply_log([df, cols, log2, neg])

Apply a logarithmic transformation to specified columns of a DataFrame.

filter_duplicated_names([df, col, ...])

Filter for duplicated items in columns (e.g., names).

filter_groups([df, groups, min_pct])

Remove samples with missing values unless one group has at least min_pct non-missing values.

filter_nan([df, groups, cols])

Filter missing values based on provided groups or columns (cols).

get_dict_group_qcols([df, groups])

Create a dictionary with for groups from df and their corresponding columns with quantifications

get_dict_qcol_group([df, groups])

Create a dictionary with quantification columns and the group they are subordinated to

get_qcols([df, groups])

Create a list with groups from df based on str_quant and given groups

run([df, groups, groups_ctrl, ...])

Perform pairwise t-tests for groups to obtain -log10 p-values and log2 fold changes, with optional p-value correction, nan policy, and log-scale output.

get_qcols(df=None, groups=None)[source]

Create a list with groups from df based on str_quant and given groups

Parameters:
  • df (Optional[DataFrame]) – DataFrame with quantifications. Rows typically correspond to proteins and columns to conditions.

  • groups (Optional[list]) – List with names grouping conditions from df columns.

Returns:

List with all quantification columns across all groups

Return type:

cols_quant

get_dict_qcol_group(df=None, groups=None)[source]

Create a dictionary with quantification columns and the group they are subordinated to

Parameters:
  • df (Optional[DataFrame]) – DataFrame with quantifications. Rows typically correspond to proteins and columns to conditions.

  • groups (Optional[list]) – List with names grouping conditions from df columns.

Returns:

Dictionary assigning names of columns with quantifications (keys) to group names (values)

Return type:

dict_qcol_group

get_dict_group_qcols(df=None, groups=None)[source]

Create a dictionary with for groups from df and their corresponding columns with quantifications

Parameters:
  • df (Optional[DataFrame]) – DataFrame with quantifications. Rows typically correspond to proteins and columns to conditions.

  • groups (Optional[list]) – List with names grouping conditions from df columns.

Returns:

Dictionary assigning group names (keys) to list of columns with quantifications (values)

Return type:

dict_group_qcols

filter_nan(df=None, groups=None, cols=None)[source]

Filter missing values based on provided groups or columns (cols).

Parameters:
  • df (Optional[DataFrame]) – DataFrame with quantifications. Rows typically correspond to proteins and columns to conditions.

  • groups (Optional[list]) – List with names grouping conditions from df columns.

  • cols (Optional[list]) – List of columns from df to consider for filtering.

Returns:

The filtered DataFrame.

Return type:

df

Notes

Two options for selecting filtering columns are provided (groups and cols) because removing samples with any missing value is a very strict filtering step.

filter_groups(df=None, groups=None, min_pct=0.8)[source]

Remove samples with missing values unless one group has at least min_pct non-missing values.

Parameters:
  • df (Optional[DataFrame]) – DataFrame with quantifications. Rows typically correspond to proteins and columns to conditions.

  • groups (Optional[list]) – List with names grouping conditions from df columns.

  • min_pct (float) – Minimum percentage threshold of non-missing values in at least one group.

Returns:

The filtered DataFrame.

Return type:

df

static filter_duplicated_names(df=None, col=None, str_split=';', split_names=False)[source]

Filter for duplicated items in columns (e.g., names). Items can be split by str_split.

Parameters:
  • df (Optional[DataFrame]) – The DataFrame with quantifications to filter. Rows typically correspond to proteins and columns to conditions.

  • col (Optional[str]) – Column from df in which to perform string splitting and filtering.

  • str_split (str) – The string character(s) to use for splitting string values in col.

  • split_names (bool) – Whether to split names using str_split in the specified cols.

Returns:

The modified and filtered DataFrame.

Return type:

df

static apply_log(df=None, cols=None, log2=True, neg=False)[source]

Apply a logarithmic transformation to specified columns of a DataFrame.

Parameters:
  • df (Optional[DataFrame]) – DataFrame containing data to transform.

  • cols (Optional[list]) – Names of columns to apply the logarithmic transformation to.

  • log2 (bool) – If True, apply a log2 transformation. Otherwise, apply a log10 transformation.

  • neg (bool) – If True, multiply the logarithmic result by -1.

Returns:

DataFrame with specified columns log-transformed.

Return type:

df

Notes

  • Make sure all values in the specified columns of df are > 0 before applying this function.

  • NaN values will remain NaN after the transformation.

static apply_exp(df=None, cols=None, base=2.0, neg=False)[source]

Apply an exponential transformation to specified columns of a DataFrame.

Parameters:
  • df (Optional[DataFrame]) – DataFrame containing data to transform.

  • cols (Optional[list]) – Names of columns to apply the exponential transformation to.

  • base (int) – The base of the exponential function. If base=2, apply a 2**x transformation, otherwise apply a 10**x transformation if base=10.

  • neg (bool) – If True, multiply the exponential result by -1.

Returns:

DataFrame with specified columns exponentially transformed.

Return type:

df

Notes

  • NaN values will remain NaN after the transformation.

add_ids(df=None, list_ids=None)[source]

Add column with protein ids to DataFrame.

Parameters:
Returns:

DataFrame with added significance column.

Return type:

df

static add_significance(df=None, col_fc=None, col_pval=None, th_fc=0.5, th_pval=0.05)[source]

Add a column indicating significance regarding threshold for fold change and p-value

Three types of significance classes are defined:

  • Up: Significant hits that are ‘up-regulated’(i.e., right quadrant of volcano plot)

  • Down: Significant hits that are ‘down-regulated’ (i.e., left quadrant of volcano plot)

  • Not Sig.: Hits that are not significant.

Parameters:
  • df (Optional[DataFrame]) – DataFrame containing fold-change and p-values.

  • col_fc (Optional[str]) – Column name containing fold change values.

  • col_pval (Optional[str]) – Column name containing p-values.

  • th_fc (float) – Threshold for fold-change, applied for negative and positive values.

  • th_pval (float) – Threshold for p-value, -log10 transformed before applied.

Returns:

DataFrame with added significance column.

Return type:

df

run(df=None, groups=None, groups_ctrl=None, pvals_correction=None, pvals_neg_log10=True)[source]

Perform pairwise t-tests for groups to obtain -log10 p-values and log2 fold changes, with optional p-value correction, nan policy, and log-scale output.

Parameters:
  • df (Optional[DataFrame]) – DataFrame with quantifications. Rows typically correspond to proteins and columns to conditions.

  • groups (Optional[list]) – List with names grouping conditions from df columns.

  • groups_ctrl (Optional[list]) – List with names control grouping conditions from df columns.

  • pvals_correction (Optional[str]) – Correction method for t-tests {“bonferroni”, “sidak”, “holm”, “hommel”, “fdr_bh”}.

  • pvals_neg_log10 (bool) – Whether to return p-values in -log10 scale.

Returns:

DataFrame with p-values and log2 fold changes for each group comparison.

Return type:

df_fc

Notes

Fold changes (FC) and P-values will be computed for each group in groups compared against each group in group_ctrl (group/group_ctrl), where self-comparison is omitted.