xomics.cImpute

class xomics.cImpute(col_id='protein_id', col_name='gene_name', str_quant='log2_lfq')[source]

Bases: object

Transparent hybrid data imputation method.

cImpute (conditional Imputation) is a transparent hybrid imputation algorithm designed to address missing values (MVs) in (prote)omics data. The types of missing values can be broadly categorized into three groups based on their nature and the reasons behind their occurrence, as detailed in [Lazar16] and [Wei18]:

  • Missing Completely At Random (MCAR): MVs resulting from random errors in data acquisition. Due to the inherent randomness of MCAR MVs, they cannot be explained purely by measured intensities and are uniformly distributed.

  • Missing At Random (MAR): MVs due to data processing flaws and variable dependencies. While MAR includes all MCAR MVs, its distribution is speculative and varies significantly across different experiments.

  • Missing Not At Random (MNAR): MVs caused by experimental biases like detection limits in mass-spectrometry. They often follow a left-censored Gaussian distribution, indicating truncation at lower abundances.

Notes

The primary goal of cImpute is to focus on the imputation of MVs that align with well-defined confidence criteria. This approach comprises four main steps:

  1. Establishing the upper bound for MNAR MVs to distinguish between MNAR and MCAR.

  2. Categorizing MVs for detected proteins within a specific experimental group.

  3. Calculating a confidence score (CS) for each protein.

  4. Performing group-wise imputation for proteins whose CS exceeds a predefined threshold.

Parameters:
  • col_id (str) –

  • col_name (str) –

  • str_quant (str) –

__init__(col_id='protein_id', col_name='gene_name', str_quant='log2_lfq')[source]
Parameters:
  • col_id (str) – Name of column with identifiers in DataFrame.

  • col_name (str) – Name of column with sample names in DataFrame.

  • str_quant (str) – Identifier for the LFQ columns in the DataFrame.

Methods

__init__([col_id, col_name, str_quant])

type col_id:

str

get_limits([df, groups, loc_pct_upmnar, ...])

Get minimum of detected values (d_min, i.e., detection limit), upper bound of MNAR MVs (up_mnar), and maximum of detected values (d_max).

run([df, groups, loc_pct_upmnar, min_cs, ...])

Run cImpute algorithm.

get_limits(df=None, groups=None, loc_pct_upmnar=0.25, cols_quant=None)[source]

Get minimum of detected values (d_min, i.e., detection limit), upper bound of MNAR MVs (up_mnar), and maximum of detected values (d_max).

Parameters:
  • df (Optional[DataFrame]) – DataFrame containing quantified values with MVs. Rows typically correspond to proteins and columns to conditions.

  • groups (Optional[NewType()(ArrayLike1D, Union[Sequence[Union[int, float]], ndarray, Series])]) – List of quantification group (substrings of columns in df).

  • loc_pct_upmnar (float) – Location factor [0-1] for the upper MNAR limit (upMNAR) given as relative proportion (percentage) of the detection range.

  • cols_quant (Optional[NewType()(ArrayLike1D, Union[Sequence[Union[int, float]], ndarray, Series])]) – Column names with quantification data in df.

Returns:

  • d_min – Minimum of detected values

  • up_mnar – upper bound of MNAR MVs

  • d_max – Maximum of detected values

run(df=None, groups=None, loc_pct_upmnar=0.25, min_cs=0.5, n_neighbors=5)[source]

Run cImpute algorithm.

Hybrid method for imputation of omics data called conditional imputation (cImpute) using MinProb for MNAR (Missing Not at Random) missing values and KNN imputation for MCAR (Missing completely at Random) missing values.

Parameters:
  • df (Optional[DataFrame]) – DataFrame containing quantified values with MVs. Rows typically correspond to proteins and columns to conditions.

  • groups (Optional[NewType()(ArrayLike1D, Union[Sequence[Union[int, float]], ndarray, Series])]) – List of quantification group (substrings of columns in df).

  • loc_pct_upmnar (float) – Location factor [0-1] for the upper MNAR limit (upMNAR) given as relative proportion (percentage) of the detection range.

  • min_cs (float) – Minimum of confidence score [0-1] used for selecting values for protein in groups to apply imputation on.

  • n_neighbors (int) – Number of neighboring samples to use for MCAR imputation by KNN.

Returns:

DataFrame with (a) imputed intensities values and (b) group-wise confidence score and NaN classification.

Return type:

df_imp

Notes

  • MAR is only imputed if min_cs=0 using the imputation for MCAR.