xomics.cImpute
- class xomics.cImpute(col_id='protein_id', col_name='gene_name', str_quant='log2_lfq')[source]
Bases:
objectTransparent hybrid data imputation method.
cImpute (conditional Imputation) is a transparent hybrid imputation algorithm designed to address missing values (MVs) in (prote)omics data. The types of missing values can be broadly categorized into three groups based on their nature and the reasons behind their occurrence, as detailed in [Lazar16] and [Wei18]:
Missing Completely At Random (MCAR): MVs resulting from random errors in data acquisition. Due to the inherent randomness of MCAR MVs, they cannot be explained purely by measured intensities and are uniformly distributed.
Missing At Random (MAR): MVs due to data processing flaws and variable dependencies. While MAR includes all MCAR MVs, its distribution is speculative and varies significantly across different experiments.
Missing Not At Random (MNAR): MVs caused by experimental biases like detection limits in mass-spectrometry. They often follow a left-censored Gaussian distribution, indicating truncation at lower abundances.
Notes
The primary goal of cImpute is to focus on the imputation of MVs that align with well-defined confidence criteria. This approach comprises four main steps:
Establishing the upper bound for MNAR MVs to distinguish between MNAR and MCAR.
Categorizing MVs for detected proteins within a specific experimental group.
Calculating a confidence score (CS) for each protein.
Performing group-wise imputation for proteins whose CS exceeds a predefined threshold.
Methods
__init__([col_id, col_name, str_quant])- type col_id:
get_limits([df, groups, loc_pct_upmnar, ...])Get minimum of detected values (d_min, i.e., detection limit), upper bound of MNAR MVs (up_mnar), and maximum of detected values (d_max).
run([df, groups, loc_pct_upmnar, min_cs, ...])Run cImpute algorithm.
- get_limits(df=None, groups=None, loc_pct_upmnar=0.25, cols_quant=None)[source]
Get minimum of detected values (d_min, i.e., detection limit), upper bound of MNAR MVs (up_mnar), and maximum of detected values (d_max).
- Parameters:
df (
Optional[DataFrame]) – DataFrame containing quantified values with MVs.Rowstypically correspond to proteins andcolumnsto conditions.groups (
Optional[NewType()(ArrayLike1D,Union[Sequence[Union[int,float]],ndarray,Series])]) – List of quantification group (substrings of columns indf).loc_pct_upmnar (
float) – Location factor [0-1] for the upper MNAR limit (upMNAR) given as relative proportion (percentage) of the detection range.cols_quant (
Optional[NewType()(ArrayLike1D,Union[Sequence[Union[int,float]],ndarray,Series])]) – Column names with quantification data indf.
- Returns:
d_min – Minimum of detected values
up_mnar – upper bound of MNAR MVs
d_max – Maximum of detected values
- run(df=None, groups=None, loc_pct_upmnar=0.25, min_cs=0.5, n_neighbors=5)[source]
Run cImpute algorithm.
Hybrid method for imputation of omics data called conditional imputation (cImpute) using MinProb for MNAR (Missing Not at Random) missing values and KNN imputation for MCAR (Missing completely at Random) missing values.
- Parameters:
df (
Optional[DataFrame]) – DataFrame containing quantified values with MVs.Rowstypically correspond to proteins andcolumnsto conditions.groups (
Optional[NewType()(ArrayLike1D,Union[Sequence[Union[int,float]],ndarray,Series])]) – List of quantification group (substrings of columns indf).loc_pct_upmnar (
float) – Location factor [0-1] for the upper MNAR limit (upMNAR) given as relative proportion (percentage) of the detection range.min_cs (
float) – Minimum of confidence score [0-1] used for selecting values for protein in groups to apply imputation on.n_neighbors (
int) – Number of neighboring samples to use for MCAR imputation by KNN.
- Returns:
DataFrame with (a) imputed intensities values and (b) group-wise confidence score and NaN classification.
- Return type:
df_imp
Notes
MAR is only imputed if
min_cs=0using the imputation for MCAR.