Multi-omics data integration
We will perform a short integration of a proteomics and lipidomics dataset obtained for a demylination (i.e., damage to the myelin sheath surrounding neurons) experiment with four time points given by days (0, 4, 7, and 14, abbreviated by ‘d00’, ‘d04’, ‘d07’, and ‘d14’). See details of this study in [Penkert21].
We can download both datasets which are provided by xOmics as use-case datasets as follows:
import xomics as xo
df_prot = xo.load_dataset(name="PROT_DEMYLINATION")
df_lip = xo.load_dataset(name="LIPID_DEMYLINATION")
Focusing on a comparison of day 7 against day 0, we can first define all constants for our analysis:
groups = ["d00", "d03", "d07", "d14"]
args_col = dict(col_fc="log2_fc_(d07/d00)", col_pval="-log10_p-value_(d07/d00)")
# Lipidomics data were measured in pikomol using an external standard
args_lip = dict(col_id="lipid_id", col_name="lipid_name", str_quant="pmol")
# Proteomics data were measured using label free quantification (LFQ)
args_prot = dict(col_id="protein_id", col_name="gene_name", str_quant="log2_lfq")
We first perform some filtering steps and compute the fold changes for our proteomics data:
# Pre-processing
pp = xo.PreProcess(**args_prot)
pr = xo.pRank(**args_prot)
df_lfq = pp.filter_groups(df=df_prot, groups=groups)
df_lfq = pp.filter_duplicated_names(df=df_lfq, col="gene_name")
# Compute fold change (FC) and p-values
df_fc = pp.run(df=df_lfq, groups=groups, groups_ctrl=["d00"])
df_fc = pp.filter_nan(df=df_fc, cols=["log2_fc_(d07/d00)", "-log10_p-value_(d07/d00)"])
Now the threshold based significance and P-scores can be computed:
df_fc = pp.add_significance(df=df_fc, **args_col)
df_fc = pr.p_score(df_fc=df_fc, **args_col)
df_fc.head(5)
| protein_id | gene_name | log2_fc_(d03/d00) | log2_fc_(d07/d00) | log2_fc_(d14/d00) | -log10_p-value_(d03/d00) | -log10_p-value_(d07/d00) | -log10_p-value_(d14/d00) | sig_class | p_score | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Q8K4G5 | Ablim1 | -0.528 | -0.6825 | -0.220 | 4.230306 | 5.312642 | 3.140084 | Down | 0.340022 |
| 1 | Q91ZJ5 | Ugp2 | -0.036 | -0.2850 | -0.518 | 0.248030 | 4.640008 | 6.791086 | Unchanged | 0.246418 |
| 2 | P53986 | Slc16a1 | -0.084 | -0.4410 | -0.806 | 0.317540 | 3.634130 | 6.823463 | Unchanged | 0.228351 |
| 3 | P24270 | Cat | -0.186 | -0.7560 | -0.384 | 1.586811 | 5.169079 | 4.096977 | Down | 0.345767 |
| 4 | Q99KC8 | Vwa5a | -0.352 | -0.7130 | -1.056 | 1.420013 | 4.742200 | 7.742758 | Down | 0.320359 |
We use this to highlight the proteins with the strongest proteomic signal in a volcano plot
import matplotlib.pyplot as plt
df_fc = df_fc.sort_values(by="p_score", ascending=False)
top_genes_up = df_fc[df_fc["sig_class"] == "Up"]["gene_name"].to_list()[0:10]
top_genes_down = df_fc[df_fc["sig_class"] == "Down"]["gene_name"].to_list()[0:10]
top_genes = top_genes_down + top_genes_up
xo.plot_settings()
xo.plot_volcano(df=df_fc, **args_col, names_to_annotate=top_genes, col_names="gene_name")
plt.tight_layout()
plt.show()
We can now obtain the significantly altered proteins and perform a gene
set enrichment analysis using
g:Profiler <https://biit.cs.ut.ee/gprofiler/gost>_ with default
settings and using all Gene Ontology and biological pathways data
sources. A list of ‘target_genes’ has to be provided, and it is highly
recommended to use a customized ‘background’ by setting in ‘Advanced
options’ the ‘Statistical domain scope’ to ‘Custom’ and copy-pasting
your entire list of detected proteins.
background = df_fc["gene_name"].to_list()
target_genes = df_fc[df_fc["sig_class"].isin(["Up", "Down"])]["gene_name"].to_list()
print(f"Target list comprises {len(target_genes)} genes, which is a subset of the complete background list including {len(background)} genes.")
Target list comprises 644 genes, which is a subset of the complete background list including 4526 genes.
Run g:Profiler und download the ‘Detailed Results’ as GEM and convert
the .txt file into a .tsv file. We provide the results of this
enrichment analysis as use-case by using xo.load_dataset and setting
name='gProfiler_DEMYLINATION':
df_gprof = xo.load_dataset(name="gProfiler_DEMYLINATION")
df_gporf = pp.apply_log(df=df_gprof, cols=["p.val"], log2=False, neg=True)
df_gprof.head(5)
| go.id | description | p.val | fdr | phenotype | genes | |
|---|---|---|---|---|---|---|
| 0 | GO:0003735 | structural constituent of ribosome | 2.875376e-09 | 2.875376e-09 | 1 | RPL18,RPL27,RPL5,RPL27A,RPS28,RPL29,RPS14,MRPL... |
| 1 | GO:0003779 | actin binding | 2.727871e-04 | 2.727871e-04 | 1 | DAAM2,ANLN,MRTFB,ADCY8,PHACTR1,MAP1A,ABLIM1,CO... |
| 2 | GO:0005198 | structural molecule activity | 4.333399e-02 | 4.333399e-02 | 1 | CRYAB,PLLP,MAP1A,RPL18,RPL27,H4C1,CD2AP,RPL5,R... |
| 3 | GO:0002181 | cytoplasmic translation | 9.007092e-05 | 9.007092e-05 | 1 | RPL18,RPL27,HNRNPU,RPL5,YBX1,RPL27A,RPS28,NMNA... |
| 4 | GO:0042274 | ribosomal small subunit biogenesis | 6.388875e-04 | 6.388875e-04 | 1 | RPS28,RPS14,RPS8,RPS13,RPS24,RPS3A,RPS5,RPS12,... |
These enrichment results can be used to obtain a protein-centric enrichment-based ranking score called E-score:
# TODO
To perform an enrichment analysis for the lipidomics data, we obtain as
well all significantly changed lipids and use this time
LION/web server <http://www.lipidontology.com/>. The results are
provided by xOmics by using xo.load_dataset with
name='LION_DEMYLINATION':