OpDEA

OpDEA: Optimizing Proteomics Data Differential Expression Analysis via High-Performing Rules and Ensemble Inference

Abstract

Identification of differentially expressed proteins in a proteomics workflow typically encompasses five key steps: raw data quantification, expression matrix construction, matrix normalization, missing value imputation (MVI), and differential expression analysis. The plethora of options in each step makes it challenging to identify optimal workflows that maximize the identification of differentially expressed proteins. To identify optimal workflows and their common properties, we conduct an extensive study involving 34,576 combinatoric experiments on 24 gold standard spike-in datasets. Applying frequent pattern mining techniques to top-ranked workflows, we uncover high-performing rules that demonstrate optimality has conserved properties. Via machine learning, we confirm optimal workflows are indeed predictable, with average cross-validation F1 scores and Matthew's correlation coefficients surpassing 0.84. We introduce an ensemble inference to integrate results from individual top-performing workflows for expanding differential proteome coverage and resolve inconsistencies. Ensemble inference provides gains in pAUC (up to 4.61%) and G-mean (up to 11.14%) and facilitates effective aggregation of information across varied quantification approaches such as topN, directLFQ, MaxLFQ intensities, and spectral counts. However, further development and evaluation are needed to establish acceptable frameworks for conducting ensemble inference on multiple proteomics workflows.

Workflow of a DEA process for label-free proteomics data and available tools for each step in the workflow

M-VIDIA: A Multi-View representation to Increase modality Depth using Integrative AI

Abstract

Artificial intelligence (AI) models are powerful tools for addressing data challenges such as complexity, sparsity, and noise. Multi-view learning (MVL), which leverages multiple data representations (“views”), holds great potential by enhancing signal quality and improving task performance. However, its application in biomedical research remains largely underexplored. To address this gap, we introduce Multi-View representation to Increase Modality Depth using Integrative AI (M-VIDIA), a novel MVL framework. M-VIDIA integrates diverse views to boost performance across various tasks, including differential expression analysis, patient diagnosis, and cell clustering. Our findings demonstrate that M-VIDIA consistently outperforms other approaches, significantly improving sensitivity and signal quality. This represents a major advancement in data-driven AI, highlighting the crucial role of high-quality data representation in producing reproducible and interpretable results in biomedical research. M-VIDIA is available for use at http://www.ai4pro.tech:3838/.

Overall pipeline of M-VIDIA and its performance on DEA

Acknowledgement and Citation

Acknowledgement

This research/project is supported by the National Research Foundation, Singapore under its Industry Alignment Fund – Prepositioning (IAF-PP) Funding Initiative. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore.

This work was partly supported by the National Innovation Fellow Program of the MOST of China (J.L., Grant No. E327130001).

WWBG also acknowledges support from an MOE Tier 1 award (RS08/21).

Publication

Please cite the following papers:

Hui Peng, He Wang, Weijia Kong, Jinyan Li*, Wilson Wen Bin Goh*. (2024). Optimizing differential expression analysis for proteomics data via high-performing rules and ensemble inference. Nat Commun 15, 3922 (2024). https://doi.org/10.1038/s41467-024-47899-w

Hui Peng, Wilson Wen Bin Goh*. (2024). M-VIDIA: A Multi-View representation to Increase modality Depth using Integrative AI.

Avaliable Resources

Raw data links for DEA workflow benchmarking

label-free DDA data

HYE5600735_LFQ: PXD028735

HYE6600735_LFQ: PXD028735

HYEqe735_LFQ: PXD028735

HYEtims735_LFQ: PXD028735

HYtims134_LFQ: PXD036134

HEtims425_LFQ: PXD021425;

YUltq006_LFQ: PDC000006

YUltq099_LFQ: PXD002099

YUltq819_LFQ: PXD001819

HEqe408_LFQ: PXD018408

HYqfl683_LFQ: PXD007683

HYEtims777_LFQ: PXD014777;

label-free DIA data

HYEtims735_DIA: PXD028735

MYtims709_DIA: PXD034709

HEqe408_DIA: PXD018408

HEof_n600_DIA: PXD026600

HEof_w600_DIA: PXD026600

HYtims134_DIA: PXD036134

HEqe777_DIA: PXD019777

HEqe777_DIA:

TMT data

HEqe277_TMT10: PXD013277

HYqfl683_TMT11: PXD007683

HYms2faims815_TMT16: PXD020815

HYsps2815_TMT16: PXD020815

HYms2815_TMT16: PXD020815

Quantification data for DEA workflow benchmarking

Please download from zenodo via the following link

Zenodo: Raw quantification results from FragPipe, Maxquant, DIA-NN and Spectronaut

Or download from google drive via the following link

google Drive: Raw quantification results of DDA data; DIA data; TMT data

DEA workflow benchmarking result data

expression matrices are available at:

Download from zenodo via the following link

Zenodo: DDA_expression_matrices;

Download from zenodo via the following link

Zenodo: DIA_expression_matrices;

Download from zenodo via the following link

Zenodo: TMT_expression_matrices;

Performance metrices are available at:

Download from zenodo via the following link

Zenodo: DDA_performance_metrics;

Download from zenodo via the following link

Zenodo: DIA_performance_metrics;

Download from zenodo via the following link

Zenodo: TMT_performance_metrics;

Workflow ranks are available at:

Download from zenodo via the following link

Zenodo: DDA_workflow_ranks;

Download from zenodo via the following link

Zenodo: DIA_workflow_ranks;

Download from zenodo via the following link

Zenodo: TMT_workflow_ranks;

view data for M-VIDIA case studies

view data for Thyroid Nodule Classification (TNC) dataset are available at:

Download from zenodo via the following link

Zenodo: discovery set;

Download from zenodo via the following link

Zenodo: retrospective test set;

Single Mouse Oocytes (SMO) data are available at:

Download from zenodo via the following link

Zenodo: proteomics;

Download from zenodo via the following link

Zenodo: transcriptomics;

Pick-up Single-cell Proteomic Analysis (PiSPA) dataset are available at:

Download from zenodo via the following link

Zenodo: single cell DDA;

Download from zenodo via the following link

Zenodo: single cell DIA;

Introduction of the webserver

Introdcution of the webserver:

This toolkit includes 6 main function panels including Introduction, Benchmarking, Suggestion & DEA, MVIDIA, Data and Help. In the [Introduction] panel, we first present the abstract of this work, then we show an overview of the proteomics data differential expression analysis workflow and the options available in each workflow steps. We later present the abstract of our newly design multi-view learning framework for proteomics data analysis together with the pipeline of the MVIDIA. At last, the Acknowledgement and Citation information are shown.

In the [Benchmarking] panel,

users can view our benchmarking results including checking the rank position and performance metric values of a workflow tested with our benchmaking datasets.

In the [Suggestion & DEA] panel,

we provide the tools for suggesting optimal workflows and conduct differential expression analysis with the suggested workflow directly.

In the [MVIDIA] panel,

we provide three functional modules for completing proteomics data analysis tasks such as DEA, patient diagnosis and cell clustering.

In the [Data] panel,

the user can get the links where raw proteomics data are available. The raw quantification results of the raw data, the extracted expression matrices and our benchmarking results can be downloaded.

The data used in our case studies in our MVIDIA paper are provided.

We also provide the link for downloading our offline toolkit with the same function as the webserver.

In the [Help] panel,

We introduce the webserver and show what the users can do with our toolkit

View benchmarking results

View workflow benchmarking results:

Step 1: Choosing an interested setting, e.g., label-free DDA data quantified with FragPipe, click the item of DDA_LFQ-FragPipe.

There are 7852, 7852, 6284, 6284, 4720 and 1568 workflows under settings of LFQ_DDA-FragPipe, LFQ_DDA-Maxquant, LFQ_DIA-DIANN, LFQ_DIA-Spectronaut, TMT-FragPipe and TMT-Maxquant.

Step 2: Filtering workflows, e.g., check the options in the 4 checkboxes that you are interested in, or by using a key word.

Drag the scroll bar below the table can check the ranking with a specific metric. The ranking of workflows is based-on the average ranking based on the five performance metrics (the column of avg_rank_mean).

Step 3: Click one row of the table to check the performance distributions testing on different datasets.

12 DDA datasets, 7 DIA datasets and 5 TMT datasets were used to evaluate the performances of workflows, the boxplot shows the performance distributions of the 5 metrics.

Only one row is permitted to be select each time, click the same row twice can cancel the selection.

Step 4: View the details of performance distrubutions of the five metrics in the right bottom figure.

Optimal workflow recommendation & DEA

Workflow suggestion and testing:

Step 1: Choose data type that the user has.

Lable-free DDA (DDA), label free DIA (DIA) and TMT data are supported. Choose the one correponding to the data you have.

Step 2: Choose a quantification platform.

For DDA data, FragPipe and Maxquant are supported. For DIA data, DIA-NN and Spectronaut are supported. For TMT data, again, FragPipe and Maxquant are supported. Just choose the one that you used to quantify your proteomics data.

Step 3: Choose to suggest single workflow or try the ensemble inference.

If single workflow is preferred. The user should choose whether the have some preferred options in the workflow step. If yes, then check Yes, and select the options in below checkboxes, otherwise select No.

For DDA and DIA data, our server also support to suggest ensemble inference where multiple workflow are integrated by a p-value integration method. The ens_multi-quant approach is used defaulty.

The imputation method missForest and MLE are quite time-consuming, if they are suggested, we will replace them with MinProb.

Step 4: Click the suggest worklfow button or Try ensemble inference button.

Our server will suggest the top 1st workflow after filtering the workflows with user specified option preference. The potential alternative options will also be shown according to our option comparsion results

Step 5: Upload raw quantification result data, specify thresholds and submit the DEA task.

The user should upload their raw quantification result data for DEA, e.g., the combined_protein.tsv file from FragPipe. The designed file showing the sample condition and group information must be uploaded at the same time. Read the file requirement carefully before uploading your files. The log2FC threshold and adj.pvalue (p-value is ajusted with BH method) threshold should be specified. At last, click DEA button to submit the task.

Step 6: Download DEA results.

After submitting the DEA task, the user should wait for a while till the task being completed. A volcano plot will be generated and a link for the user downloading their DEA results can be found below the volcano plot.

MVIDIA DEA

Applying MVIDIA to DEA tasks:

Step 1: Choose the task "DEA" under MVIDIA in the left menu.

Step 2 Preparing your input data.

We provide a module helping extract views from your proteomics data and use them as inputs to MVIDIA. see "View Extraction" page.

Step 3: Upload view data.

After extracting the view data, Compressing the view files (e.g., dlfq.tsv, maxlfq.tsv, etc.) and the design file (design.tsv) to a zip file named "views.zip". Then, upload the zip file to the server.

Step 4: Setting the parameters.

A.) Please specify the parameters including:

B.) Quantification platform, e.g., FragPipe for DDA data, DIA-NN for DIA data, etc.

C.) proteomics data acqusition mode, DDA or DIA

D.) views used, 3 views are suggested, but please select at least 2 views, e.g., dlfq and maxlfq, etc.

E.) The thesholds for defining differentially expressed proteins (DEPs), including the logFC and the adj.pvalue.

F.) Specifying whehter the data should be normalized and the missing values should be imputed. Avaialbe options please see the pages in Benchmarking.

G.) Specify the group label that you are willing to compare, e.g., A, B.

H.) Choose the multi-view DEA method, MVIDIA, Concat or MLE.

Step 5: After specifying the parameters, please click the MVIDIA-DEA button to submit the task.

Step 6: Check DEA results.

After submitting the DEA task, the user should wait for a while till the task being completed. A volcano plot and the a venn plot showing the DEP overlaps obtained by single view methods and the multi-view method will be generated. The result file can be downloaded through the blue link.

MVIDIA patient diagnosis

Applying MVIDIA to patient diagnosis (classification) tasks:

Step 1: Choose the task "Patient Diagnosis" under MVIDIA in the left menu.

Step 2 Preparing your input data.

We provide a module helping extract views from your proteomics data and use them as inputs to MVIDIA. see View Extraction page.

Step 3: Upload view data.

After extracting the view data for both training data and test data, Compressing the view files (e.g., dlfq.tsv, maxlfq.tsv, etc.) and the design file (design.tsv) to two zip files named "train.zip" and "test.zip". Then, upload the zip files to the server.

Step 4: Setting the parameters.

Please specify the parameters including:

A.) Quantification platform, e.g., FragPipe for DDA data, DIA-NN for DIA data, etc.

B.) proteomics data acqusition mode, DDA or DIA

C.) views used, 3 views are suggested, but please select at least 2 views, e.g., dlfq and maxlfq, etc.

D.) The signatures used for patient diagnosis, it is a set of proteins, can be DEPs detected by our MVIDIA-DEA.

!!! please make sure the proteins are in quotes and seperated by ",", e.g., "sp|A0A075B6S2|KVD29_HUMAN,sp|A0A075B6S5|KV127_HUMAN"

E.) Choose the multi-view DEA method, MVIDIA, Concat or MLE.

F.) Specify the label of the positive class

Step 5: After specifying the parameters, please click the "Patient Diagnosis" button to submit the task.

Step 6: Check diagnosis results.

After submitting the patient diagnosis task, the user should wait for a while till the task being completed. A notification will appear showing the running time after the task being completed. The result file can be downloaded through the blue link.

MVIDIA cell clustering

Applying MVIDIA to cell clustering tasks:

Step 1: Choose the task "Cell Clustering" under MVIDIA in the left menu.

Step 2 Preparing your input data.

We provide a module helping extract views from your proteomics data and use them as inputs to MVIDIA. see View Extraction page.

Step 3: Upload view data.

After extracting the view data, Compressing the view files (e.g., dlfq.tsv, maxlfq.tsv, etc.) and the design file (design.tsv) to a zip file named "sc.zip". Then, upload the zip file to the server.

Step 4: Setting the parameters.

Please specify the parameters including:

A.) Quantification platform, e.g., FragPipe for DDA data, DIA-NN for DIA data, etc.

B.) proteomics data acqusition mode, DDA or DIA

C.) views used, 3 views are suggested, but please select at least 2 views, e.g., dlfq and maxlfq, etc.

D.) Cell filtering threshold, per the missing rate.

E.) Specifying the number of clusters.

F.) Specify the hyperparameters for GCCA used for cell embedding.

Step 5: After specifying the parameters, please click the "Cell-Clustering" button to submit the task.

Step 6: Check clustering results.

After submitting the cell clustering task, the user should wait for a while till the task being completed. A notification will appear showing the running time after the task being completed. The result file can be downloaded through the blue link.

Contact Info

Contact:

OpDEA

Any problem about OpDEA please contact Hui Peng:

Email: hui.peng@ntu.edu.sg

or cdph2009@163.com

Or, can email to the corresponding authors:

Jinyan Li: jinyan.li@siat.ac.cn;

Wilson Wen Bin Goh: wilsongoh@ntu.edu.sg

M-VIDIA

Any problem about MVIDIA please contact Hui Peng:

Email: hui.peng@ntu.edu.sg

or cdph2009@163.com

Or, can email to the corresponding author:

Wilson Wen Bin Goh: wilsongoh@ntu.edu.sg

OpDEA: Optimizing Proteomics Data Differential Expression Analysis via High-Performing Rules and Ensemble Inference

Abstract

Workflow of a DEA process for label-free proteomics data and available tools for each step in the workflow

M-VIDIA: A Multi-View representation to Increase modality Depth using Integrative AI

Abstract

Overall pipeline of M-VIDIA and its performance on DEA

Acknowledgement and Citation

Acknowledgement

Publication

FragPipe workflow benchmarking results with label-free DDA data

Maxquant workflow benchmarking resultswith label-free DDA data

DIA-NN workflow benchmarking results with label-free DIA

Spectronaut workflow benchmarking results with label-free DIA

FragPipe workflow benchmarking results with TMT data

Maxquant workflow benchmarking results with TMT data

Recommend optimal Workflow for DDA label-free data

We highly recommend you to use our offline tool or R package instead for testing DEA workflows !!! (see following link:)

Do you have any preferred selections?

We suggest to apply the FragPipe-specific top-ranked workflow:

or your can apply one of the workflows including the following choices:

You can view the details of your selected workflow in benchmarking-DDA_LFQ-FragPipe page

Volcano plot of the differential expression analysis (DEA)

Per your info, our suggestions are as follows:

Volcano plot of the differential expression analysis (DEA)

We suggest to use the following workflows to conduct ensemble inference:

Volcano plot of the differential expression analysis (DEA)

please check in which step:

expression matrix types

Normalization methods

MVI algorithms

DEA tools

Do you have any preferred selections?

We suggest to apply the maxquant-specific top-ranked workflow:

or your can apply one of the workflows including the following choices:

You can view the details of your selected workflow in benchmarking-DDA_LFQ-Maxquant page

Volcano plot of the differential expression analysis (DEA)

Per your info, our suggestions are as follows:

Volcano plot of the differential expression analysis (DEA)

We suggest to use the following workflows to conduct ensemble inference:

Volcano plot of the differential expression analysis (DEA)

please check in which step:

expression matrix types

Normalization methods

MVI algorithms

DEA tools

Recommend optimal Workflow for DIA label-free data

We highly recommend you to use our offline tool or R package instead for testing DEA workflows !!! (see following link:)

Do you have any preferred selections?

We suggest to apply the DIA-NN-specific top-ranked workflow:

or your can apply one of the workflows including the following choices:

You can view the details of your selected workflow in benchmarking-DIA_LFQ-DIANN page

Volcano plot of the differential expression analysis (DEA)

Per your info, our suggestions are as follows:

Volcano plot of the differential expression analysis (DEA)

We suggest to use the following workflows to conduct ensemble inference:

Volcano plot of the differential expression analysis (DEA)

please check in which step:

expression matrix types

Normalization methods

MVI algorithms

DEA tools

Do you have any preferred selections?

We suggest to apply the Spectronaut-specific top-ranked workflow:

or your can apply one of the workflows including the following choices:

You can view the details of your selected workflow in benchmarking-DIA_LFQ-Spectronaut page

Volcano plot of the differential expression analysis (DEA)

Per your info, our suggestions are as follows:

Volcano plot of the differential expression analysis (DEA)

We suggest to use the following workflows to conduct ensemble inference:

Volcano plot of the differential expression analysis (DEA)

please check in which step:

expression matrix types

Normalization methods

MVI algorithms

DEA tools

Recommend optimal Workflow for TMT data

We highly recommend you to use our offline tool or R package instead for testing DEA workflows !!! (see following link:)

Do you have any preferred selections?

We suggest to apply the TMT-FragPipe-specific top-ranked workflow:

or your can apply one of the workflows including the following choices: