OpDEA: Optimizing Proteomics Data Differential Expression Analysis via High-Performing Rules and Ensemble Inference


Abstract

Identification of differentially expressed proteins in a proteomics workflow typically encompasses five key steps: raw data quantification, expression matrix construction, matrix normalization, missing value imputation (MVI), and differential expression analysis. The plethora of options in each step makes it challenging to identify optimal workflows that maximize the identification of differentially expressed proteins. To identify optimal workflows and their common properties, we conduct an extensive study involving 34,576 combinatoric experiments on 24 gold standard spike-in datasets. Applying frequent pattern mining techniques to top-ranked workflows, we uncover high-performing rules that demonstrate optimality has conserved properties. Via machine learning, we confirm optimal workflows are indeed predictable, with average cross-validation F1 scores and Matthew's correlation coefficients surpassing 0.84. We introduce an ensemble inference to integrate results from individual top-performing workflows for expanding differential proteome coverage and resolve inconsistencies. Ensemble inference provides gains in pAUC (up to 4.61%) and G-mean (up to 11.14%) and facilitates effective aggregation of information across varied quantification approaches such as topN, directLFQ, MaxLFQ intensities, and spectral counts. However, further development and evaluation are needed to establish acceptable frameworks for conducting ensemble inference on multiple proteomics workflows.

Workflow of a DEA process for label-free proteomics data and available tools for each step in the workflow

Good Generalizability confirmed by Leave-One-Dataset-Out Cross-Validation


Workflow performance levels are predictable


Acknowledgement and Citation

Acknowledgement

This research/project is supported by the National Research Foundation, Singapore under its Industry Alignment Fund – Prepositioning (IAF-PP) Funding Initiative. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore.

This work was partly supported by the National Innovation Fellow Program of the MOST of China (J.L., Grant No. E327130001).

WWBG also acknowledges support from an MOE Tier 1 award (RS08/21).


Publication

Please cite the following paper:

Hui Peng, He Wang, Weijia Kong, Jinyan Li*, Wilson Wen Bin Goh*. (2024). Optimizing Proteomics Data Differential Expression Analysis via Unveiling High-Performing Rules and Ensemble Inference

FragPipe workflow benchmarking results with label-free DDA data


workflow benchmarking
performance distributions

Maxquant workflow benchmarking resultswith label-free DDA data


workflow benchmarking
performance distributions

DIA-NN workflow benchmarking results with label-free DIA


workflow benchmarking
performance distributions

Spectronaut workflow benchmarking results with label-free DIA


workflow benchmarking
performance distributions

FragPipe workflow benchmarking results with TMT data


workflow benchmarking
performance distributions

Maxquant workflow benchmarking results with TMT data


workflow benchmarking
performance distributions

Recommend optimal Workflow for DDA label-free data

We highly recommend you to use our offline tool or R package instead for testing DEA workflows !!! (see following link:)

offline OpDEA toolkit: OpDEA toolkit

R package: R package


Do you have any preferred selections?

click the botton again to hide !!!

ensemble inference may help improve the true positive rates, Do you help try it?

click the botton again to hide !!!

We suggest to apply the FragPipe-specific top-ranked workflow:
DEqMS|FragPipe|dlfq|missForest|blank:        expression matrix:dlfq        normalization:None        MVI:missForest        DEA tool:DEqMS
or your can apply one of the workflows including the following choices:
expression matrix: [directLFQ] intensity          normalization: None          MVI: [SeqKNN]          DEA tool: [limma, ROTS]
You can view the details of your selected workflow in benchmarking-DDA_LFQ-FragPipe page

thresholds:

Volcano plot of the differential expression analysis (DEA)

DEA with the suggested workflow, please wait for the results...


DEA results.zip
Per your info, our suggestions are as follows:

You can view the details of this workflow in benchmarking-DDA_LFQ-FragPipe page



thresholds:

Volcano plot of the differential expression analysis (DEA)

DEA with the suggested workflow, please wait for the results...


DEA results.zip
We suggest to use the following workflows to conduct ensemble inference:


thresholds:

Volcano plot of the differential expression analysis (DEA)

DEA with the suggested ensemble inference workflow, please wait for the results...


DEA results.zip

please check in which step:

expression matrix types

Normalization methods

MVI algorithms

DEA tools

Do you have any preferred selections?

click the botton again to hide !!!

ensemble inference may help improve the true positive rates, Do you help try it?

click the botton again to hide !!!

We suggest to apply the maxquant-specific top-ranked workflow:
DEqMS|Maxquant|dlfq|Impseq|blank:        expression matrix:dlfq        normalization:None        MVI:Impseq        DEA tool:DEqMS
or your can apply one of the workflows including the following choices:
expression matrix: directLFQ intensity          normalization: [None]          MVI: [Impseq]          DEA tool: limma
You can view the details of your selected workflow in benchmarking-DDA_LFQ-Maxquant page

thresholds:

Volcano plot of the differential expression analysis (DEA)

DEA with the suggested workflow, please wait for the results...


DEA results.zip
Per your info, our suggestions are as follows:

You can view the details of this workflow in benchmarking-DDA_LFQ-Maxquant page


thresholds:

Volcano plot of the differential expression analysis (DEA)

DEA with the suggested workflow, please wait for the results...


DEA results.zip
We suggest to use the following workflows to conduct ensemble inference:


thresholds:

Volcano plot of the differential expression analysis (DEA)

DEA with the suggested ensemble inference workflow, please wait for the results...


DEA results.zip

please check in which step:

expression matrix types

Normalization methods

MVI algorithms

DEA tools

Recommend optimal Workflow for DIA label-free data

We highly recommend you to use our offline tool or R package instead for testing DEA workflows !!! (see following link:)

offline OpDEA toolkit: OpDEA toolkit

R package: R package


Do you have any preferred selections?

click the botton again to hide !!!

ensemble inference may help improve the true positive rates, Do you help try it?

click the botton again to hide!!!

We suggest to apply the DIA-NN-specific top-ranked workflow:
limma|DIANN|dlfq|MinDet|blank:        expression matrix:dlfq        normalization:None        MVI:MinDet        DEA tool:limma
or your can apply one of the workflows including the following choices:
expression matrix: "directLFQ intensity"          normalization: [None]          MVI: [MinDect]          DEA tool: [limma]
You can view the details of your selected workflow in benchmarking-DIA_LFQ-DIANN page

thresholds:

Volcano plot of the differential expression analysis (DEA)

DEA with the suggested workflow, please wait for the results...


DEA results.zip
Per your info, our suggestions are as follows:

You can view the details of this workflow in benchmarking-DIA_LFQ-DIANN page


thresholds:

Volcano plot of the differential expression analysis (DEA)

DEA with the suggested workflow, please wait for the results...


DEA results.zip
We suggest to use the following workflows to conduct ensemble inference:


thresholds:

Volcano plot of the differential expression analysis (DEA)

DEA with the suggested ensemble inference workflow, please wait for the results...


DEA results.zip

please check in which step:

expression matrix types

Normalization methods

MVI algorithms

DEA tools

Do you have any preferred selections?

click the botton again to hide!!!

ensemble inference may help improve the true positive rates, Do you help try it?

click the botton again to hide!!!

We suggest to apply the Spectronaut-specific top-ranked workflow:
limma|DIANN|dlfq|MinDet|blank:        expression matrix:dlfq        normalization:None        MVI:MinDet        DEA tool:limma
or your can apply one of the workflows including the following choices:
expression matrix: "directLFQ intensity"          normalization: [None]          MVI: [Impseq]          DEA tool: [ROTS, limma]
You can view the details of your selected workflow in benchmarking-DIA_LFQ-Spectronaut page

thresholds:

Volcano plot of the differential expression analysis (DEA)

DEA with the suggested workflow, please wait for the results...


DEA results.zip
Per your info, our suggestions are as follows:

You can view the details of this workflow in benchmarking-DIA_LFQ-Spectronaut page


thresholds:

Volcano plot of the differential expression analysis (DEA)

DEA with the suggested workflow, please wait for the results...


DEA results.zip
We suggest to use the following workflows to conduct ensemble inference:


thresholds:

Volcano plot of the differential expression analysis (DEA)

DEA with the suggested ensemble inference workflow, please wait for the results...


DEA results.zip

please check in which step:

expression matrix types

Normalization methods

MVI algorithms

DEA tools

Recommend optimal Workflow for TMT data

We highly recommend you to use our offline tool or R package instead for testing DEA workflows !!! (see following link:)

offline OpDEA toolkit: OpDEA toolkit

R package: R package


Do you have any preferred selections?

click the botton again to hide !!!

We suggest to apply the TMT-FragPipe-specific top-ranked workflow:
limma|FragPipe|abd|SeqKNN|blank:        expression matrix:abd        normalization:None        MVI:SeqKNN        DEA tool:limma
or your can apply one of the workflows including the following choices:
expression matrix: TMT-Integrator abundance          normalization: None          MVI: [SeqKNN]          DEA tool: limma
You can view the details of your selected workflow in benchmarking-TMT-FragPipe page

You can view the details of this workflow in benchmarking-DIA_LFQ-Spectronaut page


thresholds:

Volcano plot of the differential expression analysis (DEA)

DEA with the suggested workflow, please wait for the results...


DEA results.zip
Per your info, our suggestions are as follows:
You can view the details of your selected workflow in benchmarking-TMT-FragPipe page

You can view the details of this workflow in benchmarking-DIA_LFQ-Spectronaut page


thresholds:

Volcano plot of the differential expression analysis (DEA)

DEA with the suggested workflow, please wait for the results...


DEA results.zip

please check in which step:

expression matrix types

Normalization methods

MVI algorithms

DEA tools

Do you have any preferred selections?

click the botton again to hide !!!

We suggest to apply the TMT-Maxquant-specific top-ranked workflow:
proDA|Maxquant|intensity|bpca|blank:        expression matrix:intensity        normalization:None        MVI:bpca        DEA tool:proDA
or your can apply one of the workflows including the following choices:
expression matrix: Reporter intensity          normalization: None          MVI: [bpca]          DEA tool: [limma, ROTS]
You can view the details of your selected workflow in benchmarking-TMT-FragPipe page

You can view the details of this workflow in benchmarking-DIA_LFQ-Spectronaut page


thresholds:

Volcano plot of the differential expression analysis (DEA)

DEA with the suggested workflow, please wait for the results...


DEA results.zip
Per your info, our suggestions are as follows:

You can view the details of this workflow in benchmarking-TMT-FragPipe page


You can view the details of this workflow in benchmarking-DIA_LFQ-Spectronaut page


thresholds:

Volcano plot of the differential expression analysis (DEA)

DEA with the suggested workflow, please wait for the results...


DEA results.zip

please check in which step:

Normalization methods

MVI algorithms

DEA tools

Avaliable Resources


Resources

offline OpDEA toolkit:
OpDEA toolkit
R package:
R package

Raw data links

label-free DDA data

HYE5600735_LFQ: PXD028735

HYE6600735_LFQ: PXD028735

HYEqe735_LFQ: PXD028735

HYEtims735_LFQ: PXD028735

HYtims134_LFQ: PXD036134

HEtims425_LFQ: PXD021425;

YUltq006_LFQ: PDC000006

YUltq099_LFQ: PXD002099

YUltq819_LFQ: PXD001819

HEqe408_LFQ: PXD018408

HYqfl683_LFQ: PXD007683

HYEtims777_LFQ: PXD014777;


label-free DIA data

HYEtims735_DIA: PXD028735

MYtims709_DIA: PXD034709

HEqe408_DIA: PXD018408

HEof_n600_DIA: PXD026600

HEof_w600_DIA: PXD026600

HYtims134_DIA: PXD036134

HEqe777_DIA: PXD019777

HEqe777_DIA:


TMT data

HEqe277_TMT10: PXD013277

HYqfl683_TMT11: PXD007683

HYms2faims815_TMT16: PXD020815

HYsps2815_TMT16: PXD020815

HYms2815_TMT16: PXD020815

Quantification data

Please download from zenodo via the following link

Zenodo: Raw quantification results from FragPipe, Maxquant, DIA-NN and Spectronaut

Or download from google drive via the following link

google Drive: Raw quantification results of DDA data; DIA data; TMT data

benchmarking result data

expression matrices are available at:

Or download from zenodo via the following link
Zenodo: DDA_expression_matrices;
Or download from zenodo via the following link
Zenodo: DIA_expression_matrices;
Or download from zenodo via the following link
Zenodo: TMT_expression_matrices;

Performance metrices are available at:

Or download from zenodo via the following link
Zenodo: DDA_performance_metrics;
Or download from zenodo via the following link
Zenodo: DIA_performance_metrics;
Or download from zenodo via the following link
Zenodo: TMT_performance_metrics;

Workflow ranks are available at:

Or download from zenodo via the following link
Zenodo: DDA_workflow_ranks;
Or download from zenodo via the following link
Zenodo: DIA_workflow_ranks;
Or download from zenodo via the following link
Zenodo: TMT_workflow_ranks;

Introduction of the webserver


Introdcution of the webserver:

This webserver includes 5 main function panels including Introduction, Benchmarking, Suggestion & DEA, Data and Help. In the [Introduction] panel, we first present the abstract of this work, then we show an overview of the proteomics data differential expression analysis workflow and the options available in each workflow steps. We later using the leave-one-dataset-out cross-validation results to show the good generalizability of our benchmarking which supports the realiability of our webserver for suggesting optimal workflows for newcoming data. We also usd the 10-fold cross-validation to prove the performance level of a workflow can be predict with a CatBoost classifier, which also lays the foundation of the ability in recommendation of optimal workflows by our webserver. At last, the Acknowledgement and Citation information are shown.


In the [Benchmarking] panel,

users can view our benchmarking results including checking the rank position and performance metric values of a workflow tested with our benchmaking datasets.


In the [Suggestion & DEA] pane,

we provide the tools for suggesting optimal workflows and conduct differential expression analysis with the suggested workflow directly.


In the [Data] pane,

the user can get the links where raw proteomics data are available. The raw quantification results of the raw data, the extracted expression matrices and our benchmarking results can be downloaded.

We also provide the link for downloading our offline toolkit with the same function as the webserver.


In the [Help] pane,

We introduce the webserver and show what the users can do with our webserver.

View benchmarking results


View workflow benchmarking results:
Step 1: Choosing an interested setting, e.g., label-free DDA data quantified with FragPipe, click the item of DDA_LFQ-FragPipe.

There are 7852, 7852, 6284, 6284, 4720 and 1568 workflows under settings of LFQ_DDA-FragPipe, LFQ_DDA-Maxquant, LFQ_DIA-DIANN, LFQ_DIA-Spectronaut, TMT-FragPipe and TMT-Maxquant.


Step 2: Filtering workflows, e.g., check the options in the 4 checkboxes that you are interested in, or by using a key word.

Drag the scroll bar below the table can check the ranking with a specific metric. The ranking of workflows is based-on the average ranking based on the five performance metrics (the column of avg_rank_mean).


Step 3: Click one row of the table to check the performance distributions testing on different datasets.

12 DDA datasets, 7 DIA datasets and 5 TMT datasets were used to evaluate the performances of workflows, the boxplot shows the performance distributions of the 5 metrics.

Only one row is permitted to be select each time, click the same row twice can cancel the selection.


Step 4: View the details of performance distrubutions of the five metrics in the right bottom figure.

Optimal workflow recommendation & DEA


Workflow suggestion and testing:
Step 1: Choose data type that the user has.

Lable-free DDA (DDA), label free DIA (DIA) and TMT data are supported. Choose the one correponding to the data you have.


Step 2: Choose a quantification platform.

For DDA data, FragPipe and Maxquant are supported. For DIA data, DIA-NN and Spectronaut are supported. For TMT data, again, FragPipe and Maxquant are supported. Just choose the one that you used to quantify your proteomics data.


Step 3: Choose to suggest single workflow or try the ensemble inference.

If single workflow is preferred. The user should choose whether the have some preferred options in the workflow step. If yes, then check Yes, and select the options in below checkboxes, otherwise select No.

For DDA and DIA data, our server also support to suggest ensemble inference where multiple workflow are integrated by a p-value integration method. The ens_multi-quant approach is used defaulty.

The imputation method missForest and MLE are quite time-consuming, if they are suggested, we will replace them with MinProb.


Step 4: Click the suggest worklfow button or Try ensemble inference button.

Our server will suggest the top 1st workflow after filtering the workflows with user specified option preference. The potential alternative options will also be shown according to our option comparsion results


Step 5: Upload raw quantification result data, specify thresholds and submit the DEA task.

The user should upload their raw quantification result data for DEA, e.g., the combined_protein.tsv file from FragPipe. The designed file showing the sample condition and group information must be uploaded at the same time. Read the file requirement carefully before uploading your files. The log2FC threshold and adj.pvalue (p-value is ajusted with BH method) threshold should be specified. At last, click DEA button to submit the task.


Step 6: Download DEA results.

After submitting the DEA task, the user should wait for a while till the task being completed. A volcano plot will be generated and a link for the user downloading their DEA results can be found below the volcano plot.

Contact Info


Contact:

Any problem please contact Hui Peng:

Email: hui.peng@ntu.edu.sg


Or, can email to the corresponding authors:

Jinyan Li: jinyan.li@siat.ac.cn; Wilson Wen Bin Goh: wilsongoh@ntu.edu.sg