OpDEA: Optimizing Proteomics Data Differential Expression Analysis via High-Performing Rules and Ensemble Inference
Abstract
Identification of differentially expressed proteins in a proteomics workflow typically encompasses five key steps: raw data quantification, expression matrix construction, matrix normalization, missing value imputation (MVI), and differential expression analysis. The plethora of options in each step makes it challenging to identify optimal workflows that maximize the identification of differentially expressed proteins. To identify optimal workflows and their common properties, we conduct an extensive study involving 34,576 combinatoric experiments on 24 gold standard spike-in datasets. Applying frequent pattern mining techniques to top-ranked workflows, we uncover high-performing rules that demonstrate optimality has conserved properties. Via machine learning, we confirm optimal workflows are indeed predictable, with average cross-validation F1 scores and Matthew's correlation coefficients surpassing 0.84. We introduce an ensemble inference to integrate results from individual top-performing workflows for expanding differential proteome coverage and resolve inconsistencies. Ensemble inference provides gains in pAUC (up to 4.61%) and G-mean (up to 11.14%) and facilitates effective aggregation of information across varied quantification approaches such as topN, directLFQ, MaxLFQ intensities, and spectral counts. However, further development and evaluation are needed to establish acceptable frameworks for conducting ensemble inference on multiple proteomics workflows.
Workflow of a DEA process for label-free proteomics data and available tools for each step in the workflow
Good Generalizability confirmed by Leave-One-Dataset-Out Cross-Validation
Workflow performance levels are predictable
Acknowledgement and Citation
Acknowledgement
This research/project is supported by the National Research Foundation, Singapore under its Industry Alignment Fund – Prepositioning (IAF-PP) Funding Initiative. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore.
This work was partly supported by the National Innovation Fellow Program of the MOST of China (J.L., Grant No. E327130001).
WWBG also acknowledges support from an MOE Tier 1 award (RS08/21).
Publication
Please cite the following paper:
Hui Peng, He Wang, Weijia Kong, Jinyan Li*, Wilson Wen Bin Goh*. (2024). Optimizing Proteomics Data Differential Expression Analysis via Unveiling High-Performing Rules and Ensemble Inference
FragPipe workflow benchmarking results with label-free DDA data
Maxquant workflow benchmarking resultswith label-free DDA data
DIA-NN workflow benchmarking results with label-free DIA
Spectronaut workflow benchmarking results with label-free DIA
FragPipe workflow benchmarking results with TMT data
Maxquant workflow benchmarking results with TMT data
Recommend optimal Workflow for DDA label-free data
We highly recommend you to use our offline tool or R package instead for testing DEA workflows !!! (see following link:)
offline OpDEA toolkit: OpDEA toolkit
R package: R package
Do you have any preferred selections?
click the botton again to hide !!!
ensemble inference may help improve the true positive rates, Do you help try it?
click the botton again to hide !!!
We suggest to apply the FragPipe-specific top-ranked workflow:
DEqMS|FragPipe|dlfq|missForest|blank: expression matrix:dlfq normalization:None MVI:missForest DEA tool:DEqMS
or your can apply one of the workflows including the following choices:
expression matrix: [directLFQ] intensity normalization: None MVI: [SeqKNN] DEA tool: [limma, ROTS]
You can view the details of your selected workflow in benchmarking-DDA_LFQ-FragPipe page
thresholds:
Volcano plot of the differential expression analysis (DEA)
Per your info, our suggestions are as follows:
You can view the details of this workflow in benchmarking-DDA_LFQ-FragPipe page
thresholds:
Volcano plot of the differential expression analysis (DEA)
We suggest to use the following workflows to conduct ensemble inference:
thresholds:
Volcano plot of the differential expression analysis (DEA)
please check in which step:
expression matrix types
Normalization methods
MVI algorithms
DEA tools
Do you have any preferred selections?
click the botton again to hide !!!
ensemble inference may help improve the true positive rates, Do you help try it?
click the botton again to hide !!!
We suggest to apply the maxquant-specific top-ranked workflow:
DEqMS|Maxquant|dlfq|Impseq|blank: expression matrix:dlfq normalization:None MVI:Impseq DEA tool:DEqMS
or your can apply one of the workflows including the following choices:
expression matrix: directLFQ intensity normalization: [None] MVI: [Impseq] DEA tool: limma
You can view the details of your selected workflow in benchmarking-DDA_LFQ-Maxquant page
thresholds:
Volcano plot of the differential expression analysis (DEA)
Per your info, our suggestions are as follows:
You can view the details of this workflow in benchmarking-DDA_LFQ-Maxquant page
thresholds:
Volcano plot of the differential expression analysis (DEA)
We suggest to use the following workflows to conduct ensemble inference:
thresholds:
Volcano plot of the differential expression analysis (DEA)
please check in which step:
expression matrix types
Normalization methods
MVI algorithms
DEA tools
Recommend optimal Workflow for DIA label-free data
We highly recommend you to use our offline tool or R package instead for testing DEA workflows !!! (see following link:)
offline OpDEA toolkit: OpDEA toolkit
R package: R package
Do you have any preferred selections?
click the botton again to hide !!!
ensemble inference may help improve the true positive rates, Do you help try it?
click the botton again to hide!!!
We suggest to apply the DIA-NN-specific top-ranked workflow:
limma|DIANN|dlfq|MinDet|blank: expression matrix:dlfq normalization:None MVI:MinDet DEA tool:limma
or your can apply one of the workflows including the following choices:
expression matrix: "directLFQ intensity" normalization: [None] MVI: [MinDect] DEA tool: [limma]
You can view the details of your selected workflow in benchmarking-DIA_LFQ-DIANN page
thresholds:
Volcano plot of the differential expression analysis (DEA)
Per your info, our suggestions are as follows:
You can view the details of this workflow in benchmarking-DIA_LFQ-DIANN page
thresholds:
Volcano plot of the differential expression analysis (DEA)
We suggest to use the following workflows to conduct ensemble inference:
thresholds:
Volcano plot of the differential expression analysis (DEA)
please check in which step:
expression matrix types
Normalization methods
MVI algorithms
DEA tools
Do you have any preferred selections?
click the botton again to hide!!!
ensemble inference may help improve the true positive rates, Do you help try it?
click the botton again to hide!!!
We suggest to apply the Spectronaut-specific top-ranked workflow:
limma|DIANN|dlfq|MinDet|blank: expression matrix:dlfq normalization:None MVI:MinDet DEA tool:limma
or your can apply one of the workflows including the following choices:
expression matrix: "directLFQ intensity" normalization: [None] MVI: [Impseq] DEA tool: [ROTS, limma]
You can view the details of your selected workflow in benchmarking-DIA_LFQ-Spectronaut page
thresholds:
Volcano plot of the differential expression analysis (DEA)
Per your info, our suggestions are as follows:
You can view the details of this workflow in benchmarking-DIA_LFQ-Spectronaut page
thresholds:
Volcano plot of the differential expression analysis (DEA)
We suggest to use the following workflows to conduct ensemble inference:
thresholds:
Volcano plot of the differential expression analysis (DEA)
please check in which step:
expression matrix types
Normalization methods
MVI algorithms
DEA tools
Recommend optimal Workflow for TMT data
We highly recommend you to use our offline tool or R package instead for testing DEA workflows !!! (see following link:)
offline OpDEA toolkit: OpDEA toolkit
R package: R package
Do you have any preferred selections?
click the botton again to hide !!!
We suggest to apply the TMT-FragPipe-specific top-ranked workflow:
limma|FragPipe|abd|SeqKNN|blank: expression matrix:abd normalization:None MVI:SeqKNN DEA tool:limma
or your can apply one of the workflows including the following choices:
expression matrix: TMT-Integrator abundance normalization: None MVI: [SeqKNN] DEA tool: limma
You can view the details of your selected workflow in benchmarking-TMT-FragPipe page
You can view the details of this workflow in benchmarking-DIA_LFQ-Spectronaut page
thresholds:
Volcano plot of the differential expression analysis (DEA)
Per your info, our suggestions are as follows:
You can view the details of your selected workflow in benchmarking-TMT-FragPipe page
You can view the details of this workflow in benchmarking-DIA_LFQ-Spectronaut page
thresholds:
Volcano plot of the differential expression analysis (DEA)
please check in which step:
expression matrix types
Normalization methods
MVI algorithms
DEA tools
Do you have any preferred selections?
click the botton again to hide !!!
We suggest to apply the TMT-Maxquant-specific top-ranked workflow:
proDA|Maxquant|intensity|bpca|blank: expression matrix:intensity normalization:None MVI:bpca DEA tool:proDA
or your can apply one of the workflows including the following choices:
expression matrix: Reporter intensity normalization: None MVI: [bpca] DEA tool: [limma, ROTS]
You can view the details of your selected workflow in benchmarking-TMT-FragPipe page
You can view the details of this workflow in benchmarking-DIA_LFQ-Spectronaut page
thresholds:
Volcano plot of the differential expression analysis (DEA)
Per your info, our suggestions are as follows:
You can view the details of this workflow in benchmarking-TMT-FragPipe page
You can view the details of this workflow in benchmarking-DIA_LFQ-Spectronaut page
thresholds:
Volcano plot of the differential expression analysis (DEA)
please check in which step:
Normalization methods
MVI algorithms
DEA tools
Avaliable Resources
Resources
offline OpDEA toolkit:
OpDEA toolkitR package:
R packageRaw data links
label-free DDA data
HYE5600735_LFQ: PXD028735
HYE6600735_LFQ: PXD028735
HYEqe735_LFQ: PXD028735
HYEtims735_LFQ: PXD028735
HYtims134_LFQ: PXD036134
HEtims425_LFQ: PXD021425;
YUltq006_LFQ: PDC000006
YUltq099_LFQ: PXD002099
YUltq819_LFQ: PXD001819
HEqe408_LFQ: PXD018408
HYqfl683_LFQ: PXD007683
HYEtims777_LFQ: PXD014777;
label-free DIA data
HYEtims735_DIA: PXD028735
MYtims709_DIA: PXD034709
HEqe408_DIA: PXD018408
HEof_n600_DIA: PXD026600
HEof_w600_DIA: PXD026600
HYtims134_DIA: PXD036134
HEqe777_DIA: PXD019777
HEqe777_DIA:
TMT data
HEqe277_TMT10: PXD013277
HYqfl683_TMT11: PXD007683
HYms2faims815_TMT16: PXD020815
HYsps2815_TMT16: PXD020815
HYms2815_TMT16: PXD020815
Quantification data
Please download from zenodo via the following link
Zenodo: Raw quantification results from FragPipe, Maxquant, DIA-NN and SpectronautOr download from google drive via the following link
google Drive: Raw quantification results of DDA data; DIA data; TMT databenchmarking result data
expression matrices are available at:
Or download from zenodo via the following link
Zenodo: DDA_expression_matrices;Or download from zenodo via the following link
Zenodo: DIA_expression_matrices;Or download from zenodo via the following link
Zenodo: TMT_expression_matrices;Performance metrices are available at:
Or download from zenodo via the following link
Zenodo: DDA_performance_metrics;Or download from zenodo via the following link
Zenodo: DIA_performance_metrics;Or download from zenodo via the following link
Zenodo: TMT_performance_metrics;Workflow ranks are available at:
Or download from zenodo via the following link
Zenodo: DDA_workflow_ranks;Or download from zenodo via the following link
Zenodo: DIA_workflow_ranks;Or download from zenodo via the following link
Zenodo: TMT_workflow_ranks;Introduction of the webserver
This webserver includes 5 main function panels including Introduction, Benchmarking, Suggestion & DEA, Data and Help. In the [Introduction] panel, we first present the abstract of this work, then we show an overview of the proteomics data differential expression analysis workflow and the options available in each workflow steps. We later using the leave-one-dataset-out cross-validation results to show the good generalizability of our benchmarking which supports the realiability of our webserver for suggesting optimal workflows for newcoming data. We also usd the 10-fold cross-validation to prove the performance level of a workflow can be predict with a CatBoost classifier, which also lays the foundation of the ability in recommendation of optimal workflows by our webserver. At last, the Acknowledgement and Citation information are shown.
In the [Benchmarking] panel,
users can view our benchmarking results including checking the rank position and performance metric values of a workflow tested with our benchmaking datasets.
In the [Suggestion & DEA] pane,
we provide the tools for suggesting optimal workflows and conduct differential expression analysis with the suggested workflow directly.
In the [Data] pane,
the user can get the links where raw proteomics data are available. The raw quantification results of the raw data, the extracted expression matrices and our benchmarking results can be downloaded.
We also provide the link for downloading our offline toolkit with the same function as the webserver.
In the [Help] pane,
We introduce the webserver and show what the users can do with our webserver.
View benchmarking results
Step 1: Choosing an interested setting, e.g., label-free DDA data quantified with FragPipe, click the item of DDA_LFQ-FragPipe.
There are 7852, 7852, 6284, 6284, 4720 and 1568 workflows under settings of LFQ_DDA-FragPipe, LFQ_DDA-Maxquant, LFQ_DIA-DIANN, LFQ_DIA-Spectronaut, TMT-FragPipe and TMT-Maxquant.
Step 2: Filtering workflows, e.g., check the options in the 4 checkboxes that you are interested in, or by using a key word.
Drag the scroll bar below the table can check the ranking with a specific metric. The ranking of workflows is based-on the average ranking based on the five performance metrics (the column of avg_rank_mean).
Step 3: Click one row of the table to check the performance distributions testing on different datasets.
12 DDA datasets, 7 DIA datasets and 5 TMT datasets were used to evaluate the performances of workflows, the boxplot shows the performance distributions of the 5 metrics.
Only one row is permitted to be select each time, click the same row twice can cancel the selection.
Step 4: View the details of performance distrubutions of the five metrics in the right bottom figure.
Optimal workflow recommendation & DEA
Step 1: Choose data type that the user has.
Lable-free DDA (DDA), label free DIA (DIA) and TMT data are supported. Choose the one correponding to the data you have.
Step 2: Choose a quantification platform.
For DDA data, FragPipe and Maxquant are supported. For DIA data, DIA-NN and Spectronaut are supported. For TMT data, again, FragPipe and Maxquant are supported. Just choose the one that you used to quantify your proteomics data.
Step 3: Choose to suggest single workflow or try the ensemble inference.
If single workflow is preferred. The user should choose whether the have some preferred options in the workflow step. If yes, then check Yes, and select the options in below checkboxes, otherwise select No.
For DDA and DIA data, our server also support to suggest ensemble inference where multiple workflow are integrated by a p-value integration method. The ens_multi-quant approach is used defaulty.
The imputation method missForest and MLE are quite time-consuming, if they are suggested, we will replace them with MinProb.
Step 4: Click the suggest worklfow button or Try ensemble inference button.
Our server will suggest the top 1st workflow after filtering the workflows with user specified option preference. The potential alternative options will also be shown according to our option comparsion results
Step 5: Upload raw quantification result data, specify thresholds and submit the DEA task.
The user should upload their raw quantification result data for DEA, e.g., the combined_protein.tsv file from FragPipe. The designed file showing the sample condition and group information must be uploaded at the same time. Read the file requirement carefully before uploading your files. The log2FC threshold and adj.pvalue (p-value is ajusted with BH method) threshold should be specified. At last, click DEA button to submit the task.
Step 6: Download DEA results.
After submitting the DEA task, the user should wait for a while till the task being completed. A volcano plot will be generated and a link for the user downloading their DEA results can be found below the volcano plot.
Contact Info
Contact:
Any problem please contact Hui Peng:
Email: hui.peng@ntu.edu.sg
Or, can email to the corresponding authors:
Jinyan Li: jinyan.li@siat.ac.cn; Wilson Wen Bin Goh: wilsongoh@ntu.edu.sg