Federated deep learning enables cancer subtyping by proteomics
This talk highlights our work on developing a federated deep learning framework for privacy-preserving cancer subtyping using mass spectrometry-based proteomic data.
InvitedInvited, oral, and poster presentations at international and national conferences.
This talk highlights our work on developing a federated deep learning framework for privacy-preserving cancer subtyping using mass spectrometry-based proteomic data.
InvitedAn overview of our recent work in applying machine learning and deep learning to large-scale cancer datasets, including the ProCan proteomic map and multi-omic integration methods.
OralIntroduction Proteomic data provide unique insights into the molecular behaviour of cells in both healthy and disease contexts. Proteomics can reveal novel associations between genotype and phenotype, beyond what is apparent from genomics or transcriptomics alone. However, a lack of large proteomic datasets across a range of cancer types has limited our understanding of proteome network organisation and regulation. Methods We produced a pan-cancer proteomic map derived from 949 human cancer cell lines, namely ProCan-DepMapSanger. The map encompasses more than 40 cancer types derived from over 28 distinct human tissues. The samples were processed with a clinically-relevant workflow involving rapid and minimally complex sample preparation. The raw proteomic data were acquired by data independent acquisition mass spectrometry (DIA-MS) at ProCan® in Australia. The processed data were analysed with a bespoke deep learning-based pipeline (DeeProM) that integrates multi-omics, drug responses and CRISPR-Cas9 gene essentiality information produced at the Wellcome Sanger Institute. Preliminary Data Raw DIA-MS data were processed with DIA-NN and MaxLFQ, quantifying 8,498 proteins. The ProCan-DepMapSanger dataset significantly expands the existing molecular characterizations of this broad range of cancer cell line models. High correlations were observed between replicates of each cell line, yielding a sample-wise median Pearson’s correlation coefficient (Pearson’s r) of 0.92. Correlations between unmatched samples from the same instrument or batch were similar to random (median Pearson’s r = 0.75). We also confirmed that our dataset is consistent with other independent previously published proteomic datasets that comprise smaller subsets of the same cell lines. Nonlinear dimensionality reduction using Uniform Manifold Approximation and Projection (UMAP) showed no evidence of instrument or batch effects. Next, we defined a stringent set of protein quantifications that were supported by measuring more than one peptide (n = 6,692 human proteins). Visualization of these protein intensities via UMAP showed groupings by cell type of origin. Hematopoietic and lymphoid cells showed the most distinct clustering away from other cell types, and these could be further segregated into different cell lineages. This high-level dimensionality reduction suggested a profile of protein expression that relates to cell type of origin. Furthermore, DeeProM enabled the full integration of proteomic data with drug responses and CRISPR-Cas9 gene essentiality screens to build a comprehensive map of protein-specific biomarkers of cancer vulnerabilities that are essential for cancer cell survival and growth. Notably, to the best of our knowledge, this is the first comprehensive demonstration that proteomic data spanning a broad range of cancer cell types and molecular backgrounds have significant utility for predicting cancer cell vulnerabilities. Novel Aspect ProCan-DepMapSanger provides a definitive map of cancer dependencies and the deep learning-based method DeeProM enables protein biomarker discovery
OralIntroduction Proteomic data provide unique insights into the molecular behaviour of cells in both healthy and disease contexts. Proteomics can reveal novel associations between genotype and phenotype, beyond what is apparent from genomics or transcriptomics alone. However, a lack of large proteomic datasets across a range of cancer types has limited our understanding of proteome network organisation and regulation. Methods We produced a pan-cancer proteomic map derived from 949 human cancer cell lines, namely ProCan-DepMapSanger. The map encompasses more than 40 cancer types derived from over 28 distinct human tissues. The samples were processed with a clinically-relevant workflow involving rapid and minimally complex sample preparation. The raw proteomic data were acquired by data independent acquisition mass spectrometry (DIA-MS) at ProCan® in Australia. The processed data were analysed with a bespoke deep learning-based pipeline (DeeProM) that integrates multi-omics, drug responses and CRISPR-Cas9 gene essentiality information produced at the Wellcome Sanger Institute. Preliminary Data Raw DIA-MS data were processed with DIA-NN and MaxLFQ, quantifying 8,498 proteins. The ProCan-DepMapSanger dataset significantly expands the existing molecular characterizations of this broad range of cancer cell line models. High correlations were observed between replicates of each cell line, yielding a sample-wise median Pearson’s correlation coefficient (Pearson’s r) of 0.92. Correlations between unmatched samples from the same instrument or batch were similar to random (median Pearson’s r = 0.75). We also confirmed that our dataset is consistent with other independent previously published proteomic datasets that comprise smaller subsets of the same cell lines. Nonlinear dimensionality reduction using Uniform Manifold Approximation and Projection (UMAP) showed no evidence of instrument or batch effects. Next, we defined a stringent set of protein quantifications that were supported by measuring more than one peptide (n = 6,692 human proteins). Visualization of these protein intensities via UMAP showed groupings by cell type of origin. Hematopoietic and lymphoid cells showed the most distinct clustering away from other cell types, and these could be further segregated into different cell lineages. This high-level dimensionality reduction suggested a profile of protein expression that relates to cell type of origin. Furthermore, DeeProM enabled the full integration of proteomic data with drug responses and CRISPR-Cas9 gene essentiality screens to build a comprehensive map of protein-specific biomarkers of cancer vulnerabilities that are essential for cancer cell survival and growth. Notably, to the best of our knowledge, this is the first comprehensive demonstration that proteomic data spanning a broad range of cancer cell types and molecular backgrounds have significant utility for predicting cancer cell vulnerabilities. Novel Aspect ProCan-DepMapSanger provides a definitive map of cancer dependencies and the deep learning-based method DeeProM enables protein biomarker discovery
OralOmics data analysis, powered by machine learning, has significantly improved cancer diagnosis and prognosis. However, most machine learning methods consider each gene as an independent feature, failing to integrate experimentally-acquired gene regulation and pathway information. The benefit of utilising this information increases in the era of multi-omics, because gene regulation is the key mechanism that links different omic layers together. Here, we present an interpretable deep learning model, DeepPathNet, which uses cancer-specific pathway information for both single and multi-omics data analysis. DeePathNet leverages the cutting-edge deep learning technique, Transformer, which is derived from the field of natural language processing, to model complex interactions between pathways from omics data. The computation of self-attention in the Transformer module allows DeePathNet to learn the encoding of pathways to achieve superior predictive performance and interpretability. Techniques such as drop out layers are also integrated into DeePathNet to maximise its generalisability for unseen data. Moreover, DeePathNet supports any number of omics layers and can handle missing values. Using multiple evaluation metrics, we demonstrate that DeePathNet robustly outperforms traditional methods for predicting drug response and cancer type on four publicly available datasets, namely COSMIC Cell Lines, Genomics of Drug Sensitivity in Cancer (GDSC), Cancer Cell Line Encyclopedia (CCLE) and Cancer Therapeutics Response Portal (CTRP). DeePathNet also provides reliable model interpretation, potentially enabling biomarker discoveries at the pathway level. Using the Transformer, DeePathNet is the first method that supports multi-omics data analysis, integrates cancer pathway knowledge into modelling, and provides pathway-level model explanation.
OralOmics data analysis, powered by machine learning, has significantly improved cancer diagnosis and prognosis. However, most machine learning methods consider each gene as an independent feature, failing to integrate experimentally-acquired gene regulation and pathway information. The benefit of utilising this information increases in the era of multi-omics, because gene regulation is the key mechanism that links different omic layers together. Here, we present an interpretable deep learning model, DeepPathNet, which uses cancer-specific pathway information for both single and multi-omics data analysis. DeePathNet leverages the cutting-edge deep learning technique, Transformer, which is derived from the field of natural language processing, to model complex interactions between pathways from omics data. The computation of self-attention in the Transformer module allows DeePathNet to learn the encoding of pathways to achieve superior predictive performance and interpretability. Techniques such as drop out layers are also integrated into DeePathNet to maximise its generalisability for unseen data. Moreover, DeePathNet supports any number of omics layers and can handle missing values. Using multiple evaluation metrics, we demonstrate that DeePathNet robustly outperforms traditional methods for predicting drug response and cancer type on four publicly available datasets, namely COSMIC Cell Lines, Genomics of Drug Sensitivity in Cancer (GDSC), Cancer Cell Line Encyclopedia (CCLE) and Cancer Therapeutics Response Portal (CTRP). DeePathNet also provides reliable model interpretation, potentially enabling biomarker discoveries at the pathway level. Using the Transformer, DeePathNet is the first method that supports multi-omics data analysis, integrates cancer pathway knowledge into modelling, and provides pathway-level model explanation.
OralProteomic data can reveal novel associations between genotype and phenotype, beyond what is apparent from genomics or transcriptomics alone. However, a lack of large proteomic datasets across a range of cancer types has limited our understanding of proteome network organisation and regulation. We produced a pan-cancer proteomic map derived from 949 human cancer cell lines. The map encompasses more than 40 cancer types derived from over 28 distinct human tissues. The samples were processed with a clinically-relevant workflow involving rapid and minimally complex sample preparation, quantifying 8,500 proteins. The raw proteomic data were acquired by data independent acquisition mass spectrometry (DIA-MS) at ProCan® in Australia. The processed data were analysed with a bespoke deep learning-based pipeline (DeeProM) that integrates multi-omics, CRISPR-Cas9 gene essentiality and drug sensitivity information produced at the Wellcome Sanger Institute. First, our findings reveal pervasive post-transcriptional modification and thousands of putative protein biomarkers of cancer vulnerabilities. Second, DeeProM statistics show that a fraction of the proteome can confer similar predictive power to the entire transcriptome. This has key implications for the clinical application of proteomics in drug response prediction. Third, we demonstrate that a random proportion of the identified proteins can provide robust predictions of cancer cell phenotypes, underpinning the concept of pervasive co-regulation of protein networks. This pan-cancer cell line proteomic map is a comprehensive resource that expands our understanding of cancer proteomes. These data reveal principles of cancer cell phenotypes, including genetic vulnerabilities and drug sensitivities, that are important for developing novel targeted anticancer therapies.
Oral