Zhaoxiang Cai, Emma L Boys, Zaynab Noor, …, Roger R Reddel
The proteome provides unique insights into disease biology beyond the genome and transcriptome. However, the sharing of raw proteomic data across institutions is hindered by privacy concerns and data volume. Here, we present a federated deep learning framework for cancer subtyping using mass spectrometry-based proteomic data. By training on distributed datasets without centralized data sharing, our approach achieves performance comparable to centralized training. We demonstrate the utility of this framework by classifying 14 cancer subtypes across 7,500 cancer proteomes from multiple centers. This work introduces the first application of federated deep learning to cancer proteomics, enabling collaborative research while preserving data privacy.
Yangxiu Wu, Zhaoxiang Cai, Daniel Cross, …, Karen L MacKenzie
Telomere maintenance is a hallmark of cancer. Here, we present a large-scale analysis of telomere maintenance mechanisms (TMMs) in 976 cancer cell lines. Integrating proteomic, genomic, and transcriptomic data with drug sensitivity and CRISPR-Cas9 gene essentiality screens, we identify molecular features associated with telomerase activity and Alternative Lengthening of Telomeres (ALT). We discover broad heterogeneity in telomere biology beyond the binary TMM classification and develop multi-omic predictors for TMM status. Our findings reveal potential therapeutic vulnerabilities linked to specific TMMs, providing a resource for developing telomere-targeted cancer therapies.
Ana R Baião, Zhaoxiang Cai, Rebecca C Poulos, …, Emanuel Gonçalves
The integration of multi-omics data is essential for understanding complex biological systems. This review provides a comprehensive overview of multi-omics data integration methods, ranging from classical statistical approaches to state-of-the-art deep generative models. We discuss the challenges associated with high dimensionality, heterogeneity, and missing data, and highlight the potential of Variational Autoencoders (VAEs) and other deep learning techniques for data imputation, augmentation, and joint embedding. The review also covers emerging trends such as foundation models and contrastive learning in the context of multi-omics integration.
Pierre Osteil, Sarah Withey, Nicole Santucci, …, Patrick P L Tam
MIXL1 plays a critical role in endoderm differentiation. Here, we demonstrate that MIXL1 activation is a key determinant of the efficiency of definitive endoderm generation from human induced pluripotent stem cells (hiPSCs). By modulating MIXL1 expression, we show that lineage propensity can be re-wired, enhancing the differentiation potential toward endoderm lineages. This work provides insights for optimizing stem cell differentiation protocols for regenerative medicine applications.
Zhaoxiang Cai, Rebecca C Poulos, Jianmin Liu, Qing Zhong
DeePathNet is a transformer-based deep learning model that integrates multi-omic data with biological pathway information. By embedding pathway knowledge directly into the model architecture, DeePathNet improves the interpretability and performance of cancer subtype classification and drug response prediction. We demonstrate the utility of DeePathNet on large-scale datasets, highlighting its ability to identify pathway-level biomarkers and mechanisms of action.
Zhaoxiang Cai, Samuel Apolinário, Ana R Baião, …, Emanuel Gonçalves
We introduce MOSA (Multi-Omic Synthetic Augmentation), an unsupervised deep learning model for integrating and augmenting cancer multi-omics data. By leveraging variational autoencoders, MOSA generates synthetic multi-omic profiles that expand the effective sample size of cancer datasets, enabling the discovery of new biomarkers and drug targets. We demonstrate that MOSA-augmented data improves the power of association studies and clustering analyses, providing a valuable resource for the cancer research community.
Zhaoxiang Cai, Emanuel Gonçalves, Rebecca C Poulos, …, Roger R Reddel
The proteome provides unique insights into disease biology beyond the genome and transcriptome. A lack of large proteomic datasets has restricted the identification of new cancer biomarkers. Here, proteomes of 949 cancer cell lines across 28 tissue types are analyzed by mass spectrometry. Deploying a workflow to quantify 8,498 proteins, these data capture evidence of cell-type and post-transcriptional modifications. Integrating multi-omics, drug response, and CRISPR-Cas9 gene essentiality screens with a deep learning-based pipeline reveals thousands of protein biomarkers of cancer vulnerabilities that are not significant at the transcript level. The power of the proteome to predict drug response is very similar to that of the transcriptome. Further, random downsampling to only 1,500 proteins has limited impact on predictive power, consistent with protein networks being highly connected and co-regulated. This pan-cancer proteomic map (ProCan-DepMapSanger) is a comprehensive resource available at https://cellmodelpassports.sanger.ac.uk.
Zhaoxiang Cai, Rebecca C Poulos, Jia Liu, Qing Zhong
Multi-omics data analysis is an important aspect of cancer molecular biology studies and has led to ground-breaking discoveries. Many efforts have been made to develop machine learning methods that automatically integrate omics data. Here, we review machine learning tools categorised as either general-purpose or task-specific, covering both supervised and unsupervised learning for integrative analysis of multi-omics data. We benchmark the performance of five machine learning approaches using data from the Cancer Cell Line Encyclopedia, reporting prediction accuracy on cancer type prediction and mean absolute error on drug response prediction, and evaluating runtime efficiency. This review provides recommendations to researchers regarding suitable machine learning method selection for their specific applications. It should also promote the development of novel machine learning methodologies for data integration, which will be essential for drug discovery, clinical trial design and personalised treatments.
Rebecca C Poulos, Zhaoxiang Cai, Phillip J Robinson, Roger R Reddel, Qing Zhong
Proteomic data are a uniquely valuable resource for drug response prediction and biomarker discovery because most drugs interact directly with proteins in target cells rather than with DNA or RNA. Recent advances in mass spectrometry and associated processing methods have enabled the generation of large-scale proteomic datasets. Here we review the significant opportunities that currently exist to combine large-scale proteomic data with drug-related research, a field termed pharmacoproteomics. We describe successful applications of drug response prediction using molecular data, with an emphasis on oncology. We focus on technical advances in data-independent acquisition mass spectrometry (DIA-MS) that can facilitate the discovery of protein biomarkers for drug responses, alongside the increased availability of big biomedical data. We spotlight new opportunities for machine learning in pharmacoproteomics, driven by the combination of these large datasets and improved high-performance computing. Finally, we explore the value of pre-clinical models for pharmacoproteomic studies and the accompanying challenges of clinical validation. We propose that pharmacoproteomics offers the potential for novel discovery and innovation within the cancer landscape.
Gholamreza Haffari, Zhaoxiang Cai, Mohammad S Rahman, Ann E Nicholson
Cancer arises from successive rounds of mutations which generate tumor cells with different genomic variation i.e. clones. For drug responsiveness and therapeutics, it is necessary to identify the clones in tumor sample accurately. Many methods are developed to infer tumor heterogeneity by either computing cellular prevalence and tumor phylogeny or predicting genotype of mutations. All methods suffer some problems e.g. inaccurate computation of clonal frequencies, discarding clone specific genotypes etc. In the paper, we propose a method, called- HetFHMM to infer tumor heterogeneity by predicting clone specific genotypes and cellular prevalence. To infer clone specific genotype, we consider the presence of multiple mutations at any genomic location. We also tested our model on different simulated data. The results shows that HetFHMM outperforms recent methods which infer tumor heterogeneity. Therefore, HetFHMM is a novel approach in tumor heterogeneity research area.
This study introduced a barcode-like design into a paper-based blood typing device by integrating with smartphone-based technology. The concept of presenting a paper-based blood typing assay in a barcode-like pattern significantly enhanced the adaptability of the assay to the smartphone technology. The fabrication of this device involved the use of a printing technique to define hydrophilic bar channels which were, respectively, treated with Anti-A, -B, and -D antibodies. These channels were then used to perform blood typing assays by introducing a blood sample. Blood type can be visually identified from eluting lengths in bar channels. A smartphone-based analytical application was designed to read the bar channels, analogous to scanning a barcode, interpret this information, and then report results to users. The proposed paper-based blood typing device is rapidly read by smartphones and easy for the user to operate. We envisage that the adaptation of paper-based devices to the widely accepted smartphone technology will increase the capability of paper-based diagnostics with rapid assay result interpretation, data storage, and transmission.