Zhaoxiang (Simon) Cai

Senior Data Scientist

The University of Sydney

Children's Medical Research Institute

Biography

I am a dynamic researcher and engineer with a robust background in both academic and industrial settings. My journey includes developing large-scale systems at Goldman Sachs, where I successfully managed complex projects under tight deadlines. My PhD in Cancer Data Science at the Children’s Medical Research Institute, affiliated with the University of Sydney, marked a significant turn in my career towards integrating machine learning with oncological studies. My passion lies in harnessing the power of artificial intelligence to revolutionise healthcare, particularly in understanding and treating cancer. I am committed to pioneering advancements that will transform oncology, speeding the development of innovative therapies and improving patient outcomes. Beyond my professional pursuits, I am an avid enthusiast of the piano, skiing, and badminton.

Download my resumé .

Interests

Artificial Intelligence
Cancer Research
Big Data
Computer Vision

Education

PhD in Machine Learning and Medical Research, 2023
The University of Sydney
Master in Business Analytics, 2018
The University of Melbourne
Hons of BSc in Computer Science, 2014
Monash University

Skills

Technical Skills

Python

Data Science

Linux

Hobbies

Skiing/Snowboarding

Badminton

Reading

Experience

Conjoint Associate Lecturer

University of Sydney

June 2023 – Present Sydney

Senior Data Scientist

Children’s Medical Research Institute

February 2023 – Present Westmead

Developed new deep learning-based approach to incorporate human knowledge for multi-omic data integration
Designed and built multi-view VAE models customised for multi-omic data integration
Performed end-to-end whole exome/genome sequencing data analyses for germline/somatic mutations, copy number variations and structural variants
Performed end-to-end proteomic data analyses, including data QC, peptide-to-protein rollup, pre-processing, differential expression analysis, pathway analysis and survival analysis
Worked on integrating histopathological images with proteomic data to improve diagnosis

Lecturer

Australia Education Management Group

March 2023 – Present Melbourne

Fundamentals of Programming
Software Engineering with Java

Data Scientist

Children’s Medical Research Institute

January 2019 – February 2020 Westmead

Built pipelines using existing models for single-cell RNA-seq analysis in mouse developmental biology
Built deep learning models for live-cell imaging data analysis

Analyst Programmer

Goldman Sachs

November 2014 – December 2017 Melbourne

Communicated with business stakeholders and liaised regarding project scope with ongoing updates
Designed/developed/tested/deployed system solutions specialised in Goldman Sachs Electronic Trading (GSET)business flow
Provided production support and maitained the health of testing environment

Featured Publications

Zhaoxiang (Simon) Cai, Emanuel Gonçalves, Rebecca C Poulos, Syd Barthorpe, Srikanth S Manda, Natasha Lucas, Alexandra Beck, Daniel Bucio-Noble, Michael Dausmann, Caitlin Hall, Michael Hecker, Jennifer Koh, Sadia Mahboob, Iman Mali, James Morris, Laura Richardson, Akila J Seneviratne, Erin Sykes, Frances Thomas, Sara Valentini, Steven G Williams, Yangxiu Wu, Dylan Xavier, Karen L MacKenzie, Peter G Hains, Brett Tully, Phillip J Robinson, Qing Zhong, Mathew J Garnett, Roger R Reddel

July, 2022 In Cancer Cell

Pan-cancer proteomic map of 949 human cell lines reveals principles of cancer vulnerabilities

The proteome provides unique insights into disease biology beyond the genome and transcriptome. A lack of large proteomic datasets has restricted the identification of new cancer biomarkers. Here, proteomes of 949 cancer cell lines across 28 tissue types are analyzed by mass spectrometry. Deploying a workflow to quantify 8,498 proteins, these data capture evidence of cell-type and post-transcriptional modifications. Integrating multi-omics, drug response, and CRISPR-Cas9 gene essentiality screens with a deep learning-based pipeline reveals thousands of protein biomarkers of cancer vulnerabilities that are not significant at the transcript level. The power of the proteome to predict drug response is very similar to that of the transcriptome. Further, random downsampling to only 1,500 proteins has limited impact on predictive power, consistent with protein networks being highly connected and co-regulated. This pan-cancer proteomic map (ProCan-DepMapSanger) is a comprehensive resource available at https://cellmodelpassports.sanger.ac.uk.

Recent Publications

Quickly discover relevant content by filtering publications.

Zhaoxiang (Simon) Cai, Rebecca C Poulos, Jia Liu, Qing Zhong (2022). Machine learning for multi-omics data integration in cancer. In iScience.

Cite DOI

Rebecca C Poulos, Zhaoxiang (Simon) Cai, Phillip J Robinson, Roger R Reddel, Qing Zhong (2022). Opportunities for pharmacoproteomics in biomarker discovery. In Proteomics.

Cite DOI

Gholamreza Haffari, Zhaoxiang (Simon) Cai, Mohammad S Rahman, Ann E Nicholson (2015). HetFHMM: A novel approach to infer tumor heterogeneity using factorial Hidden Markov model. In arXiv.

PDF Cite

Liyun Guan, Junfei Tian, Rong Cao, Miaosi Li, Zhaoxiang (Simon) Cai, Wei Shen (2014). Barcode-Like Paper Sensor for Smartphone Diagnostics: An Application of Blood Typing. In Anal. Chem..

PDF Cite DOI

Projects

Synthetic augmentation of cancer cell line multi-omic datasets using unsupervised deep learning

Integrative analysis of multi-omic datasets remains a challenge due to gaps and heterogeneity. We present a bespoke unsupervised deep learning model that generates synthetic multi-omic data for 1,523 cancer cell lines, completing the gaps and increasing the number of molecular and phenotypic profiles by 32.

Machine learning of cancer type and tissue of origin from proteomes of 1,277 human tissue samples and 975 cancer cell lines

Cancer type is determined via assessment of tumour morphology, aided by immunohistochemical staining patterns. The development of machine learning (ML) models using histology slides has powered the image-based prediction of the site of origin in cancer of unknown primary (CUP).

Transformer-based deep learning integrates multi-omic data with cancer pathways

Multi-omic data analysis incorporating machine learning has the potential to significantly improve cancer diagnosis and prognosis. Traditional machine learning methods are usually limited to omic measurements, omitting existing domain knowledge, such as the biological networks that link molecular entities in various omic data types.

Pan-cancer proteomic map of 949 human cell lines reveals principles of cancer vulnerabilities

Proteomic data can reveal novel associations between genotype and phenotype, beyond what is apparent from genomics or transcriptomics alone. However, a lack of large proteomic datasets across a range of cancer types has limited our understanding of proteome network organisation and regulation.

Vaccine misinformation identification (1st prize winner)

The spread of misinformation can undermine confidence in vaccination. The capacity for social media to quickly and effectively spread information or misinformation is a pressing question for governments and global agencies.

Ocular Disease Intelligent Recognition (3rd prize winner)

In ophthalmology, fundus screening is an economic and effective way to prevent blindness as early as possible that cased by diabetes, glaucoma, cataract, age-related macular degeneration (AMD) and many other causes.

Differentiation Scoring of Embryonic Cells via a Single-cell based Reference Lineage Tree

I designed a new method which can be used by embryologists to determine the embryo differentiation propensity, given bulk RNA-seq data or qPCR data. The method is novel in using combined different single-cell datasets as a reference, then mapping the bulk RNA data to the lineage tree that is inferred from the single-cell data.

KPMG-MBS Natural Language Processing Challenge (Prize Winner)

KPMG and Melbourne Business School held a data science challenge with a topic in Natural Language Processing (NLP). The task was to explore complaints data from banks and analyse how NLP techniques can be leveraged.

Prioritization Algorithm Enhancement and Simplification in Job Recommendation

During the 5 weeks internship at SEEK, together with another three group members and one mentor from SEEK, we developed a model to make prioritizations for the relevant jobs selected by existing algorithms.

Zhaoxiang (Simon) Cai

Senior Data Scientist

The University of Sydney

Children's Medical Research Institute

Biography

Skills

Experience

Featured Publications

Recent Publications

Projects

Popular Topics