Zhaoxiang (Simon) Cai

Zhaoxiang (Simon) Cai

Senior Data Scientist

The University of Sydney

Children's Medical Research Institute

Biography

I am a dynamic researcher and engineer with a robust background in both academic and industrial settings. My journey includes developing large-scale systems at Goldman Sachs, where I successfully managed complex projects under tight deadlines. My PhD in Cancer Data Science at the Children’s Medical Research Institute, affiliated with the University of Sydney, marked a significant turn in my career towards integrating machine learning with oncological studies. My passion lies in harnessing the power of artificial intelligence to revolutionise healthcare, particularly in understanding and treating cancer. I am committed to pioneering advancements that will transform oncology, speeding the development of innovative therapies and improving patient outcomes. Beyond my professional pursuits, I am an avid enthusiast of the piano, skiing, and badminton.

Download my resumé .

Interests
  • Artificial Intelligence
  • Cancer Research
  • Big Data
  • Computer Vision
Education
  • PhD in Machine Learning and Medical Research, 2023

    The University of Sydney

  • Master in Business Analytics, 2018

    The University of Melbourne

  • Hons of BSc in Computer Science, 2014

    Monash University

Skills

Technical Skills
Python
Data Science
R
Linux
Hobbies
Skiing/Snowboarding
Badminton
Reading

Experience

 
 
 
 
 
University of Sydney
Conjoint Associate Lecturer
June 2023 – Present Sydney
 
 
 
 
 
Children's Medical Research Institute
Senior Data Scientist
February 2023 – Present Westmead
  • Developed new deep learning-based approach to incorporate human knowledge for multi-omic data integration
  • Designed and built multi-view VAE models customised for multi-omic data integration
  • Performed end-to-end whole exome/genome sequencing data analyses for germline/somatic mutations, copy number variations and structural variants
  • Performed end-to-end proteomic data analyses, including data QC, peptide-to-protein rollup, pre-processing, differential expression analysis, pathway analysis and survival analysis
  • Worked on integrating histopathological images with proteomic data to improve diagnosis
 
 
 
 
 
Australia Education Management Group
Lecturer
March 2023 – Present Melbourne
  • Fundamentals of Programming
  • Software Engineering with Java
 
 
 
 
 
Children's Medical Research Institute
Data Scientist
January 2019 – February 2020 Westmead
  • Built pipelines using existing models for single-cell RNA-seq analysis in mouse developmental biology
  • Built deep learning models for live-cell imaging data analysis
 
 
 
 
 
Goldman Sachs
Analyst Programmer
November 2014 – December 2017 Melbourne
  • Communicated with business stakeholders and liaised regarding project scope with ongoing updates
  • Designed/developed/tested/deployed system solutions specialised in Goldman Sachs Electronic Trading (GSET)business flow
  • Provided production support and maitained the health of testing environment

Recent Publications

Quickly discover relevant content by filtering publications.
(2022). Machine learning for multi-omics data integration in cancer. In iScience.

Cite DOI

(2022). Opportunities for pharmacoproteomics in biomarker discovery. In Proteomics.

Cite DOI

(2015). HetFHMM: A novel approach to infer tumor heterogeneity using factorial Hidden Markov model. In arXiv.

PDF Cite

(2014). Barcode-Like Paper Sensor for Smartphone Diagnostics: An Application of Blood Typing. In Anal. Chem..

PDF Cite DOI

Projects

*
Synthetic augmentation of cancer cell line multi-omic datasets using unsupervised deep learning
Integrative analysis of multi-omic datasets remains a challenge due to gaps and heterogeneity. We present a bespoke unsupervised deep learning model that generates synthetic multi-omic data for 1,523 cancer cell lines, completing the gaps and increasing the number of molecular and phenotypic profiles by 32.
Synthetic augmentation of cancer cell line multi-omic datasets using unsupervised deep learning
Machine learning of cancer type and tissue of origin from proteomes of 1,277 human tissue samples and 975 cancer cell lines
Cancer type is determined via assessment of tumour morphology, aided by immunohistochemical staining patterns. The development of machine learning (ML) models using histology slides has powered the image-based prediction of the site of origin in cancer of unknown primary (CUP).
Machine learning of cancer type and tissue of origin from proteomes of 1,277 human tissue samples and 975 cancer cell lines
Transformer-based deep learning integrates multi-omic data with cancer pathways
Multi-omic data analysis incorporating machine learning has the potential to significantly improve cancer diagnosis and prognosis. Traditional machine learning methods are usually limited to omic measurements, omitting existing domain knowledge, such as the biological networks that link molecular entities in various omic data types.
Transformer-based deep learning integrates multi-omic data with cancer pathways
Pan-cancer proteomic map of 949 human cell lines reveals principles of cancer vulnerabilities
Proteomic data can reveal novel associations between genotype and phenotype, beyond what is apparent from genomics or transcriptomics alone. However, a lack of large proteomic datasets across a range of cancer types has limited our understanding of proteome network organisation and regulation.
Pan-cancer proteomic map of 949 human cell lines reveals principles of cancer vulnerabilities
Vaccine misinformation identification (1st prize winner)
The spread of misinformation can undermine confidence in vaccination. The capacity for social media to quickly and effectively spread information or misinformation is a pressing question for governments and global agencies.
Ocular Disease Intelligent Recognition (3rd prize winner)
In ophthalmology, fundus screening is an economic and effective way to prevent blindness as early as possible that cased by diabetes, glaucoma, cataract, age-related macular degeneration (AMD) and many other causes.
Differentiation Scoring of Embryonic Cells via a Single-cell based Reference Lineage Tree
I designed a new method which can be used by embryologists to determine the embryo differentiation propensity, given bulk RNA-seq data or qPCR data. The method is novel in using combined different single-cell datasets as a reference, then mapping the bulk RNA data to the lineage tree that is inferred from the single-cell data.
KPMG-MBS Natural Language Processing Challenge (Prize Winner)
KPMG and Melbourne Business School held a data science challenge with a topic in Natural Language Processing (NLP). The task was to explore complaints data from banks and analyse how NLP techniques can be leveraged.
Prioritization Algorithm Enhancement and Simplification in Job Recommendation
During the 5 weeks internship at SEEK, together with another three group members and one mentor from SEEK, we developed a model to make prioritizations for the relevant jobs selected by existing algorithms.