A Deep Dive into Single-Cell RNA Sequencing Foundation Models

Abstract

Large-scale foundation models, which are pre-trained on massive, unlabeled datasets and subsequently fine-tuned on specific tasks, have recently achieved unparalleled success on a wide array of applications, including in healthcare and biology. In this paper, we explore two foundation models recently developed for single-cell RNA sequencing data, scBERT and scGPT. Focusing on the fine-tuning task of cell type annotation, we explore the relative performance of pre-trained models compared to a simple baseline, L1-regularized logistic regression, including in the few-shot setting. We perform ablation studies to understand whether pretraining improves model performance and to better understand the difficulty of the pre-training task in scBERT. Finally, using scBERT as an example, we demonstrate the potential sensitivity of fine-tuning to hyperparameter settings and parameter initializations. Taken together, our results highlight the importance of rigorously testing foundation models against well established baselines, establishing challenging fine-tuning tasks on which to benchmark foundation models, and performing deep introspection into the embeddings learned by the model in order to more effectively harness these models to transform single-cell data analysis. Code is available at https://github.com/clinicalml/sc-foundation-eval.

Publication
bioRxiv
Rebecca (Peyser) Boiarsky
Rebecca (Peyser) Boiarsky
PhD student

Rebecca’s research interests include developing methods to learn disease progression models and discover new biological insights for precision medicine applications. She works on machine learning algorithms that can utilize clinical and genomic data for this purpose, with a particular focus on single cell RNA-sequencing data and cancer.

Alejandro Buendia
Alejandro Buendia
Research Engineer
David Sontag
David Sontag
Professor of EECS

My research focuses on advancing machine learning and artificial intelligence, and using these to transform health care.

Related