By: Martin Hemberg and Vladimir Kiselev
Ever since scientists first used a microscope to inspect cells, it has been recognized that they can be grouped into distinct cell-types based on their morphology. The difference between cell-types, both in terms of form and function can be striking, even though all somatic cells in an organism share the same DNA. The reason why cells may exhibit such striking differences can be attributed to the fact that each cell-type expresses only ~10,000 of the ~20,000 genes that are present in our genomes.
Traditionally, cell-types are defined based on morphology – shape. However, recent technological advances have made it possible to measure the level of all approximately 20,000 genes expressed in individual cells. The technology is known as single-cell RNA-seq (scRNA-seq) and it builds upon the powerful methods that were initially developed as part of the Human Genome Project.
To carry out a scRNA-seq experiment, the biological sample provided (e.g. some blood, a piece of skin or a biopsy from an organ) is dissociated and the cells are isolated individually. A set number of cells are then randomly selected to have their mRNA extracted and profiled. Using computational analysis methods, cells with similar profiles are grouped together, making it possible to identify cell-types based on which genes are expressed.
In the fall of 2016, the Human Cell Atlas (HCA), a hugely ambitious international project to “generate a comprehensive map of all 37 trillion cells in the human body” was launched. The HCA uses scRNA-seq to profile cells from the human body and one of the goals is to define cell-types based on mRNA profiles. Most likely, the first release of the HCA will contain more than 100 million cells that have been profiled using scRNA-seq.
One of the key challenges will be to make sure that the HCA reference can be queried in a way that supports the questions that are likely to be asked most frequently, such as comparing cells from a new sample to the reference. This could be important for example in a clinical setting, where a doctor would be able to compare a patient sample (e.g. from an unhealthy liver) to the reference. Such a query would allow the doctor to determine if there is a major imbalance in the composition of cells, or even if there are cells that have acquired a disease state (e.g. cancer) that is not present in healthy individuals.
To support such queries, we have developed a novel computational method called scmap, which takes a query and a reference scRNA-seq dataset as the input. For each cell in the query, scmap can identify both the cell-type and the individual cell from the reference that provides the best match, as in the Figure above.
Comparing scRNA-seq profiles is challenging, mainly for two reasons: the data is high-dimensional (approximately 20,000 genes) and it is noisy.
Scmap is based on a recently developed feature selection algorithm for scRNA-seq data from the Hemberg lab. The algorithm is able to identify the subset of genes that are most informative for clustering in an unsupervised manner, and it uses state-of-the-art machine learning methods to achieve high specificity and sensitivity. Moreover, scmap is very fast, which means that it can be used for real-time searches of very large references.
Another key feature is that scmap’s internal representation of the reference is greatly compressed which means that it can be run on an ordinary workstation. Finally, scmap is modular which means that a new dataset can be added to the reference without having to re-compute previously added datasets.
Even though the HCA is years from completion, there are already large collections of scRNA-seq datasets available. In addition to the HCA, researchers are also building cell atlases for many of the model organisms that are widely used in biomedical research. The most impressive result to date are two large collections of reference data for the mouse. Researchers have already used scmap to compare the two mouse datasets to compare the different methodologies for collecting the data, providing an excellent demonstration of how scmap can help analysing large datasets.
Since scmap carries out a simple yet fundamental operation – comparison of cells from two datasets – we anticipate that it will become an integral part of many scRNA-seq analysis pipelines, and that other, more complex tasks will come to rely on it. In particular, we believe that the speed and compression afforded by scmap will ensure that the HCA becomes an accessible and easy to use reference for the community.
About the authors:
Dr Martin Hemberg is a Group Leader at the Wellcome Sanger Institute, interested in quantitative models of gene expression.
Dr Vladimir Kiselev is currently the Head of the Cellular Genetics Informatics group at the Wellcome Sanger Institute and used to be a postdoctoral researcheroc in Dr Martin Hemberg’s group.
Kiselev VY, Yiu A and Hemberg M. (2018). Scmap – projection of single-cell RNA-seq data across datasets. Nature Methods. DOI: 10.1038/nmeth.4644