Tag: Human Cell Atlas

25 GenomesHuman Cell AtlasInfluencing PolicySanger LifeSanger Science

25 years of pushing the scientific boundaries

By: Alison Cranage
Date: 01.10.18

Wellcome_Sanger_Logo_Portrait_Digital_RGBThe Sanger Institute was set up to uncover the code of life – the human genome. We opened our doors 25 years ago and became the largest single contributor to the human genome project. The principles that sat behind those endeavours are still fundamental – tackling the biggest challenges, openness and collaboration. Those principles have also helped to make Sanger one of the world’s leaders in genomics and biodata.

The Human Genome Project transformed science. The seemingly simple order of four letters of DNA changes how we understand life. Vast new areas of research have opened up, impacting biology, medicine, agriculture, the environment, businesses and governments.

Alongside our sequencing facilities, our activities and research have grown to utilise genomic knowledge. Now we are using genomics to give us an unprecedented understanding of human health, disease and life on earth.


Read our original press release from 2003 announcing the completion of the Human Genome by clicking on the image above

Sequencing at scale

From the completion of the first human genome in 2003, we moved to the 1,000 and 10,000 genomes projects. Being able to compare sequences between individuals enables the understanding of diversity, evolution and the genetic basis of disease.

One of our latest projects is to work with UK Biobank to sequence the genomes of 50,000 individuals. Participants have already provided a wealth of data about their health and their lives – from blood samples to details of their diet. Linking this information to sequence data means we can understand more than ever before about the connections between our genomes and our health.

Kamilah the gorilla. Image courtesy of San Diego Zoo. To read about our work with the gorilla genome, please click the image

Kamilah the gorilla. Image courtesy of San Diego Zoo. To read about our work with the gorilla genome, please click on the image above

Across a wide range of species

Sanger researchers also sequence the genomes of pathogens and other organisms, as well as people. We have published the genomes of thousands of species – from deadly bacteria to worms to the gorilla. This enables research into evolution, infections, drug resistance, outbreaks, symbiosis, biology and host parasite interactions.


The cumulative amount of DNA the Sanger Institute has read over time

At increasing speed and accuracy

Our sequencing teams, led by Dr Cordelia Langford, are constantly developing the technology to improve both accuracy and speed. In early 2018, we celebrated sequencing over five petabases of DNA (if you typed it all out, it would take 23 million years). The first petabyte took just over five years to produce. The fifth, just 169 days. The amount of genomic data now rivals that of the biggest data sources in the world – YouTube, Twitter and astrophysics.


We run the largest life sciences data centre in europe

Supported by Europe’s largest life sciences data centre

The Sanger Institute is not only developing sequencing technology but also leading research in computational science, IT and bioinformatics, developing new ways to store and analyse petabytes of genomic and bio-data.

From sequence to clinic

How genome sequencing, or the sequence of any given individual, can be used hasn’t always been clear. But in the case of rare genetic diseases, it can change lives.


To read more about the Deciphering Developmental Disorders project, please click on the image above

Giving families an answer

The Deciphering Developmental Disorders (DDD) study started 8 years ago, led by Dr Matt Hurles at the Sanger Institute. Over 13,600 children with rare developmental conditions, but without a diagnosis, joined the study. Sanger researchers, working together with clinical geneticists, have used genome sequencing to diagnose their conditions. 40 per cent of the children now have a diagnosis – giving the families some of the answers they were searching for. Knowing the genetic cause of a condition can help doctors manage it, help families connect with others as well as plan for the future.

Watch our video about tracking MRSA in real time

Watch our video about tracking MRSA in real time by clicking on the image above

Stopping outbreaks in hospitals

The ability of researchers to rapidly sequence and analyse bacterial genomes is also leading to advances for patients.

Dr Julian Parkhill and colleagues showed it was possible to track an MRSA outbreak in a neonatal ward in real-time. By sequencing MRSA isolates from patients and staff, they could track the outbreak, following its path from person to person. This enables clinicians to prevent further transmission and bring the outbreak under control.

Now, it is UK policy to sequence the genomes of pathogens in an outbreak.

Watch our video showing global tracking of infectious disease

Watch our video showing global tracking of infectious disease by clicking on the image above

Fighting epidemics at a global scale

But disease knows no borders. Pathogens can easily spread around the globe. Professor David Aanensen, group leader at the Sanger Institute, is also Director of the recently established Centre for Genomic Pathogen Surveillance. The centre co-ordinates global surveillance of pathogens (such as MRSA and the flu virus) using whole genome sequencing. The data is openly available. Countries around the world can monitor the rise and spread of pathogens as well as their growing resistance to antibiotics. This enables swift action – with the aim of stopping transmission and saving lives.

The forefront of human genomics

The rapid development of technology has led to the ability of researchers to sequence the DNA, or RNA, from a single cell. Previously, much larger quantities of material were needed. Single cell RNA sequencing is a powerful tool. It allows the study of an individual cell’s activity, functions and composition. And high throughput machines means hundreds of thousands of cells can be analysed at once.

human-cell-atlas-infographic-6_Aug UPDATED

To view the full infographic for the Human Cell Atlas project, please click on the image above

Capturing every type of cell in the human body, one at a time

The Human Cell Atlas is capitalising on these advances. The international collaboration is co-led by Dr Sarah Teichmann at the Sanger Institute. Launched in 2016, scientists are using Next-Generation Sequencing to sequence 30-100 million single cells from the human body – out of a total of roughly 37 trillion. The aim is to create a comprehensive, 3D reference map of all human cells. This will lead to a deeper understanding of cells as the building blocks of life. It will form a new basis for understanding human health and diagnosing, monitoring, and treating disease.

Like the human genome project before it, this huge project will disrupt science and human biology. And like the human genome project it will drive technology to make it possible.

The diversity of life

Beyond human health, genome sequence data allows the study of evolution, biology and biodiversity.


To read more about our 25 Genomes Project, please click on the image above

25 Genomes for 25 years

For our 25th anniversary we have sequenced a more diverse range of species than ever before. 25 different species that represent biodiversity in the UK – from the golden eagle to the humble blackberry. Sequencing new species will push development of our technologies as each presents unique challenges. The sequences themselves will aid research into population genetics, evolution, biodiversity management, conservation and climate change.

But 25 species is just the beginning. Every single living thing has a genome, made up of exactly the same molecules of DNA or RNA. We want to uncover how the order of those molecules lead to the diversity of life on earth.


To see the full sized tree of life diagram, please click on the image above

It took 13 years to sequence the first human genome. When the project began, no-one knew where it would lead. Now we sequence the equivalent of one gold-standard (30x) human genome in 24 minutes – faster and deeper genomic insights are enabling discoveries that improve health and our understanding of biology. These insights are happening right now, and they will lead to unimagined benefits for future generations – all possible from a sequence of four letters of DNA code.

About the author:

Alison Cranage is a science writer for the Wellcome Sanger Institute.


Human Cell AtlasSanger Science

A trusty guide for exploring the complexity of cells

By: Martin Hemberg and Vladimir Kiselev
Date: 14.05.18

Page image 2

Scmap can map individual cells from a query sample to cell types or individual cells in a reference. Previously identified cell types are coloured, unknown types are grey.

Ever since scientists first used a microscope to inspect cells, it has been recognized that they can be grouped into distinct cell-types based on their morphology. The difference between cell-types, both in terms of form and function can be striking, even though all somatic cells in an organism share the same DNA. The reason why cells may exhibit such striking differences can be attributed to the fact that each cell-type expresses only ~10,000 of the ~20,000 genes that are present in our genomes.

Traditionally, cell-types are defined based on morphology – shape. However, recent technological advances have made it possible to measure the level of all approximately 20,000 genes expressed in individual cells. The technology is known as single-cell RNA-seq (scRNA-seq) and it builds upon the powerful methods that were initially developed as part of the Human Genome Project.

To carry out a scRNA-seq experiment, the biological sample provided (e.g. some blood, a piece of skin or a biopsy from an organ) is dissociated and the cells are isolated individually. A set number of cells are then randomly selected to have their mRNA extracted and profiled. Using computational analysis methods, cells with similar profiles are grouped together, making it possible to identify cell-types based on which genes are expressed.

In the fall of 2016, the Human Cell Atlas (HCA), a hugely ambitious international project to “generate a comprehensive map of all 37 trillion cells in the human body” was launched. The HCA uses scRNA-seq to profile cells from the human body and one of the goals is to define cell-types based on mRNA profiles. Most likely, the first release of the HCA will contain more than 100 million cells that have been profiled using scRNA-seq.

One of the key challenges will be to make sure that the HCA reference can be queried in a way that supports the questions that are likely to be asked most frequently, such as comparing cells from a new sample to the reference. This could be important for example in a clinical setting, where a doctor would be able to compare a patient sample (e.g. from an unhealthy liver) to the reference. Such a query would allow the doctor to determine if there is a major imbalance in the composition of cells, or even if there are cells that have acquired a disease state (e.g. cancer) that is not present in healthy individuals.

To support such queries, we have developed a novel computational method called scmap, which takes a query and a reference scRNA-seq dataset as the input. For each cell in the query, scmap can identify both the cell-type and the individual cell from the reference that provides the best match, as in the Figure above.

Comparing scRNA-seq profiles is challenging, mainly for two reasons: the data is high-dimensional (approximately 20,000 genes) and it is noisy.

Scmap is based on a recently developed feature selection algorithm for scRNA-seq data from the Hemberg lab. The algorithm is able to identify the subset of genes that are most informative for clustering in an unsupervised manner, and it uses state-of-the-art machine learning methods to achieve high specificity and sensitivity. Moreover, scmap is very fast, which means that it can be used for real-time searches of very large references.

Another key feature is that scmap’s internal representation of the reference is greatly compressed which means that it can be run on an ordinary workstation. Finally, scmap is modular which means that a new dataset can be added to the reference without having to re-compute previously added datasets.

Even though the HCA is years from completion, there are already large collections of scRNA-seq datasets available. In addition to the HCA, researchers are also building cell atlases for many of the model organisms that are widely used in biomedical research. The most impressive result to date are two large collections of reference data for the mouse. Researchers have already used scmap to compare  the two mouse datasets to compare the different methodologies for collecting the data, providing an excellent demonstration of how scmap can help analysing large datasets.

Since scmap carries out a simple yet fundamental operation –  comparison of cells from two datasets – we anticipate that it will become an integral part of many scRNA-seq analysis pipelines, and that other, more complex tasks will come to rely on it. In particular, we believe that the speed and compression afforded by scmap will ensure that the HCA becomes an accessible and easy to use reference for the community.

About the authors:

Dr Martin Hemberg is a Group Leader at the Wellcome Sanger Institute, interested in quantitative models of gene expression.

Dr Vladimir Kiselev is currently the Head of the Cellular Genetics Informatics group at the Wellcome Sanger Institute and used to be a postdoctoral researcheroc in Dr Martin Hemberg’s group.

Related publication:
Kiselev VY, Yiu A and Hemberg M. (2018). Scmap – projection of single-cell RNA-seq data across datasets. Nature Methods. DOI: 10.1038/nmeth.4644

Further Links:



Human Cell AtlasSanger Science

New computational method reveals where genes are expressed

By: Valentine Svensson
Date: 06.04.18

main figure

SpatialDE automatically identifies sub-structures (middle), and links these to genes that depend on spatial location (right) in mouse olfactory bulb data from Stahl et al 2016.

In the body, cells are often considered the atomic fundamental units. In a similar way to how atoms are structurally joined to form molecules, cells form tissues. The organization of these tissues let different cell types work together, to enable organs in the body to perform their functions. These structures have been studied and catalogued for hundreds of years in the field of histology, using microscopes.

During the 20th century molecular techniques have enabled researchers to investigate how different genes and proteins are used in different parts of tissues, to understand how cell types collaborate in tissues. Large scale projects such as the Protein Atlas or the Allen Brain Atlas have been systematically performing molecular measurements of individual genes and proteins in tissues.

In the last decade, tremendous advancements in the scale and cost effectiveness of molecular measurements have been made. This has led to the analysis of single cell gene expression -ie which genes are switched on in a cell. This lets researchers define cell types from molecular data. Similarly, spatially defined molecular measurements of gene expression can now be made on thousands of genes in single cell resolution. Projects that would previously have taken hundreds of people and long time schedules can now be done by individual labs, meaning more types of tissues in more conditions can be investigated.

The most powerful new high throughput methods generate measurements of expression levels for tens of thousands of genes. At this scale just looking at all the genes will not be possible. Typically these sorts of data have been analysed by only looking at a handful of known marker genes.

We have now developed a method that tells us if there is a relationship between genes expressed in cells, and where those cells are located.

Our SpatialDE method filters and sorts all the genes according to how certain we are that cell location matters for the expression levels. In the main data we analysed for our paper, out of close to 12,000 genes measured only 67 genes were filtered as “spatial”. By focusing on this shortlist of genes, researchers can quickly discover genes previously unknown to be related to tissue structure.

Tissues are often divided into sub-structures, based on visual appearance, or by expression of particular proteins indicating a specific function of that sub structure. The brain for example has different layers, so does skin: the thymus on the other hand consists of connected lobules with medullas inside.

The sub-structures are defined by different cell type compositions. For cells to have major functional differences they need to express many genes together that are specific to the function, which will be reflected on a whole tissue level. We created a second method which uses this property to automatically define tissue substructures. In one go, researchers obtain the genes defining the regions, as well as labels for the regions themselves.

This allows researchers to zoom into the structures of the tissue. The markers allow design of downstream functional experiments to investigate which genes cause the structure and which are a consequence of the structure. The spatial labels then allow researchers to investigate the interaction between structures, the development of the structures, and how the tissue performs its function.

Relating cell types to their spatial structure and organization in tissues is a major component in the ongoing Human Cell Atlas project. But the technologies for spatial gene expression measurements are feasible to perform for individual labs that wants to study their tissue of on a genomic level. With our methods, researchers can answer new questions about the relation between genes and tissue structure that was not possible before, which we demonstrate in our paper.

In the long term, genomic and quantitative spatial gene expression measurements, captured and analysed by methods such as SpatialDE, may form the basis of histology and pathology in the clinic. This would allow this area of medical diagnostics to become even more powerful and personalized.

About the author:
Dr Valentine Svensson was an EMBL PhD student supervised by Sarah Teichmann at the Wellcome Sanger Institute, collaborating with Oliver Stegle at the EMBL-EBI when this work was done.  He is now a postdoctoral scholar in the Division of Biology and Biological Engineering at Caltech, working with Lior Pachter on statistics for omics based cell biology.

Related publication:
Valentine Svensson, Sarah A Teichmann and Oliver Stegle. (2018). SpatialDE: identification of spatially variable genes. Nature MethodsDOI:10.1038/nmeth.4636

Further Links: