Sanger’s super-sized sequencing scales new heights

We’re celebrating: we’ve just read the same amount of DNA in one year as we achieved in the previous 25 years combined. This dizzying speed offers unprecedented possibilities to unlock new understanding in health and disease

By: Ali Cranage, science writer at the Wellcome Sanger Institute

Sanger's super-sized sequencing scales new heights

Super-fast, super-accurate sequencing

Time flies when you work in genomics. What used to take years, now takes minutes. Technology has moved rapidly over the last 25 years, meaning the boundaries of what is possible are constantly being pushed. It took 15 years to determine the order of all six billion letters (or bases) of DNA in the first human genome. Today at the Wellcome Sanger Institute, we are sequencing DNA at a rate equivalent to one human genome every 18 minutes.

The Sanger Institute was founded in 1993, as part of the effort to sequence the first human genome. In 2018, the year of our 25th anniversary, we celebrated sequencing a total of 5 Petabases (5×1015) of DNA.

We’ve not only sequenced human genomes, but hundreds of other species’ DNA too – mammals, bacteria, viruses and protozoa. Now, just 13 months after reaching a milestone of 5Pb, we’ve hit 10 Pb.

The massive acceleration in our data production is down to one project and the technology we’re using to deliver it; UK Biobank and Illumina’s NovaSeq machines.

UK Biobank – Defeating disease through data, genomics, and volunteers

Half a million people across Britain signed up for the UK Biobank project between 2006 and 2010. They donated blood and urine samples, provided information about their lifestyles, other biometric data and agreed to regular health checks throughout the project.

The aim is to aid research into diseases like cancer, arthritis, depression and dementia – to name just a few. The huge wealth of data enables researchers to look for patterns and trends over time, and search for ways to prevent and treat disease.

Now, genome sequence data is being added, too. The Sanger Institute is sequencing the genomes of 50,000 of the participants in UK Biobank. Like the existing data, the genome data will be available for scientists to analyse. It will aid understanding of the genetic causes and risk of diseases. Researchers will be able compare genome sequences and characteristics between thousands of people, advancing our understanding of genetics in health and disease.

UK Biobank is a powerful and rich resource.

How the Sanger Institute reads the genomes of people to discover new insights into health and disease. Click to see full-sized image
How the Sanger Institute reads the genomes of people to discover new insights into health and disease. Click to see full-sized image

Sequencing technology: automation allows 24/7 DNA reading

The sequencing research and development (R&D) team at the Sanger Institute worked with scientists at Illumina for three months to optimise the new fleet of 10 NovaSeq machines, before starting the DNA sequencing for UK Biobank. This included testing all the parameters to get consistent performance, refining the requirements for the DNA preparation for input, and defining thresholds for quality control checks.

Data is now continuously flowing from the machines.

The UK Biobank samples are processed by the DNA Pipelines team who manage the sequencing day-to-day. They operate the machines at high throughput, running nearly 24/7, with approximately 5,000 genomes sequenced each month.

Genomes Assemble: sticking the code back together

The ‘raw’ genome sequence data, straight from the machines, isn’t useful on its own, because it is in pieces. The sequencing technology relies on a genome being chopped into billions of overlapping chunks to be sequenced. Those pieces need reassembling at the other end of the process, to get back the whole genome sequence.

To add to the complexity, each machine processes up to 56 individuals’ genomes at a time.

Digital alchemy: converting matter into ones and zeroes

Another big part of the UK Biobank project set up was undertaken by IT and bioinformatics teams, to deal with all the data. While our genomes are, in reality, microscopic (there is a copy inside almost every one of our 37 trillion cells), when we analyse them we need to consider the huge number of bytes – the digital space that the information gleaned takes up.

One NovaSeq produces about 2.4 Terabytes (TB) of data every two days. Analysing and processing that data takes an additional 5.6TB of computing space, temporarily. This processing includes alignment – the putting together of the billions of pieces to get the complete genome sequence, and compression of the data into file formats suitable for transfer to researchers. Processing utilises our on-site data centre that runs an Openstack system and provides a secure, flexible and local cloud computing environment.

NovaSeq DNA sequencing machine being loaded with 56 people's genomes
NovaSeq DNA sequencing machine being loaded with 56 people’s genomes

The datacentre’s capacity was recently doubled, and now includes over 32,000 computing cores, with a total of 55 Petabytes of storage.

For UK Biobank sequencing, the whole process, from loading DNA onto the machines, to genome alignment, through quality control, to returning the sequence to UK Biobank takes about five days. The processes are mostly automated.

The DNA pipelines team have now sequenced 21,000 of the UK Biobank samples since the project began less than a year ago, and are on track to finish the 50,000 by the end of the year.

Dr Louise Aigrain, together with Di Rajan, both in the sequencing R&D team at the Sanger Institute, set up the NovaSeq machines for the UK Biobank project. Louise told us why she got involved in genomics. “I’ve always been interested in methods – in how things work. I love being at the cutting edge of technology. It’s an exciting time.”

Beyond UK Biobank: from human nature to all of nature

Cordelia has the Darwin Tree of Life (UK) project firmly in her sights. The challenge is preparing, reading and assembling the genomes of the UK's 66,000 species is one she and her Scientific Operations teams are relishing
The Darwin Tree of Life (UK) project will require the Sanger Institute’s teams to help prepare, read and assemble the genomes of the UK’s 66,000 species

We use a range of sequencing technologies at the Sanger Institute. Illumina, PacBio and Oxford Nanopore technologies are each used for specific research needs, and often in combination together to get the best results. Our research and other projects into cancer, aging, cellular functions, human genetics, malaria, bacteria and the tree of life all utilise DNA sequencing, and are all adding to the Petabase count.

Dr Cordelia Langford, Director of Scientific Operations at the Sanger Institute, gave us her thoughts on the future of genomics.

“With more data, we can reach even larger scales of inquiry. The depth of data, like in UK Biobank, means researchers can study the differences and similarities between people, and uncover new insights into health and disease.

“And projects like the Darwin Tree of Life, where we will sequence the genomes of all plants, animals, protozoa and fungi in the UK, will generate a huge breadth of data. Researchers will be able to ask fundamental questions about biology and evolution.

“The scale of our sequencing endeavours will deliver new insights into human health, disease and all life on Earth.”

More information