How do you sequence over 240,000 whole human genomes?
By Alison Cranage, Science Writer at the Wellcome Sanger Institute
The world’s largest human genome sequencing project has been for UK Biobank – a large-scale biomedical database.
The Sanger Institute has sequenced 243,633 human genomes in a record 3.5 years. Each genome is 3.05 billion pairs of DNA ‘letters’, and each genome was sequenced, on average, 30 times, as standard.
In total, the order of 21 quadrillion (21×1015) letters of DNA have been determined by Sanger’s sequencing teams. The data are now in the UK Biobank resource and will soon be available to researchers worldwide, as they look for links and patterns in the data that may correlate with disease, or tell us about how our bodies function.
The process of getting to these numbers has involved multiple teams who worked on the project between 2018 and 2022. The journey began when shipments of DNA from UK Biobank arrived at the Institute.
Sample reception and preparation
Danni Weldon leads the team responsible for receiving the sample deliveries, unpacking them, scanning them into the laboratory information tracking systems (LIMS) and undertaking quality control checks on the plates, tubes and DNA.
“It was like a factory line, handling such large volumes all at once, and then quantifying the DNA in each sample, and cherry-picking the samples to adjust the concentrations – all in a very short space of time.”
“It’s amazing how much you can do when you have to. And it raises the question of, well, why weren’t we doing this before?”
Danni Weldon

Sample reception and preparation
Danni Weldon leads the team responsible for receiving the sample deliveries, unpacking them, scanning them into the laboratory information tracking systems (LIMS) and undertaking quality control checks on the plates, tubes and DNA.
“It was like a factory line, handling such large volumes all at once, and then quantifying the DNA in each sample, and cherry-picking the samples to adjust the concentrations – all in a very short space of time.”
“It’s amazing how much you can do when you have to. And it raises the question of, well, why weren’t we doing this before?”
Danni Weldon
Library preparation
The next step is library preparation, where the DNA is readied for the sequencing machines. Jamie Lovell started at the Sanger Institute as a ‘finisher’ on the Human Genome Project. He now leads the library preparation team in DNA Pipelines.
“Working on the Human Genome Project was amazing, everyone was all in the same boat; we were all looking for the same outcome. It was just such a social time. Plus, I was a lot younger then.”
“Obviously the biggest change since then is the sequencing machines, and I’m sure they will change again. They are so hungry for DNA. From the library preparation side, we are only just keeping up. But as more robotics are brought in – and I’ve half a mind to develop some systems myself – that will also change.”
“I think we proved that we can do hundreds of thousands of human genomes quite easily.”
Jamie Lovell

Library preparation
The next step is library preparation, where the DNA is readied for the sequencing machines. Jamie Lovell started at the Sanger Institute as a ‘finisher’ on the Human Genome Project. He now leads the library preparation team in DNA Pipelines.
“Working on the Human Genome Project was amazing, everyone was all in the same boat; we were all looking for the same outcome. It was just such a social time. Plus, I was a lot younger then.”
“Obviously the biggest change since then is the sequencing machines, and I’m sure they will change again. They are so hungry for DNA. From the library preparation side, we are only just keeping up. But as more robotics are brought in – and I’ve half a mind to develop some systems myself – that will also change.”
“I think we proved that we can do hundreds of thousands of human genomes quite easily.”
Jamie Lovell
Sample tracking
Tom Whitely manages the Laboratory Information Management System (LIMS) team that tracks the samples through the Institute.
“The samples go through a complex and varied workflow, from reception, refrigerated storage, quality checks, to stamping robots, until eventually the samples are put into a sequencing machine. Depending on the results the whole process may have to be repeated to reach the gold standard sequence quality we require. We have to be able to track and report back on all of these lab processes to UK Biobank, and combine it with the final results. The reporting is a lot more rigorous than we’d done before. And definitely not something you can do manually when there are 240,000 samples.”
“The mindset is more like manufacturing. We have to do a set of operations consistently and repeatedly. It’s a slightly different approach to life, compared to research where you are trying to prove a hypothesis.”
“We’re an agile team, and use the scrum process to develop software. It’s people over processes, constantly looking for improvements and we’ve worked extremely well with the lab teams, I think. Going forward, we’re looking at taking some of these ways of working to our other projects.”
Tom Whiteley

Sample tracking
Tom Whitely manages the Laboratory Information Management System (LIMS) team that tracks the samples through the Institute.
“The samples go through a complex and varied workflow, from reception, refrigerated storage, quality checks, to stamping robots, until eventually the samples are put into a sequencing machine. Depending on the results the whole process may have to be repeated to reach the gold standard sequence quality we require. We have to be able to track and report back on all of these lab processes to UK Biobank, and combine it with the final results. The reporting is a lot more rigorous than we’d done before. And definitely not something you can do manually when there are 240,000 samples.”
“The mindset is more like manufacturing. We have to do a set of operations consistently and repeatedly. It’s a slightly different approach to life, compared to research where you are trying to prove a hypothesis.”
“We’re an agile team, and use the scrum process to develop software. It’s people over processes, constantly looking for improvements and we’ve worked extremely well with the lab teams, I think. Going forward, we’re looking at taking some of these ways of working to our other projects.”
Tom Whiteley
DNA sequencing
The next laboratory step sees the DNA libraries loaded onto the Illumina Novaseq machines, overseen by Tristram Bellerby.
“I’m there to make sure we maximise the instrument’s capacity. I work to make sure they run, meeting the specification, and we get as much data as possible, in as fast a turnaround time as possible.”
Tristram Bellerby
For UK Biobank, the team was able to reach multiplexing, or pooling, of 32 individual human genomes in a single flow cell, with two flow cells on each of the 17 machines.
“I liaise closely with the Sequencing Research and Development team, and Illumina, to make sure everything is optimised. For UK Biobank, we adjusted our loading concentrations to get the best results and the most data per run. We improved our processes over the course of the project.”
“It brought a great sense of achievement because when we consistently hit the numbers we wanted, we knew the data would pass quality control, it would go out the door to be analysed.”
Tristram Bellerby

DNA sequencing
The next laboratory step sees the DNA libraries loaded onto the Illumina Novaseq machines, overseen by Tristram Bellerby.
“I’m there to make sure we maximise the instrument’s capacity. I work to make sure they run, meeting the specification, and we get as much data as possible, in as fast a turnaround time as possible.”
Tristram Bellerby
For UK Biobank, the team was able to reach multiplexing, or pooling, of 32 individual human genomes in a single flow cell, with two flow cells on each of the 17 machines.
“I liaise closely with the Sequencing Research and Development team, and Illumina, to make sure everything is optimised. For UK Biobank, we adjusted our loading concentrations to get the best results and the most data per run. We improved our processes over the course of the project.”
“It brought a great sense of achievement because when we consistently hit the numbers we wanted, we knew the data would pass quality control, it would go out the door to be analysed.”
Tristram Bellerby
Data processing and analysis
Each Novaseq produces 1.8 terabytes (TB) of data every day, as it turns the chemical structure of DNA into digital code. There is a huge computational task to turn the raw data from the machines into data that represents whole genome sequences. This includes running algorithms to de-multiplex the pooled data, and to join the billions of fragments of code together into contiguous pieces that represent a person’s genome.
Much of the process has been automated over the years, and software engineers, data managers and informaticians are essential to oversee and develop the data flows.
The data processing and analysis for all of the Illumina sequencing at the Sanger Institute are overseen by David Jackson and the Sequencing Informatics team.
“There were several challenges, both in set up and as we progressed. The main one was scale; this project was on another level to anything we’d done previously. Our initial estimates for the Central Processing Units (CPUs) required for the Vanguard were low, so we needed to have more installed.”
David Jackson
For the main phase of the project, the team also constructed a data pipeline to transfer raw data to their analysis partner, Seven Bridges Genomics, and pull key metrics back, whilst the sequence data products were submitted to UK Biobank.
“Quality control was an important issue – there was more reporting than we’d done before. But this close monitoring drove improvements in the laboratory steps of the process, as we could quickly see any issues in the data, and feed these back, working together to find what the cause was and how to resolve it.”

Data processing and analysis
Each Novaseq produces 1.8 terabytes (TB) of data every day, as it turns the chemical structure of DNA into digital code. There is a huge computational task to turn the raw data from the machines into data that represents whole genome sequences. This includes running algorithms to de-multiplex the pooled data, and to join the billions of fragments of code together into contiguous pieces that represent a person’s genome.
Much of the process has been automated over the years, and software engineers, data managers and informaticians are essential to oversee and develop the data flows.
The data processing and analysis for all of the Illumina sequencing at the Sanger Institute are overseen by David Jackson and the Sequencing Informatics team.
“There were several challenges, both in set up and as we progressed. The main one was scale; this project was on another level to anything we’d done previously. Our initial estimates for the Central Processing Units (CPUs) required for the Vanguard were low, so we needed to have more installed.”
David Jackson
For the main phase of the project, the team also constructed a data pipeline to transfer raw data to their analysis partner, Seven Bridges Genomics, and pull key metrics back, whilst the sequence data products were submitted to UK Biobank.
“Quality control was an important issue – there was more reporting than we’d done before. But this close monitoring drove improvements in the laboratory steps of the process, as we could quickly see any issues in the data, and feed these back, working together to find what the cause was and how to resolve it.”
Powering research around the world
After the data are transferred to UK Biobank, they are made available to researchers undertaking vital research into the most common and life-threatening diseases.
Professor Nicole Soranzo, Senior Group Leader at the Wellcome Sanger Institute, said: “UK Biobank data are so vast, and so detailed. It has changed the way we do research in human genetics.”
“We are beginning not only to understand the complex genetic basis of a whole variety of devastating human diseases, but also how to better use this genetic information to understand how to predict and treat these diseases.”
Professor Nicole Soranzo

Powering research around the world
After the data are transferred to UK Biobank, they are made available to researchers undertaking vital research into the most common and life-threatening diseases.
Professor Nicole Soranzo, Senior Group Leader at the Wellcome Sanger Institute, said: “UK Biobank data are so vast, and so detailed. It has changed the way we do research in human genetics.”
“We are beginning not only to understand the complex genetic basis of a whole variety of devastating human diseases, but also how to better use this genetic information to understand how to predict and treat these diseases.”
Professor Nicole Soranzo
Find out more
- Project: UK Biobank Whole Genome Sequencing on the Sanger Institute website
- News Story: 500,000 whole human genomes will be a game-changer for research into human diseases
- News Story: Whole Genome Sequencing data on 200,000 UK Biobank participants are made widely available for research through unique public-private partnership