Our UK Biobank Journey: 3 years and over 240,000 human genomes
By Alison Cranage, Science Writer at the Wellcome Sanger Institute
In 2019, the Sanger Institute started work on the most ambitious human genome sequencing project in the world. Three years later, the Institute has delivered nearly 250,000 whole human genome sequences and over 20 petabytes (PB) of data for the UK Biobank project, to aid research into health and disease.
Sanger Institute and the UK Biobank project in numbers
The Sanger Institute was founded to sequence the very first human genome in 1992. The international project, lasting 13 years and costing $2.7 billion, was a landmark achievement in science. The knowledge of the DNA sequence that codes for a human has underpinned advances in medicine and driven biological research over the last 20 years.
The Institute now has one of the largest genome sequencing facilities in the world, and can sequence DNA at a rate equivalent to a human genome every 3.2 minutes.
UK Biobank is a large-scale biomedical database and research resource, containing in-depth genetic and health information from half a million UK participants. Over 28,000 researchers from 98 countries use UK Biobank data, which is so comprehensive that it enables researchers to study the complete picture of disease risk, taking into account lifestyle, genetics and health factors.
In 2018, the Institute was employed to sequence the genomes of 50,000 people from the UK Biobank cohort.
Following that successful pilot, the Institute was then contracted, together with deCODE in Iceland, to deliver sequences for the rest of the Biobank cohort – the largest human genome sequencing project ever undertaken.
Here, some of the key people at the Sanger Institute reflect on what it took to deliver sequencing at such scale.
Ian Johnston is the Head of Sequencing Operations at the Sanger Institute, and responsible for overseeing the UK Biobank sequencing project. For the initial 50,000 sequences – known as the Vanguard project – this meant purchasing and installing 10 of Illumina’s NovaSeq machines, increasing liquid handling capabilities, scaling up laboratory processes for sample preparation, and hiring new staff. He also worked closely with colleagues in informatics teams who processed and analysed the vast amounts of data produced by the sequencing.
At the same time, the team was starting to plan how they would sequence the genomes of 200,000 more Biobank participants. The scale of the project was unprecedented.
“Our experience of the Vanguard helped us to define the quality standards we’d be able to deliver, at scale, for the next phase of the project. We knew that we could reach gold-standard metrics – for example in terms of coverage – for the hundreds of thousands of genomes.”
Cordelia Langford, Director of Scientific Operations, emphasised the huge team effort involved. “As well as laboratory and informatics teams, our stores team, health and safety, and our finance and procurement teams were a crucial part of the work. We needed to store huge volumes of liquid reagents every week, just for the sequencing machines, for example.”
The sequencing Research and Development (R&D) team worked to refine the laboratory processes to ensure there were good enough yields, consistent coverage of the genome, and that all the quality metric targets were met.
“One of the key challenges for R&D over the course of UK Biobank was to try and maximize the number of samples we can multiplex per run on the Novaseqs. At the start of the project, we were loading 21 human samples on a flow cell. By the end, we were able to reliably produce sufficient data for 32 samples in a single flow cell. Throughout the project, we’ve been constantly improving the process and looking at every possible avenue to optimize.”
Diana Rajan, Senior Staff Scientist in DNA Pipelines R&D
The team also transitioned the library construction pipeline (processing DNA for sequencing) from working in a 96-well plate format, to 384-well format. This enabled them to reach a higher capacity, but with the same fleet of liquid handling systems and platforms.
“It was the sheer speed of our set up that I was amazed by,” says Ian.
David Jackson, Sequencing Informatics Team Leader, oversaw the data pipelines for UK Biobank. His team develops and runs software to support the high-throughput data production by Sanger’s sequencing machines. This includes providing the tracking, quality control, and analysis systems to process the output from the Novaseqs.
At the time of Vanguard, the Sanger Institute had just installed a new ‘flexible compute environment, or FCE,’ – essentially a private, in-house, cloud computing system. David liaised closely with ICT teams as its use was adapted for the initial 50,000 and then the further 200,000 sequences.
Most of Sanger’s analysis for the main phase of the project, which included variant calling, where the genome sequence is compared to the reference human genome to identify differences, was undertaken by Seven Bridges Genomics. This enabled the team to deliver a full service in the time available without impacting the compute services available to Sanger Institute researchers.
“This was human genomes in numbers we’d never dealt with before, the data coming from machines we’d never used before, in a new compute environment.”
When the COVID-19 pandemic hit, the human genome sequencing work, which was ahead of schedule at the time, was put on hold. Cordelia proposed that the Institute should contribute to the pandemic response by sequencing SARS-CoV-2 genomes, and she was instrumental in making it happen. Sanger co-funded and co-founded the COVID-19 Genomics UK consortium (COG-UK), and using much of the knowledge and capacity built for Biobank – including the Novaseqs – began sequencing viral genomes.
“The process of establishing high throughput sequencing for UK Biobank connects intrinsically to our success with sequencing COVID. The scale, speed and agility we developed in setting things up quickly for UK Biobank, together with the sample logistics expertise, and the capability to pummel the sequencing machines, meant we could pivot our processes and embark on sequencing SARS-CoV-2.”
By early 2022, the Institute was sequencing 64,000 viral genomes a week for UKHSA, and has been responsible for about one quarter of the world’s COVID sequencing output.
As the pandemic continued, separate capacity was set up for COVID-19, and Biobank sequencing resumed. The team has delivered the 240,000 whole human genome sequences on time and on budget.
Powering global research
Whole genome data are now in the UK Biobank resource, with each sequence linked to anonymised medical information. Researchers can look for links between the genetic code and health using data that didn’t exist before – including in non-coding regions of the genome.
200,000 sequences are already available to approved researchers, and this data sharing has enabled a wide range of studies and new discoveries, including in cancer, diabetes, and heart disease.
Together with the sequences from project partners, the resource is the largest human genome sequence database in the world. Genome data from all of the participants in UK Biobank will soon be available for approved researchers, enabling the next set of advances in human genomic research.
 Funding for the project comes from the government’s research and innovation agency, UK Research and Innovation (UKRI) with £50m through the Industrial Strategy Challenge Fund, £50m from Wellcome and a further £100m in total from Amgen, AstraZeneca, GlaxoSmithKline (GSK) and Johnson & Johnson.