Image credit: Dan Ross

Processing COVID-19 samples at the Sanger Institute
Categories: COVID-19, Sanger Science3 November 20207.5 min read

Sequencing COVID – what now?

The leftovers from 9 million coronavirus swab tests are currently sitting in temporary -20°C freezers in car parks at the Wellcome Genome Campus. Twenty more boxes arrive every evening, couriered from the Lighthouse Laboratories undertaking COVID tests in communities across the UK.

With one of the largest genome sequencing facilities in the world, much of our capacity has been turned over to sequencing the virus, as part of the COVID-19 Genomics UK consortium (COG-UK). Together, the consortium has now sequenced more than 90,000 virus genomes from the UK – both from the Lighthouse Laboratories and NHS settings.

Virologists around the globe are using the data to understand the spread and the biology of the SARS-CoV-2 virus. The focus of researchers in COG-UK is to use the virus genome data from local outbreaks to support public health officials. This crucial work is helping to inform infection control procedures in hospitals and other locations, as well as identify previously hidden routes of transmission.

Sanger researchers are now planning to use the genomic sequence data from the hundreds of thousands of tests from the Lighthouse Labs for monitoring outbreaks too. It is becoming possible to generate enough genomic data, fast enough, to spot emerging outbreaks anywhere in the country.

We spoke to Dr Jeffrey Barrett, Lead COVID-19 Statistical Geneticist at the Sanger Institute, about the plans to achieve it, and why he’s in it for the long-term.

Molecular clocks

Tracking the virus within a hospital, town, country or across the world is possible because genomes mutate. Letters in the genome sequence change as organisms replicate. Virus genomes usually mutate at a steady rate – HIV extremely rapidly, influenza slower, and coronavirus slower still. Researchers can use the mutation rate as a molecular clock. Any genetic difference between two viruses is proportional to the time since they last shared a common ancestor. The individual virus sequences can be placed on a phylogenetic tree, much like a family tree, which determines the relatedness of two or more SARS-CoV-2 viruses. The slow mutation rate of SARS-CoV-2, together with the fact that the virus appeared relatively recently, in December 2019, means that there is limited genomic diversity in the circulating viruses so far; there is only one strain. Yet it has been possible to trace the virus’s history, from the centre of the original outbreak in China to all the corners of the world. Researchers are constantly refining and updating the picture as more evidence becomes available.


So far, a huge amount of work from laboratory, technical, software, logistic, and scientific teams has gone into sequencing SARS-CoV-2 at Sanger. Now, the focus is shifting to the long-term. The teams aim to use genomics to benefit the national public health response to coronavirus into 2021 and beyond.

“We’re currently in a bad second peak, but at some point we hope to get back into low levels of community transmission. Then the top priority will be to try to find local fast-spreading outbreaks as quickly as possible and intervene to stop a third wave,” explains Jeff.

He says that there is going to be a certain amount of virus around in communities at any given time. This ‘background’ will have specific genetic features.

“If the virus is consistently genetically monitored, and we then suddenly saw, for example, virus from a particular area all had the same genetic sequence, that could indicate there was an outbreak there,” he says.

Forward and backward tracing

As well as helping to spot local outbreaks, genomics can also assist in ‘backward’ tracing. This process differs from standard ‘forward’ contact tracing, which aims to work out who someone with the virus has infected, so they can quarantine to prevent any onward transmission.

Backward tracing tries to work out where an infected person got the virus in the first place – usually several days before symptoms appeared. There is now evidence that the coronavirus spreads in clusters, and that most transmission is caused by just 20 per cent of people who get it, in so-called ‘super-spreading’ events. If such a location can be identified, then others who were at the same place, at the same time, can be traced too. This backward tracing approach is being used by Japan and South Korea, and the UK is planning to use it too. Genomics can help those efforts by identifying if two cases are linked or not, by determining how genetically similar the virus is.


To be useful, genomic analysis must be done quickly. Public Health officials are always working with data that is hours, perhaps a day or two, behind real-life events. Results from genome sequencing need to be available as close to real time as possible. An outbreak caused by a super-spreading event or in a particular environment needs to be swiftly identified, and then contact tracing, additional testing and other interventions immediately deployed to contain it.

It is a huge logistical challenge to sequence and analyse thousands of genomes that quickly, every day. In each box in the Wellcome Genome Campus car park freezers, there will be anywhere between 80-800 positive samples and up to 7,600 negative ones. Automating the picking of positive samples out of the boxes was an initial hurdle to overcome – not least to prevent frostbite – and something that Sanger software development teams worked on back in March. They have now handled the 9 million samples, and are sequencing 3,500 virus samples a week.

“We are just about on the edge of this process being fast enough now,” says Jeff. A new £2 million robotic system is currently being installed at Sanger to speed things up even further. Robots will pick and process the positive samples from the boxes, reducing the time it takes to get them onto the sequencing machines.


Using genomic surveillance on such a massive scale has only recently become possible. The sequencing capacity simply didn’t exist before.

Standard epidemiology is like detective work; cases are hunted, movements tracked, clues amassed. Genomics can be an important piece of that puzzle – helping to determine if cases are linked or not.

“One nice thing about this new approach is that we need very little data about the sample: just the date it was taken, and the area where it came from (anonymised, not the full address details). This means we don’t need access to highly sensitive data,” adds Jeff.

But there are always limitations when interpreting any data – genomics is no exception. Sequences from two people could be the same through chance rather than because they are part of an outbreak. And because the virus has spread so widely, identical virus genomes can be seen in different locations, even if there aren’t any direct links between them.

“Whilst we would love to be able to sequence every single virus sample in the UK, it isn’t feasible. We are pushing the technology, the logistics, everything we can push, to the absolute limits. So the samples we do sequence are important - they need to be representative of the virus population in an area, as well as the country as a whole. We are developing a detailed sampling strategy, and statistical methods overcome as many of these issues as possible. We need to make sure the results are meaningful,” says Jeff.


Jeff re-joined the Sanger Institute in July this year, to lead the coronavirus analysis work. He had previously worked in the Human Genetics department for 10 years. He was then the founding director of Open Targets, before moving to become Chief Scientific Officer at Genomics PLC in 2017. I asked what made him come back.

“I wanted to put large-scale genomics to use in the pandemic,” he says. “I wanted to help. It’s all anyone wants to do, isn’t it.”

Teams working to process and sequence coronavirus samples at the Sanger Institute

Links and further information


To view global data for SARS-CoV-2 to date, visit