Hide and seq: tracking down viruses

16 September 2014
By Velislava Petrova

We identified genomic regions (shown in red) that mutate at a greater rate than the rest of norovirus genome and could be important targets of host immune response. Credit: DOI:10.1128/JVI.01333-14

We identified genomic regions (shown in red) that mutate at a greater rate than the rest of norovirus genome and could be important targets of host immune response. Credit: DOI:10.1128/JVI.01333-14

How can a genetic code almost a million times shorter than the human genome contain enough information to evade our immune system and cause disease?

Understanding the basis of this phenomenon is a major focus of research in the Virus Genomics team at the Wellcome Trust Sanger Institute.

As an undergraduate student who joined the team for a placement year in 2011, I thought I knew a lot about viruses but certainly nothing about the science behind this apparent size paradox. Under the supervision of Matthew Cotten, I began what then seemed to be a rather straightforward scientific journey: sequencing virus genomes. I certainly considered this one of the easiest projects one could start in a place with such enormous sequencing capacity as the Sanger Institute. Yet, three years later I know that I could not have been more wrong.

Virus genomes are present at very low quantities in clinical samples relative to the millions of human and bacterial nucleotide sequences, the single-letter bases that make up a genome. Reading a message from such a blurry picture requires a powerful magnifier.

In the context of molecular biology, this is achieved by the use of polymerase chain reaction (PCR), a commonly used technique that recognises and copies a particular nucleotide sequence multiple times in order to make it readable for downstream analysis. Although very simple, PCR is only successful if an appropriate bait (a primer) is used to capture the viral genome among the multitude of contaminating non-virus sequences.

This is where virus genomes pose such a great challenge: not only are they small and difficult to read, but they are also extremely changeable and thus difficult to catch. An RNA virus mutates faster than a DNA virus and replicates its genome with a substitution rate, orders of magnitude higher than estimated for the human genome.

During the course of an infection and transmission this creates a population of virus variants, each of which has the potential to be better adapted to the host immune response. Capturing each one of them is important because even a single nucleotide change, especially if it changes the encoded protein, could play a crucial role in disease progression or efficacy of treatment.

Adapting to these specific features of virus genomes, we developed a computational tool that uses all currently available sequence information for a given virus to create not just one but multiple baits capturing the full sequence diversity of a virus of interest in an attempt to understand its role in disease.

We first tested our strategy on a range of subtypes of norovirus, a highly contagious pathogen with a RNA genome of only around 7,500 nucleotides but able to cause recurrent outbreaks of winter vomiting disease around the globe.

After successful amplification and sequencing of more than 100 patient samples from Ho Chi Minh City, Vietnam, we described how the virus evolves within the local community, how its genome changes to reflect the selective pressure applied by our immune system and which are the important regions of the genome to be considered for vaccine development. So many important questions would have been difficult to answer without a successful primer design strategy able to capture the complete virus diversity within a patient.

Just as viruses evolve rapidly, our methodology can be easily adapted to any viral pathogen of interest and allow for successful sequencing of its complete genome, as our team later showed for MERS.

In this host-pathogen battle, is power a matter of size or number? Viruses have developed ways to use the optimal combination of both in their favour. Because their genomes exist in various shapes and forms (DNA, RNA, single-stranded, double-stranded, circular, linear, segmented) viruses exhibit a rather intimidating genetic diversity and pose a real research challenge.

However, as long as they use the same genetic alphabet as us, with such a potent magnification tool as PCR and efficient bait design strategy available we can be confident that we will be able to catch and read their genomes hopefully just in time for the next virus outbreak.

As for me, three years later, I am at the start of another straightforward PCR journey, this time fishing for immune cells.

Velislava Petrova is currently a first year PhD student jointly supervised by Professor Paul Kellam and Dr Carl Anderson. Her work in the Virus Genomics Team is focused on the characterisation of B cell immune repertoires in response to measles infection in humans and in cynomolgus macaques.


  • Cotten, M et al (2014). Deep sequencing of norovirus genomes defines evolutionary patterns in an urban tropical setting. Journal of Virology. DOI:10.1128/JVI.01333-14
  • Cotten, M et al (2013). Transmission and evolution of the Middle East respiratory syndrome coronavirus in Saudi Arabia: a descriptive genomic study. Lancet. DOI:10.1016/S0140-6736(13)61887-5

Related Links: