Image Credit: AdobeStock

Categories: Sanger Science10 May 20238 min read

A representative reference genome

The human reference genome is the foundation of modern human genetics and genomics, underpinning most research into human health and disease. Knowledge of the sequence has powered medical discoveries, diagnosis, new treatments and enables research into how our bodies function and how our species evolved. Its completion 20 years ago set the bar for open-access resources, and was a pivotal moment in the history of science.

Despite this, it has limitations. The three billion letters of DNA code that make up the human reference genome were put together from data of 20 individuals from the same location in North America. Seventy per cent of the sequence is contributed by a single individual. It’s a mosaic, single version and a smushing together of what in nature is diploid – humans have two versions of every gene and chromosome, one from each parent. And, there are gaps in the sequence. Until very recently there was no knowledge of the sequence in areas of the genome that are tightly wound up, or highly repetitive.

Don't miss out

Sign up for our monthly email update

Sign up

Researchers have been working to address these biases in the data since the Human Genome Project started, and the reference sequence has been continually improved over the years (it’s now on version 38). Today, the Human Pangenome Reference Consortium has published its first human pangenome reference, which aims to better capture global genomic diversity.

Rather than a linear sequence, the pangenome is more like a web. There are multiple sequences, from different individuals, and so there are different paths through the web. Some areas have many routes - representing the regions of the genome where sequence diverges between people. Others have one path, where sequence is conserved.

The work is possible because of the latest advances in sequencing technology, which enable researchers to now sequence an entire, diploid, human genome from scratch in just a matter of weeks. Until about two years ago, it would have taken 10 years.

pangenome reference 1

diagram representing a pangenome map

Flip of a coin

Kerstin Howe is Head of Production Genomics in the Tree of Life Programme at the Wellcome Sanger Institute, and is a member of the Human Pangenome Reference Consortium (HPRC). She started at the Institute towards the end of the Human Genome Project in 2000, but thanks to the flip of a coin with a colleague, began working on the zebrafish genome – an important organism for research.

“It took us 12 years to sequence and assemble the zebrafish genome. We ran a new zebrafish sample through our latest pipeline the other week. It took only a few days and it is better than the original.”

“I’m not bitter!” she laughs. “It was really, really difficult. The genome was far more complex than we originally thought.”

Her team’s work over those years paved the way for genome assembly and curation today. They developed methods and procedures to assess and further improve genome sequence assembly of an organism. They were involved in founding the Genome Reference Consortium to share best practice, maintain and improve the reference genomes of human and important model organisms. Her work was also crucial for the Vertebrate Genomes Project to sequence all vertebrates, the Darwin Tree of Life project to sequence all species in Britain and Ireland, and is now part of the Earth BioGenome Project to sequence all complex life on Earth.

“We got inquiries like, ‘we want to do lots of genomes, can you tell us how good they look and what we can do better?’”

And it’s led her back to working on the human genome sequence - the species whose DNA has been the most studied.

Shortcomings

The vast majority of whole genome sequencing of people done since the Human Genome Project is analyzed in reference to the original. DNA is chopped up into fragments to be ‘read’ and the pieces are then aligned back on the reference sequence to determine their place and the overall sequence.

If the fragments don’t map back, they are put in their ‘second best place’ which is likely incorrect. Or, they are discarded.

“There are places where the current reference genome is not enough. This is in regions of the genome where there is large scale, structural variation between individuals.”

These regions have been difficult to determine for a reference genome, because they are often repetitive, and structurally complex. They are often represented by gaps in the sequence. The last of these gaps were untangled just last year, thanks to the international T2T (or end-to-end) consortium, which produced a complete, haploid sequence of the human genome.

mapping

diagram representing sequencing human genomes, using a reference genome

In October 2022, the newly founded Human Pangenome Reference Consortium published their first paper - the results of a ‘bake-off’. An international competition of sorts, the aim was to determine the best combination of genome sequencing and assembly approaches to get the most complete and accurate diploid genome, with minimal manual intervention. Kerstin coordinated the assessment and identified the winner.

They showed that creating a high-quality, near complete, diploid reference genome was possible using mostly automated methods. It was the foundation for assembling many more, with the aim of capturing global genetic variation - from single letter DNA changes, to structural rearrangements.

“You may see lots of duplications or multiple occurrences of something here, with a totally different number of occurrences over there. There are consequences for that in terms of how our genes function. But you can’t capture it if you don’t see it.”

Dr Kerstin Howe,
Head of Production Genomics in the Tree of Life Programme, Wellcome Sanger Institute

structural variation 2

Illustration showing some of the types of structural complexity on chromosomes

The first draft pangenome was published today, which includes 47 diploid genomes from diverse individuals. The sequences are more than 99 per cent accurate and 99 per cent complete. Their work adds new sequence compared to the current reference, and subsequent analyses are more accurate as a result. The final pangenome will include at least 700 sequences, from 350 individuals.

The future of sequencing

“What we develop for human genomics will spread through to other species,” says Kerstin, who leads genome assembly production in the Tree of Life Programme here at Sanger. There are questions now about whether newly sequenced species should be T2T (end-to-end), or using pangenomes, or both. “It depends on who you ask,” she adds. “There is only so much time in the day, or so much money. Or even more crucially – so much sample.” For single-celled creatures, there is not going to be enough DNA from an individual to do all this sequencing. For other species the challenges lie in complexity – the team has recently completed the mistletoe genome, which is one of the largest of any genome, 30 times larger than a human.

Fig5

Genomic diversity between species (approximate numbers):

Mistletoe (Viscum album): 90 billion base pairs of DNA (bp), 10 chromosome pairs, 39,000 genes

Axolotl (Ambystoma mexicanum): 32 billion base pairs, 14 chromosome pairs

Human (Homo sapiens): 3.2 billion base pairs, 23 chromosome pairs, 21,000 genes

Horseshoe bat (Rhinolophus ferrumequinum): 2.1 billion base pairs, 19,000 genes.

Chicken (Gallus gallus): 1.2 billion base-pairs, 39 chromosomes (most of which are 'microchromosomes') 20,000 genes

The Caenorhabditis elegans worm: 100 million base pairs, 6 chromosome pairs, 20,500 genes

Bacteria Mycoplasma genitalium: 580,000 base pairs, 1 chromosome, 470 genes

Kerstin’s reflections on her work chime with the principals of the Human Genome Project that the Sanger Institute was founded on, of openness and science for all.

“I like solving problems. I like shuffling data around. And I really like it when you solve problems not just for your own benefit and your own paper, but you solve problems that make someone else happy, because they can work with it.”

Dr Kerstin Howe,
Head of Production Genomics in the Tree of Life Programme, Wellcome Sanger Institute

Find out more