Categories: Sanger Science31 March 20225.1 min read

Setting the bar higher – the complete human genome sequence

By Alison Cranage, Science Writer at the Wellcome Sanger Institute

This week, the Telomere to Telomere (T2T) consortium has published a ‘complete’ human genome sequence, filling in gaps that had stubbornly persisted for over 20 years since the publication of the original Human Genome Project.

The newly determined sequence covers previously inaccessible regions of chromosomes, including all of the repeat-heavy centres and short ‘arms’. For the first time, scientists will be able to delve into the functions and variations of all 3.055 billion letters of DNA that code for a human.

The team behind the research describes a ‘new era for genomics, where no region of the genome is beyond reach.’ The sequence contains previously undiscovered genes, and previously inaccessible repetitive regions and can now be detangled. The work opens the door to complete, end-to-end genome sequencing being possible for all species on Earth.

The Human Genome Project

The Sanger Institute was founded in 1993 to contribute to the Human Genome Project, the 13-year long mission to map our species’ DNA. The international consortium announced the ‘complete’ sequence in 2003 and the Sanger was responsible for sequencing a third of the genome – the largest single contribution. It was a monumental landmark for science, providing the foundations for research into biology, evolution and medicine.

The sequence formed the basis of the reference genome - an open-access resource used by the scientific community world-wide as the basis of nearly all genomics applications in research and clinical settings.

The Genome Reference Consortium (GRC), including scientists at the Sanger Institute, has been maintaining and updating the reference human genome sequence since 2007. They have been chipping away at the sequence, adding to it and correcting errors. The current version, number 38, still has about 8 per cent of the sequence missing, beyond the reach of previous sequencing approaches.

The missing millions of letters of DNA are mostly in repeat-dense regions of the genome. At the time of the Human Genome Project, there was no way to determine the order of these letters - mostly because only short fragments of the genome could be sequenced at the time. So for regions full of repeats, it was impossible to fit the puzzle together; all the pieces looked the same.

In 2018, the T2T consortium was formed to get to those uncharted regions. Led by researchers in the USA, they utilised advances in sequencing technology from two companies, Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT). Their long-read technologies can determine the order of many thousands of DNA letters in a row – much more than was previously possible. The pieces of the puzzle got bigger and so easier to fit together.

Over 100 scientists working over several years have developed new algorithms, combining sequence data from all of the available technologies to get to the finished sequence.

Dr Kerstin Howe, a Head of Production Genomics at the Sanger Institute, and member of the T2T consortium, describes the implications: “For the first time ever we have insight into what previously escaped our sequencing and assembling technologies despite our best efforts: the ‘genomic dark matter’ of highly repetitive regions like centromeres, and expanded gene families.”

For the first time ever we have insight into... the ‘genomic dark matter’ of highly repetitive regions like centromeres, and expanded gene families

The new pangenome reference [will] better capture human genetic diversity and improve studies among under-represented populations

Diverse genomes

Another issue with the current reference sequence is it is a linear, single genome, based on a composite sequence, originally from a handful of different people. This creates biases and errors, and importantly, it does not represent human genetic diversity. And while subsequent studies have sequenced more and more individuals, these are mostly people of European ancestry, creating inequity in genomic research.

Kerstin is also involved in the Human Pangenome Reference Consortium. Their aim is to create a complete reference of human genomic diversity. This pangenome will be created from telomere-to-telomere (end-to-end) sequences of 350 individuals of diverse ancestry. Kerstin describes it as a ‘web’: “The pangenome will move away from a linear genome to one which branches out where there is variation, and the branches come back together when there is none. We expect the new pangenome reference to better capture human genetic diversity and improve gene-disease association studies among as yet underrepresented populations.”

The Human Pangenome Reference Consortium is also developing methods, software, tools, and data systems to visualise, use and disseminate the sequence.

New standards

It will be a while before T2T sequencing is the norm. The next hurdle to overcome is the diploid genome, present in normal human cells. The T2T consortium worked on a haploid human cell line, with only one copy of each chromosome. But for Kerstin, T2T represents what will be achievable.

“We have started a massive project to sequence the genomes of all the species on Earth, and the bar is now higher. We know what’s possible. What we are talking about when we say ‘reference genome’ just got pushed up a notch.”

Kerstin reflects on what’s next. “There are some chromosomes that are only around during development in certain cell types, and then they disappear in adult cells. They are going to be of interest. And then of course you have somatic mutation, how our genomes change as we age, and that’s going to be looked at in the context of T2T. Single-cell T2T sequencing might be the next thing.”

The Human Genome Project transformed biology. As the data is used to inform the next generation of research into biology, evolution and personalised medicine, the more accurate, and representative, the reference sequence is, the better.

We have started... to sequence the genomes of all the species on Earth... What we are talking about when we say ‘reference genome’ just got pushed up a notch