The Sanger Institute was founded in 1993 to contribute to the Human Genome Project, the 13-year long mission to map our species’ DNA. The international consortium announced the ‘complete’ sequence in 2003 and the Sanger was responsible for sequencing a third of the genome – the largest single contribution. It was a monumental landmark for science, providing the foundations for research into biology, evolution and medicine.
The sequence formed the basis of the reference genome - an open-access resource used by the scientific community world-wide as the basis of nearly all genomics applications in research and clinical settings.
The Genome Reference Consortium (GRC), including scientists at the Sanger Institute, has been maintaining and updating the reference human genome sequence since 2007. They have been chipping away at the sequence, adding to it and correcting errors. The current version, number 38, still has about eight per cent of the sequence missing, beyond the reach of previous sequencing approaches.
The missing millions of letters of DNA are mostly in repeat-dense regions of the genome. At the time of the Human Genome Project, there was no way to determine the order of these letters - mostly because only short fragments of the genome could be sequenced at the time. So for regions full of repeats, it was impossible to fit the puzzle together; all the pieces looked the same.
In 2018, the T2T consortium was formed to get to those uncharted regions. Led by researchers in the USA, they utilised advances in sequencing technology from two companies, Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT). Their long-read technologies can determine the order of many thousands of DNA letters in a row – much more than was previously possible. The pieces of the puzzle got bigger and so easier to fit together.
Over 100 scientists working over several years have developed new algorithms, combining sequence data from all of the available technologies to get to the finished sequence.
Dr Kerstin Howe, a Head of Production Genomics at the Sanger Institute, and member of the T2T consortium, describes the implications: “For the first time ever we have insight into what previously escaped our sequencing and assembling technologies despite our best efforts: the ‘genomic dark matter’ of highly repetitive regions like centromeres, and expanded gene families.”