DNA sequencing centre, Wellcome Sanger Institute, 1990s

Categories: Sanger Science9 August 20228.1 min read

Finishing genome puzzles

By Alison Cranage, Science Writer at the Wellcome Sanger Institute

In the early days of the Sanger Institute, several teams of ‘finishers’ worked on the Human Genome Project. As the name suggests, their role was at the very end of the process; they were responsible for checking and correcting errors in the sequence of letters that represents the genetic code of a human. The teams made sure that the millions of pieces of data churned out by the DNA sequencing machines were correctly assembled in the right order, to represent the final, whole, 3.3 billion pairs of letters in the human genome.

As sequencing technology changed, the role has changed too. But many finishers who started at the Institute still work here 25 years on, using their skills on a variety of other projects and other genomes. We spoke to staff about their experiences, and how genome assembly has evolved over the last two decades.

Alan Tracey started at the Sanger in 1998, just a few years after the Institute opened. He quickly picked up the job as a finisher, even without any previous experience in either science or computing. Soon, it was taking him a week or two to complete a 150,000 letter-long stretch of the genome. “I have no idea why I was able to churn through the work so quickly to be honest. I wasn’t working long hours. I just enjoyed the work I suppose, and it was a natural fit. I got the bit between my teeth and I wanted to make an impact, to do something good,” Alan says. In 2001 he was recognised as the individual having contributed the most finished sequence to the entire Human Genome Project, though he stresses that the work was one part of a multi-stage process and a huge team effort.

In essence, the finishers checked that fragments of code, representing DNA sequence, were in the right order, the right orientation and the right place, much like a jigsaw puzzle. “A lot of the time, you’d have generic parts, where it could be assembled in different ways. If you use the puzzle analogy, it’s like if you had a lot of sky. You need to put the context together around it to get everything in the right place. At the same time you’re trying to resolve single letter errors, which were common. So it was very detailed work,” says Alan.

This task was made possible by software programmes, such as the Genome Assembly Program (GAP), created by James Bonfield, Kathryn Beal and colleagues at Rodger Staden's Lab at the MRC-LMB, which allowed the finishers to visualise, display and manipulate the data, to help solve and check genome assemblies.

The process was labour intensive, demonstrated by the fact that 40 or so finishers were employed to work on the Human Genome Project. But the result was a high quality genome sequence, and the first draft of the human genome was declared finished in 2001, revolutionising biology. As the Human Genome Project was coming to a close, some finishers left the Institute, but many went on to work on other organisms.

Screenshot 2022-08-01 111122

Viewing genome sequence data, late 1990s

The zebrafish

Riding on the success of the human genome, confidence was high starting out on the zebrafish. However, it soon became obvious that one genome is not like another. Problems appeared that had never been encountered before. One of the many issues the teams faced was the amount of DNA - there was much less raw material to work with, compared to the human genome project. One solution, Alan says, was to combine material from 40 different fish. The result was enough DNA to prepare for the sequencing machines, but because it was from 40 individuals, the data that came out the other end were hard to interpret. It was difficult to find the consensus. Other issues were overcome with different laboratory preparation techniques and chemistries.

The zebrafish genome was also riddled with short repetitive sequences – most commonly repeats of ‘TA’. The sequencing wasn’t able to get through these, it would run into them and ‘just kind of tail off’. Other regions were impossible to sequences due to their tightly wound structure. These issues were not able to be resolved with the technology available at the time.

Zebrafish from pixabay no credit required

“I remember saying to one of the BoM [Board of Management] members back then - the only way that you're going to sort the worst regions of the human genome is to have really long reads of very high quality. It was obvious what the answer was, but that technology wasn't around for more than 10 years after we were talking about it,” says Alan.

Technology was changing though. ‘Next-generation sequencing’, from Illumina, was introduced to the Sanger Institute in around 2008. This technology produced very short, but almost error free, sequence reads of about 35 letters long. The ‘capillary reads’ derived from Sanger sequencing that had been used up until that point were about 1,000 letters long.

“At first they were too short to be useful,” Alan recalls. It was difficult to assemble and finish a genome sequence, and resulting genomes had many gaps and mistakes. But as the length of reads increased, to 100 base pairs, and 150, “Suddenly, we were able to use that technology, in combination with capillary sequencing,” Alan says.

Shorter reads also meant the software had to be updated to deal with them, and the much, much larger numbers of them. Alan worked with James Bonfield, Robert Davies and Andrew Whitwham, who wrote the next version of the Genome Assembly Program, GAP 5. “I helped with bug fixing, and just tested it to destruction really,” says Alan. “When I needed to do something and it didn't exist as a feature, then I'd be having a conversation with them and they would often find a way to help me do what I needed to do.”

Long-read

The Sanger was increasingly approaching genomics at scale to sequence more human genomes, plus other organisms including pathogens, for the first time. Alan moved into the parasite genomics group where he set to work assembling helminth (parasitic worm) genomes.

Long read technologies, from Pacific Biosciences (PacBio) and Oxford Nanopore, became commercially available about 15 years ago, and were quickly introduced to the Sanger Institute. Though these platforms could sequence much longer stretches of DNA at a time, the results were initially relatively error prone. But, as the technology was progressively improved, Alan and his colleagues were able to put the data to use, too.

“When I joined the parasite genomics group, we had kind of a joke. Like…’Have you got the genome of Haemonchus contortus into chromosomes yet?’ And I’m looking at it and it's about 9000 pieces. And they're all from different haplotypes. I was coming into work, trying to do something that was impossible, for several years,” says Alan. But after a few years, using creative approaches and combining the technologies, they found a way to complete the puzzle.

“It’s rewarding when you've spent so long trying to do something to finally achieve it, and then to be able to see the research that can then be done off the back of that.”

Alan also learnt to code, and recently moved into the Sanger Institute’s Tree of Life programme, which aims to sequence the genomes of 70,000 species for the first time. “It was going back to what I knew – curating genomes,” says Alan. “It was a natural fit.”

berriman_group

Alan (back row, third from right) with the parasite genomics group

Curating the tree of life

The Tree of Life curation team, headed by Jo Wood, complete the final step in the process of producing a new genome sequence, much like the original finishers on the Human Genome Project. They take an assembled genome, check for any gaps and resolve any issues. But now, the data come from a whole range of DNA sequencing technologies. As well as highly accurate long and short reads, the team also have data from Hi-C sequencing, which provides long-range information about which sequences are near each other physically, for example on the same chromosome.

Much of what they do is automated, enabled by software developed both in house and around the world, which are freely available for others to adapt and use. In the 18 months that Alan has been in the Tree of Life team, he’s curated over 300 genomes.

“The way we curate genomes is super-efficient, to the point where we can do a few in a week. Curating a genome now, it's quite a lot of fun. And very easy. What we can do now was impossible a few years ago.”

Screenshot 2022-02-22 at 09.06.38

HiC maps of the zebrafish genome, before and after curation. Curation of the zebrafish genome using the latest technologies and techniques took two days.

There are some elements to the role that remain the same as 20 years ago, Alan says. Back then, finishers would be looking at a restriction digest ‘map’ of the capillary sequences, to see if there were any mis-matches or unclear areas, or an electropherogram to work out if an individual letter was, say, a G or a C. Now, curators might look at HiC maps, where it is not always immediately clear which sequences should be placed where. They are looking for subtle relationships between things.

“I think eventually you'll be able to press a button, and the whole thing will just come out and it will be right. That's my assumption,” says Alan.

But it is likely that manual intervention, and the skilful art of interpreting genome sequence data, will be needed for a few years yet to come, especially for the more ‘tricky’ species.

Find out more

Sanger Institute Tree of Life Programme

Sanger Institute Parasites and Microbes Programme