Riding on the success of the human genome, confidence was high starting out on the zebrafish. However, it soon became obvious that one genome is not like another. Problems appeared that had never been encountered before. One of the many issues the teams faced was the amount of DNA - there was much less raw material to work with, compared to the human genome project. One solution, Alan says, was to combine material from 40 different fish. The result was enough DNA to prepare for the sequencing machines, but because it was from 40 individuals, the data that came out the other end were hard to interpret. It was difficult to find the consensus. Other issues were overcome with different laboratory preparation techniques and chemistries.
The zebrafish genome was also riddled with short repetitive sequences – most commonly repeats of ‘TA’. The sequencing wasn’t able to get through these, it would run into them and ‘just kind of tail off’. Other regions were impossible to sequences due to their tightly wound structure. These issues were not able to be resolved with the technology available at the time.
“I remember saying to one of the BoM [Board of Management] members back then - the only way that you're going to sort the worst regions of the human genome is to have really long reads of very high quality. It was obvious what the answer was, but that technology wasn't around for more than 10 years after we were talking about it,” says Alan.
Technology was changing though. ‘Next-generation sequencing’, from Illumina, was introduced to the Sanger Institute in around 2008. This technology produced very short, but almost error free, sequence reads of about 35 letters long. The ‘capillary reads’ derived from Sanger sequencing that had been used up until that point were about 1,000 letters long.
“At first they were too short to be useful,” Alan recalls. It was difficult to assemble and finish a genome sequence, and resulting genomes had many gaps and mistakes. But as the length of reads increased, to 100 base pairs, and 150, “Suddenly, we were able to use that technology, in combination with capillary sequencing,” Alan says.
Shorter reads also meant the software had to be updated to deal with them, and the much, much larger numbers of them. Alan worked with James Bonfield, Robert Davies and Andrew Whitwham, who wrote the next version of the Genome Assembly Program, GAP 5. “I helped with bug fixing, and just tested it to destruction really,” says Alan. “When I needed to do something and it didn't exist as a feature, then I'd be having a conversation with them and they would often find a way to help me do what I needed to do.”
The Sanger was increasingly approaching genomics at scale to sequence more human genomes, plus other organisms including pathogens, for the first time. Alan moved into the parasite genomics group where he set to work assembling helminth (parasitic worm) genomes.
Long read technologies, from Pacific Biosciences (PacBio) and Oxford Nanopore, became commercially available about 15 years ago, and were quickly introduced to the Sanger Institute. Though these platforms could sequence much longer stretches of DNA at a time, the results were initially relatively error prone. But, as the technology was progressively improved, Alan and his colleagues were able to put the data to use, too.
“When I joined the parasite genomics group, we had kind of a joke. Like…’Have you got the genome of Haemonchus contortus into chromosomes yet?’ And I’m looking at it and it's about 9000 pieces. And they're all from different haplotypes. I was coming into work, trying to do something that was impossible, for several years,” says Alan. But after a few years, using creative approaches and combining the technologies, they found a way to complete the puzzle.
“It’s rewarding when you've spent so long trying to do something to finally achieve it, and then to be able to see the research that can then be done off the back of that.”
Alan also learnt to code, and recently moved into the Sanger Institute’s Tree of Life programme, which aims to sequence the genomes of 70,000 species for the first time. “It was going back to what I knew – curating genomes,” says Alan. “It was a natural fit.”
Alan (back row, third from right) with the parasite genomics group