Software to sort samples and sequence species at scale

8 December 2022
By Alison Cranage, Science Writer at the Sanger Institute

As part of the global Earth BioGenome Project, the Sanger Institute’s Tree of Life Programme is sequencing the genomes of tens of thousands of species for the first time – from moss to badgers to mackerel.

Working in partnership with organisations across Britain and Ireland, the Sanger Institute will read the DNA of some 70,000 species over the next 10 years. Currently, less than 1 per cent of the world’s animals, plants, fungi and protists have had their genome sequenced. The aim of the Earth BioGenome Project is to sequence everything – knowledge of the genome sequences of all species will be a resource for the future, underpinning research into biodiversity, conservation and evolution.

This is the first time that sequencing such a diversity of life, at such scale, has been attempted. There are many challenges, from creating laboratory processes that enable DNA to be extracted from species with sticky mucus, to getting at the tiny amounts of DNA in single-celled protists. As well as technical and laboratory challenges, there are computational problems to solve, as the order of 5 quadrillion ‘letters’ of DNA need to be determined.

Keeping track of samples from around 70,000 species is not something that can be done with spreadsheets. Processing and checking quality metrics for a species’ DNA sequence can’t be done by hand. Analysing thousands and thousands of genome sequences requires bespoke, specialist software. Software developers, informaticians and bioinformaticians are crucial to sequencing, assembling and making available the genetic codes of all life on Earth.

Completing the puzzle of life on earth

Behind the scenes photo essay

Software development is in play even before a species sample arrives at the Sanger Institute, be that a tiny piece of frozen leaf, tissue, or blood. Sample collectors input data about the species, including where it was found and who identified it, into a system called COPO. Samples arrive at the Sanger in barcoded tubes which are scanned in and managed using the aptly named ‘Sample Tracking System’ or ‘STS’. Because STS links to COPO, the team can use the system to match up what has arrived, to what was expected. This link also allows legal and compliance teams to complete and record their checks, before a sample is even shipped.

Andrew Varley leads a team of software developers, who develop and maintain bespoke programmes, including STS. STS stores all these metadata – with the aim of making things as easy, and as quick as possible for the scientists scanning in the samples. STS interfaces with a range of systems, including commercial laboratory software, and bespoke genome analysis programmes.

The team works using the Scrum methodology, in sprints. These are two-week cycles of development, where an updated version of the software is released at the end of each sprint. Working like this enables the team to stay ‘agile’ and close to the needs of the scientists, partners and users of the software. They work together to define and prioritise what is needed for each sprint. Andrew’s team then work to decide how something can be delivered. The result is an IT system that is being constantly improved, and has already enabled laboratory staff to record and process thousands of samples.

“For STS we were building systems from the ground up, so there was the opportunity to do it any way we wanted. We could choose the technologies to use, we weren’t tied to a 20-year-old legacy system,” says Andrew.

“We’ve been able to deploy new things very quickly. The best thing is the opportunity to innovate.”

Andrew Varley

“We’ve been able to deploy new things very quickly. The best thing is the opportunity to innovate.”

Andrew Varley

The software development team is able to plug into the Institute’s infrastructure, including an in-house compute farm and private cloud. Edward Moulsdale recently worked to migrate the team’s infrastructure from Docker Compose to Kubernetes – meaning their software can be run on multiple clouds or multiple nodes, enabling further automation.

“I like the Campus and I like people. I like actually what we’re doing as well. I find it really interesting reading some of the help tickets we get, and sometimes some of the manifests [descriptions] of samples were really, really interesting,” says Edward.

Ashish Mittal, who is responsible for software testing in the team, agrees. “What people are doing is good for the community or humanity, and we are part of it.”

There are challenges for a team that is setting up new systems too though. “Some of the processes are not settled, so you just need to dive in, swim, and learn those processes, so we can set them. We are making life easy for the people who will come in later on,” says Ashish.

They are also able to collaborate with software development teams in different areas, or who have different expertise. STS is now being developed for other projects at the Institute as well, including for cancer researchers who receive hundreds of human tissue and biopsy samples.

The team has written more than 250,000 lines of code so far, over a wide range of technologies and frameworks, now deployed on dozens of servers and continuously processing data and metadata.

“The challenge,” Andrew said, “is to deal with the intrinsic complexity and variability of biology. As much as programmers would like to code their way out of problems with logic and method, they continuously need to adapt to the joys and wonders of what comes through the ‘genome engine.’”

“What people are doing is good for the community or humanity, and we are part of it.”

Ashish Mittal

“What people are doing is good for the community or humanity, and we are part of it.”

Ashish Mittal

The genome after party

Once a sample is received, logged and checked, its DNA is extracted, and then split to be purified and prepared for the DNA sequencing machines. The physical structure of DNA turns into digital information, which is processed through a ‘pipeline’ of analysis – a series of mostly automated steps, led by different teams within the Institute. This starts with an initial contamination check to see if the DNA is from the species that is expected to be. Contamination may occur especially with microscopic insects, parasites or other organisms that are hard to identify.

The next process is genome assembly – where the DNA sequence data from the machines, which comes out as long fragments, is stitched together into whole chromosomes. The genome is then ‘polished’ to check for errors in the assembly. Other bioinformaticians check the genome structure. Curation and annotation add context to the data – the locations of genes within the genome, for example, and their potential function. This is done by pulling in data from other sources and databases about the functions of genes.

Metrics are produced to show genome size, measures of the sequence quality, analysis of repeated stretches of DNA and other features of the genome. The genome data, metadata and initial analysis are uploaded to public servers, where they are checked and then downloaded for a series of further analysis – which culminate in the publication of a ‘Genome Note’.

Priyanka Surana oversees the development of the pipeline that stiches all these bioinformatics processes together.

“We call it the genome after party,” she says.

Brazil, bats and GoAT

Find out about the work of Dr Cibele Sotero-Caio, a genomic data curator

As it is a relatively new process, the team is also creating policies on how the software should be written, what it should look like, naming standards – all to make sure everything is reproducible, standardised and accessible. They are currently working to convert some of the pipelines already in use into Nextflow – which is software that enables scalable and reproducible computational pipelines.

“With the easier genomes, many parts of our pipelines can be automated, more or less. But with more difficult genomes, that need more bandwidth – they just break the software and the pipelines. Mistletoe has a huge genome, and it’s caused us a lot of issues.”

Most of the software was originally designed to be used to analyse the human genome – which isn’t big or complex, in comparison.

“The more we do, the more robust our pipelines become, because every time we get an ‘edge case’, we can account for that in our pipeline. The next time, it is more automated.”

“Working on the Tree of Life is getting to work with everything. This type of data is very rare to get your hands on, and it’s a very unique challenge. Usually, as a scientist you’d work on one thing for a few years. But now I get to see the huge diversity of genomics. And you get to test your pipelines against that diversity.”

“Working on the Tree of Life is getting to work with everything. This type of data is very rare to get your hands on, and it’s a very unique challenge.”

Priyanka Surana

“Working on the Tree of Life is getting to work with everything. This type of data is very rare to get your hands on, and it’s a very unique challenge.”

Priyanka Surana

Automating Genome Notes

The final outputs of the whole process include the genome sequence, which is deposited in public databases for anyone to use. There is also a ‘Genome Note’ – an open-access publication in the journal Wellcome Open Research, including descriptions of the processes that have been used to generate the data, standardised analysis of the genome sequence and pulls in supporting data from multiple sources.

With around 70,000 Genome Notes to publish, again, this isn’t something that can be done by hand, like a standard scientific paper.

Andrew Varley’s team is part of creating and designing software to automate the publishing process.

“If we can automate producing a Genome Note, then we can save a lot of time. Currently, whoever’s writing it up will collate all the information – they have to physically go and get all the data from different websites, and put them into one place. Whereas the idea is that we retrieve data from various sources and then put them all in automatically, so there is a block of writing that’s mostly pre-generated, but the figures can be dropped into it,” says Kiernan Harding, software developer in the Enabling Platforms Team.

“If we can automate producing a Genome Note, then we can save a lot of time… the idea is that we can retrieve data from various sources and then put them all in automatically.”

Kiernan Harding

“If we can automate producing a Genome Note, then we can save a lot of time… the idea is that we can call different inputs and then put them all in automatically.”

Kiernan Harding

FAIR (Findable, Accessible, Interoperable, and Reusable)

Matthieu Muffato is the Informatics Infrastructure Team Lead on the genome sequencing work. His team coordinates informatics work for the whole process, ensuring the development of the genomes follows the FAIR principles – Findability, Accessibility, Interoperability, and Reusability – values that underpin the whole project. They deposit their code and pipelines on GitHub and Zenodo, open access and open data platforms.

“The whole world is watching us pioneering. Our duty is to make our software reusable by the hundreds of other countries and projects that will follow our steps,” he says.

But they’re far from being done. “You don’t employ the same methods to process 100, 1,000, or 10,000 species,” Matthieu explains. “As we scale up, we’ll hit new hurdles, and our systems will need to adapt. We’re living a paradigm shift for both science and informatics. The 70,000 species we’re doing now is a big number, but the Earth BioGenome Project is even bigger – there are possibly billions of species out there, and we will scale up to the challenge.”

“As we scale up, we’ll hit new hurdles, and our systems will need to adapt. We’re living a paradigm shift for both science and informatics.”

Matthieu Muffato