Genomics is set to become the biggest source of data on the planet, overtaking the current leading heavyweights - astronomy, YouTube and Twitter. Genome sequencing currently produces a staggering 25 petabytes of digital information per year. A petabyte is 1015 bytes, or about 1,000 times the average storage on a personal computer. And there is no sign of a slowdown.
The amount of DNA sequencing data produced around the world is doubling approximately every seven months. Computing power is also increasing, though more slowly. It doubles in accordance with Moore’s Law – about every 18 months.
There is a need for computing to move ahead of the upcoming challenges in genomics – not just to ensure we can store and process the vast quantities of data. There is also going to be increasingly diverse data to grapple with, as efforts to sequence the genomes of all life on earth are underway. Then there is the desire to analyse data quickly. For example in hospitals and clinics, rapid analysis of a genome can aid diagnosis and decisions about an individual’s treatment. More widely, ‘real time’ genome analysis of bacteria and viruses can help to track outbreaks of infectious diseases as they happen, across the globe.
Machine learning is one of the tools biologists are using to meet those challenges and get the most value from those petabytes of genomic data.
How to train your algorithm
Artificial Intelligence (AI) is all around us, everyday. Any form of voice recognition system, for example those on our smartphones, uses AI. As do facial recognition systems, aircraft, formula 1 cars, and our email spam filters. These examples use a particular form of AI – Machine learning – that is also now being utilised in genomics.
Machine learning is about training an algorithm to recognise patterns in data. Different types of algorithms need different training data. For example, deep learning algorithms need a lot of training data. An image recognition algorithm, a common application of deep learning, needs to be fed thousands of images and be told what they are – ‘these are butterflies’. It also needs examples of images of other things - ‘these are not butterflies’.
Digital images are made of billions of pixels in unique patterns. Spotting similarities between each image is no easy task. Say the pictures used to train the algorithm were all of red butterflies, it might reason that anything red is a butterfly, completely ignoring any other aspects of shape, size or position. If the butterflies in training pictures all have their wings open, the algorithm doesn’t have a chance of recognising a picture of a butterfly with its wings folded. So the more input, the better. Which is why you are asked to flag spam emails – with every example you flag, the algorithm improves.
Machine learning in genomics is the same. Algorithms are designed to spot patterns in genomic data sets, only instead of enhancing our lives, the technology has the potential to enhance our health.
Machine learning in genomics
Though the concept of machine learning has been around since the 1960s, it’s only in the last 10 years that it’s really been applied in genomics. Three factors have converged allowing its potential to be realised - the algorithms are sophisticated enough, the data sets needed to train the algorithms now exist, and the computing power to train those algorithms exists.
Dr Nicole Wheeler, a researcher at the Centre for Genomic Pathogen Surveillance at the Sanger Institute has used supervised machine learning to train an algorithm to spot genome sequences in Salmonella bacteria that are associated with a deadly bloodstream infection, as opposed to mild food poisoning. She trained the algorithm using genome sequences of a range of Salmonella bacteria strains. Information about each strain’s ability to cause different types of infection was also fed in.
The algorithm was able to identify approximately 200 genes in the Salmonella bacteria’s genome that were associated with an ability to cause severe infection. Once trained, she tested it on strains it hadn’t seen before. It was able to correctly identify dangerous strains of the bug circulating at the time in Sub-Saharan Africa.
As bacteria are constantly evolving, the algorithm gives researchers a powerful tool to characterise new strains as soon as they emerge. The algorithms can tackle massive genomic data sets – which are now routinely produced for infectious diseases - and get results in seconds. This will help researchers assess whether a bacterium is likely to cause a deadly disease outbreak. In the past, this is something that has sometimes taken decades, by which time bacteria have spread across the world before attempts have been made to control or eliminate them. The approach should work for a range of bacterial species, and Nicole’s colleagues aim to use a version of it to identify genome sequences that cause resistance to antibiotics too.
This link between cause and effect is at the heart of much of genome biology – researchers want to know what, in the genome, is causing a particular behaviour. It can then be investigated, understood and targeted to change its effects, if so desired.
Computer game technology powers machine learning
To train machine learning algorithms the Sanger Institute is using newly acquired kit, specifically designed for the task.
The ‘supercomputer’, a DGX-1, is the single most powerful unit in the datacentre. The DGX-1 contains eight linked graphical processing units (GPUs) and is one of the most powerful GPU based systems currently on the market. Originally put to use in the 3D graphics of computer games, GPUs are used in your laptop to display images and videos to the screen. But they are also a powerful computational device in their own right, and can massively accelerate computational workloads. They are optimised for taking huge batches of data and performing the same task over and over again very quickly - perfect for training deep learning algorithms.
Fittingly, the unit is coloured sparkly gold. Though as time moves on, it’s quite likely that this kind of power will become part of standard computers of the future – just like smartphones today are the equivalent of supercomputers of the 1980s.
Declutter your data
Data is flying out from genome sequencers so fast now that in some cases it is difficult to keep up. A single human genome equates to about 200GB of raw data. At the moment, the Sanger Institute is capable of producing data equivalent to a human genome every 17 minutes.
An algorithm built-in to sequencing machines, which can instantly analyse the data and decide which bits of information to keep, and which to bin, would be of huge value.
It’s an approach already used in astrophysics and particle physics. But biologists have yet to decide which bits of genomic information are ok to throw away. There may be nuggets of information hidden in those sequences – so we’re keeping it all – for now.