Image credit: Phil Mynott / Wellcome Sanger Institute

Categories: Sanger Science5 February 20207.7 min read

Genomics in the cloud

The huge, international Pan-Cancer project is the first large-scale use of distributed cloud computing in genomics. As genomics becomes a big data science, it is likely to be the first of many.

The aim of the project was to describe the full range of genomic changes that occur in cancer. Not only the well-studied changes in genes and regulatory regions of DNA, but also changes outside of these regions, across the whole genome. Mapping all of the mutations, from the small scale, single DNA letter substitutions, to the large, chromosomal rearrangements, gives researchers the foundations for understanding what causes cancer.

With data from over 2,600 whole cancer genome sequences spanning 33 countries, the decade-long Pan-Cancer project faced several challenges. The challenge of scale – of huge datasets - and the challenge of undertaking consistent analysis of that data were met with cloud computing.

What is a petabyte?

A petabyte (PB) of data is a unit of information equal to one thousand million million (1015) bytes. The 10 billion photos on Facebook are about 1.5 PB.

Stepping up the scale

As genome sequencing technology improves, the amount of data produced is rapidly expanding. From 13 years to sequence the first human genome, to the Wellcome Sanger Institute’s current sequencing rate which is equivalent to one gold-standard human genome every 3.5 minutes – the amount of raw genomic data being produced around the world is doubling every seven months.

“In the International Cancer Genome Consortium (ICGC), researchers amassed a data set in excess of two petabytes — roughly 500,000 DVDs-worth — in just five years. Using a typical university internet connection, it would take more than 15 months to move two petabytes from its repository into a researcher's local network of connected computers. And the hardware needed to store, let alone process the data, would cost around US$1 million,” says Dr Peter Campbell, Head of the Cancer, Ageing and Somatic Mutation programme at the Sanger Institute, and a co-founder of the Pan-Cancer study.

For the Pan-Cancer study, rather than move vast amounts of data, researchers moved the analysis. The teams used docker technology, packaging their software into ‘containers’ and porting them to run where the data was stored - in 13 data centres on three continents. The data centres included a mixture of commercial clouds, infrastructure-as-a-service, academic cloud compute, and traditional academic high-performance computer clusters.

What is cloud computing?

Cloud computing is the delivery of computing services – including servers, data storage, databases, networking and software – over the Internet (“the cloud”). Typically, users only pay for cloud services they use. Benefits include flexibility, and economies of scale. Clouds can be public and open to all, private and only open within an organisation, or a mixture of the two.

The Sanger Institute operates a data centre with approximately 45,000 compute cores and 65PB of usable storage. Just under half of the cores operate as a private cloud, allowing computing resources to be allocated on demand to researchers.

Uniform Analysis

For comparisons between different patients to be meaningful, the analysis of data from each genome needed to be the same. Algorithms used to interpret each individual’s 3 billion base pairs of DNA had to be gold-standard, benchmarked and version-controlled.

The first task was to take the millions of pieces of raw sequencing data from an individual and align them to the reference human genome, in order to determine their sequence. The researchers used a standardised algorithm for the task.

The teams then created three pipelines of bespoke software, to pinpoint the differences between an individual’s genome and the reference sequence. These ‘variant calling’ pipelines consisted of multiple software packages.

The differences they searched for included a change of a just a single letter of DNA, or a larger structural alteration where regions of a genome had been deleted, added or moved. Additional algorithms were used to improve accuracy and look for specific mutation types.

The results were validated and merged to create the final set of mutations for each patient. The raw sequencing reads amounted to over 650 terabytes (TB) of data – about the size of an HD movie playing for 30 years. Running the alignment, followed by the variant calling pipelines one after the other would have taken 19 days per donor on a single computer, or 145 years for the whole project.

“Cloud computing provides 'elasticity', meaning that a researcher can use as many computers as needed to complete an analysis quickly. Several researchers can work in parallel, sharing their data and methods with ease by performing their analyses within cloud-based virtual computers that they control from their desktops. So the analysis of a big genome data set that might have previously taken months can be executed in days or weeks,” says Peter.

Altogether they used 10 million CPU-core hours taking around 23 months for the analysis. This included software development and other set up factors; they estimate that using 200 virtual machines on a cloud compute system, the same task today would take 8 months.

Data Centre photo by Phil Mynott / Wellcome Sanger Institute

Data Centre photo by Phil Mynott / Wellcome Sanger Institute

Data Centre photo by Phil Mynott / Wellcome Sanger Institute

Reproducibility

An advantage of using containers of software means experiments are reproducible. Anyone can execute the software under the same conditions that its original developers used. There is no need to worry about different versions of add-on tools or data reference libraries.

Containers can be made freely available for anyone to download and run, saving huge amounts of time. It might take a researcher a year or more to develop bespoke software for genome analysis. Once it is containerised, it takes another laboratory just a few days to install and run it.

Disease discoveries

The 700+ Pan-Cancer researchers have now analysed the data and catalogued the biggest set of cancer genome mutations to date. There were over 43 million single base changes across the 2,600 patients’ genomes. There were over 2 million sequence insertions or deletions, and 280,000 structural variations in the genomes. The team has analysed the mutation rates and types between and within the different types of tumours. 23 papers describing the work are published in Nature today. Their analysis, software and the data itself, is available on cloud-computing platforms for researchers worldwide to access.

Computational and statistical methods designed to find changes in tumour genome sequences have led to important biological insights over the past 20 years – into the causes of cancer, as well as ways it can be treated. The Pan-Cancer data means even more is possible. The researchers have already uncovered previously unknown causes of cancer, new ways to trace the origins of cancer, and new ways to classify tumours based on their patterns of genetic change.

“The Pan-Cancer resource provides a vital foundation for cancer genome research. We are close to cataloguing all of the biological pathways involved in cancer. The cancer genome, whilst incredibly complex, is finite,” says Peter.

The promise of precision medicine

The aim of precision oncology is to match patients to targeted therapies, using genomics. A major barrier to getting precision medicine to the clinic is the huge variability of cancer, between tumour types, patients, and individual cells.

“We need knowledge banks with genome data and clinical data from tens of thousands of patients,” says Peter. “The next steps will see the genomics community working together with healthcare providers, pharmaceutical companies, data science and clinical trials groups to build comprehensive knowledge banks that will make precision medicine possible. We must now translate this knowledge into sustainable, meaningful impacts for patients.”

Data security

It’s likely more and more genomics projects will use cloud-computing. Data could come from any country, and could be stored anywhere across the globe. Good data governance, alongside patient privacy, is vital.

The Global Alliance for Genomics and Health (GA4GH) is helping to ensure genomic data are responsibly shared.

The Pan-Cancer project moved personal health information across multiple legal jurisdictions. The data were accessed and used by hundreds of international researchers. Donor consents were written to explicitly allow for broad research use and for international data sharing. Data were encrypted, securely stored, and only accessed for appropriate purposes by approved and verified researchers.

Find out more