Image credit: Phil Mynott / Wellcome Sanger Institute


The huge, international Pan-Cancer project is the first large-scale use of distributed cloud computing in genomics. As genomics becomes a big data science, it is likely to be the first of many.
The aim of the project was to describe the full range of genomic changes that occur in cancer. Not only the well-studied changes in genes and regulatory regions of DNA, but also changes outside of these regions, across the whole genome. Mapping all of the mutations, from the small scale, single DNA letter substitutions, to the large, chromosomal rearrangements, gives researchers the foundations for understanding what causes cancer.
With data from over 2,600 whole cancer genome sequences spanning 33 countries, the decade-long Pan-Cancer project faced several challenges. The challenge of scale – of huge datasets - and the challenge of undertaking consistent analysis of that data were met with cloud computing.
Stepping up the scale
As genome sequencing technology improves, the amount of data produced is rapidly expanding. From 13 years to sequence the first human genome, to the Wellcome Sanger Institute’s current sequencing rate which is equivalent to one gold-standard human genome every 3.5 minutes – the amount of raw genomic data being produced around the world is doubling every seven months.
“In the International Cancer Genome Consortium (ICGC), researchers amassed a data set in excess of two petabytes — roughly 500,000 DVDs-worth — in just five years. Using a typical university internet connection, it would take more than 15 months to move two petabytes from its repository into a researcher's local network of connected computers. And the hardware needed to store, let alone process the data, would cost around US$1 million,” says Dr Peter Campbell, Head of the Cancer, Ageing and Somatic Mutation programme at the Sanger Institute, and a co-founder of the Pan-Cancer study.
For the Pan-Cancer study, rather than move vast amounts of data, researchers moved the analysis. The teams used docker technology, packaging their software into ‘containers’ and porting them to run where the data was stored - in 13 data centres on three continents. The data centres included a mixture of commercial clouds, infrastructure-as-a-service, academic cloud compute, and traditional academic high-performance computer clusters.
Uniform Analysis
For comparisons between different patients to be meaningful, the analysis of data from each genome needed to be the same. Algorithms used to interpret each individual’s 3 billion base pairs of DNA had to be gold-standard, benchmarked and version-controlled.
The first task was to take the millions of pieces of raw sequencing data from an individual and align them to the reference human genome, in order to determine their sequence. The researchers used a standardised algorithm for the task.
The teams then created three pipelines of bespoke software, to pinpoint the differences between an individual’s genome and the reference sequence. These ‘variant calling’ pipelines consisted of multiple software packages.
The differences they searched for included a change of a just a single letter of DNA, or a larger structural alteration where regions of a genome had been deleted, added or moved. Additional algorithms were used to improve accuracy and look for specific mutation types.
The results were validated and merged to create the final set of mutations for each patient. The raw sequencing reads amounted to over 650 terabytes (TB) of data – about the size of an HD movie playing for 30 years. Running the alignment, followed by the variant calling pipelines one after the other would have taken 19 days per donor on a single computer, or 145 years for the whole project.
“Cloud computing provides 'elasticity', meaning that a researcher can use as many computers as needed to complete an analysis quickly. Several researchers can work in parallel, sharing their data and methods with ease by performing their analyses within cloud-based virtual computers that they control from their desktops. So the analysis of a big genome data set that might have previously taken months can be executed in days or weeks,” says Peter.
Altogether they used 10 million CPU-core hours taking around 23 months for the analysis. This included software development and other set up factors; they estimate that using 200 virtual machines on a cloud compute system, the same task today would take 8 months.
Disease discoveries
The 700+ Pan-Cancer researchers have now analysed the data and catalogued the biggest set of cancer genome mutations to date. There were over 43 million single base changes across the 2,600 patients’ genomes. There were over 2 million sequence insertions or deletions, and 280,000 structural variations in the genomes. The team has analysed the mutation rates and types between and within the different types of tumours. 23 papers describing the work are published in Nature today. Their analysis, software and the data itself, is available on cloud-computing platforms for researchers worldwide to access.
Computational and statistical methods designed to find changes in tumour genome sequences have led to important biological insights over the past 20 years – into the causes of cancer, as well as ways it can be treated. The Pan-Cancer data means even more is possible. The researchers have already uncovered previously unknown causes of cancer, new ways to trace the origins of cancer, and new ways to classify tumours based on their patterns of genetic change.
“The Pan-Cancer resource provides a vital foundation for cancer genome research. We are close to cataloguing all of the biological pathways involved in cancer. The cancer genome, whilst incredibly complex, is finite,” says Peter.
The promise of precision medicine
The aim of precision oncology is to match patients to targeted therapies, using genomics. A major barrier to getting precision medicine to the clinic is the huge variability of cancer, between tumour types, patients, and individual cells.
“We need knowledge banks with genome data and clinical data from tens of thousands of patients,” says Peter. “The next steps will see the genomics community working together with healthcare providers, pharmaceutical companies, data science and clinical trials groups to build comprehensive knowledge banks that will make precision medicine possible. We must now translate this knowledge into sustainable, meaningful impacts for patients.”