Categories: Sanger Science14 January 20153 min read

A molecular archaeologist’s toolkit

14 January 2015
By Moritz Gerstung

Normal cells (blue) are gradually transformed into cancerous cells (red) through the acquisition of mutations.

Normal cells (blue) are gradually transformed into cancerous cells (red) through the acquisition of mutations.

When studying cancer, researchers are in a similar situation to archaeologists looking at an ancient site. We are both presented with the shards of a past catastrophe, hoping to find out what has happened and why.

Cancer arises when cells accumulate errors in their genomes, thereby changing the way they multiply and and rendering them overly proliferative. This process typically occurs over a period of 10-20 years until a cancer is detected. Modern sequencing technologies have enabled us to study the genomes of cancer and healthy cells at unprecedented detail.

There are, however, two challenges:

1. We only observe the endpoint of the transforming process, but we would also like to infer the history of the tumour.

2. Our data is fragmented: instead of obtaining the complete sequence of the genome in each cancer cell, we get hundreds of millions of short pieces of DNA.

The data sets sequencing gives us are huge and imperfect (a single genome is about 200 gigabytes), so one may compare the situation to a vast archaelogical site where we need to find the needle in the haystack. Luckily, we can use powerful computers and algorithms to do the virtual digging through the data.

Reconstructing the history of a tumour can occur at two levels:

1. Each cancer sample consists of cells at different stages of transformation. Hence the composition of an individual sample will give us some clues about what happened in the past of that particular patient (see the image above). This process is similar to digging, finding the informative genomic shards in the myriads of DNA fragments and attempting to reconstruct the full picture of that individual tumour.

2. Comparing multiple tumours reveals overarching patterns underlying the development of tumours. This process is more about understanding the history of the ancient population by combining the (imperfect) information from different archaeological sites. This usually requires some interpretation and further assumptions about the possible ways in which a tumour cell population can evolve.

These approaches are similar to the study of Darwinian evolution, because a population of tumour cells behaves in some ways similar to a species that continuously mutates and adapts to its environment.

Because we are analysing and comparing large quantities of data, the language we use to describe the dynamics of cancer development is mathematical and our toolbox contains a series of different algorithms, each tailored to a specific purpose.

In recent years there has been rapid progress on the data production side, for example as part of the International Cancer Genome Consortium, in which 2,500 cancer genomes have been sequenced. The sudden availability of large data sets also fuelled the development of many novel statistical tools and algorithms.

To keep track of these recent developments, we have reviewed some of the approaches in a recent publication entitled Cancer evolution: mathematical models and computational inference. This acts as an overview of the area of research and as a reference to find the right tool when needed.

Moritz Gerstung is a Postdoctoral Fellow in the Cancer Genome Project at the Wellcome Trust Sanger Institute. With Peter Campbell he works on bioinformatics algorithms for analysing and understanding sequencing data from cancer patients.


  • Beerenwinkel N, et al (2015). Cancer evolution: mathematical models and computational inference. Systematic BiologyDOI: 10.1093/sysbio/syu081

Related Links: