Creating a gold-standard, not a rotten, tomato genome

Credit: Luc Viatour /

Credit: Luc Viatour /

Recently the full reference genome of the tomato (Solanum lycopersicum) was published in Nature (31 May 2012). Here, at the Wellcome Trust Sanger Institute, some of our sequencing people took part in the international collaboration of 10 countries that developed the DNA sequence. Each research group was tasked with working on a different chromosome, and we sequenced Chromosome 4. By being part of the project we were able to share our experiences and knowledge from producing animal reference genomes to enable the plant genome research teams to work together to deliver high-quality, standardised data.

When the tomato genome sequencing project began the teams estimated that the genome was 950 million base (Mb) pairs in size, split across 12 chromosomes. This was no small undertaking: it is one-third the size of the human genome (a project that had taken a worldwide collaboration 10 years to deliver). In addition, the project had limited funding resources, meaning that the work needed to be as tightly focused and efficient as possible.

Fortunately only 25 per cent of the tomato genome contains gene-rich areas, so the project teams agreed that capturing and sequencing these areas only would provide the most valuable information in the most effective way. To achieve this, we used mapping techniques to identify the gene-rich areas and used clone-by-clone sequencing to fully sequence them using the shortest number of sequencing runs.

Clone-by-Clone sequencing

We took clones taken from existing libraries and digested them with restriction enzymes, producing a fingerprint signature for each. We processed these fingerprint signatures in a database known as FPC (Fingerprint Contigs). Sections of signature in common indicate an overlap between clones and these overlaps can often be verified if known markers can be placed in them. By knowing where each clone belonged on the chromosome, we were able to select only a minimal set of clones to cover the area of interest. We made the FPC database for all the chromosomes publically available for the research community.

Fig 1. Screenshot showing the Fingerprint Contigs database. Clones highlighted in red and grey show the minimal tiling path selected for the sequencing project.

Using this approach, we mapped, sequenced and finished the gene-rich clones of Chromosome 4, which was estimated to be roughly 19Mb long. The UK team was led by Principal Investigators Gerard Bishop from Imperial College London, Graham Seymour from Nottingham University, Glenn Bryan from Scottish Crop Research Institute, and Jane Rogers from the Sanger Institute.

Finishing the genome

However, mapping and sequencing are not the whole story when producing a high-quality reference genome: the sequences need to be pieced together and inconsistencies resolved. In other words, the sequences need to be finished. This can be a long and time-consuming process, especially if a project consists of differing standards and approaches. Fortunately, we have long experience in finishing DNA sequencing data from our work on the human, mouse and zebrafish genome projects. So, to enable the other international teams draw on our experience and to develop the common standards needed for efficient finishing, we organised two International Finishing Workshops.

In these, representatives of the different research groups from across the world met and discussed the various challenges of working with the sequencing data. It was a chance to pool experience and look at efficient ways to progress each data set for each of the chromosomes. Our discussions centered around techniques for improving the data for the clones as well as ensuring that the metrics all the teams used to assess the quality of each clone was comparable.

Through meeting together and talking through the issues, the teams ensured that the resulting genomic sequence from all the laboratories involved showed parity. This data was then annotated and made publically available for the wider Solonaceae research community.

Another area that we were able to make a useful contribution to was to guide the project teams through the challenges of adopting and incorporating new technology sequencing data; which the project went on to adopt.

Funding bodies: BBSRC, EU-SOL, DEFRA and the Wellcome Trust