30 July 2012
Written by Simon White
Studying zebrafish is a vitally important way to discover what role genes play in human health and disease. By linking human genes to their closest comparators in the zebrafish genome (orthologs) we are able to investigate what happens when a gene is lost or isn’t working the way it should.
However, to be able to do this, we need to have as complete as a list of zebrafish genes as possible. I work in a team that uses RNA-seq technology to automatically annotate genes and to determine the different ways genes work in different tissues (tissue-specific splice variation). RNA-Seq looks for the molecules (transcripts) that genes produce to make the proteins that control how a cell works. This information is invaluable in helping us to construct a complete catalogue of transcripts from the zebrafish genome.
The potential biological insights offered by RNA-Seq are considerable, but at considerable cost in terms of the effort to understand the large volume of information it produces. Because of the usefulness of this information, we set ourselves the goals of constructing a zebrafish gene set based on RNA-Seq alone and then identifying the highest quality models. This information would then be added into the core Ensembl gene-set (the reference zebrafish genome used by researchers around the world) in such a way as to increase the quantity and quality of the gene-set without adding artifacts or pseudogenes.
The results of our efforts were published recently in a paper entitled Incorporating RNA-seq data into the Zebrafish Ensembl Gene Build in Genome Research.
The task of assembling RNA-Seq short reads into gene models is not trivial; in particular, we needed to overcome the issues of transcript contiguity and fragmentation. To achieve this, we employed the following approach.
We used Illumina paired-end sequencing to deep sequence a range of developmental stages and adult tissues, providing near complete coverage of the zebrafish transcriptome. We also performed an RNA-Seq three prime pull-down experiment that allowed us to identify the precise three prime ends of models.
In order to use these data sets to build gene models we created an analysis pipeline consisting of five steps:
- alignment to the genome
- processing alignments to construct basic transcript models
- re-alignment of reads to basic transcripts to identify splice sites
- refining basic transcripts using splice data to produce final transcripts
- using pull-down data to modify the three prime ends of the transcripts.
In total, we compared a sample set of 8,822 cDNAs to the RNA-Seq models and found that 95 per cent of the cDNA introns were present in our RNA-Seq set. We also found that 83 per cent of the cDNAs were reproduced perfectly in the RNA-Seq set at the level of the coding sequence of the transcript (the bit that actually codes for protein).
Many of the RNA-Seq generated models appeared to be fragments and we needed to remove them by filtering our results before we could include our findings in the Ensembl gene set. However, despite these fragments, we were able to create a significant number of full-length transcript models and 8,374 of these were added to the core Ensembl gene-set. In addition to this we generated a wealth of tissue-specific splice variation data.
By improving the quality and coverage of the zebrafish gene annotation we have provided a useful resource for researchers who wish to verify the activity of genes implicated in human disease. In addition, we have generated a high-quality RNA-Seq gene annotation pipeline that is now routinely used in Ensembl annotation and is proving particularly useful for species with very little protein or cDNA evidence.
In addition, the significant number (>1000) novel models that came from RNA-Seq that were absent from the zebrafish cDNAs suggests that the deep sequencing offered by RNA-Seq can be used to expand the gene annotation of even well-studied model organisms.
We hope that our approach will help refine the use of RNA-Seq in the Ensembl gene build process for new species. It also gives us the opportunity to rapidly update old gene-sets for which there is unlikely to be a full gene-build in the foreseeable future.