Tag: rna transcripts

Sanger Science

Seq and ye shall find…

25 July 2012

Written by Lia Chappell

I’m studying for my PhD at the Sanger Institute and my interest is in understanding how the parasites responsible for malaria are able to adapt to live in both people and mosquitoes. I’m always looking for more effective, accurate and cost-efficient ways for us to see what is happening within the parasites’ cells.

Malaria parasites have a complicated life cycle, moving through different parts of the mosquito and human host, changing the shape of their single cell drastically in the process. This is impressive for an organism with about the same number of genes as a yeast cell that just floats around in it environment! To understand what happens when they change from one form to the next we can use a technology called RNA-seq. We can use RNA-seq to detect and count the RNA molecules present in a parasite (these are encoded in the genome, are made when genes are switched on, and control the amount of proteins being made). From this we can work out which genes and biological pathways are responsible for the parasite’s adaptability, which might help to identify targets for drug treatment.

RNA-seq is very useful at looking at the how much a gene is switched on in many types of living things, but the unusual nature the malaria parasite’s genome means it’s more challenging than most.

Recently I wrote a review called ‘Looking for a needle in a haystack’ in Nature Reviews Microbiology. The study I reviewed was particularly helpful because the researchers had taken the time to look carefully at a technological challenge. This problem can prevent many researchers from answering their questions or can make them blow their entire research budget on looking at molecules that aren’t of interest. They conducted a methodical comparison between technologies and manufacturers that many small laboratories would find too expensive to be able to carry out for themselves. I wanted to bring their work to the attention of a wider audience (who might not read as many method papers as me!) to highlight how important these details can be.

The authors of the study found that there are significant differences between processes and manufacturers in their ability to remove unwanted RNA molecules and increase the proportion of useful data produced. For example, one technology (Ribo-Zero) enriched RNA transcripts by up to 40-fold and increased useful data by as much as 98 per cent of the information sequenced. In addition, this particular technology also matched the relative abundances of molecules as those in the untreated controls. Others were less effective or, even worse, distorted the counts of different molecules (something you want to avoid when you are trying to compare the differing levels of gene expression).

I hope that my review encourages scientists to think carefully about the protocols that they use when using this technology to explore how genes work. It’s often hard to know which details you should focus on and spend your time and budget optimising, if the review helps my colleagues to spot potential biases and informs their choice of approach, then I will be very happy.

Lia Chappell is a PhD student in the Parasite Genomics team, studying gene expression in malaria… more

Related Links:

Sanger Science

Knowing zebrafish, knowing you – understanding zebrafish genomes, unlocking human health

Zebrafish are an ideal model organism for modelling the effects of genes on human health and disease. Credit: Genome Research Limited

30 July 2012

Written by Simon White

Studying zebrafish is a vitally important way to discover what role genes play in human health and disease. By linking human genes to their closest comparators in the zebrafish genome (orthologs) we are able to investigate what happens when a gene is lost or isn’t working the way it should.

However, to be able to do this, we need to have as complete as a list of zebrafish genes as possible. I work in a team that uses RNA-seq technology to automatically annotate genes and to determine the different ways genes work in different tissues (tissue-specific splice variation). RNA-Seq looks for the molecules (transcripts) that genes produce to make the proteins that control how a cell works. This information is invaluable in helping us to construct a complete catalogue of transcripts from the zebrafish genome.

The potential biological insights offered by RNA-Seq are considerable, but at considerable cost in terms of the effort to understand the large volume of information it produces. Because of the usefulness of this information, we set ourselves the goals of constructing a zebrafish gene set based on RNA-Seq alone and then identifying the highest quality models. This information would then be added into the core Ensembl gene-set (the reference zebrafish genome used by researchers around the world) in such a way as to increase the quantity and quality of the gene-set without adding artifacts or pseudogenes.

The results of our efforts were published recently in a paper entitled Incorporating RNA-seq data into the Zebrafish Ensembl Gene Build in Genome Research.

The task of assembling RNA-Seq short reads into gene models is not trivial; in particular, we needed to overcome the issues of transcript contiguity and fragmentation. To achieve this, we employed the following approach.

We used Illumina paired-end sequencing to deep sequence a range of developmental stages and adult tissues, providing near complete coverage of the zebrafish transcriptome. We also performed an RNA-Seq three prime pull-down experiment that allowed us to identify the precise three prime ends of models.

In order to use these data sets to build gene models we created an analysis pipeline consisting of five steps:

  • alignment to the genome
  • processing alignments to construct basic transcript models
  • re-alignment of reads to basic transcripts to identify splice sites
  • refining basic transcripts using splice data to produce final transcripts
  • using pull-down data to modify the three prime ends of the transcripts.

In total, we compared a sample set of 8,822 cDNAs to the RNA-Seq models and found that 95 per cent of the cDNA introns were present in our RNA-Seq set. We also found that 83 per cent of the cDNAs were reproduced perfectly in the RNA-Seq set at the level of the coding sequence of the transcript (the bit that actually codes for protein).

Many of the RNA-Seq generated models appeared to be fragments and we needed to remove them by filtering our results before we could include our findings in the Ensembl gene set. However, despite these fragments, we were able to create a significant number of full-length transcript models and 8,374 of these were added to the core Ensembl gene-set. In addition to this we generated a wealth of tissue-specific splice variation data.

By improving the quality and coverage of the zebrafish gene annotation we have provided a useful resource for researchers who wish to verify the activity of genes implicated in human disease. In addition, we have generated a high-quality RNA-Seq gene annotation pipeline that is now routinely used in Ensembl annotation and is proving particularly useful for species with very little protein or cDNA evidence.

In addition, the significant number (>1000) novel models that came from RNA-Seq that were absent from the zebrafish cDNAs suggests that the deep sequencing offered by RNA-Seq can be used to expand the gene annotation of even well-studied model organisms.

We hope that our approach will help refine the use of RNA-Seq in the Ensembl gene build process for new species. It also gives us the opportunity to rapidly update old gene-sets for which there is unlikely to be a full gene-build in the foreseeable future.

Simon White works in the Ensembl Genebuild team at the Institute, where he develops and runs pipelines for automated genome annotation… more

Related Links: