Understanding the known unknowns of human gene annotation

25th November 2013
by Jonathan Mudge

The GENCODE annotation of BRCA1. The circled exon contains two known SNPs; one causes a codon substitution and the other disrupts a splice site. This exon is only found in one transcript. Is it functional or simply transcriptional noise? Credit: Ensembl

The GENCODE annotation of BRCA1. The circled exon contains two known SNPs; one causes a codon substitution and the other disrupts a splice site. This exon is only found in one transcript. Is it functional or simply transcriptional noise? Credit: Ensembl

The Human Genome Project was a starting point, not a finishing line. These three billion DNA letters will only be useful once we understand what they are telling us.

The Human and Vertebrate Analysis and Annotation (HAVANA) team at the Wellcome Trust Sanger Institute have spent ten years describing the gene content of the genome and last year we completed our first pass chromosome-by-chromosome annotation of all human chromosomes. Together with Ensembl, we contributed the GENCODE gene annotation set to the human ENCODE project. Nonetheless, our work continues. Indeed, there are times when it seems we are climbing an endless mountain, with its summit hidden behind the clouds.

There are two reasons why. Firstly, one bitter-sweet reward of next-generation sequencing is that the number of human RNAs (or ‘transcripts’) identified continues to increase, making the genome something of a moving target from an annotators perspective.

When we overlay the GENCODE annotation onto the genome, we see that around five per cent of its sequence is covered by our exons (DNA sequences that are processed into mature transcripts). However, recent RNAseq studies indicate that over 60 per cent of the genome is transcribed. We now know that at least some of this additional transcription represents long non-coding RNA, and, currently, less than half of our 57,000 human genes are classed as protein-coding. Nonetheless, the true faction of this transcription that can be converted into meaningful annotation remains to be seen.

In addition, the number of transcripts associated with protein-coding genes is on the increase and we now know that most protein-coding genes generate several distinct transcripts from the use of alternative splicing.

However, the goal of our annotation is not simply to capture transcripts. We also want to understand the contribution each makes to human biology; its function, in other words. Here we get to the second complication, and it’s a big one: some transcripts may not actually do anything. Perhaps this seems like a heretical suggestion? Nonetheless, it must be true to some extent, given that the molecular processes that generate RNA – in particular transcription and splicing – are known to be both error-prone and rather promiscuous.

What proportion, then, of the transcriptome is truly functional? The truth is we don’t know, and it’s rather sobering to consider that a small minority of the 200,000 transcripts in GENCODE have been studied in the laboratory. For this reason, much of the functional annotation that currently exists in GENCODE should be regarded as putative.

The implications of this ambiguity are profound. Consider the BRCA1 gene (which I daresay many of you frequently do, due to its association with breast cancer). We have annotated 28 transcripts within this gene, each of which is constructed from a subset of 33 distinct exons. Meanwhile, hundreds of genomic variants have been identified within this region.

On one hand, if certain exons annotated within BRCA1 are actually non-functional, then it follows that variants found in such exons may be of little consequence. On the other hand, if can we establish exactly which BRCA1 transcripts are functional, then we can interpret the function of any associated variants with confidence. In short, we need this information, and we need it now. (See image above.)

All of this begs the question: where do we go from here? We start by plugging our new publication in Genome Research (see link below), where we discuss the relative merits of a variety of strategies for both transcript discovery and the functional annotation of transcript models.

In this paper, we propose an integrated strategy using all of the resources at our disposal, including a variety of next-generation sequencing methods, modern proteomics and the comparative analysis of genomes from other species. This will not be a short-term endeavour.

In the meantime, here is the take-home message for researchers: realise that you favourite gene is complex, accept that it is currently difficult to untangle the meaning of this complexity, and, above all, appreciate how this uncertainty may affect your work.

Jonathan is a Senior Computer Biologist at the Wellcome Trust Sanger Institute.

References