The concept of sequences of ambiguous coding status is gaining increasing attention across diverse biological systems, under names such as the “ghost proteome”(10), the “dark proteome”(11) or the “noncanonical translatome”(12–14). In the human genome, researchers have shown that these sequences play a role in cancer(15–18). Relatively little work has been in bacteria(19,20), where my work is focused.
A related world of research is investigating what these non-canonical proteins, and other poorly understood genes, actually do in the cell. What, if anything, is their function? This is a question that can increasingly be addressed with the kinds of high-throughput biological data which the Sanger Institute excels at producing. With the aid of new methods in artificial intelligence, the massive datasets available can be leveraged for new functional insight. This includes studying bacterial proteins, where often a quarter or more of genes in a genome have no known function(21).
The apparently very basic questions of; “what is a gene?” and “what does it do?” remain under active investigation, and will have implications for some of the most important topics in biological research.