Categories: Sanger Science5 December 20133.7 min read

Finding functions for DUF proteins

5th December
by Penny Coggill

Representative domain-architectures of proteins containing the YARHG domain

Representative domain-architectures of proteins containing the YARHG domain

In the spring of 2013 a small group of structural biologists gathered in San Diego at the Sanford-Burnham Institute for a week of concentrated examination of the structures of proteins designated ‘Domain of Unknown Function’, or ‘DUF’. This was a small part of a large effort to try and make sense of some of the millions of proteins found in the human gut that have been classified but, as yet, not characterised (reference 1).

New genomes are being sequenced on a daily basis and this is depositing very large numbers of new proteins into protein databases. A number of groups attempt to assign meaning to these proteins and genomes by classifying the proteins in varying ways. Pfam is a database held at the Wellcome Trust Sanger Institute that attempts to build families of the functional units of proteins through their evolutionary relationships to one another, their homology.

We, the annotators at Pfam, have built over 16,000 families of homologous regions of proteins, to the extent that approximately 80 per cent of all the proteins in UniProtKB – the protein repository held at the European Bioinformatics Institute (EMBL-EBI) - have been associated with a family. However, since so many of the 43 million proteins in UniProtKB come from bacteria and other protozoa that are uncharacterised, nothing meaningful can be said about at least a quarter of the families in Pfam, and to these we have given the name of ‘DUF’.

Our consortium of scientists, including members of Pfam and specialists from the Joint Center for Structural Genomics (JCSG), decided to begin the gargantuan task of finding functions for these DUF proteins by identifying 12 possible candidate structures and analysing them in depth during the week-long jamboree. Our aim for these 12 candidate structures was to propose a possible function and then to publish.

I chose to analyse a Protein Data Bank structure, 4g2a, from Legionella pneumophila subsp. pneumophila str. Philadelphia 1. This structure had been solved by the JCSG for a protein that carried two separately-folding globular domains on it, so was found in two Pfam families. We had already published a paper on one of these domains, a YARHG domain (reference 2) and the other domain had been classified as a DUF.

In our previous paper, we had called the family the YARHG domain because of a highly conserved sequence motif of five amino acid residues - YARHG - that we had speculatively suggested might be binding a small, possibly hydrophobic, molecule associated with bacterial cell-walls. With the structure of one member of both these families now at our disposal the group collectively examined all aspects of the structure and the sequences.

The combined brains of the group came up with the hypothesis that the DUF4424 domain might be playing a role in forming large, multi-component enzyme complexes, and that the YARGH domain might be binding a chemical group associated with bacterial cell-wall outer layers, such as a hydrophobic outer membrane lipid or lipopolysaccharide that is to be reacted upon by the complex.

The importance of a balanced, personalised gut flora is now well established. We hope, ultimately, that the outcomes of this project will help towards a better understanding of the many biochemical pathways and bacterial species involved in the human gut microbiome. This will allow scientists to better appreciate the relevance of these systems to human health and well-being.

So far, two further publications from this jamboree have been published (references 3 and 4) and more will follow!

Penny is a computational biologist working with the Pfam database under Alex Bateman.


  • 1. Ellrott K, Jaroszewski L, Li W, Wooley JC, Godzik A. (2010) Expansion of the protein repertoire in newly explored environments: human gut microbiome specific protein families. PLoS One Computational Biology . doi: 10.1371/journal.pcbi.1000798
  • 2. Coggill P, Bateman A. (2012) The YARHG domain: an extracellular domain in search of a function. PLoS One.
    doi: 10.1371/journal.pone.0035575
  • 3. Hwang WC, Bakolitsa C, Punta M, Coggill PC, Bateman A, Axelrod HL, Rawlings ND, Sedova M, Peterson SN, Eberhardt RY, Aravind L, Pascual J, Godzik A. (2013) LUD, a new protein domain associated with lactate utilization. BioMed Central Bioinformatics. doi:10.1186/1471-2105-14-341
  • 4. Eberhardt RY, Chang Y, Bateman A, Murzin AG, Axelrod HL, Hwang WC, Aravind L. (2013) Filling out the structural map of the NTF2-like superfamily. BioMed Central Bioinformatics doi:10.1186/1471-2105-14-327

Related Links: