15 April 2015
By Anika Ollerich
While human readers best understand data that is provided as free text, computers can better analyze data that is represented with ontologies. In order for computers to learn how free text corresponds to ontologies, a text corpus (such as the HPO corpus) is required, based on which a computer can learn this relationship. Credit: Tudor Groza
Every one of us has to visit our GP from time to time. Ideally, we describe our symptoms and are prescribed medicines corresponding to the diagnosis.
Unfortunately, it's not always possible to identify the cause of a disease and all a GP can do is try to relieve the symptoms.
The symptoms and underlying causes of rarely occurring hereditary diseases are particularly hard to define, which prevents successful treatment or prevention.
To help with the identification of causes for these diseases, numerous biological as well as computational efforts are ongoing. While in a biological experiment the scope is very narrow and precise, focusing perhaps on one gene or set of symptoms, computational projects analyse large amounts of data from several sources. The sheer amount of information can mean that computational projects miss smaller details.
Some of the computational algorithms either work on assumptions derived from biological experiments or take the output of these projects into consideration. The lab robots Adam and Eve, developed by researchers at Aberystwyth University, are an example of a computational approach used to derive a biological hypothesis.
In order for Adam and Eve to be able to execute their algorithms, knowledge needs to be prepared in format that can be understood by computers, which is not necessarily readable for humans. Due to this necessity, the meaning of things, such as the symptoms of diseases, are gathered into groups according to the context in which they are used.
These groupings are called domain-specific ontologies. Probably the best known example of a domain-specific ontology is the Gene Ontology, a collaborative effort by the computational biology community to create a standardised vocabulary for annotating gene function.
The Human Phenotype Ontology is an ontology to encode human-specific symptoms of diseases. The process of encoding includes a textual description of the symptom as well as logical representation that can be used by computers to perform further deductions. For computers to be able to read records written by a GP and derive deductions, it would not only need the ontology, it would also have to know how symptoms are expressed in the text.
One way of teaching a computer the language of specific entities, in our case symptoms, is to generate a corpus. A text corpus is nothing more than a highlighted text.
In our study, we generated such a corpus for the Human Phenotype Ontology by highlighting symptoms in a number of medical research papers. This corpus can then be used to assess how good an algorithm is at recognising symptoms. However, this usually provides only a quantified assessment of performance.
Thus, we also provide a variety of test cases that can elucidate the shortcomings of an algorithm, i.e. which bits of natural language cause the algorithm to struggle to identify a symptom.
Once an algorithm is sufficiently trained, it can then be applied to read the millions of abstracts and full-text papers contained in PubMed and provide summaries for diseases, genes, drugs or any other entity relevant to symptoms of diseases. These summaries have a wide range of applications such as the prediction of disease gene candidates, drug repurposing, and many more.
Anika Oellrich is a Senior Bioinformatician working as part of the International Mouse Phenotyping Consortium (IMPC) project. Her research work focuses on aspects of phenotype mining, in large data sets as well as scientific literature. Having investigated the different representations of phenotypes, she applies this knowledge to data integration and human genetic disorders with the aim of improving the understanding about the molecular mechanisms underlying human diseases.
- Groza T, et al. Automatic concept recognition using the Human Phenotype Ontology reference and test suite corpora. The Journal of Biological Databases and Curation. DOI:10.1093/database/bav005
- Ashburner M, et al (2000). Gene Ontology: tool for the unification of biology. Nature Genetics. DOI:10.1038/75556
- Hoehndorf R, et al (2012). Linking PHARMGKB to phenotype studies and animal models of disease for drug repurposing. Pacific Symposium on Biocomputing 2012. DOI:10.1142/9789814366496_0038