| Literature DB >> 22676436 |
Leila Ranandeh Kalankesh1, Robert Stevens, Andy Brass.
Abstract
BACKGROUND: Most major genome projects and sequence databases provide a GO annotation of their data, either automatically or through human annotators, creating a large corpus of data written in the language of GO. Texts written in natural language show a statistical power law behaviour, Zipf's law, the exponent of which can provide useful information on the nature of the language being used. We have therefore explored the hypothesis that collections of GO annotations will show similar statistical behaviours to natural language.Entities:
Mesh:
Year: 2012 PMID: 22676436 PMCID: PMC3473240 DOI: 10.1186/1471-2105-13-127
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Total number of annotations and the number of distinct GO identifiers for each of the data sets used in the study in terms of three separate sub-ontologies
| 51,640 | 889 | ||
| | 55,781 | 2,844 | |
| | 58,320 | 5,259 | |
| 45,933 | 641 | ||
| | 60,919 | 2,318 | |
| | 59,133 | 4,239 | |
| 23,179 | 304 | ||
| | 47,651 | 1,187 | |
| | 34,158 | 1,513 | |
| 29,563 | 626 | ||
| | 26,292 | 1,611 | |
| | 31,797 | 1,963 | |
| 53,342 | 50 | ||
| | 63,050 | 2,776 | |
| 74,943 | 5,411 | ||
CC – Cellular Component sub-ontology, MF - Molecular Function sub-ontology and BP - Biological Process sub-ontology. Homo sapiens (Hs), Mus musculus (Mm), Danio rerio (Dr), Saccharomyces cerevisiae (Sc), Rattus norvegicus (Rn).
The total number of annotations and the number of distinct GO identifiers of each of the Homo sapiens (Hs) and Mus musculus (Mm) data sets in terms of the three separate sub-ontologies by evidence code
| 642 | 16,744 | 572 | 31,164 | ||
| | 1,974 | 20,250 | 1,735 | 31,709 | |
| | 3,172 | 18,594 | 3,642 | 33,820 | |
| 487 | 11,784 | 232 | 28,918 | ||
| | 1,364 | 10,467 | 1,320 | 45,185 | |
| 3,846 | 264,78 | 731 | 26,473 | ||
Figure 1The cumulative distribution function Pr(x) plotted as a function of frequency (x) for GO gene annotations contained within Human GOA. The straight line shows the region of the plots for which a power law was found to provide a good model of the data [25]. 1(a) Annotation from the biological process sub-ontology, 1(b) annotation from the molecular function sub-ontology, and 1 (c) annotation from the cellular component sub-ontology The measured power law exponents, β, (were 2.04, 1.83, and 1.73 respectively. For all graphs p-value > 0.55, suggesting that the power law does provide a plausible model of the data.
Results obtained from the power law analysis of each of the data sets characterized in Table2
| 1.73 | 0.63 | ||
| | 1.83 | 0.55 | |
| | 2.04 | 0.65 | |
| 1.69 | 0.74 | ||
| | 1.76 | 0.36 | |
| | 2.08 | 0.97 | |
| 1.62 | 0.74 | ||
| | 1.69 | 0.91 | |
| | 1.88 | 0.11 | |
| 1.86 | 0.29 | ||
| | 1.88 | 0.78 | |
| | 2.27 | 0.42 | |
| 1.68 | 0.24 | ||
| | 1.91 | 0.85 | |
| 2.38 | 0.76 | ||
β is the power law exponent and P-value is a statistic used to determine how good a model the power law is of the data. If P > 0.1 we can assume that the power law does provide a good description of the data. H. sapiens (Hs), M. musculus (Mm), D. rerio (Dr), S. cerevisiae (Sc), R. norvegicus (Rn).
Results obtained from power law analysis of each of the data sets characterized in Table2
| | | ||||
|---|---|---|---|---|---|
| 1.88 | 1.62 | ||||
| | 2.05 | 1.75 | |||
| | 2.12 | 2.04 | |||
| 1.9 | 1.5 | ||||
| | 2.15 | 1.67 | 0.03 | ||
| 2.6 | 1.62 | 0.00 | |||
β is the power law exponent and P-value is a statistic used to determine how good a model the power law is of the data. Statistically significant values are denoted in bold. The GO evidence codes used to define the high confidence (HC) and low confidence (LC) data sets are described in the materials and methods.
Figure 2The power law exponent, β, as a function of the total number of distinct GO identifiers in each of the GO sub-ontologies referenced in table 4 as well as a number of other species datasets taken from Ensembl.