| Literature DB >> 23077617 |
Raul Torrieri1, Francislon S Oliveira, Guilherme Oliveira, Roney S Coimbra.
Abstract
In the last years, there was an exponential increase in the number of publicly available genomes. Once finished, most genome projects lack financial support to review annotations. A few of these gene annotations are based on a combination of bioinformatics evidence, however, in most cases, annotations are based solely on sequence similarity to a previously known gene, which was most probably annotated in the same way. As a result, a large number of predicted genes remain unassigned to any functional category despite the fact that there is enough evidence in the literature to predict their function. We developed a classifier trained with term-frequency vectors automatically disclosed from text corpora of an ensemble of genes representative of each functional category of the J. Craig Venter Institute Comprehensive Microbial Resource (JCVI-CMR) ontology. The classifier achieved up to 84% precision with 68% recall (for confidence≥0.4), F-measure 0.76 (recall and precision equally weighted) in an independent set of 2,220 genes, from 13 bacterial species, previously classified by JCVI-CMR into unambiguous categories of its ontology. Finally, the classifier assigned (confidence≥0.7) to functional categories a total of 5,235 out of the ∼24 thousand genes previously in categories "Unknown function" or "Unclassified" for which there is literature in MEDLINE. Two biologists reviewed the literature of 100 of these genes, randomly picket, and assigned them to the same functional categories predicted by the automatic classifier. Our results confirmed the hypothesis that it is possible to confidently assign genes of a real world repository to functional categories, based exclusively on the automatic profiling of its associated literature. The LitProf--Gene Classifier web server is accessible at: www.cebio.org/litprofGC.Entities:
Mesh:
Year: 2012 PMID: 23077617 PMCID: PMC3471813 DOI: 10.1371/journal.pone.0047436
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Taxonomy distribution of genes in the training dataset.
| Phylum | # ofspecies |
| Acidobacteria | 1 |
| Actinobacteria | 5 |
| Aquificae | 1 |
| Bacteroidetes | 3 |
| Chlamydiae | 1 |
| Chlorobi | 1 |
| Chloroflexi | 4 |
| Crenarchaeota | 3 |
| Cyanobacteria | 9 |
| Deinococcus-Thermus | 2 |
| Euryarchaeota | 10 |
| Fibrobacteres | 1 |
| Firmicutes | 15 |
| Fusobacteria | 1 |
| Nanoarchaeota | 1 |
| Planctomycetes | 1 |
| Proteobacteria | 49 |
| Spirochaetes | 2 |
| Tenericutes | 3 |
| Thermotogae | 1 |
| Virus | 3 |
Gene distribution in the functional categories of the JCVI-CMR ontology.
| Functional category | Dataset | ||
| Original (%) | Training (%) | Classified (%) | |
| Amino acid biosynthesis | 4102 (2.53) | 90 (2.54) | 115 (2.20) |
| Biosynthesis of cofactors, prosthetic groups, and carriers | 5482 (3,39) | 148 (4.18) | 179 (3.42) |
| Cell envelope | 227 (1.40) | 47 (1.33) | 41 (0.78) |
| Cellular processes | 17778 (10.99) | 353 (9.97) | 283 (5.41) |
| Central intermediar metabolism | 2517 (1.56) | 78 (2.20) | 92 (1.76) |
| DNA metabolism | 9238 (5.71) | 287 (8.10) | 344 (6.57) |
| Energy metabolism | 27132 (16.77) | 744 (21.01) | 1363 (26.04) |
| Fatty acid and phospholipid metabolism | 4823 (2.98) | 118 (3.33) | 253 (4.83) |
| Mobile and extrachromosomal element functions | 7716 (4.77) | 123 (3.47) | 119 (2.27) |
| Protein fate | 11611 (7.17) | 359 (10.14) | 550 (10.51) |
| Protein synthesis | 9044 (5.59) | 140 (3.95) | 149 (2.85) |
| Purines, pyrimidines, nucleosides, and nucleotides | 2678 (1.65) | 62 (1.75) | 77 (1.47) |
| Regulatory functions | 18817 (11.63) | 247 (6.97) | 327 (6.25) |
| Transcription | 2816 (1.74) | 86 (2.43) | 50 (0.96) |
| Transport and binding proteins | 25368 (15.68) | 363 (10.25) | 757 (14.46) |
| Mix category | 8297 (5.13) | 297 (8.39) | 536 (10.24) |
Only categories used to train the classifier are shown. Mix category regroups the noisy subcategories. The original column refers to the complete J. Craig Venter Institute Comprehensive Microbial Resource (JCVI-CMR). The training column refers to the dataset used to train the classifier. The classified column refers to the “Unknown function” and “Unclassified” genes that were classified by LitProf- Gene Classifier with confidence≥0.7. There is no significant difference between the original and training datasets (p>0.05 in paired t-test; confidence level of 95%).
Figure 1Recall vs. precision of the classifier.
The red line represents the average performance of the initial classifier trained with the original categories of the JCVI-CMR ontology. The blue line, represents the average performance of the final classifier trained with a rearranged version of the ontology where noisy subcategories were merged together to create the Mix Category. For red and blue lines, the average was calculated from 100 replicates of 10-fold cross validation. The green line represents the performance of the final classifier in an independent gene set. Horizontal bars represent the standard deviations of recall. The dashed lines represent the standard deviation of precision for the blue curve.
Summary of the classification of genes previously assigned to categories “Unknown function” and “Unclassified” of the JCVI-CMR ontology.
| Filters | # of genes |
| Total number of “Unknown function” and “Unclassified” genes | 69,088 |
| Genes that have a name | 34,033 |
| Genes with at least five abstracts in MEDLINE | 23,973 |
| Classified genes (confidence threshold≥0.7) | 5,235 |
From the total number of “Unknown function” and “Unclassified” genes, nearly 50% have a name, with is crucial for text corpora retrieval. From those, ∼70% have enough literature (min = five abstracts; max = 50) for classification, and in this group, ∼22% could be assigned by LitProf - Gene Classifier to a functional category with high confidence.
JCVI-CMR = J. Craig Venter Institute Comprehensive Microbial Resource.
Examples of genes classified by LitProf- Gene Classifier and further validated by manually reviewing their literature.
| Name | JCVI-CMR Accession | Species | Predicted category | Confidence | PubmedIDs | GO Biological process (species with GO annotated ortholog) * |
| ArsR protein | NT01MC4786 |
| Regulatory functions | 0.98 | 20724137; 20586430 | GO:0006355: regulation of transcription, DNA-dependent ( |
| Phosphatidylserine decarboxylase, putative | GSU_1908 |
| Fatty acid and phospholipids metabolism | 0.96 | 14651609; 16667073 | GO:0006660: phosphatidylserine catabolic process; GO:0004609: phosphatidylserine decarboxylase ( |
| UmuD protein [Contains: UmuD protein] | NT03PS1033 |
| DNA metabolism | 0.97 | 14651609; 16667073 | GO:0009432: SOS response; GO:0009650: UV protection; GO:0008236: serine-type peptidase activity ( |
| phage portal protein | NT03SP0558 |
| Mobile and extrachromosomal element functions | 0.95 | 20467052; 19947526 | GO:0019068: virion assembly; GO:0019012: virion; GO:0005198: structural molecule activity ( |
| Putative metalloprotease | pc0037 |
| Protein fate | 0.94 | 20838651; 20812964 | GO:0006508: proteolysis; GO:0008233: peptidase activity ( |
| Lambda Kil | ECH74115_3562 |
| Mobile and extrachromosomal element functions | 0.98 | 12441108; 11470529 | - |
| bacteriophage tail fiber assembly protein | NT06EC2684 |
| Mobile and extrachromosomal element functions | 0.95 | 20531477; 10051617 | - |
| staphylococcal respiratory response protein, SrrB | SAUSA300_1441 |
| Regulatory functions | 0.92 | 17697253; 17198402 | - |
| Clp amino terminal domain protein | NT01NFA0344 |
| Protein fate | 0.99 | 20014030; 19843523 | - |
| Putative malate dehydrogenase | nfa36620 |
| Energy metabolism | 0.94 | 20127467; 19405028 | GO:0006108: malate metabolic process; GO:0016615: malate dehydrogenase activity ( |
| (R)-2-hydroxyglutaryl-CoA dehydratase activator | NT01CA2639 |
| Regulatory functions | 0.76 | 11106419; 15374661 | GO:0006520: cellular amino acid metabolic process; GO:0008047: enzyme activator activity ( |
| putative beta-lactamase II | NT05LB0990 |
| Cellular processes | 0.58 | 19407375; 16452624 | GO:0017001: antibiotic catabolic process; GO:0008800: beta-lactamase activity ( |
| Carbohydrate binding protein, cbp35C | CJA_0494 |
| Transport and binding proteins | 0.80 | 20816499; 20713592 | - |
| Serine acetyltransferase, putative | GFRORF1528 |
| Cellular processes | 0.58 | 20830571; 20189106 | GO:0006535: cysteine biosynthetic process from serine; GO:0009001:serine O-acetyltransferase ( |
| CoA ligase Family protein | NT01BT3039 |
| Fatty acid and phospholipid metabolism | 0.81 | 20545743; 20534558 | GO:0008150: biological process; GO:00165878: acid-thiol ligase activity; GO:0016208: AMP binding ( |
| Modification methylase SalI | NT09RC1177 |
| DNA metabolism | 0.81 | 9628360; 9130589 | - |
The GO terms from Biological Process, Molecular Function and Cellular Component ontologies associated with each gene in table 4 (or, in most cases, their prokaryotic orthologs) were retrieved from AmiGO (http://amigo.geneontology.org) by querying the database with their canonical gene names. In most cases the GO terms retrieved supported the functional categorization predicted by LitProf – Gene Classifier, although there is not an exact correspondence between GO and JCVI-CMR ontologies. Six gene names out 16 tested had no match in AmiGO.
Number of genes assigned to functional categories with different confidence thresholds.
| confidence | genes |
| ≥0.9 | 1804 |
| ≥0.8 | 3479 |
| ≥0.7 | 5235 |
| ≥0.6 | 7085 |
| ≥0.5 | 9055 |
| ≥0.4 | 11797 |
| ≥0.3 | 15584 |
| ≥0.2 | 23397 |
| ≥0.1 | 23973 |