| Literature DB >> 26770195 |
Mikhail A Golyshev1, Eugene V Korotkov2.
Abstract
Over the last years a great number of bacterial genomes were sequenced. Now one of the most important challenges of computational genomics is the functional annotation of nucleic acid sequences. In this study we presented the computational method and the annotation system for predicting biological functions using phylogenetic profiles. The phylogenetic profile of a gene was created by way of searching for similarities between the nucleotide sequence of the gene and 1204 reference genomes, with further estimation of the statistical significance of found similarities. The profiles of the genes with known functions were used for prediction of possible functions and functional groups for the new genes. We conducted the functional annotation for genes from 104 bacterial genomes and compared the functions predicted by our system with the already known functions. For the genes that have already been annotated, the known function matched the function we predicted in 63% of the time, and in 86% of the time the known function was found within the top five predicted functions. Besides, our system increased the share of annotated genes by 19%. The developed system may be used as an alternative or complementary system to the current annotation systems.Entities:
Year: 2015 PMID: 26770195 PMCID: PMC4684837 DOI: 10.1155/2015/635437
Source DB: PubMed Journal: Adv Bioinformatics ISSN: 1687-8027
Figure 1Creation of the phylogenetic profiles database for genes with known functions. Function prediction for a new gene.
Bacterial genomes grouped by annotation method.
| Annotation methods | Number of genomes | Group ID |
|---|---|---|
| NCBI, UniProt, TIGRfam, Pfam, PRIAM, KEGG, COG, InterPro, IMG-ER | 38 | GRP_1 |
|
| ||
| BLAST, homology | 28 | GRP_2 |
|
| ||
| GenDB, BLAST, COG, COGnitor | 7 | GRP_3 |
|
| ||
| InterPro(Scan) | 5 | GRP_4 |
|
| ||
| Total | 104 | GRP_ALL |
Figure 2Subsets of genes under study.
Subsets of genes used to compare the current and previous annotations.
| Name | Description |
|---|---|
|
| All genes under study |
|
| |
|
| The subset of genes from the |
|
| |
|
| The subset of genes from the |
|
| |
|
| The subset of genes from the |
|
| |
|
| The subset of genes from the |
|
| |
|
| The subset of genes from the |
|
| |
|
| The subset of genes from the |
|
| |
|
| The subset of genes from the |
Shares of C 1–C 5 subsets in the total number of genes (N 0). N 1/N 0 is the share of genes from the C 1 set; N 2/N 0 is the share of genes from the C 2 set; N 3/N 0 is the share of genes from the C 3 set; N 4/N 0 is the share of genes from the C 4 set; N 5/N 0 is the share of genes from the C 5 set.
| Group ID |
|
|
|
|
|
|
|---|---|---|---|---|---|---|
|
|
|
|
|
| ||
| GRP_1 | 144157 | 0.551 | 0.613 | 0.169 | 0.113 | 0.444 |
| GRP_2 | 82170 | 0.573 | 0.668 | 0.186 | 0.091 | 0.482 |
| GRP_3 | 24657 | 0.568 | 0.714 | 0.214 | 0.068 | 0.500 |
| GRP_4 | 20592 | 0.549 | 0.663 | 0.194 | 0.080 | 0.469 |
| GRP_ALL | 375151 | 0.563 | 0.658 | 0.186 | 0.091 | 0.472 |
Comparison of original and predicted functions. N 5/N 0 is the share of genes from the C 5 set (these genes have both known and predicted functions), N 6/N 0 is the share of genes from the C 7 set (genes from C 5 for which the known function and the predicted function are equal), and N 7/N 0 is the share of genes from the C 7 set (genes from C 5 for which the known function differs from the predicted function).
| Group ID |
| Perfect match | Fuzzy match | ||
|---|---|---|---|---|---|
|
|
|
|
| ||
| GRP_1 | 0.444 | 0.377 | 0.067 | 0.401 | 0.043 |
| GRP_2 | 0.482 | 0.420 | 0.062 | 0.444 | 0.038 |
| GRP_3 | 0.500 | 0.432 | 0.068 | 0.460 | 0.040 |
| GRP_4 | 0.469 | 0.384 | 0.085 | 0.419 | 0.050 |
| GRP_ALL | 0.472 | 0.407 | 0.065 | 0.432 | 0.040 |
Figure 3Visualization of annotation results.
Distribution of places in the list of predicted functions where known function was found.
| Position of the known function in the list | Cumulative percentage of genes | Percentage of genes | Number of genes |
|---|---|---|---|
| 1 | 63.23 | 63.23 | 108806 |
| 2 | 77.12 | 13.89 | 23894 |
| 3 | 82.06 | 4.94 | 8498 |
| 4 | 84.64 | 2.58 | 4446 |
| 5 | 86.23 | 1.59 | 2743 |
| 6 | 87.36 | 1.13 | 1949 |
| 7 | 88.19 | 0.83 | 1433 |
| 8 | 88.84 | 0.65 | 1127 |
| 9 | 89.40 | 0.56 | 962 |
| 10 | 89.87 | 0.47 | 783 |
The average share of functions from the Q length metabolic pathway in the lists of top Q and top 2Q predicted functions.
| Approach | Size of predicted functions list | Metabolic pathways length range ( | |||||
|---|---|---|---|---|---|---|---|
| 1–4 | 5–9 | 10–14 | 15–19 | 20–24 | 25–30 | ||
| Perfect match |
| 0.381 | 0.198 | 0.162 | 0.115 | 0.097 | 0.090 |
| 2 | 0.415 | 0.235 | 0.196 | 0.144 | 0.125 | 0.115 | |
|
| |||||||
| Fuzzy match |
| 0.549 | 0.462 | 0.495 | 0.440 | 0.427 | 0.453 |
| 2 | 0.588 | 0.530 | 0.572 | 0.538 | 0.543 | 0.578 | |
Using of RAST and InterProScan for annotation of genes from 10 bacterial genomes. Subsets of genes C 3, C 4, and C 5 are shown (Table 2).
|
|
|
| |
|---|---|---|---|
| RAST | 17 | 48 | 9 |
| InterProScan | 12 | 44 | 13 |