| Literature DB >> 16423288 |
Bradford C Powell1, Clyde A Hutchison.
Abstract
BACKGROUND: Experimental verification of gene products has not kept pace with the rapid growth of microbial sequence information. However, existing annotations of gene locations contain sufficient information to screen for probable errors. Furthermore, comparisons among genomes become more informative as more genomes are examined. We studied all open reading frames (ORFs) of at least 30 codons from the genomes of 27 sequenced bacterial strains. We grouped the potential peptide sequences encoded from the ORFs by forming Clusters of Orthologous Groups (COGs). We used this grouping in order to find homologous relationships that would not be distinguishable from noise when using simple BLAST searches. Although COG analysis was initially developed to group annotated genes, we applied it to the task of grouping anonymous DNA sequences that may encode proteins.Entities:
Mesh:
Substances:
Year: 2006 PMID: 16423288 PMCID: PMC1386717 DOI: 10.1186/1471-2105-7-31
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Genomes included in this study
| Accessiona | Name | Length (nt) | # of genes annotateda | # of ORFsb >30 aa |
| 4202352 | 4066 | 73839 | ||
| 4214630 | 4106 | 75310 | ||
| 910724 | 850 | 10756 | ||
| 1042519 | 894 | 17211 | ||
| 1230230 | 1052 | 19259 | ||
| 3940880 | 3672 | 48244 | ||
| 3031430 | 2660 | 31417 | ||
| 4639221 | 4289 | 86919 | ||
| 5528970 | 5349 | 102747 | ||
| Haemophilus influenzae Rd KW20 | 1830138 | 1709 | 27756 | |
| 1643831 | 1491 | 21997 | ||
| 2944528 | 2855 | 45146 | ||
| 996422 | 726 | 13506 | ||
| 580074 | 480 | 8058 | ||
| 777079 | 633 | 10241 | ||
| 1211703 | 1016 | 14127 | ||
| 1358633 | 1037 | 17111 | ||
| 816394 | 688 | 13868 | ||
| MPUABCTIP | 963879 | 782 | 13324 | |
| 2272351 | 2025 | 42660 | ||
| 6264403 | 5566 | 92461 | ||
| RPXX | 1111523 | 834 | 12029 | |
| STYPHCT18 | 4809037 | 4600 | 90974 | |
| 2038615 | 2043 | 31733 | ||
| 1138011 | 1031 | 21937 | ||
| 751719 | 611 | 9173 | ||
| 2961149 | 2736 | 53378 | ||
| 1072315 | 1092 | 19506 |
aAccessions and annotated genes reference Genome Reviews version 25.0
Figure 1Gene prediction based on sequence conservation. (A) and (B) show receiver-operator characteristic curves summarizing the sensitivity and specificity of gene prediction based on COG membership when compared to the current gene annotations. In (A), an ORF is classified as a gene if it is conserved in a COG at a certain stringency; for (B), the ORF must be in a COG that contains at least one annotated gene from another species. Curves are produced by examining COGs at different stringencies. At stringency 2, tests are very sensitive but not very specific (points at upper right of each panel). As stringency increases, specificity increases and sensitivity decreases (indicated by arrow). For clarity, full ROC curves are shown for only seven of the organisms studied, and for the pooled result among all of the organisms studied. The plotting symbols and colors used in (A) and (B) are next to the organism names in (C). (C) shows the areas under the curves in (A) black bars and (B) grey bars. The ROC curve of a perfect test would enclose an area of 1, for a completely arbitrary test the area would be 0.5. The organisms in (C) are ordered by the area under the ROC curve in (B).
Figure 2Presence of annotated genes in COGs of ORFs. Open reading frames (ORFs) of at least 90 nucleotides between stop codons were used to construct COGs at varying stringencies as described in the methods. COGs were divided into one of three groups: "All members match annotated genes" – contain only ORFs which correspond to annotated genes, "No members match annotated genes" – contain no ORFs which match annotated genes, or "Mixed" – contain some ORFs that correspond to annotated genes and some that do not. The numbers of ORFs in COGs of each of these classes are plotted along the y-axis with a logarithmic scale.
ORFs in Majority-annotated mixed COGs of stringency 6 that may represent missed genes
| ORF COG ida | Organism | Genomic coordinatesb | Annotated gene(s) present in COGc | ORF COG ida | Organism | Genomic coordinatesb | Annotated gene(s) present in COGc |
| Potential genes missed in current annotations | Potential genes missed in current annotations (continued) | ||||||
| 678 | Bbur | 117772-116825 | 397 | Nmen | 340008-339358 | ||
| 314 | Bhal | 1503738-1503905 | 871 | Nmen | 554238-552676 | ||
| 314 | Bsub | 2477091-2476963 | 723 | Nmen | 666433-665363 | ||
| 2346 | Bsub | 4202360-4202148 | 119 | Nmen | 690163-687386 | ||
| 1717 | Cace | 243535-242696 | 1382 | Nmen | 1056138-1057340 | ||
| 1908 | Cace | 1395172-1395522 | 464 | Nmen | 1147918-1149261 | ||
| 2064 | Cace | 2284461-2283778 | 464 | Nmen | 1179954-1181297 | ||
| 148 | Cace | 3287735-3286509 | 2743 | Nmen | 1400226-1401977 | ||
| 1840 | Cace | 3650828-3649308 | 978 | Nmen | 1484110-1486353 | ||
| 659 | Cace | 3842459-3840768 | 635 | Nmen | 1527781-1528521 | ||
| 1551 | EcoK12 | 311756-311598 | 1248 | Nmen | 1629570-1628017 | ||
| 148 | EcoK12 | 3469408-3468167 | 2793 | Nmen | 1749455-1752016 | ||
| 1551 | EcoO157 | 344941-344783 | 618 | Nmen | 2119341-2120882 | ||
| 2748 | EcoO157 | 4240898-4240665 | 618 | Nmen | 2124720-2128169 | ||
| 2531 | Hinf | 131970-132959 | 788 | Nmen | 2199859-2200686 | ||
| 2319 | Hinf | 170676-169396 | 2519 | Paer | 224101-225219 | ||
| 2432 | Hinf | 235913-238519 | 1385 | Paer | 434829-433933 | ||
| 2947 | Hinf | 370735-372912 | 38 | Paer | 4143744-4142569 | ||
| 1098 | Hpyl | 315887-316504 | 2748 | Sent | 4247574-4247864 | ||
| 309 | Lmon | 640139-639558 | 192 | Tpal | 213049-213270 | ||
| 2023 | Mgen | 180733-181020 | 653 | Tpal | 624206-625738 | ||
| 994 | Mmob | 102995-102588 | 890 | Tpal | 946250-944889 | ||
| 3131 | Mmob | 201807-201646 | 946 | Tpal | 1032059-1031772 | ||
| 3175 | Mmob | 317659-317411 | 39 | Upar | 3002-3886 | ||
| 3186 | Mmob | 449811-451241 | 142 | Upar | 3861-4427 | ||
| 3000 | Mmyc | 441031-441783 | 3131 | Upar | 725869-726024 | ||
| 542 | Mmyc | 441031-441783 | 38 | VchoI | 709524-710558 | ||
| 199 | Mmyc | 830915-830742 | 2932 | VchoI | 1045279-1044317 | ||
| 73 | Mmyc | 831148-830924 | 2947 | VchoI | 1627856-1625871 | ||
| 182 | Mmyc | 836915-836712 | 1246 | VchoI | 2869620-2871836 | ||
| 3175 | Mmyc | 973088-973423 | 2793 | VchoII | 295059-292882 | ||
| 3131 | Mmyc | 1089962-1090141 | 2621 | VchoII | 299032-300000 | ||
| 314 | Mmyc | 1089962-1090141 | 2699 | VchoII | 406033-405167 | ||
| 1670 | Mpen | 2755-3009 | 2573 | VchoII | 987698-986424 | ||
| 3131 | Mpen | 1191375-1191163 | 2340 | VchoII | 1026697-1023563 | ||
| 879 | Mpen | 1226934-1226722 | |||||
| 199 | Mpen | 1317088-1316960 | Gene annotated in different framed | ||||
| 166 | Mpen | 1327926-1326898 | 1769 | Bhal | 251734-251429 | ||
| 2023 | Mpne | 207436-207717 | 3183 | Mpul | 130854-130480 | ||
| 2090 | Nmen | 70930-70358 | 3175 | Mpul | 412829-413074 | ||
| 148 | Nmen | 149590-150777 | 946 | Rpro | 433751-433479 | ||
| 2564 | Nmen | 238562-237666 | 363 | Tpal | 262583-262897 | ||
| 2572 | Nmen | 299359-298070 | |||||
aThe identifiers for COGs are local to this study. They do not correspond to numbers in the NCBI COG database.
bCoordinates in which the first number is greater than the second indicate that the ORF is on the minus strand.
cA named annotated putative ortholog in another organism or paralog within the organism to the ORF listed.
dThese COGs may indicate both that the ORF listed is a missed gene and that the annotated
ORFs in majority-annotated mixed COGs that do not appear to represent missed genes
| ORF COG ida | Organism | Genomic coordinatesb | Annotated gene(s) present in COGc | ORF COG ida | Organism | Genomic coordinatesb | Annotated gene(s) present in COGc |
| Existing annotation of pseudogene | Frameshift 3' fragmentc | ||||||
| 876 | EcoK12 | 1488620-1487985 | 1036 | Bbur | 21098-20445 | ||
| 2340 | Sent | 4738725-4740071 | 1750 | Bhal | 984866-983856 | ||
| 2433 | Sent | 4745051-4743573 | 1188 | Bhal | 1359362-1360555 | ||
| 1895 | Sent | 3243737-3244861 | 2257 | Bhal | 3182850-3181696 | ||
| 1399 | Sent | 461578-461874 | 88 | Bsub | 3671944-3672555 | ||
| 653 | Sent | 2505700-2506824 | 641 | Hinf | 1525427-1524561 | ||
| 1058 | Sent | 3413535-3416306 | 2031 | Hinf | 1719924-1718821 | ||
| 3088 | Sent | 4084807-4083605 | 2473 | Mgal | 431452-431778 | ||
| 815 | Sent | 1360931-1362226 | 2309 | Mgen | 416785-416336 | ||
| 3104 | Sent | 4009730-4009993 | 975 | Mmyc | 57011-56760 | ||
| 569 | Sent | 1969437-1970648 | 686 | Mmyc | 690895-690356 | ||
| 1319 | Nmen | 107757-109406 | |||||
| Annotated in GenomeReviews but with different stop | 842 | Nmen | 1995043-1994876 | ||||
| 928 | Bsub | 2500726-2499347 | 556 | VchoI | 553588-552383 | ||
| 157 | Cper | 2751593-2751051 | 745 | VchoI | 555313-556182 | ||
| 999 | EcoO157 | 3613249-3610595 | 106 | VchoI | 1087924-1089819 | ||
| 107 | Hinf | 655042-654365 | 2435 | VchoI | 2612949-2611972 | ||
| 589 | Mpne | 329463-331229 | 2807 | VchoII | 1060889-1060107 | ||
| 210 | Mpul | 150772-151668 | |||||
| 743 | Sent | 2492196-2490763 | Frameshift 5' fragmentc | ||||
| 534 | Tpal | 478406-478777 | 697 | Bhal | 3580443-3579682 | ||
| 166 | Upar | 279005-279949 | 2029 | Bsub | 2304436-2305248 | ||
| 2049 | Bsub | 3032201-3032512 | |||||
| Fragments around stop codons (nonsense)c | 2 | Cpne | 383405-384037 | ||||
| 928 | Bsub | 2500726-2499347 | 462 | Cpne | 1088259-1088711 | ||
| 157 | Cper | 2751593-2751051 | 2769 | EcoK12 | 3814680-3813886 | ||
| 999 | EcoO157 | 3613249-3610595 | 2257 | EcoK12 | 3948538-3949566 | ||
| 107 | Hinf | 655042-654365 | 2433 | Hinf | 232074-232991 | ||
| 589 | Mpne | 329463-331229 | 3066 | Hinf | 1377365-1378063 | ||
| 210 | Mpul | 150772-151668 | 1075 | Hinf | 1477189-1476557 | ||
| 743 | Sent | 2492196-2490763 | 641 | Hinf | 1526028-1525285 | ||
| 2571 | Nmen | 292645-294051 | |||||
| 220 | Tpal | 220772-221749 | |||||
| 556 | VchoI | 554244-553561 | |||||
| 1826 | VchoI | 637551-638246 | |||||
| 42 | VchoI | 851189-849954 | |||||
| 1082 | VchoII | 690599-690273 | |||||
aThe identifers for COGs are local to this study. They do not correspond to numbers in the NCBI COG database.
bCoordinates in which the first number is greater than the second indicate that the ORF is on the minus strand.
cA named annotated putative ortholog in another organism or paralog within the organism to the ORF listed.
dThese categories represent probable pseudogenes or sequencing errors.
Minority-annotated mixed COGs of stringency 6
| aORF COG id | Organism | bGenomic coordinates | Annotated locus tag | Explanation for similarity |
| 458 | Bhal | 2607307-2607975 | ambiguous-- | |
| 2939 | Lmon | 2784312-2784674 | opposite strand | |
| 3041 | Mgen | 400107-399841 | opposite strand tRNA cluster | |
| 715 | Mmob | 474080-474634 | opposite strand tRNA cluster | |
| 1171 | Mmyc | 315687-315178 | opposite strand annotated gene | |
| 1172 | Mmyc | 315687-315178 | opposite strand annotated gene | |
| 3172 | Mpul | 547792-547565 | opposite strand RNA-gene ( | |
| 169 | Mpul | 703396-704043 | ribosomal protein in opposite strand | |
| 1148 | Mpul | 706478-707455 | opposite strand ribosomal protein | |
| 1436 | Spne | 199207-198743 | opposite strand ribosomal protein | |
| 400 | Tpal | 321084-317926 | region upstream of gene is opposite | |
| 625 | Tpal | 580802-581407 | opposite strand |
aCOG identifiers are local to this study.
bCoordinates in which the first number is greater than the second indicate that the ORF is on the minus strand.
Mixed COGs containing ORFs from Mycoplasma genitalium that do not correspond to annotated genes
| a | b | ||
| 180733-181020 | + | 4-3347 | Homologous to genes in 12 other organisms, some annotated as N-utilization substance |
| 237114-237299 | - | 4-1487 | Deletion of 'C' at 237175 joins this to the gene (MG199) annotated at 236591-237084. Together the joined fragments are similar to ribonuclease genes. [GenBank: |
| 416336-416785 | - | 4-3943 | Deletion of 'G' at 416710 joins this to fragment at 416661-416939. Together the joined fragments are similar to acyl carrier protein diesterases. [GenBank: |
| 290638-291003 | + | 4-8314 | Insertion of 'T' at 290983 joins this fragment to the gene (MG243) annotated at 290922-291326. Together the joined fragments are similar to hypothetical genes in |
aCoordinates and insertions/deletions refer to [GenBank:NC_000908.1]
bCOG identifiers are local to this study
Figure 3COGs at varying stringencies. The concept of stringency places a requirement of interconnectedness of elements of a COG. As stringency increases, COGs may split into smaller COGs and less-connected nodes are dropped. Each vertex represents a gene (as used in the initial definition of COGs) or an ORF (as used in this study). Edges represent bidirectional best-hit pairs. Dashed lines enclose elements of a single COG. Grayed vertices and edges do not participate in a COG at the given stringency. There is a single COG of stringency (2) containing all of the vertices in this graph because they are all transitively connected. Stringency (3) COGs are as described by Tatusov et al. [21]. An orthologous group of stringency 3 forms a triangle (such as {i , j , k }); orthologous groups of stringency (3) are clustered if they share two vertices (alternatively: if they share an edge). Stringency (4) OGs are clustered if they share three vertices. The orthologous groups {j , k , l , m } and {l , m , n , o } only share two vertices so they form two separate COGs. At stringency (5) only one orthologous group, and thus only one COG, remains.
Figure 4SPROCKET. The SPROCKET program was developed to facilitate the analysis performed in this study. For the members of a COG, a user can view the peptide sequence alignment (using CLUSTALW), a graph of the best-hit relationships and the genomic context.