| Literature DB >> 23571676 |
Feng-Biao Guo1, Lifeng Xiong, Jade L L Teng, Kwok-Yung Yuen, Susanna K P Lau, Patrick C Y Woo.
Abstract
In this paper, we performed a comprehensive re-annotation of protein-coding genes by a systematic method combining composition- and similarity-based approaches in 10 complete bacterial genomes of the family Neisseriaceae. First, 418 hypothetical genes were predicted as non-coding using the composition-based method and 413 were eliminated from the gene list. Both the scatter plot and cluster of orthologous groups (COG) fraction analyses supported the result. Second, from 20 to 400 hypothetical proteins were assigned with functions in each of the 10 strains based on the homology search. Among newly assigned functions, 397 are so detailed to have definite gene names. Third, 106 genes missed by the original annotations were picked up by an ab initio gene finder combined with similarity alignment. Transcriptional experiments validated the effectiveness of this method in Laribacter hongkongensis and Chromobacterium violaceum. Among the 106 newly found genes, some deserve particular interests. For example, 27 transposases were newly found in Neiserria meningitidis alpha14. In Neiserria gonorrhoeae NCCP11945, four new genes with putative functions and definite names (nusG, rpsN, rpmD and infA) were found and homologues of them usually are essential for survival in bacteria. The updated annotations for the 10 Neisseriaceae genomes provide a more accurate prediction of protein-coding genes and a more detailed functional information of hypothetical proteins. It will benefit research into the lifestyle, metabolism, environmental adaption and pathogenicity of the Neisseriaceae species. The re-annotation procedure could be used directly, or after the adaption of detailed methods, for checking annotations of any other bacterial or archaeal genomes.Entities:
Keywords: eliminated non-coding ORFs; newly assigned functions; newly found genes; re-annotation; the Neisseriaceae family
Mesh:
Substances:
Year: 2013 PMID: 23571676 PMCID: PMC3686433 DOI: 10.1093/dnares/dst009
Source DB: PubMed Journal: DNA Res ISSN: 1340-2838 Impact factor: 4.458
Figure 1.Distribution of GC2 versus GC3 for four types of sequences. The x-axis indicates the value of GC2 and the y-axis denotes the value of GC3. (A) For 969 function-known genes in L. hongkongensis; (B) for 86 predicted non-coding genes in L. hongkongensis; (C) for 915 retained hypothetical genes in L. hongkongensis and (D) for 20 horizontally transferred genes in P. aeruginosa.
Statistical information in genomes of the 10 Neisseriaceae strains
| Strains | Published yeara | Gene density (kb) | Genome size (bp) | Gene number | G + C content (%) | First class gene (ratio) | Third class gene (ratio) |
|---|---|---|---|---|---|---|---|
| 18 September 2003 | 0.927 | 4 751 080 | 4405 | 64.8 | 1494 (33.9%) | 1714 (38.9%) | |
| 14 April 2009 | 1.021 | 3 169 329 | 3235 | 62.4 | 969 (30.0%) | 1001 (30.1%) | |
| 15 February 2005 | 0.930 | 2 153 922 | 2002 | 52.7 | 266 (13.3%) | 800 (40.0%) | |
| 9 July 2008 | 1.195 | 2 232 025 | 2668 | 52.4 | 244 (9.1%) | 996 (37.3%) | |
| 16 December 2010 | 0.888 | 2 220 606 | 1972 | 52.3 | 744 (37.7%) | 657 (33.3%) | |
| 3 December 2007 | 0.938 | 2 153 416 | 2020 | 51.7 | 875 (43.3%) | 645 (31.9%) | |
| 24 July 2009 | 0.872 | 2 145 295 | 1872 | 52.0 | 865 (46.2%) | 467 (25.0%) | |
| 10 March 2000 | 0.908 | 2 272 360 | 2063 | 51.5 | 806 (39.1%) | 810 (39.3%) | |
| 30 March 2000 | 0.874 | 2 184 406 | 1909 | 51.8 | 459 (24.0%) | 669 (35.0%) | |
| 8 September 2011 | 0.926 | 4 332 995 | 4012 | 64.4 | 704 (17.5%) | 831 (20.7%) |
aInformation of published date were extracted from http://www.genomesonline.org/.
Accuracy based on 5-fold cross-validation and the COG ratio in each strain
| Strain | Accuracy of the method (%) | Class 1 with COG (% ratio) | Predicted non-coding ORFs | Predicted non-coding with COG (% ratio) |
|---|---|---|---|---|
| 100 | 1457 (97.5) | 88 | 0 (0) | |
| 99.90 | 954 (98.7) | 86 | 1 (1.2) | |
| 100 | 263 (98.9) | 24 | 1 (4.2) | |
| 100 | 241 (98.9) | 56 | 0 (0) | |
| 99.87 | 725 (97.4) | 8 | 0 (0) | |
| 99.77 | 851 (97.3) | 37 | 1 (2.7) | |
| 99.77 | 840 (97.1) | 25 | 1 (4.0) | |
| 99.75 | 790 (98.0) | 48 | 0 (0) | |
| 100 | 456 (99.3) | 12 | 0 (0) | |
| 99.86 | 683 (97.0) | 34 | 1 (2.9) |
Numbers of revised genes, which contains newly found genes, hypothetical genes with newly assigned functions, eliminated ORFs and disrupted ORFs
| Strains | Newly found genes | Newly assigned functions | Eliminated ORFs | Disrupted ORFs |
|---|---|---|---|---|
| 10 | 120 | 88 | 0 | |
| 8 | 20 | 85 | 2 | |
| 18 | 400 | 23 | 23 | |
| 9 | 207 | 56 | 0 | |
| 5 | 218 | 8 | 8 | |
| 6 | 214 | 36 | 19 | |
| 30 | 214 | 24 | 14 | |
| 11 | 331 | 48 | 27 | |
| 8 | 299 | 12 | 20 | |
| 1 | 46 | 33 | 0 |
Details of the eight newly found genes in the genome of L. hongkongensis
| ID | Position | COG | Coverage, | Potential function |
|---|---|---|---|---|
| LHK_A0001 | 476554–477024 (+) | COG0560E | 92%, 1e−49, 55% | Phosphoserine phosphatase |
| LHK_A0002 | 880771–881358 (+) | COG1961L | 97%, 2e−72, 60% | Site-specific recombinases |
| LHK_A0003 | 1391850–1392488 (−) | COG2869C | 98%, 6e−66, 51% | Na+-transporting nicotinamide adenine dinucleotide |
| LHK_A0004 | 1570970–1571566 (+) | COG2864C | 100%, 1e−110, 81% | Thiosulphate reductase cytochrome subunit B |
| LHK_A0005 | 1848723–1850306 (−) | COG0733R | 63%, 3e−156, 73% | Na+-dependent transporters of the sodium: neurotransmitter symporter family |
| LHK_A0006 | 2282651–2283334 (+) | COG0778C | 77%, 3e−73, 66% | Putative Cob(II)yrinic acid a,c-diamide reductase (BluB) |
| LHK_A0007 | 2660234–2660422 (−) | COG0103J | 96%, 2e−28, 88% | Ribosomal protein S9 |
| LHK_A0008 | 2875641–2876057 (−) | COG0824R | 91%, 7e−55, 64% | Predicted thioesterase |
Figure 2.RT–PCR confirmations of eight newly found genes in L. hongkongensis. mRNAs corresponding to candidate genes were evaluated by RT–PCR (RT). We used no transcriptase-containing sample as negative control (NRT) and PCR with genomic DNA as a positive control (PC).
Details of the 10 newly found genes in the genome of C. violaceum
| ID | Position | COG | Coverage, | Potential function |
|---|---|---|---|---|
| CV_A0001 | 486035–487360 (+) | COG0845Q | 100%, 0, 72% | Membrane-fusion protein |
| CV_A0002 | 487503–489461 (+) | COG0750M | 100%, 0, 71% | Membrane-associated Zn-dependent proteases |
| CV_A0003 | 1025430–1025810 (+) | COG3536S | 89%, 4e−67, 83% | Uncharacterized bacterial conserved region (BCR) |
| CV_A0004 | 1026544–1027152 (+) | COG3165S | 95%, 6e−88, 66% | Uncharacterized BCR |
| CV_A0005 | 1956990–1958684 (+) | COG0654HC | 88%, 3e−177, 53% | 2-polyprenyl-6-methoxyphenol hydroxylase |
| CV_A0006 | 2305072–2305416 (−) | COG3628R | 99%, 4e−53, 71% | Phage baseplate assembly protein |
| CV_A0007 | 2362585–2363220 (−) | COG0693R | 92%, 1e−70, 54% | Intracellular protease/amidase |
| CV_A0008 | 4207595–4208386 (+) | COG0614P | 85%, 6e−94, 67% | ABC-type cobalamin/Fe3+-siderophores transport systems |
| CV_A0009 | 4462117–4463199 (−) | COG0438M | 96%, 5e−137, 62% | Predicted glycosyltransferases |
| CV_A0010 | 4588140–4588436 (−) | COG1872S | 98%, 3e−47, 75% | Uncharacterized ancient conserved region |
Figure 3.RT–PCR confirmations of newly found potential genes in C. violaceum.
PCR primers and annealing temperature for the eight newly found genes in L. hongkongensis
| ID | Primer pair | Primer sequence | Annealing temperature |
|---|---|---|---|
| LHK_A0001 | LPW20036 | GCATGCCGAATTCCTCGAAG | 60 |
| LHK_A0002 | LPW20038 | ACGCGCTTTGATTCGGGAAC | 60 |
| LHK_A0003 | LPW20046 | TGGCCAATCCGATCGTGAC | 55 |
| LHK_A0004 | LPW19946 | ATTCATCCGTCGTGGCTAAG | 65 |
| LHK_A0005 | LPW19950 | TGGGCGCCATGATCACCTAC | 65 |
| LHK_A0006 | LPW19952 | TGGCGCTTCATCCGCATCAC | 65 |
| LHK_A0007 | LPW20545 | CATCACCCGTGCCCTGAT | 60 |
| LHK_A0008 | LPW20048 | CTGACACCCGGTGCAGTTTC | 60 |
Figure 4.Matching relationship of aa sequences encoded by LHK_02777 and LHK_A0007 with the RpsI protein in the genome of Pseudogulbenkiania. The plot is adapted from the result generated by the NCBI blast application. In the search, the query is LHK_02777 and LHK_A0007, respectively, whereas the rpsI protein constitutes the subject.
PCR primers and annealing temperature for the 10 newly found potential genes in C. violaceum
| ID | Primer pair | Primer sequence | Annealing temperature |
|---|---|---|---|
| 1 | LPW21817 LPW21818 | CTGGCATTGACCGATGAC CGAAGCGTTGGGATACAG | 55 |
| 2 | LPW21819 LPW21820 | CTGTATCGCCTGGTGTTG GCCCTTGCTCTGCAAATC | 45 |
| 3 | LPW21821 LPW21822 | TGCCCATGTCAGGACTTG GAGCTTGTCCAGGTATTG | 55 |
| 4 | LPW21823 LPW21824 | GATTTGTCGCGGGTGTTC GAAGCGTTGAACCAGATG | 60 |
| 5 | LPW21825 LPW21826 | CATGAGGTTAGCCCTTTC GCCATCGACGAAATACAG | 55 |
| 6 | LPW21827 LPW21828 | CAGTGCATCCGCATCATC GCTCCCATTGCCGAATAG | 60 |
| 7 | LPW21829 LPW21830 | CAGGAAGACCTGTCTTAC CTGGCAAAGTCCTCTTCC | 55 |
| 8 | LPW21835 LPW21836 | CGCAGCTGAAGCAGCTGAAG CGGCTTGAAACCGTTGAG | 48 |
| 9 | LPW21837 LPW21838 | TTGAGCTACGGCATAGAC GCCAGCCGTTTCAGATTC | 55 |
| 10 | LPW21839 LPW21840 | CGTCTGACGCTGCATGTG GTCGCCGGACAACAATTC | 55 |
Among hypothetical gene with assigned functions, the numbers of genes with definite names, those encoding membrane proteins, lipoproteins and periplasmic proteins
| Strains | With definite name | Membrane protein | Lipoprotein | Periplasmic protein |
|---|---|---|---|---|
| C. | 29 | 5 | 1 | 1 |
| 9 | 3 | 0 | 0 | |
| 105 | 72 | 25 | 31 | |
| 34 | 25 | 6 | 8 | |
| 39 | 25 | 6 | 21 | |
| 30 | 21 | 4 | 19 | |
| 19 | 67 | 16 | 26 | |
| 47 | 77 | 26 | 17 | |
| 78 | 33 | 8 | 18 | |
| 7 | 9 | 5 | 0 |