| Literature DB >> 28261563 |
Tsute Chen1, Huma Siddiqui2, Ingar Olsen2.
Abstract
Currently, genome sequences of a total of 19 Porphyromonas gingivalis strains are available, including eight completed genomes (strains W83, ATCC 33277, TDC60, HG66, A7436, AJW4, 381, and A7A1-28) and 11 high-coverage draft sequences (JCVI SC001, F0185, F0566, F0568, F0569, F0570, SJD2, W4087, W50, Ando, and MP4-504) that are assembled into fewer than 300 contigs. The objective was to compare these genomes at both nucleotide and protein sequence levels in order to understand their phylogenetic and functional relatedness. Four copies of 16S rRNA gene sequences were identified in each of the eight complete genomes and one in the other 11 unfinished genomes. These 43 16S rRNA sequences represent only 24 unique sequences and the derived phylogenetic tree suggests a possible evolutionary history for these strains. Phylogenomic comparison based on shared proteins and whole genome nucleotide sequences consistently showed two groups with closely related members: one consisted of ATCC 33277, 381, and HG66, another of W83, W50, and A7436. At least 1,037 core/shared proteins were identified in the 19 P. gingivalis genomes based on the most stringent detecting parameters. Comparative functional genomics based on genome-wide comparisons between NCBI and RAST annotations, as well as additional approaches, revealed functions that are unique or missing in individual P. gingivalis strains, or species-specific in all P. gingivalis strains, when compared to a neighboring species P. asaccharolytica. All the comparative results of this study are available online for download at ftp://www.homd.org/publication_data/20160425/.Entities:
Keywords: Porphyromonas gingivalis; comparative genomics; phylogenetics; phylogenomics
Mesh:
Substances:
Year: 2017 PMID: 28261563 PMCID: PMC5306136 DOI: 10.3389/fcimb.2017.00028
Source DB: PubMed Journal: Front Cell Infect Microbiol ISSN: 2235-2988 Impact factor: 5.293
Summary of all the .
| W83 | 2003-09-02 | 2,343,476 | 1 | PRJNA48 | SAMN02603720 | ||
| ATCC_33277 | 2008-05-20 | 2,354,886 | 1 | PRJDA19051 | Kitasato Univ. | ||
| TDC60 | 2011-05-23 | 2,339,898 | 1 | PRJDA66755 | Tokyo Medical and Dental Univ. | ||
| W50 | 2012-06-25 | 2,242,062 | 104 | PRJNA78905 | SAMN00792205 | J. Craig Venter Institute | |
| JCVI_SC001 | 2013-04-24 | 2,426,396 | 1,284 | PRJNA167667 | SAMN02436407 | J. Craig Venter Institute | |
| F0568 | 2013-09-16 | 2,334,744 | 154 | PRJNA173937 | SAMN02436723 | Washington Univ. | |
| F0569 | 2013-09-16 | 2,249,227 | 111 | PRJNA173938 | SAMN02436724 | Washington Univ. | |
| F0570 | 2013-09-16 | 2,282,791 | 117 | PRJNA173939 | SAMN02436747 | Washington Univ. | |
| F0185 | 2013-09-16 | 2,246,368 | 113 | PRJNA198891 | SAMN02436815 | Washington Univ. | |
| F0566 | 2013-09-16 | 2,306,092 | 192 | PRJNA198892 | SAMN02436881 | Washington Univ. | |
| W4087 | 2013-09-16 | 2,216,597 | 114 | PRJNA198893 | SAMN02436749 | Washington Univ. | |
| SJD2 | 2013-12-04 | 2,329,548 | 117 | PRJNA205615 | SAMN02470968 | Shanghai Jiao Tong Univ. School of Medicine | |
| HG66 | 2014-08-14 | 2,441,780 | 1 | PRJNA245225 | SAMN02732406 | Univ. of Louisville | |
| A7436 | 2015-08-11 | 2,367,029 | 1 | PRJNA276132 | SAMN03366764 | Univ. of Florida | |
| AJW4 | 2015-08-26 | 2,372,492 | 1 | PRJNA276132 | SAMN03372093 | Univ. of Florida | |
| Ando | 2015-09-17 | 2,229,994 | 112 | PRJDB4201 | SAMD00040429 | Lab. of Plant Genomics and Genetics, Dept. of Plant Genome Research, Kazusa DNA Research Institute | |
| 381 | 2015-10-14 | 2,378,872 | 1 | PRJNA276132 | SAMN03656156 | Univ. of Florida | |
| A7A1-28 | 2015-11-17 | 2,249,024 | 1 | PRJNA276132 | SAMN03653671 | Univ. of Florida | |
| MP4-504 | 2016-02-09 | 2,373,453 | 92 | PRJNA305025 | SAMN04309157 | Univ. of Washington |
For a more detailed list of this table please follow this web link: .
Genomes of this table are sorted by the original sequence release date.
Unassembled raw sequence reads from which the assembly that was done can be traced back by the Biosample ID, if available.
This Genbank number shows the sequence as “circular,” however it is a single pseudo-contig with many Ns filling the gaps. Thus, it should not be considered as a complete genome.
Effective (non-Ns) sizes of the genomes.
| HG66 | 1 | 2,441,780 | 2,441,680 | 100 | 100 (1) |
| JCVI_SC001 | 1 | 2,426,396 | 2,398,196 | 28,200 | 100 (282) |
| 381 | 1 | 2,378,872 | 2,378,872 | 0 | None |
| MP4-504 | 92 | 2,373,453 | 2,373,453 | 0 | None |
| AJW4 | 1 | 2,372,492 | 2,372,492 | 0 | None |
| A7436 | 1 | 2,367,029 | 2,367,029 | 0 | None |
| ATCC_33277 | 1 | 2,354,886 | 2,354,886 | 0 | None |
| W83 | 1 | 2,343,476 | 2,343,476 | 0 | None |
| TDC60 | 1 | 2,339,898 | 2,339,897 | 1 | 1 (1) |
| SJD2 | 117 | 2,329,548 | 2,328,850 | 698 | 4–256 (23) |
| F0568 | 154 | 2,334,744 | 2,328,244 | 6,500 | 100 (65) |
| F0566 | 192 | 2,306,092 | 2,300,992 | 5,100 | 100 (51) |
| F0570 | 117 | 2,282,791 | 2,278,391 | 4,400 | 100 (44) |
| A7A1-28 | 1 | 2,249,024 | 2,249,024 | 0 | None |
| W50 | 104 | 2,242,062 | 2,242,060 | 2 | 1 (2) |
| F0569 | 111 | 2,249,227 | 2,242,027 | 7,200 | 100 (72) |
| F0185 | 113 | 2,246,368 | 2,240,268 | 6,100 | 100 (61) |
| Ando | 112 | 2,229,994 | 2,227,972 | 2,022 | 10–100 (61) |
| W4087 | 114 | 2,216,597 | 2,212,597 | 4,000 | 100 (40) |
Genomes are ordered based on the non-N size.
Summary of the NCBI annotation.
| 5W83 | 1,909 | 53 | 12 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 41 | 2014-01-31 |
| ATCC_33277 | 2,090 | 53 | 12 | 0 | 210 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2011-11-26 |
| TDC60 | 2,220 | 53 | 12 | 1 | 380 | 7 | 0 | 0 | 0 | 0 | 1 | 34 | 2011-08-17 |
| W50 | 2,016 | 48 | 3 | 1 | 0 | 8 | 0 | 1 | 1 | 0 | 0 | 0 | 2012-06-25 |
| JCVI_SC001 | 2,354 | 45 | 3 | 1 | 0 | 8 | 0 | 1 | 1 | 0 | 0 | 0 | 2013-04-23 |
| F0568 | 2,410 | 46 | 3 | 1 | 0 | 7 | 0 | 0 | 1 | 0 | 0 | 0 | 2013-09-16 |
| F0569 | 2,297 | 46 | 3 | 1 | 0 | 7 | 0 | 0 | 1 | 1 | 0 | 0 | 2013-09-16 |
| F0570 | 2,315 | 44 | 3 | 1 | 0 | 7 | 0 | 0 | 1 | 1 | 0 | 0 | 2013-09-16 |
| F0185 | 2,233 | 45 | 3 | 1 | 0 | 7 | 0 | 0 | 1 | 0 | 0 | 0 | 2013-09-16 |
| F0566 | 2,392 | 45 | 3 | 1 | 0 | 7 | 0 | 0 | 1 | 1 | 0 | 0 | 2013-09-16 |
| W4087 | 2,202 | 45 | 3 | 1 | 0 | 7 | 0 | 0 | 1 | 1 | 0 | 0 | 2013-09-16 |
| SJD2 | 2,012 | 48 | 3 | 0 | 3 | 0 | 62 | 0 | 0 | 0 | 0 | 0 | 2013-12-04 |
| HG66 | 1,958 | 53 | 12 | 0 | 3 | 5 | 38 | 0 | 1 | 0 | 0 | 0 | 2014-10-22 |
| A7436 | 2,004 | 53 | 12 | 1 | 4 | 0 | 3 | 0 | 1 | 0 | 0 | 0 | 2015-08-11 |
| AJW4 | 2,002 | 53 | 12 | 1 | 2 | 0 | 2 | 0 | 1 | 0 | 0 | 0 | 2015-08-26 |
| Ando | 1,770 | 47 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2015-11-27 |
| 381 | 1,968 | 53 | 12 | 1 | 3 | 0 | 9 | 0 | 1 | 1 | 0 | 0 | 2015-10-14 |
| A7A1-28 | 1,841 | 53 | 12 | 1 | 5 | 0 | 37 | 0 | 1 | 0 | 0 | 0 | 2015-11-17 |
| MP4-504 | 1,889 | 47 | 3 | 0 | 3 | 2 | 99 | 0 | 1 | 0 | 0 | 0 | 2016-02-09 |
Data analyzed based on the gff files of each genome generated by the NCBI annotation pipeline.
Detail information provided by NCBI can also be downloaded from .
Mon-coding RNA.
Trans-messenger RNA: a bacterial RNA molecule with dual tRNA-like and mRNA-like properties.
NCBI annotation release dates were based on the dates reported in the protein.gbff file in the above FTP link.
Comparison of NCBI and RAST genome annotations.
| W83 | 1,909 | 2,163 | 1784/80/334 | 4 | 4 | 4 | 53 |
| ATCC_33277 | 2,090 | 2,092 | 1911/154/144 | 4 | 4 | 4 | 53 |
| TDC60 | 2,220 | 2,090 | 1880/286/167 | 4 | 4 | 4 | 53 |
| W50 | 2,016 | 2,036 | 1887/102/123 | 1 | 1 | 1 | 48 |
| JCVI_ SC001 | 2,354 | 2,136 | 2030/276/78 | 1 | 1 | 1 | 45/42 |
| F0568 | 2,417 | 2,096 | 1939/403/111 | 1 | 1 | 1 | 46 |
| F0569 | 2,297 | 1,982 | 1845/377/92 | 1 | 1 | 1 | 46 |
| F0570 | 2,316 | 2,063 | 1912/338/107 | 1 | 1 | 1 | 44 |
| F0185 | 2,236 | 2,005 | 1862/319/107 | 1 | 1 | 1 | 45 |
| F0566 | 2,395 | 2,044 | 1885/428/112 | 1 | 1 | 1 | 45 |
| W4087 | 2,204 | 1,973 | 1850/303/92 | 1 | 1 | 1 | 45 |
| SJD2 | 2,020 | 2,166 | 1845/136/271 | 1 | 1 | 1 | 48/47 |
| HG66 | 1,958 | 2,215 | 1881/58/298 | 4 | 4 | 4 | 53 |
| A7436 | 2,004 | 2,173 | 1898/84/239 | 4 | 4 | 4 | 53 |
| AJW4 | 2,002 | 2,139 | 1884/104/226 | 4 | 4 | 4 | 53 |
| Ando | 1,788 | 1,989 | 1674/76/275 | 2 | 1 | 1 | 47 |
| 381 | 1,968 | 2,108 | 1853/91/221 | 4 | 4 | 4 | 53 |
| A7A1-28 | 1,841 | 2,039 | 1736/89/269 | 4 | 4 | 4 | 53 |
| MP4-504 | 1,891 | 2,181 | 1806/68/347 | 1 | 1 | 1 | 47 |
Only protein-coding, rRNA and tRNA genes were compared since these are the only types of genes annotated by RAST.
The three numbers shown (X/Y/Z) are X, common genes, genes with ≥ 80% overlapped based on the annotated start and end postion; Y, RAST unique genes, gene annotated by RAST without overlap of any NCBI gene; Z, NCBI unique genes, genes annotated by NCBI without overlap to any RAST gene. There are genes that are partially overlapping to each other with < 80% of the length not included.
Unique .
| Unique Seq 1 | Unique Trimmed Seq 1 | 4 | 1,422 | 381 (4) |
| Unique Seq 2 | 4 | 1,475 | ATCC33277 (4) | |
| Unique Seq 3 | 3 | 1,538 | HG66 (3) | |
| Unique Seq 4 | Unique Trimmed Seq 2 | 1 | 1,538 | HG66 |
| Unique Seq 5 | Unique Trimmed Seq 3 | 3 | 1,422 | A7436 (3) |
| Unique Seq 6 | 5 | 1,475 | W50; W83 (4) | |
| Unique Seq 7 | Unique Trimmed Seq 4 | 1 | 1,422 | A7436 |
| Unique Seq 8 | Unique Trimmed Seq 5 | 4 | 1,422 | A7A1-28 (4) |
| Unique Seq 9 | Unique Trimmed Seq 6 | 3 | 1,422 | AJW4 (3) |
| Unique Seq 10 | Unique Trimmed Seq 7 | 1 | 1,422 | AJW4 |
| Unique Seq 11 | Unique Trimmed Seq 8 | 1 | 1,521 | TDC60 |
| Unique Seq 12 | 1 | 1,520 | TDC60 | |
| Unique Seq 13 | Unique Trimmed Seq 9 | 1 | 1,522 | TDC60 |
| Unique Seq 14 | Unique Trimmed Seq 10 | 1 | 1,520 | TDC60 |
| Unique Seq 15 | Unique Trimmed Seq 11 | 1 | 1,475 | JCVI SC001 |
| Unique Seq 16 | 1 | 1,538 | SJD2 | |
| Unique Seq 17 | Unique Trimmed Seq 12 | 1 | 1,475 | Ando |
| Unique Seq 18 | Unique Trimmed Seq 13 | 1 | 1,520 | W4087 |
| Unique Seq 19 | Unique Trimmed Seq 14 | 1 | 1,520 | F0569 |
| Unique Seq 20 | Unique Trimmed Seq 15 | 1 | 1,520 | F0568 |
| Unique Seq 21 | Unique Trimmed Seq 16 | 1 | 1,520 | F0185 |
| Unique Seq 22 | Unique Trimmed Seq 17 | 1 | 1,520 | F0566 |
| Unique Seq 23 | Unique Trimmed Seq 18 | 1 | 1,542 | MP4-504 |
| Unique Seq 24 | Unique Trimmed Seq 19 | 1 | 1,520 | F0570 |
| Unique Seq 25 | Unique Trimmed Seq 20 | 2 | 1,517 | PaDSM20707 (2) |
Sequences were pre-aligned with the software MAFFT v6.935b (2012/08/21) (Katoh and Standley, .
If multiple copies of identical sequences are present, the copy number is indicated in the parenthesis.
Sequence of P. asaccharolytica strain DSM 20707 (from Genbank ID: .
Figure 1Phylogenetic tree of . A total of 24 unique 16S rRNA gene sequences were extracted from the genomes of 19 P. gingivalis strains annotated by NCBI. Sequences were pre-aligned with MAFFT v6.935b (2012/08/21) (Katoh and Standley, 2013) and leading and trailing sequences not present in all sequences were trimmed. The trimmed aligned sequences represent 20 unique sequences and were subject to QuickTree V 1.1 (Howe et al., 2002) using the “-kimura” option to calculate the substitution rate. Sequence of P. asaccharolytica strain DSM 20707 (PaDSM20707) was used as out-group. The branch length of the out-group was truncated to fit the tree in the figure and the substitution rate is indicated with the blue number. The red numbers next to the branching point are the bootstrap values based on 100 iterations. Sequences of different strains were separated by semicolons and the number of sequences were indicated in the parentheses in the format of (x–y/z), where x and y are the start and end IDs and z the total number in the strain.
Figure 2Core and unique genes in . Of the 39,926 NCBI annotated P. gingivalis proteins, 37,667 are ≥ 50 amino acids in length and were searched for homologous clusters using the “blastclust” software V.2.2.25 (http://www.ncbi.nlm.nih.gov/Web/Newsltr/Spring04/blastlab.html). Various sequence identity cutoffs ranging from 10 to 95% and two minimal alignment length cutoffs 50 and 90% were used as the program parameters to identify the protein clusters in the three categories (A) clusters containing proteins from all 19 genomes; (B) clusters containing proteins from 2 to 18 genomes; and (C) clusters with protein from only 1 genome.
Figure 3Unique proteins in 19 . Of the 39,926 NCBI annotated P. gingivalis proteins, 37,667 are ≥ 50 amino acids in length and were searched for homologous clusters using the “blastclust” software V.2.2.25 (http://www.ncbi.nlm.nih.gov/Web/Newsltr/Spring04/blastlab.html). Unique proteins of each of the 19 P. gingivalis genomes were identified as proteins found in only one genome without any similar counterpart in any other. The total number of clusters that contain only unique proteins for each genome were plotted. Various sequence identity cutoffs ranging from 10 to 95% (dots with varying grayscale color intensity) and two minimal alignment length cutoffs 50% (A) and 90% (B) were used as the program parameters.
Percent hypothetical proteins 19 .
| HG66 | 1,958 | 28 | 53 | 81 |
| 381 | 1,968 | 27 | 13 | 85 |
| ATCC_33277 | 2,090 | 42 | 14 | 79 |
| A7A1-28 | 1,841 | 28 | 46 | 78 |
| MP4-504 | 1,891 | 27 | 34 | 85 |
| Ando | 1,788 | 29 | 61 | 70 |
| F0568 | 2,417 | 46 | 114 | 88 |
| F0569 | 2,297 | 45 | 125 | 86 |
| W4087 | 2,204 | 43 | 94 | 78 |
| F0185 | 2,236 | 43 | 72 | 88 |
| F0570 | 2,316 | 44 | 96 | 90 |
| JCVI_ SC001 | 2,354 | 30 | 172 | 72 |
| SJD2 | 2,020 | 35 | 79 | 82 |
| AJW4 | 2,002 | 29 | 45 | 76 |
| A7436 | 2,004 | 28 | 25 | 80 |
| W50 | 2,016 | 26 | 34 | 91 |
| W83 | 1,909 | 35 | 13 | 100 |
| F0566 | 2,395 | 46 | 161 | 86 |
| TDC60 | 2,220 | 41 | 78 | 68 |
The strains were ordered somewhat according to the 16S rRNA phylogenetic tree shown in Figure .
The unique proteins were identified by “blastclust” program with parameters 80% as the sequence identity and 50% alignment length.
Non-hypothetical unique.
| HG66 | Glyoxalase |
| A7A1-28 | Beta-galactosidase; putative hydrolase or acyltransferase of alpha/beta superfamily |
| Ando | DNA polymerase III subunits gamma and tau, partial external scaffolding protein D replication-associated protein A major spike protein G |
| F0568 | DGQHR domain protein |
| F0569 | Toxin-antitoxin system, toxin component, Fic domain protein |
| W4087 | CAAX amino terminal protease family protein phage portal protein, SPP1 family phage uncharacterized protein |
| F0185 | Peptidase S24-like protein |
| JCVI_SC001 | Thioesterase family protein, partial starch-binding protein, SusD-like domain protein, partial spermine/spermidine synthase, partial phage portal protein, lambda family, partial head to-tail joining protein W serine carboxypeptidase domain protein, partial NYN domain protein imidazoleglycerol-phosphate dehydratase domain protein, partial carbohydrate kinase, PfkB domain protein PF13785 domain protein, partial DNA-binding helix-turn-helix protein |
| SJD2 | Transposase ISPsy14 |
| AJW4 | Geranylgeranyl pyrophosphate synthase T5orf172 domain-containing protein |
| A7436 | Transposase |
| W50 | Transposase, mutator-like family protein |
| TDC60 | Terminase |
These proteins were searched against all the proteins in the 19 genomes and matched none but itself at the default BLASTP 2.2.25 parameter (i.e., with expected e value ≤ 10) (Altschul et al., .
Figure 4. Of the 39,926 NCBI annotated P. gingivalis proteins, 37,667 are ≥ 50 amino acids in length and were searched for homologous clusters using the “blastclust” software V.2.2.25 (http://www.ncbi.nlm.nih.gov/Web/Newsltr/Spring04/blastlab.html). (A) unrooted tree based on the 1,045 shared proteins identified by “blastclust” with 60% as the sequence identity and 90% as the alignment length cutoffs; the alignment generated a total of 17,389 effective (non-identical) protein sequence positions across all 19 genomes and the tree was constructed based on these positions; (B) rooted tree based on 436 proteins (out of 1,045) that are also found in P. asaccharolytica strain DSM 20707 (PaDSM20707) with ≥ 50% sequence identity and ≥ 90% alignment length; the alignment generated 4,771 effective protein sequence positions; (C) rooted tree based on 36 proteins shared among 20 genomes with ≥ 80% sequence identity and ≥ 90% alignment length. Proteins were aligned with MAFFT v6.935b (2012/08/21) (Katoh and Standley, 2013) and poorly aligned regions were filtered by Gblocks 0.91b (Talavera and Castresana, 2007). Trees were constructed with FastTree 2.1.9 (Price et al., 2010) using the JTT protein mutation model (Jones et al., 1992) and CAT+–gemma options to account for the different rates of evolution at different sites. The reliability of tree splits were reported as “local support values” based on Shimodaira-Hasegawa test (Shimodaira and Hasegawa, 2001) and are printed in blue on the split. The branch length (substitution rate) of the outgroup PaDSM20707 was truncated and the length were printed in black (B,C); (D) Rooted tree constructed using PhyloPhlAn (Segata et al., 2013) by directly subjecting all NCBI annotated proteins of the 20 genomes to the software, resulting in 840 effective protein positions from 225 aligned proteins.
Figure 5DNA-DNA sequence alignment between . Genomic sequence alignment between several pairs of P. gingivalis strains were plotted using NUCmer (NUCleotide MUMmer) version 3.1 (Delcher et al., 2002). The sequence percent identities of detected homologous fragments were plotted in gradient colors based on the percentage. The axes are the nucleotide coordination in the genomes. The orders of the contigs in the unfinished genomes were rearranged based on the reference genome (genome on X- axis). (A) strain 381 vs. ATCC 33277; (B) HG66 vs. ATCC 33277; (C) strain 381 vs. HG66; (D) W50 vs. W83; (E) A7436 vs. W83; (F) AJW4 vs. A7436; (G) TDC60 vs. JCVI SC001; and (H) TDC60 vs. JCVI SC001 showing only the region with percent identity ≥ 99%.
Figure 6Genomic DNA similarity of 19 . All possible 20-mer sequences present in all genomes, including that of P. asaccharolytica strain DSM 20707 (PaDSM2070) used as an out-group, were categorized and the number of genomes in which a 20-mer is present, was recorded. (A) was generated by first calculating the average number of genomes for all the 20 mers present in every 500-nucleotide windows across the entire genome and then color each window based on the genome frequency (minimum 1 in yellow and maximum 20 in black). (B) was similar to (A) but the non-coding regions were masked with light blue color to highlight the oligonucleotide frequencies for the areas that correspond to both forward (upper) and reverse-complement (lower) protein coding sequences. The order of the unfinished genomic contigs was arranged in the same order as appeared in the sequences downloaded from NCBI. The genomes in the plot were ordered based on the 16S rRNA phylogenetic tree (Figure 1) with a dendrogram derived from the same tree to show the relatedness.
Comparative functional genomics of .
| NCBI | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 2 | 0 | 1 | 0 | 0 | 0 | 5 | 0 |
| RAST | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| BLAST | ||||||||||||||||||||
| NCBI | 7 | 6 | 5 | 1 | 8 | 4 | 4 | 1 | 12 | 3 | 1 | 1 | 6 | 1 | 1 | 10 | 3 | 1 | 5 | 0 |
| RAST | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| BLAST | ||||||||||||||||||||
| NCBI | 4 | 1 | 2 | 4 | 4 | 1 | 2 | 3 | 3 | 1 | 4 | 4 | 4 | 4 | 5 | 1 | 2 | 4 | 4 | 1 |
| RAST | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 3 | 1 | 2 | 1 | 1 | 1 | 3 | 1 | 1 | 1 | 4 |
| BLAST | ||||||||||||||||||||
| NCBI | 118 | 68 | 94 | 73 | 20 | 98 | 73 | 25 | 30 | 13 | 35 | 23 | 14 | 26 | 14 | 26 | 25 | 45 | 65 | 32 |
| RAST | 46 | 50 | 56 | 48 | 35 | 64 | 69 | 42 | 57 | 48 | 57 | 38 | 28 | 40 | 27 | 87 | 46 | 61 | 51 | 24 |
| BLAST | ||||||||||||||||||||
| KEGG Orthology | 47 | 45 | 45 | 13 | 3 | 27 | 16 | 0 | 1 | 1 | 1 | 0 | 3 | 2 | 2 | 2 | 14 | 1 | 22 | 0 |
| NCBI | 2 | 3 | 3 | 3 | 1 | 4 | 4 | 1 | 1 | 3 | 1 | 1 | 3 | 1 | 1 | 4 | 3 | 1 | 2 | 1 |
| RAST | 3 | 3 | 3 | 3 | 3 | 3 | 2 | 2 | 3 | 3 | 2 | 2 | 2 | 3 | 3 | 2 | 3 | 3 | 3 | 0 |
| BLAST | ||||||||||||||||||||
| NCBI | 4 | 12 | 11 | 11 | 15 | 15 | 1 | 5 | 0 | 0 | 5 | 12 | 2 | 5 | 5 | 6 | 14 | 10 | 11 | 7 |
| RAST | 12 | 12 | 12 | 12 | 12 | 12 | 0 | 6 | 0 | 3 | 5 | 12 | 3 | 5 | 5 | 5 | 11 | 8 | 12 | 7 |
| BLAST | ||||||||||||||||||||
| CRISPR arrays | 3 | 3 | 3 | 4 | 5 | 5 | 2 | 3 | 3 | 3 | 7 | 22 | 3 | 15 | 4 | 3 | 4 | 7 | 5 | 2 |
| NCBI | 1 | 1 | 3 | 1 | 6 | 1 | 2 | 10 | 13 | 1 | 9 | 5 | 8 | 8 | 12 | 3 | 1 | 6 | 3 | 1 |
| RAST | 3 | 2 | 3 | 4 | 4 | 4 | 4 | 2 | 6 | 5 | 2 | 2 | 6 | 4 | 7 | 3 | 2 | 1 | 1 | 2 |
| BLAST | ||||||||||||||||||||
Results were compiled based on the NCBI or RAST genome annotations. Total number of proteins containing any of the keywords shown in each category were recorded for each genome and for NCBI and RAST annotations separately. The detail results are provided in the Supplemental Files available from the FTP site: ftp://bioinformatics.forsyth.org/publication_data/20160425/
The keyword search was performed in a case-insensitive manner and allowed matching of the partial word.
The order of genomes was based on that similar to the 16S rRNA phylogenetic tree.
BLAST: all the proteins identified by NCBI and RAST were collected and the sequences searched against all the proteins of all 20 genomes using BPLSTP. The numbers (in bold) indicated for each genome are the number of proteins with ≥ 95% sequence identity and ≥ 95% coverage of the query sequences. The numbers were calculated separatly for NCBI and RAST annotated proteins, and the larger number of the two are shown in this table.
The number of proteins related to the IS5 transposase family was identified by the BlastKOALA program (Kanehisa et al., .
The number of CRISPR arrays detected by the online software CRISPRfinger (.
Genome statistics of selected species.
| Number of genomes | 19 | 7 | 17 | 38 | 16 | 103 |
| Number of genomes with single contig | 9 | 3 | 6 | 8 | 0 | 4 |
| Number of contigs (genomes with > 1 contig) | 92–192 | 71–141 | 2–12 | 4–787 | 8–343 | 2–2,566 |
| Genome sizes | 2,216–2,441 kb | 3,233–3,405 kb | 2,742–2,990 kb | 1,860–2,382 kb | 4,270–5,047 kb | 4,457–8,029 kb |
| Mean genome size | 2,320 kb | 3,312 kb | 2,838 kb | 2,155 kb | 4,771 kb | 5,416 kb |
| Number of ORFs | 1,788–2,417 | 2,492–3,001 | 2,520–2,793 | 1,829–2,364 | 3,254–4,663 | 3,593–8,060 |
| Mean ORF number | 2,101 | 2,740 | 2,634 | 2,046 | 4,045 | 4,760 |
| Number of core proteins | 1,037 | 1,560 | 1,129 | 424 | 1,191 | NA |
| Number of unique proteins | 1,044 | 801 | 692 | 1,233 | 4,040 | NA |
| Non-hypothetical proteins | 1,206–1,637 (54.16–74.26%) | 1,573–1,800 (58.77–64.65%) | 685–1,618 (25.17–58.47%) | 1,127–2,035 (48.43–89.75%) | 1,038–3,156 (24.84–79.48%) | 677–5,732 (13.99–71.91%) |
| Non-hypothetical proteins: Mean (Percentage) | 1,346 (64.60%) | 1,702 (62.18%) | 801 (30.38%) | 1,656 (81.19%) | 2,450 (60.71%) | 2,993 (62.63%) |
| Hypothetical proteins | 515–1,108 (25.74–45.84%) | 919–1,201 (35.35–41.23%) | 1,149–2,090 (41.53–74.83%) | 211–1,200 (10.25–51.57%) | 784–3,149 (20.52–75.16%) | 1,193–4,162 (28.09–86.01%) |
| Hypothetical proteins: Mean (Percentage) | 755 (35.40%) | 1,038 (37.82%) | 1,832 (69.62%) | 390 (18.81%) | 1,594 (39.29%) | 1,767 (37.37%) |
| Level 1 | 1 | NA | ||||
| Level 2 | 1 | 1 | 6 | 1 | NA | |
| Level 3 | 4 | 4 | 15 | 37 | 5 | NA |
| Level 4 | 25 | 28 | 141 | 397 | 84 | NA |
| Level 3 | 1 | NA | ||||
| Level 4 | 11 | NA | ||||
PG, Porphyromonas gingivalis; TF, Tannerella forsythia; TD, Treponema denticola; AA, Aggregatibacter actinomycetemcomitans; BU, Bacteroides uniformis; BF, B. fragilis Genomes are downloaded from the NCBI FTP site; only those genomes with annotation were analyzed.
Number of NCBI predicted ORFs based on the number of proteins found in the “.faa” file for each genome.
Number of core proteins are those present in all genomes of the species, identified with the “blastclust” software (.
Number of unique proteins are those present in a single genome of the species, identified with the “blastclust” software using 60% sequence identity and 50% length coverage as the parameters.
Non-hypothetic proteins predicted by NCBI, are proteins with annotation that does not contain the key words “hypothetic” and “uncharacterized”; hypothetical ones are those with annotation that contains either words.
The 4 levels of subsystems were defined by the RAST (Aziz et al., .