| Literature DB >> 34915867 |
William F Anjos1, Gabriel C Lanes2, Vasco A Azevedo3, Anderson R Santos4.
Abstract
BACKGROUND: Bacterial genomes are being deposited into online databases at an increasing rate. Genome annotation represents one of the first efforts to understand organisms and their diseases. Some evolutionary relationships capable of being annotated only from genomes are conserved gene neighbourhoods (CNs), phylogenetic profiles (PPs), and gene fusions. At present, there is no standalone software that enables networks of interactions among proteins to be created using these three evolutionary characteristics with efficient and effective results.Entities:
Keywords: Bacteria; Interaction; Network; Protein; Software; Standalone
Mesh:
Year: 2021 PMID: 34915867 PMCID: PMC8680239 DOI: 10.1186/s12859-021-04501-0
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Sample of model organisms obtained from the STRING database according to seven network metrics
| Organism | STRING nomenclature | Vertices | Average degree | Density | Edges | Maximum degree |
|---|---|---|---|---|---|---|
| ATCC 8739 | 4190 | 195.34 | 0.047 | 409,238 | 1697 | |
| subsp. subtillis | 4181 | 244.388 | 0.058 | 51,0983 | 2014 | |
| CB15 | 3721 | 208.14 | 0.056 | 387,245 | 1393 | |
| ATCC 33530 | 474 | 128.35 | 0.271 | 30,419 | 318 | |
| sp. ATCC 27150 | 4124 | 215.33 | 0.052 | 444,011 | 1548 | |
| NCIMB 11764 | 6384 | 247.91 | 0.039 | 791,330 | 2158 | |
| DJ | 4955 | 233.558 | 0.047 | 578,640 | 2047 | |
| A3(2) | 7741 | 357.142 | 0.046 | 1,382,317 | 3576 |
We define these networks as appropriate to infer centrality measures. The majority of the networks from this sample have densities less than 0.100. The average degree, density, and edges were 228, 0.077, and 567 thousand, respectively
Comparison of our heuristic to find high similarity pairs of proteins (HistoFasta) to the exact algorithm Needleman–Wunsch
| AA limit (-aadifflimit) | Check limit (-aacheckminlimit) | Number of similar proteins | Mean identity | Median identity | Min identity |
|---|---|---|---|---|---|
| 0 | 26 | 336 | 100.00 | 100.00 | 100.00 |
| 0 | 25 | 336 | 100.00 | 100.00 | 100.00 |
| 0 | 24 | 360 | 99.95 | 100.00 | 97.96 |
| 0 | 23 | 366 | 99.91 | 100.00 | 96.94 |
| 0 | 22 | 368 | 99.90 | 100.00 | 96.94 |
| 0 | 21 | 370 | 99.87 | 100.00 | 94.68 |
| 0 | 20 | 372 | 99.83 | 100.00 | 91.75 |
| 0 | 19 | 382 | 99.60 | 100.00 | 85.57 |
| 1 | 26 | 360 | 99.95 | 100.00 | 97.87 |
| 1 | 25 | 370 | 99.84 | 100.00 | 92.55 |
| 1 | 24 | 390 | 99.07 | 100.00 | 29.21 |
| 1 | 23 | 428 | 96.21 | 100.00 | 29.21 |
| 1 | 22 | 500 | 89.38 | 100.00 | 17.33 |
| 1 | 21 | 784 | 71.68 | 97.70 | 17.33 |
| 1 | 20 | 2164 | 52.27 | 39.60 | 17.33 |
| 1 | 19 | 6120 | 43.26 | 36.36 | 15.00 |
For the creation of the core pangenome, we need only the higher matches
With values 1 and 25 for the aa-limit and check-limit parameters, respectively, our heuristic guarantees a minimum identity percentage equal to 92.55% for pairs of similar classified proteins (Table 3)
| Amino acids | A | R | N | D | C | E | Q | G | H | I | L | K | M | F | P | S | T | W | Y | V |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| A histogram | 12 | 2 | 8 | 6 | 4 | 10 | 1 | 9 | 2 | 11 | 6 | 13 | 3 | 2 | 4 | 14 | 5 | 1 | 5 | 10 |
| B histogram | 11 | 3 | 8 | 6 | 4 | 11 | 1 | 9 | 2 | 11 | 8 | 12 | 3 | 2 | 4 | 13 | 4 | 1 | 4 | 11 |
| abs(A-B): | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 2 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 1 |
According to the heuristics of GENPPI, proteins A and B are similar because, in the difference of their amino acid histograms, at least 25 of the 26 possible types presented frequency differences less than or equal to 1. In this table, we present only the 20 principal amino acids for the sake of exemplification. For the proteins A and B, in fasta format below, we have 94.5% identity (96.9% similar) according to the Needleman–Wunsch Algorithm. Amino acids in bold format are the different ones between A and B sequences
>A Protein
MAYSKKVMDHYENPRNVGSFSNSDNNVGSGLVGAPACGDVMKLQIKVNEKGIIEDACFKTYGCGS
AIASSSLVTEWVKGKSITEAESIRNTTIVEELELPPVKIHCSILAEDAIKAAIADYKSKKYSN
>B Protein
MAYSKKVMDHYENPRNVGSFSNSDLNVGSGLVGAPACGDVMKLQIKVNEEGIIEDACFKTYGCGS
AIASSSLVTEWVKGKSIVEAESIRNTTIVEELELPPVKIHCSILAEDAIKAAISDYKRKKNLN
Fig. 1Scheme of an arrangement of the results. Suggesting that GENPPI produces trustable results
GENPPI execution line with parameters to generate interaction networks keeps fixed expansions for conserved neighbourhoods and variations controlling the number of phylogenetic profile interactions
| Id | Parameters |
|---|---|
| f1 | genppi -expt fixed -w1 10 -cw1 3 -ppiterlimit 1000000 -ppdifftolerated 3 -ppaadifflimit 0 |
| f2 | genppi -expt fixed -w1 10 -cw1 4 -trim 20000 |
| f3 | genppi -expt fixed -w1 10 -cw1 1 -ppiterlimit 500000 |
| f4 | genppi -expt fixed -w1 10 -cw1 1 -ppcomplete -aadifflimit 0 -aachecklimit 24 |
| d5 | genppi -expt dynamic -ws 3 -ppcomplete -ppdifftolerated 1 -pphistofilter |
We omitted the folder parameter (-dir) for not contributing changes in the nodes or edges’ volume resulting in the networks. Execution d5 is a dynamic expansion for the conserved neighbourhood. d5 has no counterpart results of a fixed retraction. d5 was maintained in this table solely to group the documentation on the exploited commands
Metric values obtained for interaction networks by CN and PP
| Id | CN expansion | Nodes | Medium degree | Density | Edges | Maximum degree |
|---|---|---|---|---|---|---|
| STRING | – | 2213 | 180.83 | 0.082 | 200,088 | 901 |
| f1 | Fixed | 2149 | 840.228 | 0.391 | 902,825 | 1316 |
| f2 | Fixed | 2050 | 69.748 | 0.034 | 71,492 | 688 |
| f3 | Fixed | 2057 | 270.628 | 0.132 | 278,341 | 689 |
| f4 | Fixed | 1984 | 121.620 | 0.061 | 120,647 | 385 |
| d1 | Dynamic | 2141 | 853.973 | 0.399 | 914,178 | 1355 |
| d2 | Dynamic | 2045 | 91.318 | 0.045 | 93,373 | 705 |
| d3 | Dynamic | 2045 | 289.11 | 0.141 | 295,615 | 772 |
| d4 | Dynamic | 1976 | 132.947 | 0.067 | 131,356 | 469 |
| d5 | Dynamic | 2058 | 272.208 | 0.132 | 280,102 | 713 |
Fig. 2Average nucleotide identity for 50 genomes of the genus Corynebacterium. The largest grayish square represents the clusters of C. pseudotuberculosis and Corynebacterium diphtheriae. Both groupings have units that are almost black due to the high score of DNA similarities
Fig. 3Pangenome similarity profile for the same 50 genomes of the genus Corynebacterium depicted in Fig. 2. The clusters of C. pseudotuberculosis and Corynebacterium diphtheriae are the most grayish. The remaining units are whitish due to the low protein similarities of their phylogenetic profiles
Fig. 4Profile differences between Figs. 2 and 3. It accounts for the chunk of divergence about ANI and the pangenome raised by GENPPI. Black cells represent the maximum difference, while white cells account for smaller differences, with grayish units representing intermediate differences. The majority of the comparisons are white because ANI and GENPPI agree on the low similarity of the compared genomes and small differences
Fig. 5Each genome has a box plot registering stats for their conserved phylogenetic profiles. The width of a box plot is proportional to the number of PPs found. There is a numerical expressiveness of genomes from C. pseudotuberculosis (left) and diphtheriae (leftmost). For the former, the PP median enables separation of the biovars ovis and equi (highest medians)
Fig. 6Better possible separation of the genus Corynebacterium achieved using the conserved neighbourhood. We achieved splitting via the expansion of a window of pitch three. The stopping criterion was a reduction by more than half in the number of conserved loci for window size. In this query, we did not ensure a full split of C. pseudotuberculosis biovars ovis and equi but of the main represented species
Fig. 7Quantity shared for each pair of runs of Table 5 regarding the first 100 proteins ordered by the metric Bridging Centrality. The fn and dn labels come from the executions of Table 5. White cells have a higher percentage of an intersection followed by yellow cells, and red cells have the lowest prevailing amounts between each gene neighbourhood parameter setting (fixed retraction or dynamic expansion)
Fig. 8We query the network’s interactions listed in columns against subjects in rows. The result compares shared interactions between different GENPII parameters and the STRING. The more significant intersection of GENPPI obtained 46% of the STRING interactions (f1 and d1 Ids), even using 0.009% of genomes to craft the networks compared to the STRING database. On the other hand, the more significant result of the STRING network to our webs achieved 14%