| Literature DB >> 30497358 |
Vincenzo Bonnici1, Rosalba Giugno2, Vincenzo Manca2.
Abstract
BACKGROUND: Pan-genome approaches afford the discovery of homology relations in a set of genomes, by determining how some gene families are distributed among a given set of genomes. The retrieval of a complete gene distribution among a class of genomes is an NP-hard problem because computational costs increase with the number of analyzed genomes, in fact, all-against-all gene comparisons are required to completely solve the problem. In presence of phylogenetically distant genomes, due to the variability introduced in gene duplication and transmission, the task of recognizing homologous genes becomes even more difficult. A challenge on this field is that of designing fast and adaptive similarity measures in order to find a suitable pan-genome structure of homology relations.Entities:
Keywords: Distant genomes; Pan-genome; k-mer dictionary
Mesh:
Year: 2018 PMID: 30497358 PMCID: PMC6266927 DOI: 10.1186/s12859-018-2417-6
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Overview of PanDelos Pan-genome computation of three genomes (represented as blue, red, and violet). Genomes are taken in input as list of genetic sequences (represented as colored rectangles). The homology detection schema is divided into 5 steps. PanDelos, at first, chooses an optimal word length that is used to compare dictionaries of genetic sequences. The 1-vs-1 genome comparisons are performed. An initial candidate gene pairs selection is obtained by applying a minimum percentage threshold on the dictionary intersection. Then, PanDelos computes generalized Jaccard similarities among genes (shown in the bottom left matrix). Only pairs of genes that passed the threshold applied on dictionary percentages are taken in consideration for the similarity computation. Pairs that did not pass the threshold are represented by gray tiles. Next, PanDelos computes bidirectional best hits (BBH), here represented with green borders. On the bottom right, a similarity network, made of reciprocal best hits is shown. Border colors represent the genomes to which genes belong. A final computational step discards edges in inconsistent components of the network and returns the final list of gene families. A component is inconsistent if it contains two genes belonging to the same genome that are not accounted as paralogs. A family may contains orthologous as well as paralogous genes, such as the yellow/brown ones. Families are finally classified as singletons, dispensable or core depending on their presence among genomes (borders of the rectangles represent the genomes the genes belong to)
Fig. 2Examples of indexing structures. In the top-side of the image a, an example of the indexing structure ESA+N is shown for the string WLLPPP. The string is indexed by lexicographically sorting its suffixes. The array SA, LCP and N are computed according to the ordering. The indexing structure is composed by the three arrays, and the other columns shown on the image are virtually extracted. The SA array stores star positions of suffixes and it is used to keep trace of the lexicographic ordering. Values along the LCP and N arrays are used to identify intervals that correspond to specific k-mers [30]. The 1-mer L is identified by an interval that covers the first two positions of the structure, while the 1-mer P covers three positions and the 1-mer W cover one positions. Thus, the multiplicity of L, P and W are respectively 2, 3, and 1. 2-mers intervals are shown in the second columns,from the left. Note that the third position is not cover when 2-mer intervals are extracted because it can not identify the start of any 2-mer. The second section of the image b, c, d, e, f, show the extension the indexing structure in order to manage set of strings. Four input strings, s1, s2, t1 and t2 are indexed. Firstly, a global string is built by concatenating the four strings and by putting a special symbol (represented as N) on the concatenation joints. Then, similarly to the single string case, suffixes of the global sequences are sorted in lexicographic order. The sorting procedure defines the content of the SA array and LCP, N and SID arrays are computed in accordance with it. The SID array informs for each suffix the sequences from which it originates. The indexing structure helps in extracting the information, namely the multiplicities of 2-mers in every sequence, that is ideally represented in the matrix b. The illustrations d, e and f show the final values that the matrices M, P1 and P−1 take after every 2-mer of the global sequence have been taken into account
Phylogenetic distances (average and standard deviation) for the four real datasets
| Species | Distance |
|---|---|
|
| 0.28 (0.13) |
|
| 0.37 (0.34) |
|
| 0.69 (0.25) |
|
| 0.92 (0.21) |
Number of genomes per gene family in 7 serotype Typhi of the Salmonella enterica species
| Genome count | PanDelos | EDGAR | Roary |
|---|---|---|---|
| 1 | 241 | 219 | 236 |
| 2 | 74 | 72 | 87 |
| 3 | 27 | 28 | 35 |
| 4 | 93 | 42 | 108 |
| 5 | 213 | 246 | 213 |
| 6 | 469 | 491 | 464 |
| 7 | 3748 | 3749 | 3751 |
| Total | 4865 | 4847 | 4894 |
The table reports the count of gene families found in a given amount (from 1 to 7) of genomes, for each of the tested algorithms. Families found in only 1 genome are the singletons, whereas families found in all 7 genomes are the core families. The whole dataset consists of 31,311 gene sequences (CDS) that were clustered in more then 4800 gene families by the three approaches
Number of genomes per gene family in 14 Xanthomonas campestris species
| Genome count | PanDelos | EDGAR | Roary |
|---|---|---|---|
| 1 | 3050 | 2572 | 7143 |
| 2 | 743 | 854 | 1864 |
| 3 | 797 | 873 | 2112 |
| 4 | 585 | 600 | 3811 |
| 5 | 233 | 249 | 1086 |
| 6 | 159 | 201 | 86 |
| 7 | 110 | 111 | 40 |
| 8 | 128 | 143 | 104 |
| 9 | 400 | 431 | 797 |
| 10 | 196 | 222 | 12 |
| 11 | 107 | 98 | 13 |
| 12 | 203 | 181 | 54 |
| 13 | 715 | 630 | 642 |
| 14 | 1742 | 1829 | 50 |
| Total | 9168 | 8994 | 17814 |
The table reports the count of gene families found in a given amount (from 1 to 14) of genomes, for each of the tested algorithms. The dataset consists of a total of 56,759 input gene sequences
Number of genomes per gene family in 10 Escherichia coli isolates
| Genome count | PanDelos | EDGAR | Roary |
|---|---|---|---|
| 1 | 1819 | 1593 | 2589 |
| 2 | 740 | 781 | 1083 |
| 3 | 916 | 990 | 1270 |
| 4 | 463 | 527 | 523 |
| 5 | 287 | 301 | 265 |
| 6 | 322 | 332 | 290 |
| 7 | 201 | 223 | 172 |
| 8 | 228 | 224 | 145 |
| 9 | 354 | 338 | 312 |
| 10 | 3075 | 3084 | 2951 |
| Total | 8405 | 8443 | 9600 |
The dataset consists of a total of 48,980 input gene sequences
Number of genomes per gene family in 64 Mycoplasma genus
| Genome count | PanDelos | EDGAR | Roary | Roary-65 |
|---|---|---|---|---|
| 1-10 | 12,218 | 11,180 | 21,140 | 15,240 |
| 11-20 | 676 | 825 | 604 | 737 |
| 21-30 | 156 | 166 | 0 | 59 |
| 31-40 | 31 | 54 | 0 | 4 |
| 41-50 | 38 | 51 | 0 | 6 |
| 51-60 | 27 | 40 | 0 | 5 |
| 61-64 | 35 | 28 | 0 | 5 |
| Total | 13,181 | 12,344 | 21,744 | 16,056 |
The table reports the count of gene families found in a given amount (from 1 to 64) of genomes, for each of the tested algorithms. The dataset consists of a total of 47,385 input gene sequences
Phylogenetic distances (average and standard deviation) for the four extracted synthetic sub-populations
| G37 ( | ||
| Variation perc. | ||
| Extr. type | 0.5% | 1% |
| Roots | 0.17 (0.6) | 0.24 (0.07) |
| Leaves | 0.55 (0.17) | 0.68 (0.17) |
| M129 ( | ||
| Variation perc. | ||
| Extr. type | 0.5% | 1% |
| Roots | 0.15 (0.05) | 0.22 (0.07) |
| Leaves | 0.55 (0.14) | 0.64 (0.19) |
Performances of PanDelos, EDGAR and Roary on the synthetic datasets
| TP | FP | FN | TN | f-measure | CDiff | |||
|---|---|---|---|---|---|---|---|---|
| PanDelos | ||||||||
| G37 | 0.5% | Roots | 1,263,632 | 0 | 2324 | 1,386,060,388 | 0.9991 | 1 |
| M129 | 0.5% | Roots | 1,689,082 | 0 | 4060 | 2,460,563,516 | 0.9988 | 2 |
| G37 | 1% | Roots | 1,259,344 | 0 | 5310 | 1,385,219,936 | 0.9979 | 0 |
| M129 | 1% | Roots | 1,689,682 | 0 | 3024 | 2,467,589,288 | 0.9991 | 0 |
| G37 | 0.5% | Leaves | 1,278,188 | 0 | 25,374 | 1,756,053,000 | 0.9902 | 24 |
| M129 | 0.5% | Leaves | 1,695,228 | 0 | 47,376 | 3,086,359,452 | 0.9862 | 57 |
| G37 | 1% | Leaves | 1,270,658 | 0 | 45,042 | 1,735,332,524 | 0.9826 | 64 |
| M129 | 1% | Leaves | 1,773,110 | 196 | 57,320 | 3,144,153,456 | 0.9840 | 210 |
| EDGAR | ||||||||
| G37 | 0.5% | Roots | 1,258,382 | 0 | 7574 | 1,386,060,388 | 0.9970 | 34 |
| M129 | 0.5% | Roots | 1,663,846 | 0 | 29,296 | 2,460,563,516 | 0.9913 | 139 |
| G37 | 1% | Roots | 1,253,564 | 0 | 11,090 | 1,385,219,936 | 0.9956 | 48 |
| M129 | 1% | Roots | 1,665,186 | 0 | 27,520 | 2,467,589,288 | 0.9918 | 132 |
| G37 | 0.5% | Leaves | 1,269,670 | 0 | 33,892 | 1,756,053,000 | 0.9868 | 154 |
| M129 | 0.5% | Leaves | 1,671,400 | 0 | 71,204 | 3,086,359,452 | 0.9791 | 319 |
| G37 | 1% | Leaves | 1,269,724 | 0 | 45,976 | 1,735,332,524 | 0.9822 | 197 |
| M129 | 1% | Leaves | 1,753,318 | 98 | 77,112 | 3,144,153,554 | 0.9785 | 267 |
| Roary | ||||||||
| G37 | 0.5% | Roots | 1,212,344 | 0 | 53,612 | 1,386,060,388 | 0.9784 | 179 |
| M129 | 0.5% | Roots | 1,598,840 | 856 | 94,302 | 2,460,562,660 | 0.9711 | 247 |
| G37 | 1% | Roots | 1,166,946 | 0 | 97,708 | 1,385,219,936 | 0.9598 | 383 |
| M129 | 1% | Roots | 1,541,422 | 1244 | 151,284 | 2,467,588,044 | 0.9529 | 537 |
| G37 | 0.5% | Leaves | 348,356 | 112 | 955,206 | 1,756,052,888 | 0.4217 | 3520 |
| M129 | 0.5% | Leaves | 423,836 | 154 | 1,318,768 | 3,086,359,298 | 0.3912 | 5619 |
| G37 | 1% | Leaves | 97,710 | 24 | 1,217,990 | 1,735,332,500 | 0.1383 | 6302 |
| M129 | 1% | Leaves | 468,466 | 64 | 1,361,964 | 3,144,153,588 | 0.4075 | 4674 |
Fig. 3Execution times of PanDelos and Roary over the four synthetic datasets extracted form the two populations generated from the Mycoplasma genitalium G37 genome. Time requirements have been measured by taking into account five different amounts of analyzed genomes, from 10 to 50. Execution times are reported in seconds