| Literature DB >> 21918634 |
Fabián Reyes-Prieto1, Adda J García-Chéquer, Hueman Jaimes-Díaz, Janet Casique-Almazán, Juana M Espinosa-Lara, Rosaura Palma-Orozco, Alfonso Méndez-Tenorio, Rogelio Maldonado-Rodríguez, Kenneth L Beattie.
Abstract
PURPOSE: Here we describe LifePrint, a sequence alignment-independent k-tuple distance method to estimate relatedness between complete genomes.Entities:
Keywords: phylogeny; sequence alignment; similarity search; tuple; viroid
Year: 2011 PMID: 21918634 PMCID: PMC3169951 DOI: 10.2147/AABC.S15021
Source DB: PubMed Journal: Adv Appl Bioinform Chem ISSN: 1178-6949
Figure 1Distribution of the LifePrint set of 9-tuples (LPS9) inside the complete set of 9-tuples.a
Note: aIn total, 1878 tuples of the LPS9 were represented graphically (blue points) in agreement with the positions inside the original list of all possible 9-tuples (262,144). Every line represents 10,000 tuples.
National Center for Biotechnology Information access numbers of 36 real viroid genomes.
| Apple dimple fruit viroid, complete genome | NC_003463 |
| Apple fruit crinkle viroid, complete genome | NC_003777 |
| Apple scar skin viroid, complete genome | NC_001340 |
| Australian grapevine viroid, complete genome | NC_003553 |
| Avocado sunblotch viroid, complete genome | NC_001410 |
| Chrysanthemum chlorotic mottle viroid, complete genome | NC_003540 |
| Chrysanthemum stunt viroid, complete genome | NC_002015 |
| Citrus bent leaf viroid, complete genome | NC_001651 |
| Citrus dwarf viroid, complete genome | NC_005821 |
| Citrus exocortis viroid, complete genome | NC_001464 |
| Citrus viroid Ia, complete genome | NC_001907 |
| Citrus viroid II, complete genome | NC_003881 |
| Citrus viroid III, complete genome | NC_003264 |
| Citrus viroid IV, complete genome | NC_003539 |
| Citrus viroid OS, complete genome | NC_004359 |
| Citrus viroid-I-LSS, complete genome | NC_004358 |
| Coconut cadang-cadang viroid, complete genome | NC_001462 |
| Coconut tinangaja viroid, complete genome | NC_001471 |
| Coleus blumei viroid 1, complete genome | NC_003681 |
| Coleus blumei viroid 2, complete genome | NC_003682 |
| Coleus blumei viroid 3, complete genome | NC_003683 |
| Coleus blumei viroid, complete genome | NC_003882 |
| Columnea latent viroid, complete genome | NC_003538 |
| Eggplant latent viroid, complete genome | NC_004728 |
| Grapevine yellow speckle viroid 1, complete genome | NC_001920 |
| Grapevine yellow speckle viroid 2, complete genome | NC_003612 |
| Hop latent viroid, complete genome | NC_003611 |
| Hop stunt viroid, complete genome | NC_001351 |
| Iresine viroid, complete genome | NC_003613 |
| Mexican papita viroid, complete genome | NC_003637 |
| Peach latent mosaic viroid, complete genome | NC_003636 |
| Pear blister canker viroid PBCVd, complete genome | NC_001830 |
| Potato spindle tuber viroid, complete genome | NC_002030 |
| Tomato apical stunt viroid, complete genome | NC_001553 |
| Tomato chlorotic dwarf viroid, complete genome | NC_000885 |
| Tomato planta macho viroid, complete genome | NC_001558 |
Figure 2True tree.a
Note: aThe true tree was manually constructed using as a reference the simulated evolution of 32 viroid genomes derived from the Citrus viroid II genome. Nucleotide substitutions were simulated following a 5-generation pattern and considering an evolutionary model with a transition/transversion ratio of 2 (Kimura 2-parameter).
Figure 3Genomic coverage.a
Note: aAccording to their corresponding matching position, we gathered all tuples of the LifePrint set of 9-tuples (LPS9) that detected identity and/or similarity in the first 80 nucleotides (5′ end) of the Hop stunt viroid genome. The coincidences between the most frequent nucleotides and their respective genomic positions are indicated in each column. Every five nucleotides, a green mark is placed as the nucleotide number reference. The identities and differences appear in capital and lower case letters, respectively. The tuples that found identities or sites with one difference are marked in yellow and gray, respectively. Six subsequences that were not detected directly by any tuple (beginning at nucleotide numbers 7, 27, 36, 39, 42, and 60) are underlined. Three of them (beginning at nucleotide numbers 36, 39, and 42) are located in a rich adenine region, which is marked in blue.
Figure 4LifePrint detection of single base repeats.a
Note: aUsing a model of 36 nucleotides comprising four different single nucleotide repeats, we carried out a similarity search. We tiled 9-tuples according to their matching position along the model. Sequence identities and differences appear in capital and lower case letters, respectively. Tuples able to identify the repeat and the respective genome repeat direct are highlighted in blue.
Figure 5General bootstrapping scheme for k-tuple distance and tree construction in LifePrint.a
Note: aThe Virtual Hybridization program generates a matrix for the identity/similarity for each tuple (rows 1 to 7) against each genome sequence (columns A to E). Then, entire rows from the original matrix are randomly sampled with replacement in order to produce a new bootstrap matrix with the same number of rows as the original matrix. A distance table for each bootstrap sample matrix is calculated and used to estimate a phylogenetic tree. Finally, a consensus tree is calculated from all the bootstrap trees. The numbers in the consensus tree show the percentage of abundance of the groups in the bootstrap samples.
Figure 6LifePrint set of 9-tuples (LPS9) bootstrap consensus tree from 36 real viroid genomes (k-tuple distance based on Pearson’s correlation coefficient, 1000 replicates).a
Note: aFamilies were assigned according to the International Committee on Taxonomy of Viruses classification. Numbers represent bootstrap confidence values for the sequence groups. The black circles correspond to unclassified viroid genomes. The black triangle corresponds to a viroid that should properly be grouped in the subfamily Pospiviroid.
Figure 7The 5-tuple method bootstrap consensus tree from 36 real viroid genomes (k-tuple distance based on Pearson’s correlation coefficient, 1000 replicates).a
Note: aFamilies were assigned according the International Committee on Taxonomy of Viruses classification. Numbers represent bootstrap confidence values for the sequence groups. The black circles correspond to unclassified viroid genomes.
Appendix IIFor LPS9 and 5-tuple neighbor-joining trees constructed with dPear (Characters original file) and true tree we compared the topologies using the Phylocomparison program. The first figure compared the topologies of the true tree (Tree A) and the LPS9 N J-dPear tree (Tree B). The second figure compared topologies of true tree (Tree A) and 5-tuple neighbor-joining dPear tree (Tree B). In both figures, thicker lines show a poor match. Topologic score is proportional to line thickness, ie, for major thickness the difference in this clade is bigger. Also, they appear in the low part of the image as overall topologic scores. Observing the figures and overall topologic scores, it can be established that topologic differences are evidently minor between the true tree and the LPS9 NJ-dPear tree.
Number of LifePrint sets of identical and/or similar 9-tuples (LPS9) under four different similarity search schemesa
| Conditions | Allowed differences between sequences | Average number of identical and/or similar LPS9 tuples in sequences | Percentage in relation to the number of tuples of the LPS9 |
|---|---|---|---|
| A | 0 | 3 | 0.2 |
| B | 0 and 1 | 64 | 3.4 |
| C | 0, 1, and 2 | 605 | 32.2 |
| D | 0, 1, 2, and 3 | 1705 | 90.8 |
Note: We carried out a similarity search between LPS9 and 36 viroid genomes, allowing a different number of differences between the sequences. We calculated the average number of 9-tuples that are identical and/or similar found in four different conditions (A, B, C, and D).
k-tuple distance values on single substitutions variantsa
| Value | Independent | Successive |
|---|---|---|
| Minimum | 0.000000 | 0.00040 |
| Maximum | 0.005894 | 0.00580 |
| Average | 0.003780 | 0.00390 |
Note: We calculated the minimum, maximum, and average values of the k-tuple distances for variants of the Citrus II viroid genome using independent and successive approaches described in the Methods section.
k-tuple distance values for variants with single substitutions or eliminations located in the ends of sequencesa
| Position from end | Deleted nucleotides (n) | |||
|---|---|---|---|---|
| 1 | 0.00000 | 0.00069 | 0.000000 | 0 |
| 2 | 0.00069 | 0.00117 | 0.000380 | 1 |
| 3 | 0.00082 | 0.00179 | 0.000580 | 2 |
| 4 | 0.00137 | 0.00248 | 0.000580 | 3 |
| 5 | 0.00145 | 0.00331 | 0.000960 | 4 |
| 6 | 0.00255 | 0.00324 | 0.000962 | 5 |
| 7 | 0.00234 | 0.00441 | 0.001349 | 6 |
| 8 | 0.00381 | 0.00531 | 0.001543 | 7 |
| 9 | 0.00426 | 0.00552 | 0.002128 | 8 |
| 5′ end | 3′ end | |||
Note: We calculated the k-tuple distance between every variant and the Citrus II viroid genome. In the second and third columns we present the results for three possible single substitutions in the 5′ or 3′ ends. In the fourth column the results represent the combined effect of successive eliminations in the 3′ end and the resulting substitutions in the new ends.
Figure 8Differential detection of variants with a single substitution that implies an average k-tuple distance.a
Note: aWe calculated the k-tuple distance between simulated variant 144 A→G and the Citrus II viroid genome. Tuples of the LifePrint set of 9-tuples (LPS9) that found identity or similarity in both sequences in this region of interest are between both sequences (in bold type), and the distinctive tuples are placed above or below the respective sequence. Substitution is marked in yellow when there is identity with A and in blue when the identity is with G. We highlight with green other positions where the same tuples find identity or similarity. These tuples were not considered to be distinctive.
Ability of LifePrint to distinguish between sequences with different degree of relatednessa
| Single substitutions
| ||||||
|---|---|---|---|---|---|---|
| Real | Observed
| Minimum | Maximum | Average | ||
| Number | Percentage | |||||
| 1 | 1.00 | 0.334 | 0.00134 | 0.00589 | 0.00382 | 0.00098828 |
| 3 | 2.97 | 0.993 | 0.00423 | 0.01575 | 0.01150 | 0.00187987 |
| 6 | 5.96 | 1.993 | 0.01437 | 0.02847 | 0.02208 | 0.00287570 |
| 9 | 8.89 | 2.973 | 0.02312 | 0.04112 | 0.03246 | 0.00361394 |
| 12 | 11.82 | 3.953 | 0.03117 | 0.04995 | 0.04205 | 0.00401709 |
| 24 | 22.82 | 7.632 | 0.06276 | 0.09471 | 0.07808 | 0.00704353 |
| 36 | 32.91 | 11.006 | 0.09334 | 0.13755 | 0.11449 | 0.01101853 |
| 48 | 42.99 | 14.378 | 0.11396 | 0.21124 | 0.16224 | 0.01946713 |
| 60 | 51.61 | 17.261 | 0.14709 | 0.29580 | 0.21528 | 0.03098441 |
| 72 | 60.87 | 20.358 | 0.17289 | 0.42443 | 0.28682 | 0.04982284 |
| 84 | 68.55 | 22.927 | 0.22703 | 0.76171 | 0.36894 | 0.08921467 |
| 96 | 75.29 | 25.181 | 0.26546 | 0.76253 | 0.45189 | 0.12901015 |
| 120 | 87.86 | 29.385 | 0.30701 | 0.76664 | 0.61653 | 0.14854929 |
| 150 | 99.86 | 33.398 | 0.35974 | 0.76852 | 0.72943 | 0.07186317 |
| 200 | 116.55 | 38.980 | 0.39897 | 0.77006 | 0.76087 | 0.01373281 |
Note: Fifteen groups of 100 Citrus viroid II virtual variants containing an average of 1–116 substitutions were created. The minimum, maximum, average, and standard deviation of k-tuple distance between each variant and the original viroid were determined. Column 3 (percent) is computed by column 2 (number) divided by 299.
Symmetric difference values between true tree and neighbor-joining trees constructed from k-tuple distance based on three different distances metricsa
| Distance metrics | LPS9 (from global binary table) | LPS9 (from global frequency table) | 5-tuple (from global binary table) | 5-tuple (from global frequency table) |
|---|---|---|---|---|
| dLog | 10 | 10 | 18 | 18 |
| dPear | 14 | 6 | 18 | 26 |
| dk | 14 | 8 | 18 | 26 |
Note: We measured the accuracy of the LifePrint set of 9-tuples (LPS9) and complete set of 5-tuple methods by comparing each neighbor-joining tree with the true tree using a symmetric difference.
Abbreviations: dLog, k-tuple distance based on the Jaccard index; dPear, k-tuple distance based on the Pearson’s correlation coefficient; dk, typical k-tuple distance.