| Literature DB >> 24886411 |
Abstract
BACKGROUND: Enormous volumes of short read data from next-generation sequencing (NGS) technologies have posed new challenges to the area of genomic sequence comparison. The multiple sequence alignment approach is hardly applicable to NGS data due to the challenging problem of short read assembly. Thus alignment-free methods are needed for the comparison of NGS samples of short reads.Entities:
Mesh:
Year: 2014 PMID: 24886411 PMCID: PMC4057587 DOI: 10.1186/1756-0500-7-320
Source DB: PubMed Journal: BMC Res Notes ISSN: 1756-0500
Comparison of the alignment-free distances and the MSA distance for NGS short reads of the mtDNA sequences
| 454 | 14 | 16 | |||
| Exact | |||||
| Empirical | 8 | 8 | 12 | ||
| Sanger | 14 | 14 | |||
| 454 | 0.68 | 0.66 | 0.68 | ||
| Exact | 0.69 | 0.68 | 0.68 | ||
| Empirical | 0.66 | ||||
| Sanger | 0.67 | 0.67 | 0.65 |
The short reads were simulated from the mtDNA sequences using four error models 454, Exact, Empirical, and Sanger of the tool MetaSim at 5 × sampling depth. The two smallest tree symmetric differences and the two highest distance correlation coefficients for each error model are highlighted in boldface. Similar results for 1 ×, 10 ×, and 30 × sampling depths can be found in Additional file 1: Table S2.
Figure 1Phylogenetic trees reconstructed from NGS short reads of 29 mtDNA sequences using: (A) , (B) (= 10), (C) . The short reads were simulated from the tool MetaSim using the Empirical model and 5× sampling depth. The group of three species platypus, opossum, and wallaroo was used as the outgroup to root the tree.
Comparison of the phylogenetic trees reconstructed from the mtDNA sequences and from their NGS short reads
| 454 | 10 | 14 | |||
| Exact | 4 | ||||
| Empirical | 10 | ||||
| Sanger | 12 |
The short reads were simulated from the mtDNA sequences using four error models 454, Exact, Empirical, and Sanger of the tool MetaSim at 5 × sampling depth. The two smallest tree symmetric differences for each error model are highlighted in boldface. Similar results for 1 ×, 10 ×, and 30 × sampling depths can be found in Additional file 1: Table S2.
Comparison of the alignment-free distances and the benchmark distance for 29 genomes
| Symmetric difference | 16 | 20 | 20 | 16 | 24 | ||
| Distance correlation | 0.95 | 0.80 | 0.79 | 0.80 | 0.20 |
The two smallest tree symmetric differences and the two highest correlation coefficients are highlighted in boldface.
Figure 2Phylogenetic trees reconstructed from 29 genomes using (A) , (B), and from NGS short reads using (C) . The short reads were simulated from the tool MetaSim using the Exact model and 1× sampling depth. Escherichia Fergusonii was used as the outgroup to root the tree.
Comparison of the alignment-free distances and the benchmark distance for NGS short reads of 29 genomes
| 454 | 16 | 22 | 22 | 54 | 50 | ||
| Exact | 22 | 24 | 36 | 50 | |||
| Empirical | 16 | 24 | 18 | 42 | 54 | ||
| Sanger | 28 | 24 | 40 | 50 | |||
| 454 | 0.74 | 0.87 | 0.31 | 0.04 | |||
| Exact | 0.82 | 0.82 | 0.61 | 0.05 | |||
| Empirical | 0.77 | 0.85 | 0.38 | 0.01 | |||
| Sanger | 0.78 | 0.86 | 0.36 | 0.04 |
The short reads were simulated from the Escherichia/Shigella genomes using four error models 454, Exact, Empirical, and Sanger of the tool MetaSim at 1 × sampling depth. The two smallest tree symmetric differences and the two highest correlation coefficients for each error model are highlighted in boldface. Similar results for 5 × sampling depth can be found in Additional file 1: Table S3.
Comparison of the alignment-free distances and the benchmark MSA distance for 70 genomes
| | parsimony score | 17 | 25 | ||||
| 16s rRNA sequences | tree symmetric difference | 62 | 108 | ||||
| | distance correlation | 0.90 | 0.65 | ||||
| | parsimony score | 31 | 26 | ||||
| Genome sequences | tree symmetric difference | 80 | 84 | 110 | 110 | ||
| | distance correlation | 0.47 | 0.46 | 0.47 | 0.45 | ||
| | parsimony score | 23 | 24 | 32 | 28 | ||
| NGS short reads | tree symmetric difference | 90 | 88 | 114 | 116 | ||
| distance correlation | 0.58 | 0.53 | 0.48 | 0.42 |
The NGS short reads were simulated from the whole genome sequences using the Exact model of MetaSim at 1 × sampling depth. The two smallest parsimony scores, the two smallest tree symmetric differences and the two highest correlation coefficients are highlighted in boldface. For CVTree, we used k = 7 for the 16S rRNA data set and k = 12 for the whole genome and NGS data sets. For d 2S, we used k = 6 for the 16S rRNA data set and k = 8 for the whole genome and NGS data sets.
Figure 3Clustering tree reconstructed from 16S rRNA sequences of 70 genomes using the distance. Those 70 genomes belong to 15 orders which are indicated by different colors in the figure.
Parsimony score for the classification of 39 metagenomic samples using the alignment-free distances
| Sub-data set | 6 | 7 | |||
| (omnivore samples excluded) | |||||
| Full data set | 12 | 12 |
For CVTree, we used k = 6 for the full data set and k = 4 for the sub-data set in which the omnivore samples were excluded. For d, we used k = 7 for the full data set and k = 5 for the sub-data set. The two smallest parsimony scores for each data set are highlighted in boldface.
Figure 4Clustering tree reconstructed from 39 metagenomic samples using the distance. The host species’ colors indicate their diet and gut physiology: foregut-fermenting herbivores (green), hindgut-fermenting herbivores (yellow), carnivores (red) and omnivores (blue).