| Literature DB >> 26525298 |
Saulo Alves Aflitos1,2, Edouard Severing3, Gabino Sanchez-Perez4,5, Sander Peters6, Hans de Jong7, Dick de Ridder8.
Abstract
BACKGROUND: Identification of biological specimens is a requirement for a range of applications. Reference-free methods analyse unprocessed sequencing data without relying on prior knowledge, but generally do not scale to arbitrarily large genomes and arbitrarily large phylogenetic distances.Entities:
Mesh:
Year: 2015 PMID: 26525298 PMCID: PMC4630969 DOI: 10.1186/s12859-015-0806-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Cnidaria analysis summary. The JELLYFISH software reads each of the source sequence files (in Fasta or Fastq formats), extracts their k-mers (k = 3 in this example), canonizes them (by generating the reverse complement of each k-mer and storing only the k-mer which appears first lexicographically), orders them according to a deterministic hashing algorithm (in this example, alphabetically) and then saves each dataset in a separated database file (.jf). CNIDARIA subsequently reads these databases and compares them, side-by-side, by counting the total number of k-mers (white circles), the number of valid k-mers (k-mers shared by at least two samples, black circles) and the number of shared k-mers for each pair of samples as a matrix. Those values are exported to a Cnidaria Summary Database (CSD, a .json file) that is then used to construct a matrix of, by default Jaccard, distances between the samples (Formula 1). This dissimilarity matrix is then used for Neighbour-Joining clustering and exported as a NEWICK tree. Alternatively, Cnidaria can export a Cnidaria Complete Database (CCD, a .cne file) containing all k-mers and a linked list describing their presence/absence in the samples. This second database can be used as an input dataset together with other .cne or .jf files for new analysis
Fig. 21-nearest-neighbour analysis for species and supra-species levels at each taxonomic level for CNIDARIA and REFERENCEFREE using 21-mers and Jaccard distance. Supra-species level analysis contains 30 samples (Additional file 5: Table S5) from 8 genus, 7 families, 7 orders, 4 phylum and 3 kingdoms. Species level analysis contains 33 samples (Additional file 5: Table S5) from 11 species of the Solanum clade. Classification reports the Leave-One-Out Cross-Validation error estimate (LOOCV) for 21-mers. Error bars indicate the minimum and maximum performance found across the 71 distance metrics tested
Summary of search space per k-mer size and number of k-mers found in datasets
|
| # Canonical | % of | % of | ||
|---|---|---|---|---|---|
| Median | MAD | Median | MAD | ||
| 11-mer | 2.10 × 1006 | 100.00 % | 1.58 % | 100.00 % | 0.00 % |
| 15-mer | 5.40 × 1008 | 53.59 % | 17.07 % | 100.00 % | 0.00 % |
| 17-mer | 8.60 × 1009 | 8.90 % | 4.03 % | 98.37 % | 0.99 % |
| 21-mer | 2.20 × 1012 | 0.05 % | 0.03 % | 81.45 % | 20.55 % |
| 31-mer | 2.30 × 1018 | 0.000000061 % | 0.000000032 % | 67.05 % | 24.14 % |
The second column contains the total number of possible k-mers, calculated as (4/2), where the division by two is due to canonization. The third column is the median and the Median Absolute Deviation (MAD) of the total number of k-mers found in the samples (Additional file 3: Table S3) divided by the number of possible k-mers, showing the percentage of combinations actually found and, consequently, the saturation of the search space; the fourth column gives the median and MAD of the percentage of valid k-mers (k-mers shared between at least two samples, Additional file 3: Table S3)
Fig. 3Results for the 21-mer dataset of 169 individuals using the Jaccard distance and Neighbour-Joining. The phylogenetic tree shows the clustering of the samples without displaying branch lengths (plotted using iTOL, [85]). RNAseq samples are highlighted with a * in the outer rim of the tree