| Literature DB >> 28974235 |
Andrzej Zielezinski1, Susana Vinga2, Jonas Almeida3, Wojciech M Karlowski4.
Abstract
Alignment-free sequence analyses have been applied to problems ranging from whole-genome phylogeny to the classification of protein families, identification of horizontally transferred genes, and detection of recombined sequences. The strength of these methods makes them particularly useful for next-generation sequencing data processing and analysis. However, many researchers are unclear about how these methods work, how they compare to alignment-based methods, and what their potential is for use for their research. We address these questions and provide a guide to the currently available alignment-free sequence analysis tools.Entities:
Mesh:
Year: 2017 PMID: 28974235 PMCID: PMC5627421 DOI: 10.1186/s13059-017-1319-7
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Alignment-free calculation of the word-based distance between two sample DNA sequences ATGTGTG and CATGTG using the Euclidean distance
Fig. 2Alignment-free calculation of the normalized compression distance using the Lempel–Ziv complexity estimation algorithm. Lempel–Ziv complexity counts the number of different words in sequence when scanned from left to right (e.g., for s = ATGTGTG, Lempel–Ziv complexity is 4: A|T|G|TG). Description of compression algorithms in alignment-free analysis has been reviewed extensively [63]
Alignment-free sequence comparison tools available for next-generation sequencing data analysis
| Category | Analysis | Tool | Primary features | Implementation | Reference | URL |
|---|---|---|---|---|---|---|
| Mapping | Transcript quantification | kallisto | Transcript abundance quantification from RNA-seq data (uses pseudoalignment for rapid determination of read compatibility with targets) | Software (C++) | [ |
|
| Sailfish | Estimation of isoform abundances from reference sequences and RNA-seq data ( | Software (C++) | [ |
| ||
| Salmon | Quantification of the expression of transcripts using RNA-seq data (uses | [ |
| |||
| RNA-Skim | RNA-seq quantification at transcript-level (partitions the transcriptome into disjoint transcript clusters; uses | Software (C++) | [ |
| ||
| Variant calling | ChimeRScope | Fusion transcript prediction using gene | Software (Java) | [ |
| |
| FastGT | Genotyping of known SNV/SNP variants directly from raw NGS sequence reads by counting unique | Software (C) | [ |
| ||
| Phy-Mer | Reference-independent mitochondrial haplogroup classifier from NGS data ( | Software (Python) | [ |
| ||
| LAVA | Genotyping of known SNPs (dbSNP and Affymetrix's Genome-Wide Human SNP Array) from raw NGS reads ( | Software (C) | [ |
| ||
| MICADo | Detection of mutations in targeted third-generation NGS data (can distinguish patients’ specific mutations; algorithm uses | Software (Python) | [ |
| ||
| General mapper | Minimap | Lightweight and fast read mapper and read overlap detector (uses the concept of “minimazers”, a special type of | Software (C) | [ |
| |
| Assembly | De novo genome assembly | MHAP | Produces highly continuous assembly (fully resolved chromosome arms) from third-generation long and noisy reads (10 kbp) using a dimensionality reduction technique MinHash | Software (Java) | [ |
|
| Miniasm | Assembler of long noisy reads (SMRT, ONT) using the Overlap-Layout Consensus (OLC) approach without the necessity of an error correction stage (uses minimap) | Software (C) | [ |
| ||
| LINKS | Scaffolding genome assembly with error-containing long sequence (e.g., ONT or PacBio reads, draft genomes) | Software (Perl) | [ |
| ||
| Read clustering | afcluster | Clustering of reads from different genes and different species based on | Software (C++) | [ |
| |
| QCluster | Clustering of reads with alignment-free measures ( | Software (C++) | [ |
| ||
| Reads error correction | Lighter | Correction of sequencing errors in raw, whole genome sequencing reads ( | Software (C++) | [ |
| |
| QuorUM | Error corrector for Illumina reads using k-mers | Software (C++) | [ |
| ||
| Trowel | Software (C++) | [ |
| |||
| Metagenomics | Assembly-free phylogenomics | AAF | Phylogeny reconstruction directly from unassembled raw sequence data from whole genome sequencing projects; provides bootstrap support to assess uncertainty in the tree topology ( | Software (Python) | [ |
|
| kSNP v3 | Reference-free SNP identification and estimation of phylogenetic trees using SNPs (based on | Software (C) | [ |
| ||
| NGS-MC | Phylogeny of species based on NGS reads using alignment-free sequence dissimilarity measures d2* and d2
S under different Markov chain models (using | R package | [ |
| ||
| Species identification/taxonomic profiling | CLARK | Taxonomic classification of metagenomic reads to known bacterial genomes using | Software (C++) | [ |
| |
| FOCUS | Reports organisms present in metagenomic samples and profiles their abundances (uses composition-based approach and non-negative least squares for prediction) | Web service Software (Python) | [ |
| ||
| GSM | Estimation of abundances of microbial genomes in metagenomic samples ( | Software (Go) | [ |
| ||
| Mash | Species identification using assembled or unassembled Illumina, PacBio, and ONT data (based on MinHash dimensionality-reduction technique) | Software (C++) | [ |
| ||
| Kraken | Taxonomic assignment in metagenome analysis by exact | Software (C++) | [ |
| ||
| LMAT | Assignment of taxonomic labels to reads by | Software (C++/Python) | [ |
| ||
| stringMLST |
| Software (Python) | [ |
| ||
| Taxonomer |
| Web service | [ |
| ||
| Other | d2-tools | Word-based ( | Software (Python/R) | [ |
| |
| VirHostMatcher | Prediction of hosts from metagenomic viral sequences based on ONF using various distance measures (e.g., d2) | Software (C++) | [ |
| ||
| MetaFast | Statistics calculation of metagenome sequences and the distances between them based on assembly using de Bruijn graphs and Bray–Curtis dissimilarity measure | Software (Java) | [ |
|
The up-to-date list of currently available programs can be found at http://www.combio.pl/alfree/tools/. Accessed 23 August 2017
LCA lowest common ancestor, NGS next-generation sequencing, SNP single-nucleotide polymorphism, SNV single-nucleotide variant
Alignment-free sequence comparison tools available for research purposes
| Category | Name | Features | Implementation | Reference | URL |
|---|---|---|---|---|---|
| Pairwise and multiple sequence comparison | ALF | Calculation of pairwise similarity scores (using N2 measure) for sequences in fasta file | Software (C++) | [ |
|
| Alfree | 25 word-based measures, 8 IT-based measures, 3 graph-based measures, W-metric | Web service Software (Python) | This article |
| |
| decaf + py | 13 word-based measures, Lempel–Ziv complexity-based measure, average common substring distance, W-metric | Software (Python) | [ |
| |
| multiAlignFree | Multiple alignment-free sequence comparison using five word-based statistics | R package | [ |
| |
| NASC | Non-aligned sequence comparison: four word-based measures and 2 IT-based measures | Matlab framework | [ |
| |
| Whole-genome phylogeny | ALFRED ALFRED-G | Phylogenetic tree reconstruction based on the average common substring approach | Software (C++) | [ |
|
| andi | Computation of evolutionary distances between closely related genomes by approximation of local alignments ( | Software (C) | [ |
| |
| CAFE | Alignment-free analysis platform for studying the relationships among genomes and metagenomes (offers 28 word-based dissimilarity measures) | Software (C) | [ |
| |
| CVTree3 | Phylogeny reconstruction from whole genome sequences based on word composition | Web service | [ |
| |
| DLTree | Automated whole genome/proteome-based phylogenetic analysis based on alignment-free dynamical language method | Web Service | [ |
| |
| FFP | Feature frequency profile-based measures for whole genome/proteome comparisons (from viral to mammalian scale) | Software (C/Perl) | [ |
| |
| jD2Stat (JIWA) | Generation of the distance matrix using | Software (Java) | [ |
| |
| kr | Efficient word-based estimation of mutation distances from unaligned genomes | Software (C) | [ |
| |
| FSWM/kmacs/Spaced | Three tools for alignment-free sequence comparison based on inexact word matches | Software (C++) Web service | [ | Software currently unavailable | |
| SlopeTree | Whole genome phylogeny that corrects for HGT | Software (C++) |
| ||
| Underlying Approach | Phylogeny of whole genomes using composition of subwords | Software (Java) | [ |
| |
| Sequence similarity search tool | RAFTS3 | Searches of similar protein sequences against a protein database (>300 times faster than BLAST) | Matlab | [ |
|
| Annotation of long non-coding RNA | FEELnc | Prediction of lncRNAs from RNA-seq samples based word frequencies and relaxed open reading frames | Software (Perl/R) | [ |
|
| lncScore | Identification of long non-coding RNA from assembled novel transcripts | Software (Python) | [ |
| |
| Horizontal gene transfer | alfy | Alignment-free local homology calculation for detecting horizontal gene transfer | Software (C) | [ |
|
| rush | Detection of recombination between two unaligned DNA sequences | Software (C) | [ |
| |
| Smash | Identification and visualization of DNA rearrangements between pairs of sequences | Software (C) | [ |
| |
| TF-IDF | Detection of HGT regions and the transfer direction in nucleotide/protein sequences | Software (C++) | [ |
| |
| Regulatory elements | D2Z | Identification of functionally related homologous regulatory elements | Software (Perl) | [ |
|
| MatrixREDUCE | Prediction of functional regulatory targets of TFs by predicting the total affinity of each promoter and orthologous promoters | Software (Python) | [ |
| |
| RRS | Detection of functionally similar group of enhancers and their regions | Software (Perl/C) | [ |
| |
| Sequence clustering | d2_cluster | Word-based clustering EST and full-length cDNA sequences | Software (C) | [ |
|
| d2-vlmc | Word-based clustering of metatranscriptomic samples using variable length Markov chains | Software (Python) | [ |
| |
| mBKM | Clustering of DNA sequences using Shannon entropy and Euclidean distance | Software (Java) | [ |
| |
| kClust | Large-scale clustering of protein sequences (down to 20–30% sequence identity) | Software (C++) | [ |
| |
| Other | COMET | Rapid classification of HIV-1 nucleotide sequences into subtypes based on prediction by partial matching compression | Web service | [ |
|
| PPI | Identification of protein–protein interaction by coevolution analysis using discrete Fourier transform | Software (Python) | [ |
| |
| VaxiJen | Antigen prediction based on uniform vectors of principal amino acid properties | Web service | [ |
|
The up-to-date list of currently available programs can be found at http://www.combio.pl/alfree/tools/. Accessed 23 August 2017
HGT horizontal gene transfer, IT information theory
Fig. 3Snapshot of the results returned by the alignment-free web tool (Alfree) for “example 1”: HIV viral sequences obtained from dental patients in Florida [186]. Briefly, in the late 1980s some patients of an HIV-positive dentist in Florida were diagnosed as infected with HIV. An investigation by the Centers for Disease Control and Prevention did not uncover any hygiene lapses that could result in infection of patients. However, sequence comparison of the gene encoding gpg120 isolated from HIV strains from the dentist, his patients, and other individuals revealed that PATIENT_A, PATIENT_B, PATIENT_C, PATIENT_E, and PATIENT_G became infected while receiving dental care [183]. The phylogeny shown is based on the gp120 viral protein sequences from the dentist, the dentist’s wife (DENTIST WIFE), eight patients (PATIENT_A to PATIENT H), and five individuals that never had contact with the accused (CONTROL 1, 2, 3, 4, and 5). The sphylogram was obtained as a majority-rule consensus tree that summarizes the agreement across 15 alignment-free methods (support values in scale from 0 to 1 are shown for every node of the tree). The web interface of the Alfree portal also provides an example case of phylogenetic reconstruction of mitochondrial genomes of 12 primates. Several additional options are available to explore and visualize the sequence comparison results, including selection of individual method, re-rooting trees, changing tree layouts, as well as collapsing or expanding different parts of the tree