Literature DB >> 36071721

Interpreting alignment-free sequence comparison: what makes a score a good score?

Martin T Swain1, Martin Vickers2.   

Abstract

Alignment-free methods are alternatives to alignment-based methods when searching sequence data sets. The output from an alignment-free sequence comparison is a similarity score, the interpretation of which is not straightforward. We propose objective functions to interpret and calibrate outputs from alignment-free searches, noting that different objective functions are necessary for different biological contexts. This leads to advantages: visualising and comparing score distributions, including those from true positives, may be a relatively simple method to gain insight into the performance of different metrics. Using an empirical approach with both DNA and protein sequences, we characterise different similarity score distributions generated under different parameters. In particular, we demonstrate how sequence length can affect the scores. We show that scores of true positive sequence pairs may correlate significantly with their mean length; and even if the correlation is weak, the relative difference in length of the sequence pair may significantly reduce the effectiveness of alignment-free metrics. Importantly, we show how objective functions can be used with test data to accurately estimate the probability of true positives. This can significantly increase the utility of alignment-free approaches. Finally, we have developed a general-purpose software tool called KAST for use in high-throughput workflows on Linux clusters.
© The Author(s) 2022. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics.

Entities:  

Year:  2022        PMID: 36071721      PMCID: PMC9442500          DOI: 10.1093/nargab/lqac062

Source DB:  PubMed          Journal:  NAR Genom Bioinform        ISSN: 2631-9268


  55 in total

Review 1.  Sequence analysis by iterated maps, a review.

Authors:  Jonas S Almeida
Journal:  Brief Bioinform       Date:  2013-10-25       Impact factor: 11.622

2.  A measure of the similarity of sets of sequences not requiring sequence alignment.

Authors:  B E Blaisdell
Journal:  Proc Natl Acad Sci U S A       Date:  1986-07       Impact factor: 11.205

3.  MeShClust: an intelligent tool for clustering DNA sequences.

Authors:  Benjamin T James; Brian B Luczak; Hani Z Girgis
Journal:  Nucleic Acids Res       Date:  2018-08-21       Impact factor: 16.971

Review 4.  Benchmarking Metagenomics Tools for Taxonomic Classification.

Authors:  Simon H Ye; Katherine J Siddle; Daniel J Park; Pardis C Sabeti
Journal:  Cell       Date:  2019-08-08       Impact factor: 41.582

Review 5.  Dinucleotide relative abundance extremes: a genomic signature.

Authors:  S Karlin; C Burge
Journal:  Trends Genet       Date:  1995-07       Impact factor: 11.639

6.  Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms.

Authors:  Rob Patro; Stephen M Mount; Carl Kingsford
Journal:  Nat Biotechnol       Date:  2014-04-20       Impact factor: 54.908

7.  Scaling read aligners to hundreds of threads on general-purpose processors.

Authors:  Ben Langmead; Christopher Wilks; Valentin Antonescu; Rone Charles
Journal:  Bioinformatics       Date:  2019-02-01       Impact factor: 6.937

8.  A survey and evaluations of histogram-based statistics in alignment-free sequence comparison.

Authors:  Brian B Luczak; Benjamin T James; Hani Z Girgis
Journal:  Brief Bioinform       Date:  2019-07-19       Impact factor: 11.622

9.  Phylogeny Reconstruction with Alignment-Free Method That Corrects for Horizontal Gene Transfer.

Authors:  Raquel Bromberg; Nick V Grishin; Zbyszek Otwinowski
Journal:  PLoS Comput Biol       Date:  2016-06-23       Impact factor: 4.475

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.