Literature DB >> 33554117

Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models.

Hani Z Girgis1, Benjamin T James2, Brian B Luczak3.   

Abstract

Pairwise global alignment is a fundamental step in sequence analysis. Optimal alignment algorithms are quadratic-slow especially on long sequences. In many applications that involve large sequence datasets, all what is needed is calculating the identity scores (percentage of identical nucleotides in an optimal alignment-including gaps-of two sequences); there is no need for visualizing how every two sequences are aligned. For these applications, we propose Identity, which produces global identity scores for a large number of pairs of DNA sequences using alignment-free methods and self-supervised general linear models. For the first time, the new tool can predict pairwise identity scores in linear time and space. On two large-scale sequence databases, Identity provided the best compromise between sensitivity and precision while being faster than BLAST, Mash, MUMmer4 and USEARCH by 2-80 times. Identity was the best performing tool when searching for low-identity matches. While constructing phylogenetic trees from about 6000 transcripts, the tree due to the scores reported by Identity was the closest to the reference tree (in contrast to andi, FSWM and Mash). Identity is capable of producing pairwise identity scores of millions-of-nucleotides-long bacterial genomes; this task cannot be accomplished by any global-alignment-based tool. Availability: https://github.com/BioinformaticsToolsmith/Identity.
© The Author(s) 2021. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics.

Entities:  

Year:  2021        PMID: 33554117      PMCID: PMC7850047          DOI: 10.1093/nargab/lqab001

Source DB:  PubMed          Journal:  NAR Genom Bioinform        ISSN: 2631-9268


  43 in total

Review 1.  Alignment-free sequence comparison-a review.

Authors:  Susana Vinga; Jonas Almeida
Journal:  Bioinformatics       Date:  2003-03-01       Impact factor: 6.937

2.  Basic local alignment search tool.

Authors:  S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal:  J Mol Biol       Date:  1990-10-05       Impact factor: 5.469

3.  Compressive genomics.

Authors:  Po-Ru Loh; Michael Baym; Bonnie Berger
Journal:  Nat Biotechnol       Date:  2012-07-10       Impact factor: 54.908

4.  MeShClust: an intelligent tool for clustering DNA sequences.

Authors:  Benjamin T James; Brian B Luczak; Hani Z Girgis
Journal:  Nucleic Acids Res       Date:  2018-08-21       Impact factor: 16.971

5.  Accuracy of phylogeny reconstruction methods combining overlapping gene data sets.

Authors:  Anne Kupczok; Heiko A Schmidt; Arndt von Haeseler
Journal:  Algorithms Mol Biol       Date:  2010-12-06       Impact factor: 1.405

6.  Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation.

Authors:  Torbjørn Rognes
Journal:  BMC Bioinformatics       Date:  2011-06-01       Impact factor: 3.169

7.  Red: an intelligent, rapid, accurate tool for detecting repeats de-novo on the genomic scale.

Authors:  Hani Z Girgis
Journal:  BMC Bioinformatics       Date:  2015-07-24       Impact factor: 3.169

8.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler.

Authors:  Ruibang Luo; Binghang Liu; Yinlong Xie; Zhenyu Li; Weihua Huang; Jianying Yuan; Guangzhu He; Yanxiang Chen; Qi Pan; Yunjie Liu; Jingbo Tang; Gengxiong Wu; Hao Zhang; Yujian Shi; Yong Liu; Chang Yu; Bo Wang; Yao Lu; Changlei Han; David W Cheung; Siu-Ming Yiu; Shaoliang Peng; Zhu Xiaoqian; Guangming Liu; Xiangke Liao; Yingrui Li; Huanming Yang; Jian Wang; Tak-Wah Lam; Jun Wang
Journal:  Gigascience       Date:  2012-12-27       Impact factor: 6.524

9.  MsDetector: toward a standard computational tool for DNA microsatellites detection.

Authors:  Hani Z Girgis; Sergey L Sheetlin
Journal:  Nucleic Acids Res       Date:  2012-10-02       Impact factor: 16.971

Review 10.  Alignment-free sequence comparison: benefits, applications, and tools.

Authors:  Andrzej Zielezinski; Susana Vinga; Jonas Almeida; Wojciech M Karlowski
Journal:  Genome Biol       Date:  2017-10-03       Impact factor: 13.583

View more
  2 in total

1.  Interpreting alignment-free sequence comparison: what makes a score a good score?

Authors:  Martin T Swain; Martin Vickers
Journal:  NAR Genom Bioinform       Date:  2022-09-05

2.  MeShClust v3.0: high-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores.

Authors:  Hani Z Girgis
Journal:  BMC Genomics       Date:  2022-06-06       Impact factor: 4.547

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.