Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models.

Literature DB >> 33554117

Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models.

Hani Z Girgis¹, Benjamin T James², Brian B Luczak³.

Abstract

Pairwise global alignment is a fundamental step in sequence analysis. Optimal alignment algorithms are quadratic-slow especially on long sequences. In many applications that involve large sequence datasets, all what is needed is calculating the identity scores (percentage of identical nucleotides in an optimal alignment-including gaps-of two sequences); there is no need for visualizing how every two sequences are aligned. For these applications, we propose Identity, which produces global identity scores for a large number of pairs of DNA sequences using alignment-free methods and self-supervised general linear models. For the first time, the new tool can predict pairwise identity scores in linear time and space. On two large-scale sequence databases, Identity provided the best compromise between sensitivity and precision while being faster than BLAST, Mash, MUMmer4 and USEARCH by 2-80 times. Identity was the best performing tool when searching for low-identity matches. While constructing phylogenetic trees from about 6000 transcripts, the tree due to the scores reported by Identity was the closest to the reference tree (in contrast to andi, FSWM and Mash). Identity is capable of producing pairwise identity scores of millions-of-nucleotides-long bacterial genomes; this task cannot be accomplished by any global-alignment-based tool. Availability: https://github.com/BioinformaticsToolsmith/Identity.

Entities: Chemical Disease Gene Species

Year: 2021 PMID： 33554117 PMCID： PMC7850047 DOI： 10.1093/nargab/lqab001

Source DB: PubMed Journal: NAR Genom Bioinform ISSN： 2631-9268

43 in total

Review 1. Alignment-free sequence comparison-a review.

Authors: Susana Vinga; Jonas Almeida
Journal: Bioinformatics Date: 2003-03-01 Impact factor: 6.937

2. Basic local alignment search tool.

Authors: S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal: J Mol Biol Date: 1990-10-05 Impact factor: 5.469

3. Compressive genomics.

Authors: Po-Ru Loh; Michael Baym; Bonnie Berger
Journal: Nat Biotechnol Date: 2012-07-10 Impact factor: 54.908

4. MeShClust: an intelligent tool for clustering DNA sequences.

Authors: Benjamin T James; Brian B Luczak; Hani Z Girgis
Journal: Nucleic Acids Res Date: 2018-08-21 Impact factor: 16.971

5. Accuracy of phylogeny reconstruction methods combining overlapping gene data sets.

Authors: Anne Kupczok; Heiko A Schmidt; Arndt von Haeseler
Journal: Algorithms Mol Biol Date: 2010-12-06 Impact factor: 1.405

6. Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation.

Authors: Torbjørn Rognes
Journal: BMC Bioinformatics Date: 2011-06-01 Impact factor: 3.169

7. Red: an intelligent, rapid, accurate tool for detecting repeats de-novo on the genomic scale.

Authors: Hani Z Girgis
Journal: BMC Bioinformatics Date: 2015-07-24 Impact factor: 3.169

8. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler.

Authors: Ruibang Luo; Binghang Liu; Yinlong Xie; Zhenyu Li; Weihua Huang; Jianying Yuan; Guangzhu He; Yanxiang Chen; Qi Pan; Yunjie Liu; Jingbo Tang; Gengxiong Wu; Hao Zhang; Yujian Shi; Yong Liu; Chang Yu; Bo Wang; Yao Lu; Changlei Han; David W Cheung; Siu-Ming Yiu; Shaoliang Peng; Zhu Xiaoqian; Guangming Liu; Xiangke Liao; Yingrui Li; Huanming Yang; Jian Wang; Tak-Wah Lam; Jun Wang
Journal: Gigascience Date: 2012-12-27 Impact factor: 6.524

2. MeShClust v3.0: high-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores.

Authors: Hani Z Girgis
Journal: BMC Genomics Date: 2022-06-06 Impact factor: 4.547

2 in total