Literature DB >> 35134110

SeqWho: Reliable, Rapid Determination of Sequence File Identity using k-mer Frequencies in Random Forest Classifiers.

Christopher Bennett1, Micah Thornton1, Chanhee Park1, Gervaise Henry2, Yun Zhang1, Venkat Malladi1, Daehwan Kim1.   

Abstract

MOTIVATION: With the vast improvements in sequencing technologies and increased number of protocols, sequencing is being used to answer complex biological problems. Subsequently, analysis pipelines have become more time consuming and complicated, usually requiring highly extensive pre-validation steps. Here we present SeqWho, a program designed to assess heuristically the quality of sequencing files and reliably classify the organism and protocol type by using Random Forest classifiers trained on biases native in k-mer frequencies and repeat sequence identities.
RESULTS: Using one of our primary models, we show that our method accurately and rapidly classifies human and mouse sequences from nine different sequencing libraries by species, library, and both together, 98.32%, 97.86%, and 96.38% of the time respectively. Ultimately, we demonstrate that SeqWho is a powerful method for reliably validating the quality and identity of the sequencing files used in any pipeline. AVAILABILITY: https://github.com/DaehwanKimLab/seqwho. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author(s) (2022). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.

Entities:  

Year:  2022        PMID: 35134110      PMCID: PMC8963323          DOI: 10.1093/bioinformatics/btac050

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  19 in total

1.  Near-optimal probabilistic RNA-seq quantification.

Authors:  Nicolas L Bray; Harold Pimentel; Páll Melsted; Lior Pachter
Journal:  Nat Biotechnol       Date:  2016-04-04       Impact factor: 54.908

2.  Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype.

Authors:  Daehwan Kim; Joseph M Paggi; Chanhee Park; Christopher Bennett; Steven L Salzberg
Journal:  Nat Biotechnol       Date:  2019-08-02       Impact factor: 54.908

3.  The Sequence Read Archive: explosive growth of sequencing data.

Authors:  Yuichi Kodama; Martin Shumway; Rasko Leinonen
Journal:  Nucleic Acids Res       Date:  2011-10-18       Impact factor: 16.971

4.  Annotation inconsistencies beyond sequence similarity-based function prediction - phylogeny and genome structure.

Authors:  Vasilis J Promponas; Ioannis Iliopoulos; Christos A Ouzounis
Journal:  Stand Genomic Sci       Date:  2015-11-19

5.  Experimental design-based functional mining and characterization of high-throughput sequencing data in the sequence read archive.

Authors:  Takeru Nakazato; Tazro Ohta; Hidemasa Bono
Journal:  PLoS One       Date:  2013-10-22       Impact factor: 3.240

6.  Quality control of microbiota metagenomics by k-mer analysis.

Authors:  Florian Plaza Onate; Jean-Michel Batto; Catherine Juste; Jehane Fadlallah; Cyrielle Fougeroux; Doriane Gouas; Nicolas Pons; Sean Kennedy; Florence Levenez; Joel Dore; S Dusko Ehrlich; Guy Gorochov; Martin Larsen
Journal:  BMC Genomics       Date:  2015-03-14       Impact factor: 3.969

7.  KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies.

Authors:  Daniel Mapleson; Gonzalo Garcia Accinelli; George Kettleborough; Jonathan Wright; Bernardo J Clavijo
Journal:  Bioinformatics       Date:  2017-02-15       Impact factor: 6.937

8.  An integrated encyclopedia of DNA elements in the human genome.

Authors: 
Journal:  Nature       Date:  2012-09-06       Impact factor: 49.962

9.  The real cost of sequencing: scaling computation to keep pace with data generation.

Authors:  Paul Muir; Shantao Li; Shaoke Lou; Daifeng Wang; Daniel J Spakowicz; Leonidas Salichos; Jing Zhang; George M Weinstock; Farren Isaacs; Joel Rozowsky; Mark Gerstein
Journal:  Genome Biol       Date:  2016-03-23       Impact factor: 13.583

10.  Genomes OnLine database (GOLD) v.7: updates and new features.

Authors:  Supratim Mukherjee; Dimitri Stamatis; Jon Bertsch; Galina Ovchinnikova; Hema Y Katta; Alejandro Mojica; I-Min A Chen; Nikos C Kyrpides; Tbk Reddy
Journal:  Nucleic Acids Res       Date:  2019-01-08       Impact factor: 16.971

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.