| Literature DB >> 32181688 |
Kristoffer Sahlin1, Paul Medvedev1,2,3.
Abstract
Long-read sequencing of transcripts with Pacific Biosciences (PacBio) Iso-Seq and Oxford Nanopore Technologies has proven to be central to the study of complex isoform landscapes in many organisms. However, current de novo transcript reconstruction algorithms from long-read data are limited, leaving the potential of these technologies unfulfilled. A common bottleneck is the dearth of scalable and accurate algorithms for clustering long reads according to their gene family of origin. To address this challenge, we develop isONclust, a clustering algorithm that is greedy (to scale) and makes use of quality values (to handle variable error rates). We test isONclust on three simulated and five biological data sets, across a breadth of organisms, technologies, and read depths. Our results demonstrate that isONclust is a substantial improvement over previous approaches, both in terms of overall accuracy and/or scalability to large data sets.Entities:
Keywords: algorithms; clustering; long-read sequencing; sequencing data analysis; third-generation sequencing; transcriptomics
Mesh:
Year: 2020 PMID: 32181688 PMCID: PMC8884114 DOI: 10.1089/cmb.2019.0299
Source DB: PubMed Journal: J Comput Biol ISSN: 1066-5277 Impact factor: 1.479