Literature DB >> 30351359

Alignment-free clustering of UMI tagged DNA molecules.

Baraa Orabi1, Emre Erhan1, Brian McConeghy2, Stanislav V Volik2, Stephane Le Bihan2, Robert Bell2, Colin C Collins2,3, Cedric Chauve4, Faraz Hach2,3.   

Abstract

MOTIVATION: Next-Generation Sequencing has led to the availability of massive genomic datasets whose processing raises many challenges, including the handling of sequencing errors. This is especially pertinent in cancer genomics, e.g. for detecting low allele frequency variations from circulating tumor DNA. Barcode tagging of DNA molecules with unique molecular identifiers (UMI) attempts to mitigate sequencing errors; UMI tagged molecules are polymerase chain reaction (PCR) amplified, and the PCR copies of UMI tagged molecules are sequenced independently. However, the PCR and sequencing steps can generate errors in the sequenced reads that can be located in the barcode and/or the DNA sequence. Analyzing UMI tagged sequencing data requires an initial clustering step, with the aim of grouping reads sequenced from PCR duplicates of the same UMI tagged molecule into a single cluster, and the size of the current datasets requires this clustering process to be resource-efficient.
RESULTS: We introduce Calib, a computational tool that clusters paired-end reads from UMI tagged sequencing experiments generated by substitution-error-dominant sequencing platforms such as Illumina. Calib clusters are defined as connected components of a graph whose edges are defined in terms of both barcode similarity and read sequence similarity. The graph is constructed efficiently using locality sensitive hashing and MinHashing techniques. Calib's default clustering parameters are optimized empirically, for different UMI and read lengths, using a simulation module that is packaged with Calib. Compared to other tools, Calib has the best accuracy on simulated data, while maintaining reasonable runtime and memory footprint. On a real dataset, Calib runs with far less resources than alignment-based methods, and its clusters reduce the number of tentative false positive in downstream variation calling.
AVAILABILITY AND IMPLEMENTATION: Calib is implemented in C++ and its simulation module is implemented in Python. Calib is available at https://github.com/vpc-ccg/calib. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

Entities:  

Mesh:

Substances:

Year:  2019        PMID: 30351359     DOI: 10.1093/bioinformatics/bty888

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  4 in total

1.  Structural variation and fusion detection using targeted sequencing data from circulating cell free DNA.

Authors:  Alexander R Gawroński; Yen-Yi Lin; Brian McConeghy; Stephane LeBihan; Hossein Asghari; Can Koçkan; Baraa Orabi; Nabil Adra; Roberto Pili; Colin C Collins; S Cenk Sahinalp; Faraz Hach
Journal:  Nucleic Acids Res       Date:  2019-04-23       Impact factor: 16.971

2.  De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality Value-Based Algorithm.

Authors:  Kristoffer Sahlin; Paul Medvedev
Journal:  J Comput Biol       Date:  2020-03-16       Impact factor: 1.479

3.  Minnow: a principled framework for rapid simulation of dscRNA-seq data at the read level.

Authors:  Hirak Sarkar; Avi Srivastava; Rob Patro
Journal:  Bioinformatics       Date:  2019-07-15       Impact factor: 6.937

4.  Whole-Genome k-mer Topic Modeling AssociatesBacterial Families.

Authors:  Ernesto Borrayo-Carbajal; Isaias May-Canche; Omar Paredes; J Alejandro Morales; Rebeca Romo-Vázquez; Hugo Vélez-Pérez
Journal:  Genes (Basel)       Date:  2020-02-14       Impact factor: 4.096

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.