| Literature DB >> 31290946 |
Robert S Harris1, Monika Cechova1, Kateryna D Makova1,2.
Abstract
SUMMARY: Tandem DNA repeats can be sequenced with long-read technologies, but cannot be accurately deciphered due to the lack of computational tools taking high error rates of these technologies into account. Here we introduce Noise-Cancelling Repeat Finder (NCRF) to uncover putative tandem repeats of specified motifs in noisy long reads produced by Pacific Biosciences and Oxford Nanopore sequencers. Using simulations, we validated the use of NCRF to locate tandem repeats with motifs of various lengths and demonstrated its superior performance as compared to two alternative tools. Using real human whole-genome sequencing data, NCRF identified long arrays of the (AATGG)n repeat involved in heat shock stress response.Entities:
Mesh:
Year: 2019 PMID: 31290946 PMCID: PMC6853708 DOI: 10.1093/bioinformatics/btz484
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.(A) Performance of NCRF, TRF, and Minimap2 on simulated PacBio (upper panel) and Nanopore (lower panel) reads, binned by motif lengths. Solid bars are TPRs, crosshatched bars are FDRs. All 847 arrays totaled 822 kb, with 197 arrays and 170 kb for lengths 2–36 bp, 156/158 kb for 37–47 bp, 158/173 kb for 48–59 bp, 172/162 kb for 60–80 bp and 164/160 kb for 81–198 bp. (B) Observed lengths of (AATGG)n arrays (with and without consensus filtering) in PacBio and Nanopore reads. Reads were subsampled to a similar length distribution of 16.5 Gb (Supplementary Note S5). Filtered and unfiltered results for PacBio are very similar