Literature DB >> 31290946

Noise-cancelling repeat finder: uncovering tandem repeats in error-prone long-read sequencing data.

Robert S Harris¹, Monika Cechova¹, Kateryna D Makova^1,2.

Abstract

SUMMARY: Tandem DNA repeats can be sequenced with long-read technologies, but cannot be accurately deciphered due to the lack of computational tools taking high error rates of these technologies into account. Here we introduce Noise-Cancelling Repeat Finder (NCRF) to uncover putative tandem repeats of specified motifs in noisy long reads produced by Pacific Biosciences and Oxford Nanopore sequencers. Using simulations, we validated the use of NCRF to locate tandem repeats with motifs of various lengths and demonstrated its superior performance as compared to two alternative tools. Using real human whole-genome sequencing data, NCRF identified long arrays of the (AATGG)n repeat involved in heat shock stress response.
AVAILABILITY AND IMPLEMENTATION: NCRF is implemented in C, supported by several python scripts, and is available in bioconda and at https://github.com/makovalab-psu/NoiseCancellingRepeatFinder. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2019 PMID： 31290946 PMCID： PMC6853708 DOI： 10.1093/bioinformatics/btz484

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Long tandem repeat (TR) arrays are associated with heterochromatin and play critical roles in the human genome. For instance, (TTAGGG)n TRs protect telomeres (Blackburn and Gall, 1978), (AATGG)n repeats are implicated in heat shock response (Goenka ), and the lengths of heterochromatin-associated TRs differ across populations (Altemose ; Wevrick and Willard, 1989) and change with aging and environmental exposure (Goenka ; Zhang ). Despite these important features of TRs, their length variation has been understudied due to a lack of experimental and computational techniques able to capture their full length. Long TRs cannot be studied with short sequencing reads, but can be profiled with long-read technologies (Pacific Biosciences, or PacBio, and Oxford Nanopore, or Nanopore). However, they are difficult to decipher because such technologies have distinctive error profiles (see below). Moreover, they are often absent from reference genomes and assemblies (Peona ). To our knowledge, no tool currently exists to identify TR arrays in long, error-prone reads. Tools solving similar problems, primarily developed to work with short reads or assembled genomes, have limitations when applied to this use case (Lower ). Some fail to consider unequal rates of insertions versus deletions [e.g. Tandem Repeats Finder, or TRF (Benson, 1999)]; others do not permit high sequencing error rates (e.g. short read mappers). General purpose aligners, e.g. Minimap2 (Li, 2018), even with parameterizations for long-read sequencing technologies, are not designed to find TRs. To address the shortcomings of existing tools in identifying user-specified TR arrays directly from error-prone long sequencing reads, we developed Noise-Cancelling Repeat Finder (NCRF). NCRF supports high and unequal rates of short insertions and deletions observed in long-read sequencing data. As a result, its performance is superior to alternative tools.

2 Developing NCRF

The aligner at the core of NCRF finds alignments of a given motif to a segment of a given DNA sequence, with the motif repeated as often as needed. It is a Smith–Waterman aligner (Smith and Waterman, 1981) with affine gap penalties. It makes use of a typical Dynamic Programming matrix with a row for each nucleotide in a single copy of the motif and a column for each nucleotide in the sequence, allowing for wraparound from the end to the beginning of the motif (Supplementary Note S1A). Typically, the alignment core utilizes a score for matches and penalties for mismatches and indels; but we allow different penalties for insertions and deletions because sequencing technologies can be biased as to which type of indel they introduce. Thus, technology-specific scoring parameters are tuned to observed sequencing error profiles (Supplementary Notes S1D, S2). The dynamic programming recurrence is further modified to support a high prevalence of short indels (Supplementary Note S1A). A finishing step filters out and discards alignment pieces with a high density of mismatches and indels, retaining only high-quality alignments (Supplementary Note S1B). Alignments identify intervals that putatively align to perfectly repeated copies of a motif. However, segments containing a mix of motif variants, or a similar motif, may also be reported. Such mixes are consistent with known evolutionary signatures of heterochromatic repeats (Plohl ). An optional consensus filtering step eliminates TR arrays lacking a single dominant motif. Intervals reported for more than one motif can be identified with an optional overlap-detection step, see Supplementary Note S1 and Supplementary Figure S1 for details.

3 Analysis of simulated reads and tool comparison

We simulated PacBio and Nanopore sequencing reads for a mock genome mimicking the presence of long repeat arrays in the human reference genome (Supplementary Note S3). NCRF discovered 99% and 91% of the specified TRs in PacBio and Nanopore reads, respectively (Fig. 1A and Supplementary Table S2). In comparison, TRF discovered only 72% and 33%, while Minimap2 60% and 63%, for PacBio and Nanopore reads, respectively. False discovery rate (FDR) was much higher for NCRF than for TRF and Minimap2. Thus, we introduced the optional consensus filtering step in NCRF, reducing the FDR to below 1%, while still outperforming both TRF and Minimap2 in true positive rate (TPR). For the remainder of this section, we refer only to consensus-filtered results.

Fig. 1.

(A) Performance of NCRF, TRF, and Minimap2 on simulated PacBio (upper panel) and Nanopore (lower panel) reads, binned by motif lengths. Solid bars are TPRs, crosshatched bars are FDRs. All 847 arrays totaled 822 kb, with 197 arrays and 170 kb for lengths 2–36 bp, 156/158 kb for 37–47 bp, 158/173 kb for 48–59 bp, 172/162 kb for 60–80 bp and 164/160 kb for 81–198 bp. (B) Observed lengths of (AATGG)n arrays (with and without consensus filtering) in PacBio and Nanopore reads. Reads were subsampled to a similar length distribution of 16.5 Gb (Supplementary Note S5). Filtered and unfiltered results for PacBio are very similar Further, we studied how the performance of all three tools was affected by the motif length. For this analysis, we divided mock repeat arrays into five bins by motif length (2–36 bp, 37–47 bp, 48–59 bp, 60–80 bp and 81–198 bp), each bin having ∼20% of the total repeat bases in the mock genome. In the two shortest bins, NCRF had TPRs of 97% for Pacbio and 87% for Nanopore. This rate decreased as motifs grew longer—to 93% and 81%, respectively, for the middle bin, to 78% and 64% for the fourth bin, and to 45% and 36% for the longest bin. The same trend was observed for TRF, with TPR decreasing for longer bins. In all bins NCRF’s TPR was higher than TRF’s. For PacBio, NCRF’s TPR was between 8% and 13% higher than TRF; for Nanopore, it was 27% to 45% higher. In contrast, TPR for Minimap2 fluctuated, apparently independent of the motif length. Still, NCRF had higher TPR for the short and middle bins, as well as the fourth bin for PacBio. Comparing FDRs, NCRF’s FDR was below 1.2% across the board. TRF had better (lower) FDR in all bins but one; however this minor advantage (typically <0.2%) pales in comparison to NCRF’s gain in TPR. Minimap2’s FDR was worse than both NCRF and TRF in all bins. Surprisingly, both TRF and Minimap2 occasionally reported overlapping intervals for the same motif (Supplementary Table S2). Several other tools were considered for this evaluation but rejected after preliminary investigation (Supplementary Note S4).

4 Applying NCRF to real sequencing data

Lastly, we applied NCRF to investigate perfect repeats of (AATGG)n in publicly available PacBio and Nanopore sequenced data (Jain ; Zook ) generated for the same individual, subsampled to a 16.5 Gb common read length distribution (Supplementary Note S5). Searching for >500-bp repeats of (AATGG)n, NCRF identified 8883 repeats in PacBio covering 9.8 Mb; averaging 0.6 bp per kb sequenced (Fig. 1B). 9947 repeats covering 35.6 Mb were found in Nanopore; 2.2 bp per kb sequenced. Additional applications of NCRF to real sequencing data, as well as potential reasons behind differences in density between technologies, are presented in Cechova .

5 Conclusions

To our knowledge, NCRF is the first tool designed specifically to identify TR arrays in noisy and reference-free sequencing data, accounting for the unique characteristics of the long-read technologies. We anticipate NCRF will accelerate research of heterochromatin-associated TR arrays and will aid in unraveling their functions in the genome.

Funding

This work was supported by the National Institute of General Medical Sciences of the National Institutes of Health under Award No, R01GM130691. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. Funding was also provided by the Eberly College of Sciences, The Huck Institute of Life Sciences, and the Institute for CyberScience, at Penn State, as well as under grants from the Pennsylvania Department of Health using Tobacco Settlement and CURE Funds. Conflict of Interest: none declared. Click here for additional data file.

13 in total

Review 1. Satellite DNAs between selfishness and functionality: structure, genomics and evolution of tandem repeats in centromeric (hetero)chromatin.

Authors: Miroslav Plohl; Andrea Luchetti; Nevenka Mestrović; Barbara Mantovani
Journal: Gene Date: 2007-12-04 Impact factor: 3.688

2. Aging stem cells. A Werner syndrome stem cell model unveils heterochromatin alterations as a driver of human aging.

Authors: Weiqi Zhang; Jingyi Li; Keiichiro Suzuki; Jing Qu; Ping Wang; Junzhi Zhou; Xiaomeng Liu; Ruotong Ren; Xiuling Xu; Alejandro Ocampo; Tingting Yuan; Jiping Yang; Ying Li; Liang Shi; Dee Guan; Huize Pan; Shunlei Duan; Zhichao Ding; Mo Li; Fei Yi; Ruijun Bai; Yayu Wang; Chang Chen; Fuquan Yang; Xiaoyu Li; Zimei Wang; Emi Aizawa; April Goebl; Rupa Devi Soligalla; Pradeep Reddy; Concepcion Rodriguez Esteban; Fuchou Tang; Guang-Hui Liu; Juan Carlos Izpisua Belmonte
Journal: Science Date: 2015-04-30 Impact factor: 47.728

3. Tandem repeats finder: a program to analyze DNA sequences.

Authors: G Benson
Journal: Nucleic Acids Res Date: 1999-01-15 Impact factor: 16.971

4. Long-range organization of tandem arrays of alpha satellite DNA at the centromeres of human chromosomes: high-frequency array-length polymorphism and meiotic stability.

Authors: R Wevrick; H F Willard
Journal: Proc Natl Acad Sci U S A Date: 1989-12 Impact factor: 11.205

5. How complete are "complete" genome assemblies?-An avian perspective.

Authors: Valentina Peona; Matthias H Weissensteiner; Alexander Suh
Journal: Mol Ecol Resour Date: 2018-08-16 Impact factor: 7.090

Review 6. Satellite DNA evolution: old ideas, new approaches.

Authors: Sarah Sander Lower; Michael P McGurk; Andrew G Clark; Daniel A Barbash
Journal: Curr Opin Genet Dev Date: 2018-03-23 Impact factor: 5.578

7. Identification of common molecular subsequences.

Authors: T F Smith; M S Waterman
Journal: J Mol Biol Date: 1981-03-25 Impact factor: 5.469

8. Human satellite-III non-coding RNAs modulate heat-shock-induced transcriptional repression.

Authors: Anshika Goenka; Sonali Sengupta; Rajesh Pandey; Rashmi Parihar; Girish Chandra Mohanta; Mitali Mukerji; Subramaniam Ganesh
Journal: J Cell Sci Date: 2016-08-15 Impact factor: 5.285

9. Nanopore sequencing and assembly of a human genome with ultra-long reads.

Authors: Miten Jain; Sergey Koren; Karen H Miga; Josh Quick; Arthur C Rand; Thomas A Sasani; John R Tyson; Andrew D Beggs; Alexander T Dilthey; Ian T Fiddes; Sunir Malla; Hannah Marriott; Tom Nieto; Justin O'Grady; Hugh E Olsen; Brent S Pedersen; Arang Rhie; Hollian Richardson; Aaron R Quinlan; Terrance P Snutch; Louise Tee; Benedict Paten; Adam M Phillippy; Jared T Simpson; Nicholas J Loman; Matthew Loose
Journal: Nat Biotechnol Date: 2018-01-29 Impact factor: 54.908

10. Genomic characterization of large heterochromatic gaps in the human genome assembly.

Authors: Nicolas Altemose; Karen H Miga; Mauro Maggioni; Huntington F Willard
Journal: PLoS Comput Biol Date: 2014-05-15 Impact factor: 4.475

16 in total

1. Finding and Characterizing Repeats in Plant Genomes.

Authors: Jacques Nicolas; Sébastien Tempel; Anna-Sophie Fiston-Lavier; Emira Cherif
Journal: Methods Mol Biol Date: 2022

Review 2. Revisiting tandem repeats in psychiatric disorders from perspectives of genetics, physiology, and brain evolution.

Authors: Xiao Xiao; Chu-Yi Zhang; Zhuohua Zhang; Zhonghua Hu; Ming Li; Tao Li
Journal: Mol Psychiatry Date: 2021-10-14 Impact factor: 15.992

3. Evolutionary Dynamics of Abundant 7-bp Satellites in the Genome of Drosophila virilis.

Authors: Jullien M Flynn; Manyuan Long; Rod A Wing; Andrew G Clark
Journal: Mol Biol Evol Date: 2020-05-01 Impact factor: 16.240

Review 4. Genomic Tackling of Human Satellite DNA: Breaking Barriers through Time.

Authors: Mariana Lopes; Sandra Louzada; Margarida Gama-Carvalho; Raquel Chaves
Journal: Int J Mol Sci Date: 2021-04-29 Impact factor: 5.923

5. High satellite repeat turnover in great apes studied with short- and long-read technologies.

Authors: Monika Cechova; Robert S Harris; Marta Tomaszkiewicz; Barbara Arbeithuber; Francesca Chiaromonte; Kateryna D Makova
Journal: Mol Biol Evol Date: 2019-07-02 Impact factor: 16.240

6. TRiCoLOR: tandem repeat profiling using whole-genome long-read sequencing data.

Authors: Davide Bolognini; Alberto Magi; Vladimir Benes; Jan O Korbel; Tobias Rausch
Journal: Gigascience Date: 2020-10-07 Impact factor: 6.524

Review 7. Probably Correct: Rescuing Repeats with Short and Long Reads.

Authors: Monika Cechova
Journal: Genes (Basel) Date: 2020-12-31 Impact factor: 4.096

8. Brain Regional Differences in Hexanucleotide Repeat Length in X-Linked Dystonia-Parkinsonism Using Nanopore Sequencing.

Authors: Charles Jourdan Reyes; Björn-Hergen Laabs; Susen Schaake; Theresa Lüth; Raphaela Ardicoglu; Aleksandar Rakovic; Karen Grütz; Daniel Alvarez-Fischer; Roland Dominic Jamora; Raymond L Rosales; Imke Weyers; Inke R König; Norbert Brüggemann; Christine Klein; Valerija Dobricic; Ana Westenberger; Joanne Trinh
Journal: Neurol Genet Date: 2021-07-06

9. A Long-Term Conserved Satellite DNA That Remains Unexpanded in Several Genomes of Characiformes Fish Is Actively Transcribed.

Authors: Rodrigo Zeni Dos Santos; Rodrigo Milan Calegari; Duílio Mazzoni Zerbinato de Andrade Silva; Francisco J Ruiz-Ruano; Silvana Melo; Claudio Oliveira; Fausto Foresti; Marcela Uliano-Silva; Fábio Porto-Foresti; Ricardo Utsunomia
Journal: Genome Biol Evol Date: 2021-02-03 Impact factor: 3.416

10. The string decomposition problem and its applications to centromere analysis and assembly.

Authors: Tatiana Dvorkina; Andrey V Bzikadze; Pavel A Pevzner
Journal: Bioinformatics Date: 2020-07-01 Impact factor: 6.937