Literature DB >> 31857828

LTR_FINDER_parallel: parallelization of LTR_FINDER enabling rapid identification of long terminal repeat retrotransposons.

Shujun Ou1, Ning Jiang1.   

Abstract

Annotation of plant genomes is still a challenging task due to the abundance of repetitive sequences, especially long terminal repeat (LTR) retrotransposons. LTR_FINDER is a widely used program for the identification of LTR retrotransposons but its application on large genomes is hindered by its single-threaded processes. Here we report an accessory program that allows parallel operation of LTR_FINDER, resulting in up to 8500X faster identification of LTR elements. It takes only 72 min to process the 14.5 Gb bread wheat (Triticum aestivum) genome in comparison to 1.16 years required by the original sequential version. LTR_FINDER_parallel is freely available at https://github.com/oushujun/LTR_FINDER_parallel.
© The Author(s). 2019.

Entities:  

Keywords:  Genome annotation; LTR retrotransposon; LTR_FINDER; Transposable element

Year:  2019        PMID: 31857828      PMCID: PMC6909508          DOI: 10.1186/s13100-019-0193-0

Source DB:  PubMed          Journal:  Mob DNA


Introduction

Transposable elements (TEs) are the most prevalent components in eukaryotic genomes. Among different TE classes, long terminal repeat (LTR) retrotransposons, including endogenous retroviruses (ERVs), is one of the most repetitive TEs due to their high copy numbers and large element sizes [1]. LTR retrotransposons are found in almost all eukaryotes including plants, fungi, and animals, but are most abundant in plant genomes [2]. For example, LTR retrotransposons contribute more than 65 and 70% to the genomes of bread wheat (Triticum aestivum) and maize (Zea mays), respectively [1]. Annotation of LTR retrotransposons relies primarily on de novo approaches due to their highly diverse terminal repeats. For this purpose, many computational programs have been developed in the past two decades. LTR_FINDER is one of the most popular LTR search engines [3], and the prediction quality out-performs counterpart programs [1]. However, LTR_FINDER runs on a single thread and is prohibitively slow for large genomes with long contigs, preventing its application in those species. In this study, we applied the “divide and conquer” approach to simplify and parallel the annotation task for the original LTR_FINDER and observed an up to 8500 times speedup for analysis of known genomes.

Methods

We hypothesized that complete sequences of highly complex genomes may contain a large number of complicated nested structures that exponentially increase the search space. To break down these complicated sequence structures, we split chromosomal sequences into relatively short segments (1 Mb) and executes LTR_FINDER in parallel. We expect the time complexity of LTR_FINDER_parallel is O(n). For highly complicated regions (i.e., centromeres), one segment could take a rather long time (i.e., hours). To avoid extended operation time in such regions, we used a timeout scheme (300 s) to control for the longest time a child process can run. If timeout, the 1 Mb segment is further split into 50 Kb segments to salvage LTR candidates. After processing all segments, the regional coordinates of LTR candidates are converted back to the genome-level coordinates for the convenience of downstream analyses. LTR_FINDER_parallel is a Perl program that can be "download and run" and does not require any form of installation. We used the original LTR_FINDER as the search engine which is binary and also installation free. Based on our previous study [1], we applied the optimized parameter for LTR_FINDER (−w 2 -C -D 15000 -d 1000 -L 7000 -l 100 -p 20 -M 0.85), which identifies long terminal repeats ranging from 100 to 7000 bp with identity ≥85% and interval regions from 1 to 15 Kb. The output of LTR_FINDER_parallel is convertible to the popular LTRharvest [4] format, which is compatible to the high-accuracy post-processing filter LTR_retriever [1].

Results

To benchmark the performance of LTR_FINDER_parallel, we selected four plant genomes with sizes varying from 120 Mb to 14.5 Gb, which are Arabidopsis thaliana (version TAIR10) [5], Oryza sativa (rice, version MSU7) [6, 7], Zea mays (maize, version AGPv4) [8], and Triticum aestivum (wheat, version CS1.0) [9], respectively. Each of the genomes was analyzed both sequentially (1 thread) and in parallel (36 threads) with wall clock time and maximum memory recorded. Using our method, we observe 5X - 8500X increase in speed for plant genomes with varying sizes (Table 1). For the 14.5 Gb bread wheat genome, the original LTR_FINDER took 10,169 h, or 1.16 years, to complete, while the multithreading version completed in 72 min on a modern server with 36 threads, demonstrating an 8500X increase in speed (Table 1). Even we analyzed each wheat chromosome separately, the original LTR_FINDER still took 20 days on average to complete. Among the genomes we tested, the parallel version of LTR_FINDER produced slightly different numbers of LTR candidates when compared to those generated using the original version (0–2.73%; Table 1), which is likely due to the use of the dynamic task control approach for processing of heavily nested regions. By filtering out LTR candidates in the rice genome with LTR_retriever [1], we found a difference of 28 LTR elements (out of ~ 1950 filtered elements) between results from the original version and the parallel version of LTR_FINDER. Of these 28 elements, 25 of them were located at the sequence split sites. However, all of the 28 elements were represented by similar full-length copies identified from other locations in the genome, indicating little loss in terms of the coverage of final library for LTR retrotransposons. Given the substantial speed improvement (Table 1), we consider the parallel version to be a promising solution for large genomes.
Table 1

Benchmarking the performance of LTR_FINDER_parallel

GenomeArabidopsisRiceMaizeWheat
VersionTAIR10MSU7AGPv4CS1.0
Size119.7 Mb374.5 Mb2134.4 Mb14,547.3 Mb
Original memory (1 threada)0.37 Gbyte0.55 Gbyte5.00 Gbyte11.88 Gbyteb
Parallel memory (36 threadsa)0.10 Gbyte0.12 Gbyte0.82 Gbyte17.67 Gbyte
Original time (1 thread)0.58 h2.1 h448.5 h10,169.3 hb
Parallel time (36 threads)6.4 min2.6 min10.3 min71.8 min
Speed up5.4 X48.5 X2613 X8498 X
# of LTR candidates (1 thread)226285160,165231,043
# of LTR candidates (36 threads)226283459,658237,352
% difference in candidate #0.00%0.60%0.84%−2.73%

a Intel(R) Xeon(R) CPU E5–2660 v4 @ 2.00GHz

b LTR_FINDER was run on each chromosome; the maximum memory and the total time are shown

Benchmarking the performance of LTR_FINDER_parallel a Intel(R) Xeon(R) CPU E5–2660 v4 @ 2.00GHz b LTR_FINDER was run on each chromosome; the maximum memory and the total time are shown
  9 in total

Review 1.  The contributions of transposable elements to the structure, function, and evolution of plant genomes.

Authors:  Jeffrey L Bennetzen; Hao Wang
Journal:  Annu Rev Plant Biol       Date:  2014-02-21       Impact factor: 26.379

2.  LTR_retriever: A Highly Accurate and Sensitive Program for Identification of Long Terminal Repeat Retrotransposons.

Authors:  Shujun Ou; Ning Jiang
Journal:  Plant Physiol       Date:  2017-12-12       Impact factor: 8.340

3.  Analysis of the genome sequence of the flowering plant Arabidopsis thaliana.

Authors: 
Journal:  Nature       Date:  2000-12-14       Impact factor: 49.962

4.  The map-based sequence of the rice genome.

Authors: 
Journal:  Nature       Date:  2005-08-11       Impact factor: 49.962

5.  Shifting the limits in wheat research and breeding using a fully annotated reference genome.

Authors: 
Journal:  Science       Date:  2018-08-16       Impact factor: 47.728

6.  Improved maize reference genome with single-molecule technologies.

Authors:  Yinping Jiao; Paul Peluso; Jinghua Shi; Tiffany Liang; Michelle C Stitzer; Bo Wang; Michael S Campbell; Joshua C Stein; Xuehong Wei; Chen-Shan Chin; Katherine Guill; Michael Regulski; Sunita Kumari; Andrew Olson; Jonathan Gent; Kevin L Schneider; Thomas K Wolfgruber; Michael R May; Nathan M Springer; Eric Antoniou; W Richard McCombie; Gernot G Presting; Michael McMullen; Jeffrey Ross-Ibarra; R Kelly Dawe; Alex Hastie; David R Rank; Doreen Ware
Journal:  Nature       Date:  2017-06-12       Impact factor: 49.962

7.  LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons.

Authors:  David Ellinghaus; Stefan Kurtz; Ute Willhoeft
Journal:  BMC Bioinformatics       Date:  2008-01-14       Impact factor: 3.169

8.  LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons.

Authors:  Zhao Xu; Hao Wang
Journal:  Nucleic Acids Res       Date:  2007-05-07       Impact factor: 16.971

9.  Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data.

Authors:  Yoshihiro Kawahara; Melissa de la Bastide; John P Hamilton; Hiroyuki Kanamori; W Richard McCombie; Shu Ouyang; David C Schwartz; Tsuyoshi Tanaka; Jianzhong Wu; Shiguo Zhou; Kevin L Childs; Rebecca M Davidson; Haining Lin; Lina Quesada-Ocampo; Brieanne Vaillancourt; Hiroaki Sakai; Sung Shin Lee; Jungsok Kim; Hisataka Numa; Takeshi Itoh; C Robin Buell; Takashi Matsumoto
Journal:  Rice (N Y)       Date:  2013-02-06       Impact factor: 4.783

  9 in total
  23 in total

1.  A high-quality genome sequence of alkaligrass provides insights into halophyte stress tolerance.

Authors:  Wenting Zhang; Jie Liu; Yongxue Zhang; Jie Qiu; Ying Li; Baojiang Zheng; Fenhong Hu; Shaojun Dai; Xuehui Huang
Journal:  Sci China Life Sci       Date:  2020-03-12       Impact factor: 6.038

2.  Development of molecular markers based on LTR retrotransposon in the Cleistogenes songorica genome.

Authors:  Tiantian Ma; Xingyi Wei; Yufei Zhang; Jie Li; Fan Wu; Qi Yan; Zhuanzhuan Yan; Zhengshe Zhang; Gisele Kanzana; Yufeng Zhao; Yingbo Yang; Jiyu Zhang
Journal:  J Appl Genet       Date:  2021-09-23       Impact factor: 3.240

3.  Bioinformatics Approaches for Determining the Functional Impact of Repetitive Elements on Non-coding RNAs.

Authors:  Chao Zeng; Atsushi Takeda; Kotaro Sekine; Naoki Osato; Tsukasa Fukunaga; Michiaki Hamada
Journal:  Methods Mol Biol       Date:  2022

4.  The first long-read nuclear genome assembly of Oryza australiensis, a wild rice from northern Australia.

Authors:  Aaron L Phillips; Scott Ferguson; Nathan S Watson-Haigh; Ashley W Jones; Justin O Borevitz; Rachel A Burton; Brian J Atwell
Journal:  Sci Rep       Date:  2022-06-25       Impact factor: 4.996

5.  A functionally conserved STORR gene fusion in Papaver species that diverged 16.8 million years ago.

Authors:  Theresa Catania; Yi Li; Thilo Winzer; David Harvey; Fergus Meade; Anna Caridi; Andrew Leech; Tony R Larson; Zemin Ning; Jiyang Chang; Yves Van de Peer; Ian A Graham
Journal:  Nat Commun       Date:  2022-06-07       Impact factor: 17.694

6.  Recombination landscape dimorphism and sex chromosome evolution in the dioecious plant Rumex hastatulus.

Authors:  Joanna L Rifkin; Solomiya Hnatovska; Meng Yuan; Bianca M Sacchi; Baharul I Choudhury; Yunchen Gong; Pasi Rastas; Spencer C H Barrett; Stephen I Wright
Journal:  Philos Trans R Soc Lond B Biol Sci       Date:  2022-03-21       Impact factor: 6.671

7.  Phylogenomics of the genus Glycine sheds light on polyploid evolution and life-strategy transition.

Authors:  Yongbin Zhuang; Xutong Wang; Xianchong Li; Junmei Hu; Lichuan Fan; Jacob B Landis; Steven B Cannon; Jane Grimwood; Jeremy Schmutz; Scott A Jackson; Jeffrey J Doyle; Xian Sheng Zhang; Dajian Zhang; Jianxin Ma
Journal:  Nat Plants       Date:  2022-03-14       Impact factor: 17.352

8.  Assemblies of the genomes of parasitic wasps using meta-assembly and scaffolding with genetic linkage.

Authors:  Kameron T Wittmeyer; Sara J Oppenheim; Keith R Hopper
Journal:  G3 (Bethesda)       Date:  2022-01-04       Impact factor: 3.542

9.  European maize genomes highlight intraspecies variation in repeat and gene content.

Authors:  Georg Haberer; Nadia Kamal; Eva Bauer; Heidrun Gundlach; Iris Fischer; Michael A Seidel; Manuel Spannagl; Caroline Marcon; Alevtina Ruban; Claude Urbany; Adnane Nemri; Frank Hochholdinger; Milena Ouzunova; Andreas Houben; Chris-Carolin Schön; Klaus F X Mayer
Journal:  Nat Genet       Date:  2020-07-27       Impact factor: 38.330

10.  De novo assembly, annotation, and comparative analysis of 26 diverse maize genomes.

Authors:  Matthew B Hufford; Arun S Seetharam; Margaret R Woodhouse; Kapeel M Chougule; Shujun Ou; Jianing Liu; William A Ricci; Tingting Guo; Andrew Olson; Yinjie Qiu; Rafael Della Coletta; Silas Tittes; Asher I Hudson; Alexandre P Marand; Sharon Wei; Zhenyuan Lu; Bo Wang; Marcela K Tello-Ruiz; Rebecca D Piri; Na Wang; Dong Won Kim; Yibing Zeng; Christine H O'Connor; Xianran Li; Amanda M Gilbert; Erin Baggs; Ksenia V Krasileva; John L Portwood; Ethalinda K S Cannon; Carson M Andorf; Nancy Manchanda; Samantha J Snodgrass; David E Hufnagel; Qiuhan Jiang; Sarah Pedersen; Michael L Syring; David A Kudrna; Victor Llaca; Kevin Fengler; Robert J Schmitz; Jeffrey Ross-Ibarra; Jianming Yu; Jonathan I Gent; Candice N Hirsch; Doreen Ware; R Kelly Dawe
Journal:  Science       Date:  2021-08-06       Impact factor: 47.728

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.