Literature DB >> 30165587

RepeatCraft: a meta-pipeline for repetitive element de-fragmentation and annotation.

Abstract

SUMMARY: Repetitive elements comprise large proportion of many genomes. They have impact on both genome evolution and regulation. Their classification and the study of evolutionary history is a major emerging field. Various software exist to-date to classify and map repeats across genomes. The major unresolved drawback, however, is the fragmented nature of many identified repeat loci. This ultimately makes the classification of novel repeats and their evolutionary analyses difficult. To improve on this, we developed a pipeline (RepeatCraft) that integrates results from several repeat element classification tools based on both sequence similarity and structural features. The pipeline de-fragments closely spaced repeat loci in the genomes, reconstructing longer copies, thus allowing for a better annotation and sequence comparisons. The pipeline also includes a user interface that can run in a web browser allowing for an easy access and exploration of the repeat data.
AVAILABILITY AND IMPLEMENTATION: RepeatCraft is implemented in Python and the web application is implemented in R. Download and documentation is freely available at https://github.com/niccw/repeatCraftp. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Disease Species

Mesh：

Year: 2019 PMID： 30165587 PMCID： PMC6419915 DOI： 10.1093/bioinformatics/bty745

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Repetitive elements comprise a large proportion of many metazoan genomes. Three major classes exist: DNA elements (cut and paste propagation), retroelement (copy paste propagation) and simple repeats (not autonomous). The length of individual repeat elements can vary from a few base pairs (simple repeats) to a dozen kilobases [e.g. helitrons (Kapitonov and Jurka, 2001)]. Active autonomous repeat elements contain enzymes that enable their propagation, e.g. transposase for DNA and reverse transcriptase for long interspersed nuclear elements (LINEs), respectively. Some elements harbour additional structural features such as the long terminal repeats (LTR) that facilitate their recognition and transcription as well as insertion into the genome. Despite the prevalence in the genomes and many reports on their role in the evolution of gene regulation and genome stability (Feschotte and Pritham, 2007), detection and classification of repetitive elements remains problematic. Many unclassified repeat families are emerging in genomic studies (Chapman ; Chalopin ). One of the major problems in the classification is that the current tools of repeat annotation report fragmented pieces of transposon in the genomes that hamper both the classification by sequence similarity to known elements and through the lack of any detectable structural features. Very few tools exist to date that allow exploration and validation of raw RepeatMasker outputs [e.g. Bailly-Bechet ]. Here, we introduce a meta-approach, RepeatCraft, that combines several annotation pipelines and a merging algorithm to solve this problem by extending annotated pieces of transposons into larger repeat loci and reclassifying them based on those complete sequences. The method both reduces the complexity of repeat ‘fragments’ found in the genomes as well as facilitates detection and classification of novel repeat classes. Finally, we introduce a multi-functional web-based user interface that allows to explore repeat data with the help of sequence similarity and structural features.

2 Materials and methods

The pipeline is implemented in Python and builds off the coordinate file (in the genome feature format, GFF), as produced by any repeat masking tool, e.g. RepeatMasker (http://www.repeatmasker.org). The attribute field in the GFF file is reserved for the repeat family name and should also contain coordinates relative to the full-length or consensus sequence used in masking of the genome (as, e.g. in the default RepeatMasker output). In the first step, the pipeline updates the coordinate GFF file to populate and format the attribute column with annotations and coordinates on the reference sequence. Secondly, it identifies short (user-defined, by default 100 bp) repeats. Third step involves merging the output of other tools [currently, LTR_FINDER, (Xu and Wang, 2007)] labelling structural features (such as the LTRs, Fig. 1). In the final step, the sequence similarity and structural information is merged by the following rules. Neighbouring fragments of the same repeat element that are non-overlapping in the coverage of their corresponding consensus sequence are merged into and labelled as one locus. Combining structural evidence for each of those loci our pipeline updates their annotation. We implement two merging algorithms. Strict merging requires consecutive order of repeat fragments from the same family, without any intervening repeat of a different family [similar to Bailly-Bechet , yet also allowing for inversions]. ‘Loose’ merging additionally allows for such intervening families to be present provided that they fall within the maximum separation cutoff (default 150 bp). Many of such intervening repeat families comprise very short stretches of simple repeats and can be useful in the classification of the larger repeat loci they are embedded in.

Fig. 1.

Schematic diagram (A) showing how RepeatCraft groups repeat ‘fragments’, based on their coverage in the consensus sequence (blue track) and the distance between consecutive repeats. By default, RepeatCraft only merges consecutive repeats (strict merging). The ‘loose’ merge considers non-consecutive closely spaced repeats and retains the annotation of other short repeats (i.e. simple repeats) in between the fragments. (B) The range plot panel of the web application provides a track-based visualization of the result of RepeatCraft, similar to a genome browser. The first track shows the annotation from RepeatMasker, the second track shows the improved annotation from RepeatCraft. The remaining tracks display the annotations from other tools (e.g. LTR_FINDER)

3 Results

Running the pipeline with default parameters results in a reduction of some highly fragmented repeat loci. Based on a genome assembly of Hydra magnipapillata (https://research.nhgri.nih.gov/hydra) (Chapman ) we could reduce the total number of repeat loci as detected by RepeatMasker by around 10% over all repetitive elements (Supplementary Table S1), while not changing the overall repeat coverage. The highest level of repeat locus extension (fragment reduction) was observed for unknown repeats (around 60 000 merges). This further allowed us to assign structural features (e.g. ORFs ≥ of over 100 bp) to about 90% of previously unknown elements. Family-level annotation improvement statistics is provided in the Supplementary Tables S2–S4. Finally, we designed a web application implemented in R as a graphical user interface of the RepeatCraft pipeline (Fig. 1), it also functions as an interactive tool for visualizing and browsing the repeat element annotation. It allows to study sequence similarity and structural annotation of the repeat families in a genome browser format, as well as on the individual family basis. Latter feature is of particular importance for the characterization of novel repeat elements whose longest loci can now be studied in terms of their structural features such as tandem repeats, LTRs and open reading frames.

Funding

WYW and OS are supported by a grant from the Austrian Science Fund (FWF): P30686-B29. Conflict of Interest: none declared. Click here for additional data file.

5 in total

1. Rolling-circle transposons in eukaryotes.

Authors: V V Kapitonov; J Jurka
Journal: Proc Natl Acad Sci U S A Date: 2001-07-10 Impact factor: 11.205

Review 2. DNA transposons and the evolution of eukaryotic genomes.

Authors: Cédric Feschotte; Ellen J Pritham
Journal: Annu Rev Genet Date: 2007 Impact factor: 16.830

3. Evolutionary active transposable elements in the genome of the coelacanth.

Authors: Domitille Chalopin; Shaohua Fan; Oleg Simakov; Axel Meyer; Manfred Schartl; Jean-Nicolas Volff
Journal: J Exp Zool B Mol Dev Evol Date: 2013-08-01 Impact factor: 2.656

4. The dynamic genome of Hydra.

Authors: Jarrod A Chapman; Ewen F Kirkness; Oleg Simakov; Steven E Hampson; Therese Mitros; Thomas Weinmaier; Thomas Rattei; Prakash G Balasubramanian; Jon Borman; Dana Busam; Kathryn Disbennett; Cynthia Pfannkoch; Nadezhda Sumin; Granger G Sutton; Lakshmi Devi Viswanathan; Brian Walenz; David M Goodstein; Uffe Hellsten; Takeshi Kawashima; Simon E Prochnik; Nicholas H Putnam; Shengquiang Shu; Bruce Blumberg; Catherine E Dana; Lydia Gee; Dennis F Kibler; Lee Law; Dirk Lindgens; Daniel E Martinez; Jisong Peng; Philip A Wigge; Bianca Bertulat; Corina Guder; Yukio Nakamura; Suat Ozbek; Hiroshi Watanabe; Konstantin Khalturin; Georg Hemmrich; André Franke; René Augustin; Sebastian Fraune; Eisuke Hayakawa; Shiho Hayakawa; Mamiko Hirose; Jung Shan Hwang; Kazuho Ikeo; Chiemi Nishimiya-Fujisawa; Atshushi Ogura; Toshio Takahashi; Patrick R H Steinmetz; Xiaoming Zhang; Roland Aufschnaiter; Marie-Kristin Eder; Anne-Kathrin Gorny; Willi Salvenmoser; Alysha M Heimberg; Benjamin M Wheeler; Kevin J Peterson; Angelika Böttger; Patrick Tischler; Alexander Wolf; Takashi Gojobori; Karin A Remington; Robert L Strausberg; J Craig Venter; Ulrich Technau; Bert Hobmayer; Thomas C G Bosch; Thomas W Holstein; Toshitaka Fujisawa; Hans R Bode; Charles N David; Daniel S Rokhsar; Robert E Steele
Journal: Nature Date: 2010-03-14 Impact factor: 49.962

5. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons.

Authors: Zhao Xu; Hao Wang
Journal: Nucleic Acids Res Date: 2007-05-07 Impact factor: 16.971

5 in total

8 in total

1. Recent reconfiguration of an ancient developmental gene regulatory network in Heliocidaris sea urchins.

Authors: Phillip L Davidson; Haobing Guo; Jane S Swart; Abdull J Massri; Allison Edgar; Lingyu Wang; Alejandro Berrio; Hannah R Devens; Demian Koop; Paula Cisternas; He Zhang; Yaolei Zhang; Maria Byrne; Guangyi Fan; Gregory A Wray
Journal: Nat Ecol Evol Date: 2022-10-20 Impact factor: 19.100

2. Genome of the four-finger threadfin Eleutheronema tetradactylum (Perciforms: Polynemidae).

Authors: Zhe Qu; Wenyan Nong; Yifei Yu; Tobias Baril; Ho Yin Yip; Alexander Hayward; Jerome H L Hui
Journal: BMC Genomics Date: 2020-10-19 Impact factor: 3.969

3. Migrators within migrators: exploring transposable element dynamics in the monarch butterfly, Danaus plexippus.

Authors: Tobias Baril; Alexander Hayward
Journal: Mob DNA Date: 2022-02-16

4. Repeat Age Decomposition Informs an Ancient Set of Repeats Associated With Coleoid Cephalopod Divergence.

Authors: Alba Marino; Alena Kizenko; Wai Yee Wong; Fabrizio Ghiselli; Oleg Simakov
Journal: Front Genet Date: 2022-03-14 Impact factor: 4.772

5. The genome sequence of the lesser marbled fritillary, Brenthis ino, and evidence for a segregating neo-Z chromosome.

Authors: Alexander Mackintosh; Dominik R Laetsch; Tobias Baril; Robert G Foster; Vlad Dincă; Roger Vila; Alexander Hayward; Konrad Lohse
Journal: G3 (Bethesda) Date: 2022-05-30 Impact factor: 3.542

6. The genome sequence of the scarce swallowtail, Iphiclides podalirius.

Authors: Alexander Mackintosh; Dominik R Laetsch; Tobias Baril; Sam Ebdon; Paul Jay; Roger Vila; Alex Hayward; Konrad Lohse
Journal: G3 (Bethesda) Date: 2022-08-25 Impact factor: 3.542

7. Chromosomal-level reference genome of the incense tree Aquilaria sinensis.

Authors: Wenyan Nong; Sean T S Law; Annette Y P Wong; Tobias Baril; Thomas Swale; Lee Man Chu; Alexander Hayward; David T W Lau; Jerome H L Hui
Journal: Mol Ecol Resour Date: 2020-04-08 Impact factor: 7.090

8. Expansion of a single transposable element family is associated with genome-size increase and radiation in the genus Hydra.

Authors: Wai Yee Wong; Oleg Simakov; Diane M Bridge; Paulyn Cartwright; Anthony J Bellantuono; Anne Kuhn; Thomas W Holstein; Charles N David; Robert E Steele; Daniel E Martínez
Journal: Proc Natl Acad Sci U S A Date: 2019-10-28 Impact factor: 12.779

8 in total