Literature DB >> 16845104

PLOTREP: a web tool for defragmentation and visual analysis of dispersed genomic repeats.

Gábor Tóth¹, Gábor Deák, Endre Barta, György B Kiss.

Abstract

Identification of dispersed or interspersed repeats, most of which are derived from transposons, retrotransposons or retrovirus-like elements, is an important step in genome annotation. Software tools that compare genomic sequences with precompiled repeat reference libraries using sensitive similarity-based methods provide reliable means of finding the positions of fragments homologous to known repeats. However, their output is often incomplete and fragmented owing to the mutations (nucleotide substitutions, deletions or insertions) that can result in considerable divergence from the reference sequence. Merging these fragments to identify the whole region that represents an ancient copy of a mobile element is challenging, particularly if the element is large and suffered multiple deletions or insertions. Here we report PLOTREP, a tool designed to post-process results obtained by sequence similarity search and merge fragments belonging to the same copy of a repeat. The software allows rapid visual inspection of the results using a dot-plot like graphical output. The web implementation of PLOTREP is available at http://bioinformatics.abc.hu/PLOTREP/.

Entities: Disease Species

Mesh：

Year: 2006 PMID： 16845104 PMCID： PMC1538846 DOI： 10.1093/nar/gkl263

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Repetitive sequences are ubiquitous in eukaryotic genomes (1,2). The fraction of the genome occupied by repetitive elements varies by species and ranges from a few percent in lower eukaryotes to >70% in some plants (3,4). Dispersed or interspersed repeats almost exclusively result from the transposition of mobile genetic elements, which belong to one of the main classes of DNA transposons, retrotransposons or retrovirus-like elements. Proper annotation of a genome involves the identification and classification of transposable elements (TEs). Recognizing repetitive DNA is essential for accurate genome assembly while masking them is a prerequisite for sequence similarity searches aimed at gene prediction and functional annotation. Cataloging and further analysis of TEs promotes our understanding of TE and genome evolution. The best-known tools for systematic annotation of repeat families and subsequent repeat masking are RepeatMasker (A. F. Smit, R. Hubley and P. Green; ) and CENSOR (5,6). Both programs perform similarity searches based on local alignments using precompiled libraries of consensus or representative sequences of repeat families. A profile HMM-based method has been applied with success to find certain groups of TEs in the rice genome (7). BLAST-based searches can also prove useful in TE annotation of genomic sequences (8–10). These approaches offer reliable results in repeat identification. However, the coverage of the precompiled libraries is inevitably patchy for species with incompletely sequenced genomes. The species-specific nature of many TEs requires such TE and repeat databases to be built for each genome sequencing project simultaneously as genome assembly and genome annotation proceeds. Structural features characteristic for particular superfamilies of TEs can be utilized to find superfamily members: the LTR_STRUC program (11) identifies LTR retrotransposons, while the FINDMITE (12) and MAK (13) programs are designed to locate MITEs (miniature inverted repeat TEs). Recently, several more general methods for automated de novo repeat identification and classification have been described and implemented in the programs RECON (14), PILER (15) and RepeatScout (16). While these tools perform relatively well in finding repetitive families, their output is often redundant and the quality of the consensus sequences derived is not comparable with that of the entries in manually curated databases. Another frequently observed problem when searching against a repeat library is that putative TEs in the query sequence appear only partially and fragmented in the output of the program. This phenomenon is usually due to the considerable divergence of the repeat from the reference sequence in the database. It is more pronounced when good consensus sequences have not yet been generated for each subfamily of the particular TE. Divergence from the once active founder element, which is supposedly reconstructed in the consensus sequence, is caused by various mutations (nucleotide substitutions, deletions or insertions) in the sequence since the transposition. If the comparison involves a representative copy instead of a consensus, the sequence difference may double. Merging the fragmented hits to identify the whole region that represents an ancient copy of a mobile element is challenging, particularly if the element is large and suffered multiple deletions or insertions. Here we report PLOTREP, a web tool designed to address the problem of apparent fragmentation of search results observed during repeat annotation of genomic sequences. PLOTREP can identify repeats in BAC-sized genomic regions (up to several 100 kb). First, a sequence similarity search is carried out to detect matches against a library of various reference sequences. The user can compile and upload his/her own set of known repetitive sequences to be used as the reference library or select from the libraries offered by the server. The results of the search are then post-processed by the program and defragmentation of the regions belonging together is carried out. All fragments predicted to be parts of the same copy of a repeat (i.e. the result of a single transposition event) are merged and plotted to show the whole region covered by the element. Positions of large deletions and insertions with respect to the reference sequence are listed in the output beside the positions of the merged repetitive regions. PLOTREP allows rapid visual inspection of the results: a two-dimensional (2D) dot-plot like graphics is generated for each reference sequence showing unmerged and merged hits in the query sequence. The graphical output is particularly useful in the analysis of large, several kilobase-long TEs like retrotransposons and retrovirus-like elements, which are more prone to suffer deletions and insertions if they were present in the genome for long times.

METHODS AND IMPLEMENTATION

The input of the PLOTREP web tool is a genomic query sequence of length up to 1 Mb. The operation of the program can be divided into three main steps: (i) a sequence similarity search is carried out against a reference library of known interspersed repetitive sequences; (ii) the search result is post-processed to find matches that can be merged into one combined region representing a single copy of a repeat and (iii) the results of the first and second step are displayed in both tabular and graphical format. In the first step, significant local alignments between the genomic query and the sequences of the reference library are found by the software CENSOR (5). CENSOR, like RepeatMasker, the other well-known tool for library-based repeat identification, is designed to locate and mask regions in genomic sequences that correspond to known repetitive elements. CENSOR uses the fast and sensitive similarity search program WU-BLAST (W. Gish; ). Optionally, the BLASTN or BLASTX programs of the WU-BLAST package can be used directly instead of CENSOR. All three programs allow the relatively rapid identification of fragments homologous to sequences in a repeat library either supplied by the user or chosen from a list offered in the PLOTREP search form. The matching fragments are often not contiguous even if they have been originated from the same transposition event of a TE. Gaps, deletions and insertions of unrelated sequences may disrupt the alignment. Throughout this article, we refer to sequences probably homologous to the reference but not sufficiently similar to appear among the local alignments generated by CENSOR as ‘gaps’. The idea behind the second, defragmentation or merging step is based on the proven usability of the dot-plot method for repeat analysis. 2D dot-plots are often used to check and visually inspect repeats (including duplications and inversions) within sequences or local similarities between two otherwise unrelated sequences. Generation of a full dot-plot with a program like Dotter (17) is very time consuming for large sequences and dot-plot programs do not allow the automatic determination of the borders of matching regions. Applying the dot-plot approach on the results of a relatively fast local similarity detection program combines the advantage of the visual inspection of the matches with the possibility of automated processing of the results if manual intervention is not feasible. A similar approach was proven useful in the BLAST2GENE program designed to convert BLAST output into independent genes and gene fragments (18). Hereafter the line that represents a matching fragment in the 2D plot will be referred to as a diagonal (Figure 1). The merging step involves diagonals that maintain consistency. Two diagonals are consistent if their order is the same with respect to both the query and the reference sequences. In PLOTREP we use the notion of the offset difference between diagonals. The offset difference is the distance between two parallel or nearly parallel diagonal lines. If the two sequences are drawn starting from the upper left corner of the rectangle then the absolute offset of a positively oriented diagonal is measured from the lower left corner, while the absolute offset of a negatively oriented diagonal is measured from the upper left corner. Diagonals closer to each other than a given maximum offset difference are combined into a group of consistent diagonals. Since deletions and insertions increase the offset difference, they prevent the flanking fragments from being grouped together. Therefore, pairs of neighboring groups are examined whether they can be considered consistent under the assumption that they are separated only by a deletion (i.e. fragments are adjacent in the query sequence but separated in the reference sequence) or an insertion (i.e. fragments are adjacent in the reference sequence but separated in the query sequence). Fragments are accepted as adjacent if they are closer too each other than a given maximum distance in one of the two sequences. Gaps separate two groups on diagonals with no or small offset difference between them. If the offset difference is below a pre-defined threshold, even such groups are combined. Depending on parameter settings, all or most fragments predicted to be parts of the same individual copy of a repeat are merged and boundaries are calculated for the whole region covered by the element.

Figure 1

The diagram explains the terms used in the description of the algorithm and provides help to interpret the 2D plot. Matching fragments are shown as red diagonals (1–8). Fragments 1–5 are in positive orientation while fragments 6–8 are in negative orientation. The absolute offset of the diagonals is calculated as indicated for fragment 3 as an example. The offset differences separating fragments 1, 2 and 3 are small, therefore they can be grouped together as the initial step of merging. Similarly, fragments 4 and 5 can also be grouped. A gap is a discontinuity with small offset difference between flanking fragments like 6 and 7. Insertions and deletions are defined with respect to the reference sequence on the vertical axis, and their presence is examined after the groups of ‘same-diagonal’ fragments are formed. An insertion or a deletion is characterized by large offset difference and adjacency of the flanking fragments in the reference sequence or in the query, respectively. Depending on the parameters, fragments belonging together can be merged as shown by the black lines 9 and 10.

In the third step, the output is generated and displayed. The output contains (i) a diagram summarizing all repeats, both unmerged (i.e. the raw CENSOR output) and merged (processed in the second step), found in the query; (ii) tables with position data for unmerged and merged repeat regions and (iii) 2D dot-plot like representations of sequence similarity between the query sequence and each of those reference sequences that matched it in the first search step. The web interface is programmed in Perl CGI and JavaScript, while the programs performing the fragment merging step and generating the graphical output are written in Perl. The latter scripts are able to process not only the output of CENSOR but also the GFF-format RepeatMasker output or other simple plain text table listing positions of matched fragments. The scripts are available to the academic community upon request.

FEATURES

We designed PLOTREP to be suitable for anyone who wants to annotate BAC-sized or smaller genomic sequences and identify interspersed repeats similar to consensus or reference sequences representing known families of repetitive elements. To meet this requirement, PLOTREP finds matching regions in the genomic sequence and merges them if they are predicted to jointly compose the same individual copy of a repeat, which in turn is presumed to have resulted from a single transposition event. PLOTREP also allows rapid visual inspection of the findings via a dot-plot like 2D graphical output. On the other hand, PLOTREP can also fulfill the requirements of those who would like to analyze certain TEs or identify novel transposons or retrotransposons. We are interested in plant repetitive elements and it is reflected in the inclusion of the TIGR Plant Repeat Database () (19) among the repeat libraries offered by the server. Repbase Update, the most comprehensive database of repetitive element consensus sequences, is also available for use in the CENSOR or BLASTN search step. Repbase Update is compiled and maintained by the Genetic Information Research Institute (6,20,21). A useful feature of the server is that it also allows searches against user-supplied repeat libraries, a frequent claim when analyzing genomes where sequencing is still under progress and custom-made libraries are preferred because public repeat libraries do not contain TE sequences for the species. One approach to identify a TE which belongs to a new TE family is based on the detection of largish insertions into regions of known or predictable (i.e. conserved) sequence structure. Nested insertions of TEs into each other are frequently observed, particularly in large plant genomes (3,22,23). Thus, transposition of a sequence of unknown identity into a repetitive element that belongs to a known family or subfamily can be easily noticed. Since PLOTREP can detect insertions into nearly full-length or even partial elements, and many of such insertions result from (retro)transpositions, the program can help to identify unknown families of mobile genetic elements and determination of the relative ages of element families or subfamilies. When searching against a library of repetitive elements, a gap observed between two consecutive hits may be caused either by an unrelated sequence or, more probably, by extensive divergence of a homologous region from the repeat consensus/reference, which prevents detection of the remote similarity by the search program. By examining the offset difference between the diagonals on which the hits flanking the gap lie, PLOTREP can predict whether a region not detected by the similarity search may belong to a TE or not. However, small insertions (e.g. MITEs) or recombination may result in sequences interpreted by PLOTREP as a ‘gap’, therefore the origin of gaps reported by PLOTREP should be checked manually. Visual inspection of the graphical output, especially in the dot-plot like 2D form, can reveal even very complex patterns of nested insertions and element duplications. A unique feature of PLOTREP when using a user-supplied reference library is that long terminal repeat (LTR) retrotransposons are treated in a specific way. The LTR sequence and the internal sequence must be in two separate sequence entries and special rules apply for naming the two sequences (see the online Tutorial for details). In this case, PLOTREP attempts to merge fragments for the whole retrotransposon of the structure LTR–internal–LTR, and plots this combination on the 2D dot-plot like image. However, PLOTREP may be unable to resolve complex nested insertions or tandem LTR retroelements resulting from recombination between LTRs belonging to two different elements. The 2D diagram can greatly help the user interpret the results in such cases. Manual editing may also be required if an element harbors homologies with two closely related but differently named elements. Consequently, the 2D graphical representation of sequence comparison often provides information not conspicuous from the summary figure and the tables, and this surplus is more pronounced when one inspects matches to large repetitive elements including retrotransposons but less so when looking at small repeats. This feature is particularly useful in analyzing plant genomes, since plant LTR retroelements longer than 10 kb are not uncommon (10,24–27).

INPUT AND OPTIONS

Input sequences, search method and defragmentation options can be specified in the Search page. One of the three search programs, CENSOR, BLASTN or BLASTX can be selected by the user, with CENSOR being the default. The query sequence of up to 1 Mb must be in ‘FASTA’ format. One can either paste it into the input field or upload from a file. Only a single sequence entry is allowed, multisequence files are rejected by the program. A reference sequence library consisting of known repetitive elements to be searched against has to be specified. There are essentially two options to supply the library, except for BLASTX for which only the first option is available. First, the user can paste reference sequences into an input field or upload a sequence file. Second, the user can select a library from those stored on the server and offered in a list. In the user-supplied reference library, sequences must be in FASTA format and multiple sequence entries are allowed. LTR retroelements are handled in a specific manner (see above and the online Tutorial). Selectable server-based libraries currently include various sections of the Repbase Update database (6) and the TIGR Plant Repeat Database (19). More databases are planned for later addition. Five parameters affecting the second, processing and defragmentation step of the search, can be modified by the user. Maximum insertion length is the maximum length of an insertion allowed between two consecutive hits to merge them. An insertion longer than this will keep the fragments separated. Maximum deletion length is the maximum length of a deletion allowed between two adjacent hit fragments to merge them. A deletion longer than this will keep the fragments separated. Maximum gap length is the maximum length of a same-diagonal gap allowed between two consecutive hits to merge them. The default value is zero when there is no limit and merging is guided only by the offset difference between the diagonals on which the two hits lie. Minimum coverage to merge is the minimum total coverage of merged fragments in percentage of the reference sequence length to accept the merging of the fragments. Maximum relative offset difference is the maximum relative difference in offset with respect to total repeat length. Two consecutive fragments are merged only if the relative offset difference is smaller than this value. CENSOR can be run in three different sensitivity modes. The WU-BLAST parameter settings corresponding to these modes are listed in the online Tutorial. Certain parameters (word size, E-value threshold, gap penalties) of the direct WU-BLAST searches can also be adjusted by the user (see the online Tutorial for details).

OUTPUT

The output consists of three main parts (Figure 2): (i) a summary figure; (ii) two alternative tables of position data, one for the original unmerged hits and another for the merged repeats and (iii) 2D dot-plot like figures. On the top of the Result page, there is a diagram summarizing the regions occupied by all repeats which have been found in the query. Both the unmerged fragments (i.e. raw CENSOR or BLAST output) and the merged ones (defragmented in the second, post-processing step) are indicated by two rows of horizontal color bars below a scaled line representing the query sequence. The diagram can be enlarged to view more details if fragments seem to overlap. Below it, a table with the merged repeats is shown by default but the user can also select and view another table displaying raw results of the CENSOR or BLAST search step. Columns of the table for merged fragments include (i) repeat name; (ii) position of the combined repeat region in the query; (iii) positions of insertions (with respect to the query sequence) if any; (iv) positions of deletions (with respect to the reference sequence) if any and (v) positions of gaps (with respect to the query sequence) if any. Columns of the table for raw CENSOR (or BLAST) output are (i) repeat name; (ii) position of the hit in the query sequence; (iii) length of the fragment in the query sequence; (iv) position of the matching fragment in the reference sequence; (v) sequence similarity; (vi) direction of the repeat in the query sequence. Both tables can be downloaded as plain text files. Sequences of either the merged repeat regions or the insertions can also be downloaded using links in the table of merged fragments. Sequences of the matching query fragments and the original CENSOR alignments can be downloaded using the appropriate links in the table containing the raw output. The 2D plot of sequence comparison between the query and a matching individual reference sequence can be viewed by clicking on the repeat name in either tables. The query sequence is drawn horizontally and a single reference sequence is drawn vertically. If the plot was accessed from the table of merged fragments, a black line indicating the merged region appears beside the red lines representing the original fragments. The 2D plots can be zoomed in and out at the user's convenience. A similar diagram displaying the query sequence compared with itself, thus helping the user to recognize direct and inverted repeats, can be opened by clicking on the ‘Show DotPlot’ button below the summary figure. The online Tutorial explains the output options in more details.

Figure 2

An example of PLOTREP results. A genomic query sequence was searched against a small user-supplied library containing LTR and internal sequences of an LTR retrotransposon. (A) A diagram summarizing all matching hits and those merged by PLOTREP, providing an overall picture of repeat positions. (B) A table listing positions of merged regions along with the positions of insertions, deletions and virtual gaps within these regions. (C) A 2D dot-plot like diagram displaying the comparison between the query (on the horizontal axis) and a library reference sequence (here in combined LTR–internal–LTR structure on the vertical axis). All matching fragments are shown as red lines and the merged regions are depicted as black lines. (D) A dot-plot like diagram displaying the query sequence compared with itself. (E) The sequence of a merged region or a region covered by an insertion can be downloaded by clicking on the ‘S’ button. (F) Local alignments generated by CENSOR can be viewed by clicking on the ‘A’ button in the table listing the raw CENSOR output of fragment positions (this alternatively displayed table is not shown here).

CONCLUSIONS

The defragmentation and visualization tool PLOTREP facilitates detection and further studies of repetitive elements in eukaryotic genomes. This software supports the identification of full-length elements even if they are fragmented and disrupted by insertions. Further analysis of sequences causing insertions larger than a few dozen base pairs may reveal previously unknown families of mobile genetic elements. Visual inspection of the 2D representation of fragments matching between the query sequence and a TE reference assists the user to grasp the repeat organization of the genomic region of interest.

27 in total

1. Repbase update: a database and an electronic journal of repetitive elements.

Authors: J Jurka
Journal: Trends Genet Date: 2000-09 Impact factor: 11.639

2. Highly abundant pea LTR retrotransposon Ogre is constitutively transcribed and partially spliced.

Authors: Pavel Neumann; Dana Pozárková; Jirí Macas
Journal: Plant Mol Biol Date: 2003-10 Impact factor: 4.076

3. Updating of transposable element annotations from large wheat genomic sequences reveals diverse activities and gene associations.

Authors: François Sabot; Romain Guyot; Thomas Wicker; Nathalie Chantret; Bastien Laubin; Boulos Chalhoub; Philippe Leroy; Pierre Sourdille; Michel Bernard
Journal: Mol Genet Genomics Date: 2005-10-11 Impact factor: 3.291

Review 4. Repbase Update, a database of eukaryotic repetitive elements.

Authors: J Jurka; V V Kapitonov; A Pavlicek; P Klonowski; O Kohany; J Walichiewicz
Journal: Cytogenet Genome Res Date: 2005 Impact factor: 1.636

5. CENSOR--a program for identification and elimination of repetitive elements from DNA sequences.

Authors: J Jurka; P Klonowski; V Dagman; P Pelton
Journal: Comput Chem Date: 1996-03

6. Athila4 of Arabidopsis and Calypso of soybean define a lineage of endogenous plant retroviruses.

Authors: David A Wright; Daniel F Voytas
Journal: Genome Res Date: 2002-01 Impact factor: 9.043

7. Genomic sequencing reveals gene content, genomic organization, and recombination relationships in barley.

Authors: Nils Rostoks; Yong-Jin Park; Wusirika Ramakrishna; Jianxin Ma; Arnis Druka; Bryan A Shiloff; Phillip J SanMiguel; Zeyu Jiang; Robert Brueggeman; Devinder Sandhu; Kulvinder Gill; Jeffrey L Bennetzen; Andris Kleinhofs
Journal: Funct Integr Genomics Date: 2002-04-25 Impact factor: 3.410

8. CACTA transposons in Triticeae. A diverse family of high-copy repetitive elements.

Authors: Thomas Wicker; Romain Guyot; Nabila Yahiaoui; Beat Keller
Journal: Plant Physiol Date: 2003-05 Impact factor: 8.340

9. BLAST2GENE: a comprehensive conversion of BLAST output into independent genes and gene fragments.

Authors: Mikita Suyama; David Torrents; Peer Bork
Journal: Bioinformatics Date: 2004-03-22 Impact factor: 6.937

10. Eight novel families of miniature inverted repeat transposable elements in the African malaria mosquito, Anopheles gambiae.

Authors: Z Tu
Journal: Proc Natl Acad Sci U S A Date: 2001-02-06 Impact factor: 11.205

9 in total

1. ImtRDB: a database and software for mitochondrial imperfect interspersed repeats annotation.

Authors: Viktor A Shamanskiy; Valeria N Timonina; Konstantin Yu Popadin; Konstantin V Gunbin
Journal: BMC Genomics Date: 2019-05-08 Impact factor: 3.969

2. REMiner: a tool for unbiased mining and analysis of repetitive elements and their arrangement structures of large chromosomes.

Authors: Byung-Ik Chung; Kang-Hoon Lee; Kyung-Seop Shin; Woo-Chan Kim; Deug-Nam Kwon; Ri-Na You; Young-Kwan Lee; Kiho Cho; Dong-Ho Cho
Journal: Genomics Date: 2011-07-22 Impact factor: 5.736

3. Genomic skimming for identification of medium/highly abundant transposable elements in Arundo donax and Arundo plinii.

Authors: Aung Kyaw Lwin; Edoardo Bertolini; Mario Enrico Pè; Andrea Zuccolo
Journal: Mol Genet Genomics Date: 2016-10-24 Impact factor: 3.291

4. Inactivation dates of the human and guinea pig vitamin C genes.

Authors: Marc Y Lachapelle; Guy Drouin
Journal: Genetica Date: 2010-12-08 Impact factor: 1.082

5. Bioinformatics and genomic analysis of transposable elements in eukaryotic genomes.

Authors: Mateusz Janicki; Rebecca Rooke; Guojun Yang
Journal: Chromosome Res Date: 2011-08 Impact factor: 4.620

6. Contrasted patterns of molecular evolution in dominant and recessive self-incompatibility haplotypes in Arabidopsis.

Authors: Pauline M Goubet; Hélène Bergès; Arnaud Bellec; Elisa Prat; Nicolas Helmstetter; Sophie Mangenot; Sophie Gallina; Anne-Catherine Holl; Isabelle Fobis-Loisy; Xavier Vekemans; Vincent Castric
Journal: PLoS Genet Date: 2012-03-22 Impact factor: 5.917

7. An Sp185/333 gene cluster from the purple sea urchin and putative microsatellite-mediated gene diversification.

Authors: Chase A Miller; Katherine M Buckley; Rebecca L Easley; L Courtney Smith
Journal: BMC Genomics Date: 2010-10-18 Impact factor: 3.969

8. Automated paleontology of repetitive DNA with REANNOTATE.

Authors: Vini Pereira
Journal: BMC Genomics Date: 2008-12-18 Impact factor: 3.969

9. Adaptive Evolution Coupled with Retrotransposon Exaptation Allowed for the Generation of a Human-Protein-Specific Coding Gene That Promotes Cancer Cell Proliferation and Metastasis in Both Haematological Malignancies and Solid Tumours: The Extraordinary Case of MYEOV Gene.

Authors: Spyros I Papamichos; Dimitrios Margaritis; Ioannis Kotsianidis
Journal: Scientifica (Cairo) Date: 2015-10-19

9 in total