Literature DB >> 29688350

FishTEDB: a collective database of transposable elements identified in the complete genomes of fish.

Feng Shao1, Jianrong Wang2, Hongen Xu3, Zuogang Peng1.   

Abstract

Database URL: http://www.fishtedb.org/.

Entities:  

Mesh:

Substances:

Year:  2018        PMID: 29688350      PMCID: PMC6404401          DOI: 10.1093/database/bax106

Source DB:  PubMed          Journal:  Database (Oxford)        ISSN: 1758-0463            Impact factor:   3.451


Introduction

Transposable elements (TEs) are discrete DNA segments that can insert into new chromosomal locations by one of two mechanisms (1). TEs are typically divided into Class I (‘copy and paste’ style, retrotransposons) and Class II (‘cut and paste’ style, transposons) based on whether the intermediate they use to move is RNA or DNA (2). On the basis of sequence similarities and structural relationships, these classes can be further subdivided into orders and superfamilies. Retrotransposons are commonly grouped into five distinct orders: long terminal repeat (LTR), Dictyostelium intermediate repeat sequence (DIRS), Penelope-like element (PLE), long interspersed nuclear element (LINE) and short interspersed nuclear element (SINE). DNA transposons consist of four main orders: terminal inverted repeat (TIR), Helitron, Crypton and Maverick (3). TEs are commonly considered molecular parasites owing to their removable and reproducible characteristics. However, studies of TEs in the past several decades have shown that transposons can affect gene regulation, function and coding ability (4–6). Transposons also play important roles in new gene creation, chromosome rearrangement and genome evolution (7–11). Recently, the regulatory activities of TEs in both plants and animals have become a focus of research. For example, in the peppered moth, TEs enhance cortex gene expression levels, which underlies the adaptive coloration that occurred during the industrial revolution (12). In oil palms, sporadic demethylation of a Karma TE within an intron of the MANTLED gene caused the mantled fruit phenotype (13). Fish are the largest and oldest group of vertebrates. Thus far, 33 700 species have been recorded in Fishbase (http://www.fishbase.org/, version 10/2017), and this number is constantly increasing. Fish play a crucial role in modern biology. For example, zebrafish are not only model organisms for developmental biology but also a major disease research model (14, 15). Lungfish and coelacanth, which have been described as ‘living fossils’, provide a unique opportunity to understand the mechanisms that enabled the successful adaptation of vertebrates to land (16, 17). The content, diversity and distribution of TEs in fish genomes have been studied (18–21); however, the functions and evolutionary significance of transposons in fish genomes are largely unknown. A comprehensive database of fish TEs is needed to facilitate studies of TE functions and evolution in fish genomes. In this study, we identified 33 260 consensus sequences of TEs classified into ∼50 superfamilies from 28 fish species, 1 lamprey and 1 lancelet, using de novo, structure-based and homology-based approaches. We integrated all data into a centralized database, FishTEDB, which allows users to browse, search and download all data. In addition, the GetORF, BLAST and HMMER web-based tools were provided to facilitate analyses of genomic sequences. FishTEDB can be used not only to study the origin, amplification mechanism and evolutionary dynamics of TEs in fish, but also for comparative analyses among vertebrates to elucidate the roles of TEs on genes and genomes.

Materials and methods

Data collection

All fish, lancelet and lamprey genomes used in this study were downloaded from public databases (Table 1). The Repbase Update collection (update 20150807) was retrieved from http://www.girinst.org/repbase/index.html (22). The Swiss-Prot data were downloaded from http://www.uniprot.org/downloads (23).
Table 1.

Species in FishTEDB and their genome websites

SpeciesDownload links
Anguilla anguillahttps://www.ncbi.nlm.nih.gov/assembly/GCA_000695075.1
Anguilla japonicahttps://www.ncbi.nlm.nih.gov/assembly/GCA_000470695.1
Astyanax mexicanusftp://ftp.ensembl.org/pub/release-84/fasta/astyanax_mexicanus/dna/
Branchiostoma floridaehttp://mosas.sysu.edu.cn/genome/download_data.php
Callorhinchus miliihttp://esharkgenome.imcb.a-star.edu.sg/
Ctenopharyngodon idellushttp://www.ncgr.ac.cn/grasscarp/
Cynoglossus semilaevishttps://www.ncbi.nlm.nih.gov/assembly/GCA_000523025.1
Dicentrarchus labraxhttps://www.ncbi.nlm.nih.gov/assembly/GCA_000689215.1
Electrophorus electricushttp://efishgenomics.zoology.msu.edu/? q=node/1
Gadus morhuaftp://ftp.ensembl.org/pub/release-84/fasta/gadus_morhua/dna/
Gasterosteus aculeatusftp://ftp.ensembl.org/pub/release-84/fasta/gasterosteus_aculeatus/dna/
Larimichthys croceahttps://www.ncbi.nlm.nih.gov/assembly/GCA_000972845.1
Lates calcariferhttps://www.ncbi.nlm.nih.gov/assembly/GCA_001010145.1
Latimeria chalumnaeftp://ftp.ensembl.org/pub/release-84/fasta/latimeria_chalumnae/dna/
Lepisosteus oculatusftp://ftp.ensembl.org/pub/release-84/fasta/lepisosteus_oculatus/dna/
Neolamprologus brichardihttps://www.ncbi.nlm.nih.gov/assembly/GCA_000239395.1
Nothobranchius furzerihttp://africanturquoisekillifishbrowser.org/downloads.html
Notothenia coriicepshttps://www.ncbi.nlm.nih.gov/assembly/GCA_000735185.1
Oreochromis niloticusftp://ftp.ensembl.org/pub/release-84/fasta/oreochromis_niloticus/dna/
Oryzias latipesftp://ftp.ensembl.org/pub/release-84/fasta/oryzias_latipes/dna/
Periophthalmus magnuspinnatushttps://www.ncbi.nlm.nih.gov/assembly/GCA_000787105.1
Petromyzon marinusftp://ftp.ensembl.org/pub/release-84/fasta/petromyzon_marinus/dna/
Poecilia formosaftp://ftp.ensembl.org/pub/release-84/fasta/poecilia_formosa/dna/
Scleropages formosushttps://www.ncbi.nlm.nih.gov/assembly/GCA_001005745.2
Sinocyclocheilus grahamhttps://www.ncbi.nlm.nih.gov/assembly/GCA_001515645.1
Takifugu flavidushttps://www.ncbi.nlm.nih.gov/assembly/GCA_000400755.1
Takifugu rubripesftp://ftp.ensembl.org/pub/release-84/fasta/takifugu_rubripes/dna/
Tetraodon nigroviridisftp://ftp.ensembl.org/pub/release-84/fasta/tetraodon_nigroviridis/dna/
Thunnus orientalishttps://www.ncbi.nlm.nih.gov/assembly/GCA_000418415.1
Xiphophorus maculatesftp://ftp.ensembl.org/pub/release-84/fasta/xiphophorus_maculatus/dna/
Species in FishTEDB and their genome websites

Collection and identification of TEs in fish genomes

TE libraries of fish were generated using de novo, homology-based and structure-based methods (Figure 1). De novo identification of TEs was performed using RepeatModeler (http://www.repeatmasker.org/RepeatModeler/, version 1.0.7), which assists in automating the runs of RECON (24) and RepeatScout (25) to analyze fish genomic databases, and the output of this software was used to build, refine and classify consensus models of putative interspersed repeats. Repeats identified by RepeatModeler were filtered for tandem repeat coverage of >25%, using Tandem Repeats Finder (http://tandem.bu.edu/trf/trf.unix.help.html, version 4.07b) with the default parameters. The preserved sequences were used as queries for BlastX (identity > 30%, e-value < 1e-5 and percent query coverage > 50%) to search against Swiss-Prot data to filter protein-coding genes. We constructed a library of ncRNAs using tRNAscan-SE (version 1.3.1) (26) and Rfam (27) to filter tRNA and rRNA by Blastn (identity > 90%, BLAST e-value < 1e-5 and percent query coverage > 90%).
Figure 1.

Flowchart of the TE analysis pipeline.

Flowchart of the TE analysis pipeline. For the LTR and non-LTR retroelements, given their easier-to-detect structural peculiarities (3), a structure-based approach was used. For LTR retrotransposons, LTR_STRUC (28) and MGEScan-LTR (http://darwin.informatics.indiana.edu/cgi-bin/evolution/daphnia_ltr.pl) were used to search the assembly of fish genomes with default parameters. For the MGEScan-LTR, intact LTR retroelements were identified using multiple empirical rules: similarity of a pair of LTRs at both ends, structure with internal regions (IRs), di (tri)-nucleotides at flanking ends and target site duplications (TSDs). We only retained the results that had these four structures. This framework was applied to identify a large number of novel elements, which were later analyzed to estimate the evolutionary history and relationships of LTR retrotransposons. Non-LTR retrotransposons were identified by the pHMM-based MGEScan-non-LTR (29) program with default parameters. Given that Class II TEs lack easy-to-detect structural features, a homology-based method using TESeeker was employed to predict them. TESeeker is an automated homology-based approach for identifying TEs that is BLAST-based, but also makes use of the CAP3 assembly program and the ClustalW2 multiple sequence alignment tool, as well as numerous BioPerl scripts (30). In total, 257 transposase protein sequences from fish DNA transposons were extracted from RepBase and NCBI. These sequences were used as the library in TESeeker. Finally, we only retained the sequences with the highest quality in the consensus_contigs.fas file.

TE classification and redundancy elimination in fish genomes

When identifying TEs in fish genomes, some software (TESeeker, RepeatModeler, MGEScan-LTR) can classify TEs in superfamilies, but the classification of some sequences remains unknown. REPCLASS (version 1.0, https://github.com/feschottelab/REPCLASS) and TEclass (31) were used to classify these TEs. REPCLASS is the first software used for classification of TEs. It uses an automated high-throughput workflow model, leveraging various programs to identify and classify TEs in new genomes. REPCLASS can classify consensus sequences into superfamilies. TEclass uses a machine learning support vector machine (SVM) for classification based on oligomer frequencies to classify unknown TEs into DNA transposons, LTRs, LINEs and SINEs (31). Hence, for the consensus sequences that cannot be classified into a superfamily by REPCLASS, we used TEclass (http://www.compgen.unimuenster.de/tools/teclass/generate/index.pl?lang=en) to classify them into orders. In the step of TE prediction, we combined all of the results directly in a ‘union’ set of different types of evidence; therefore, the results contained redundant TEs that were predicted based on different methods. We reduced the presence of redundant sequences by CD-HIT (32) with parameters cd-hit-est -c 0.90 and –n 8. Some transposons may insert in or next to other retrotransposons (especially in LTR), forming highly TE-rich regions (Nested TEs) (33–35). For example, some DNA transposons may insert into LTR. Normally, if all the results are put together for filtering, DNA transposons are filtered out because they are shorter than LTR. Thus, to prevent interference by nested TEs, we removed redundancies from the superfamily units one by one. We aligned the sequences that could not be classified into superfamily level (‘Unknown’ elements) to corresponding genomes by BLAST (identity > 85% and coverage > 50%), and only retained sequences with copy number > 3.

Implementation and web interface

To make this vast amount of TE data available, a user-friendly web-based database, FishTEDB, was constructed. FishTEDB enables users to browse, search, download and analyze TEs (Figure 2). FishTEDB was constructed using Yii 2.0 (a high-performance PHP MVC framework for developing Web 2.0 applications). We used the Linux (CentOS 6.7) system as the server, Nginx 1.10 (a high-performance HTTP server and reverse proxy server) as the web server, Mysql 5.7 as the storage engine and PHP 7.0 for web development. Bootstrap 3.3, JavaScript, Jquery and HTML5 were also used for the web page.
Figure 2.

User interface introduction. (A) Browsing data shown in a superfamily-centric way; (B) Browsing data shown in a species-centric way.

User interface introduction. (A) Browsing data shown in a superfamily-centric way; (B) Browsing data shown in a species-centric way.

Browser

All TEs were displayed in the browsing interface in species- and superfamily-centric manners. Users can browse by superfamily by clicking the corresponding number. Detailed information for each superfamily can be retrieved using the hyperlinks provided (Figure 2A). In the species-centric interface, all TEs were assigned to corresponding species. In both interfaces, the same method was used to browse TE data (Figure 2B). Users can also use a keyword (TE class, TE order, TE superfamily, species name) to locate entries in the search section that used approximate string matching to implement (Figure 3A). All data can be downloaded. In addition, we calculated the number of different superfamily sequences and displayed it with a pie chart and histogram (Figure 4).
Figure 3.

Snapshots of different functional sections provided in FishTEDB. (A) Screenshot of a keyword search results; (B) BLAST interface and a sample of BLASTn results; (C) GetORF interface and output results; (D) HMMER interface of a test protein sequence in FishTEDB.

Figure 4.

The statistics of consensus sequences. (A) Pie chart of different classes and orders; (B) Histogram of different superfamilies in TIR.

Snapshots of different functional sections provided in FishTEDB. (A) Screenshot of a keyword search results; (B) BLAST interface and a sample of BLASTn results; (C) GetORF interface and output results; (D) HMMER interface of a test protein sequence in FishTEDB. The statistics of consensus sequences. (A) Pie chart of different classes and orders; (B) Histogram of different superfamilies in TIR.

Tools

Three general sequence analysis tools, that is, BLAST (36), GetORF (37) and HMMER (38), were further configured into our database. Examples of BLASTN, GetORF and HMMER results are shown in Figure 3B–D, respectively. BLAST was used for the homology search, and users can align interest query sequences against FishTEDB to make an incipient judgment (whether the query sequence is a TE and which type it belongs to). BLAST will act as an efficient helper for researchers to detect whether TEs exist in sequences upstream and downstream sequences of genes of interest. Users can identify the potential open reading frame (ORF) in query sequences using the GetORF tool. Given that some TEs show differences (especially interspecies) even though they belong to the same superfamily, the results of the BLAST alignment may be deficient. GetORF can predict amino acid sequences (transposase, integrase, reverse transcriptase), and can be combined with BLAST and HMMER for TE identification and classification in species distantly related to fish at the nucleotide level. HMMER was used for the identification of transposase, endonuclease and reverse transcriptase domains of transposons. All profile-HMM (profile hidden Markov model) databases were collected from previous study (29) and Pfam (39).

Results and discussion

In the seminal work of Barbara McClintock, TEs were proposed as the ‘controlling elements’ of maize (40). Since then, many researchers have paid close attention to the functions of TEs; however, to what extent the pervasive colonization of genomes by TEs has affected the evolution of eukaryotic gene regulation remains a matter of speculation and controversy (41). The evolution of fish began ∼530 million years ago during the Cambrian explosion (42). It was during this time that the early vertebrates developed the skull and the vertebral column, leading to the first vertebrates (43). Thus, supposing a TE mechanism, investigation of the roles of TEs in the genome evolution and the impact on host genes in fish may offer insights for other vertebrates. In this study, we constructed an effective combined pipeline, suitable not only for fish but also for other vertebrates. FishTEDB provides a good basis for TE functional studies and has an auxiliary role. First, FishTEDB can enrich the transposon data of vertebrates and promote transposon research. In particular, it would provide a homologous database for the identification and classification of TEs. Second, researchers can combine tools in FishTEDB with their own sequences to achieve rapid positioning of potential TEs. We identified 33 260 TEs from 30 species: 28 fishes, 1 lamprey and 1 lancelet. Most TEs were classified into known superfamilies (Table 2). In addition, the results suggest that TEs are diverse in fish genomes. In particular, the Gypsy, L1, L2, R2, RTE, Rex, Tc1-Mariner and hAT superfamilies showed higher diversity than other superfamilies. Nevertheless, fishes and lancelet presented a lower diversity of SINEs.
Table 2.

Summary of identified transposable elements families (/consensi) in FishTEDB

ClassOrderSuperfamilyQuantity
FishLampreyLancelet
CLASS ILTRCopia4511
Gypsy178716029
DIRS199N3
ERV1871N
Ngaro9162
Pao574N
Unknown LTR3378214117
LINECR161198171
CREN1N
DRE3NN
Dong99NN
I21058
Jockey29285
L123253157
L227947572
Penelope1716915
Proto218N7
R1511
R26261021
RTE963384193
Rex9544839
Tad1711
Unknown LINE13792186
SINE5S414N
7SL1NN
ID10NN
MIR75N13
U31N
tRNA1984411
Unknown SINE347519
Unknown non-LTR18794398
CLASS IITIRAcadem20321
CACTA45N2
Tc1-Mariner22245811
hAT28045251
Mutator15N10
CMC277620
PIF-Harbinger438156
PIF-ISL2EU6313
PiggyBac94N17
Merlin3NN
Zator142
MuLE4218
Sola4528
P20NN
Kolobok96NN
Ginger19N11
Dada23N4
Zisupton5NN
Novosib21N2
CryptonCrypton27NN
HelitronHelitron162223
MaverickMaverick59NN
Unknown DNA467157190
UnknownUnknownUnknown6781452
Total3034414761440

Note. Numbers represent the number of consensus sequences and N indicates undetected.

Summary of identified transposable elements families (/consensi) in FishTEDB Note. Numbers represent the number of consensus sequences and N indicates undetected. It should be noted that we only classified ∼60% of consensus sequences in superfamilies. There are still many TEs that cannot be classified into known superfamilies. The karyotypes and genome sizes in fish are more diverse and complex than those of other vertebrates, and an extra level of complexity was observed due to whole genome duplication (WGD) and a rediploidization event that teleost fish have underwent during evolution (44). Therefore, we speculate that there are many fish-specific transposons, such as Zisupton (45). TE research is difficult without using a dedicated database. The transposon information of zebrafish in RepBase is probably the most comprehensive thus far, but that is still not sufficient to assist the classification of fish TEs. Nevertheless, these TEs may have potential effects on regulating host gene function and expression. In future studies, we will focus on the identification of novel superfamilies to further enrich TE data resources.
  10 in total

1.  Insertion Hot Spots of DIRS1 Retrotransposon and Chromosomal Diversifications among the Antarctic Teleosts Nototheniidae.

Authors:  Juliette Auvinet; Paula Graça; Laura Ghigliotti; Eva Pisano; Agnès Dettaï; Catherine Ozouf-Costaz; Dominique Higuet
Journal:  Int J Mol Sci       Date:  2019-02-06       Impact factor: 5.923

2.  Nucleotide composition of transposable elements likely contributes to AT/GC compositional homogeneity of teleost fish genomes.

Authors:  Radka Symonová; Alexander Suh
Journal:  Mob DNA       Date:  2019-12-12

Review 3.  Transposable Elements and Stress in Vertebrates: An Overview.

Authors:  Anna Maria Pappalardo; Venera Ferrito; Maria Assunta Biscotti; Adriana Canapa; Teresa Capriglione
Journal:  Int J Mol Sci       Date:  2021-02-17       Impact factor: 5.923

4.  A duplicated copy of id2b is an unusual sex-determining candidate gene on the Y chromosome of arapaima (Arapaima gigas).

Authors:  Mateus C Adolfi; Kang Du; Susanne Kneitz; Cédric Cabau; Margot Zahm; Christophe Klopp; Romain Feron; Rômulo V Paixão; Eduardo S Varela; Fernanda L de Almeida; Marcos A de Oliveira; Rafael H Nóbrega; Céline Lopez-Roques; Carole Iampietro; Jérôme Lluch; Werner Kloas; Sven Wuertz; Fabian Schaefer; Matthias Stöck; Yann Guiguen; Manfred Schartl
Journal:  Sci Rep       Date:  2021-11-03       Impact factor: 4.379

5.  Genome biology of the darkedged splitfin, Girardinichthys multiradiatus, and the evolution of sex chromosomes and placentation.

Authors:  Gene Myers; Yann Guiguen; Constantino Macias Garcia; Kang Du; Martin Pippel; Susanne Kneitz; Romain Feron; Irene da Cruz; Sylke Winkler; Brigitta Wilde; Edgar G Avila Luna; Manfred Schartl
Journal:  Genome Res       Date:  2022-01-26       Impact factor: 9.438

6.  Leafy and weedy seadragon genomes connect genic and repetitive DNA features to the extravagant biology of syngnathid fishes.

Authors:  Clayton M Small; Hope M Healey; Mark C Currey; Emily A Beck; Julian Catchen; Angela S P Lin; William A Cresko; Susan Bassham
Journal:  Proc Natl Acad Sci U S A       Date:  2022-06-22       Impact factor: 12.779

7.  DeviaTE: Assembly-free analysis and visualization of mobile genetic element composition.

Authors:  Lukas Weilguny; Robert Kofler
Journal:  Mol Ecol Resour       Date:  2019-07-03       Impact factor: 7.090

8.  Evolution and diversity of transposable elements in fish genomes.

Authors:  Feng Shao; Minjin Han; Zuogang Peng
Journal:  Sci Rep       Date:  2019-10-28       Impact factor: 4.379

9.  STOREFISH 2.0: a database on the reproductive strategies of teleost fishes.

Authors:  Stéphane Teletchea; Fabrice Teletchea
Journal:  Database (Oxford)       Date:  2020-11-20       Impact factor: 3.451

10.  Genome Analysis of Lagocephalus sceleratus: Unraveling the Genomic Landscape of a Successful Invader.

Authors:  Theodoros Danis; Vasileios Papadogiannis; Alexandros Tsakogiannis; Jon B Kristoffersen; Daniel Golani; Dimitris Tsaparis; Aspasia Sterioti; Panagiotis Kasapidis; Georgios Kotoulas; Antonios Magoulas; Costas S Tsigenopoulos; Tereza Manousaki
Journal:  Front Genet       Date:  2021-12-08       Impact factor: 4.599

  10 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.