Literature DB >> 23203982

SINEBase: a database and tool for SINE analysis.

Abstract

SINEBase (http://sines.eimb.ru) integrates the revisited body of knowledge about short interspersed elements (SINEs). A set of formal definitions concerning SINEs was introduced. All available sequence data were screened through these definitions and the genetic elements misidentified as SINEs were discarded. As a result, 175 SINE families have been recognized in animals, flowering plants and green algae. These families were classified by the modular structure of their nucleotide sequences and the frequencies of different patterns were evaluated. These data formed the basis for the database of SINEs. The SINEBase website can be used in two ways: first, to explore the database of SINE families, and second, to analyse candidate SINE sequences using specifically developed tools. This article presents an overview of the database and the process of SINE identification and analysis.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2012 PMID： 23203982 PMCID： PMC3531059 DOI： 10.1093/nar/gks1263

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Short interspersed elements (SINEs) are mobile genetic elements invading genomes of most higher eukaryotes (exceeding 10% of some genomes). Although these genomic parasites can be deleterious to the cell, the long-term being in the genome has made SINEs a valuable factor of genetic variation, providing regulatory elements for gene expression, alternative splice sites, polyadenylation signals and even functional RNA genes (1–4). At the same time, the system and nomenclature of SINEs remain to a large extent unarticulated. SINEBase is a manually curated database of SINE families known to date. It aims to be a resource for scientists working on mobile elements as well as for a wide range of biologists analysing nucleic acid sequences. SINEBase can be considered as a compendium of SINEs; its toolset allows individual SINE sequences to be attributed to known SINE families and/or analysed.

Definitions

Retro(trans)posons are genetic elements that can amplify themselves in eukaryotic genomes, which requires an RNA intermediate, and thus, transcription and reverse transcription. Retrotransposons are divided into three classes: long terminal repeat (LTR) elements, long interspersed elements (LINEs) and SINEs. The elements that encode the enzyme activities, providing for the reverse transcription and integration of the DNA copy into the genome, are called autonomous transposons. Nonautonomous retroposons rely on the enzyme machinery of autonomous transposons. LTR retrotransposons and LINEs can be autonomous or nonautonomous; and their genomic copies are transcribed by the cellular RNA polymerase II (5,6). SINEs are defined as relatively short (<700 bp) nonautonomous retroposons transcribed by the cellular RNA polymerase III (pol III) from an internal promoter, whereas their reverse transcription depends on the reverse transcriptase of partner LINEs. Eukaryotic genomes can harbor hundreds thousands (sometimes more) of SINE copies; copies originating from a common ancestral SINE can differ from each other by single-nucleotide alterations as well as by longer internal deletions or duplications (SINEs with such duplication are called quasidimeric). Some of them can become founders of new SINE subfamilies. SINEs consist of two or more modules; typically, head, body and tail. The 5′-terminal head originates from one of cellular RNAs synthesized by pol III: tRNA, 7SL RNA or 5S rRNA. The origin of the body is either unknown or it descends from a partner LINE. SINEs with such a region mimic LINE RNA in the reverse transcription (7). The body can also contain a central domain shared by distant SINE families (CORE and similar domains). The 3′-terminal tail is a sequence of variable length consisting of simple (often degenerate) repeats. In addition, two SINEs can combine into a dimeric SINE, thus, giving rise to a new SINE family. SINEs consisting of the head and tail only are called simple, whereas dimeric, trimeric, etc. are complex SINEs. Various aspects of SINE structure, biology and evolution have been reviewed elsewhere (4,8,9). We consider SINEs as (i) short (<1 kb) interspersed (nontandem) genomic repeats; (ii) present in at least 100 copies per genome (except certain genomes where repetitive elements are not abundant, e.g. Arabidopsis thaliana); (iii) with at least 60% identity with a tRNA species (10), 5S rRNA or 7SL RNA in at least 60-nt overlap (with a few exceptions where the element transcription by pol III was confirmed experimentally). We found that pol III promoters (e.g. boxes A and B) can serve only as an indication (but not a proof) that the sequence belongs to SINEs. (Even when more sophisticated methods of pol III promoter identification, e.g. with position frequency matrices, were used, the proportion of false positives and/or misses remained high for different ‘stringency’ values). SINEs should be distinguished from RNA pseudogenes: the pseudogenes are generated by the reverse transcription of the functional RNAs of cellular origin (e.g. 5S rRNA) rather than of SINE RNAs transcribed from their genomic copies. In practical terms, most SINEs have extra (body) sequences, whereas simple SINEs have characteristic substitutions/indels shared with their source gene but not with the cellular RNA gene. In addition, SINEs significantly outnumber RNA pseudogenes. The notion of ‘SINE family’ is widely used but not clearly defined. We consider SINE family as a set of SINEs (i) of a common origin and (ii) consisting of the same modules (except the tail, which can vary even in the same species). Thus, similar SINEs with different LINE-derived regions (e.g. mammalian Ther-2 and Mar-1) belong to different families. Long insertions are considered as modules. At the same time, internal deletions or duplications within modules do not give birth to a new family; although a combination of complete or almost complete SINEs (complex SINEs) is considered as a new family (thus, pB1 and quasidimeric B1 are subfamilies of the same family, whereas dimeric Alu represents a distinct family). Finally, there are а few SINEs with similar structure but of independent origin (e.g. simple SINEs: ID in rodents, vic-1 in camels and DAS-I in armadillos), thus, considered as different families.

DATABASE OF SINE FAMILIES

Data acquisition

We extracted consensus sequences of SINE families largely from two sources, original publications and the Repbase Update (RU; ver. 16.07) database (11). In many cases, they were refined in the available sequence databases. The consensus sequences were compared with the sequences of other SINEs, LINEs, tRNA species, 5S rRNA and 7SL RNA to identify their modules. Similar sequences were aligned and the differences were analysed. The elements composed of the same modules were considered as the same SINE family. There were particularly knotty cases, e.g. the CHRS family. This SINE is a quasioligomer, it contains a ∼20-nt degenerate motif, which can be tandemly repeated more than 10 times. The variants differing in the number of these repeats were previously recognized as different SINE families (CHRS, CHR-2, CHRL, etc.). Multiple alignments clarifying such cases can be found in Supplementary Alignments S1. As a result, 175 SINE families were recognized according to the above definitions.

SINEBase organization

The heart of the database is the SINETable (also available as Supplementary Table S1) visualizing the main data about all SINE families known to date (length, distribution, copy number, schematic structure, etc.). The table contents can be limited to certain taxa and sorted by some characters (e.g. tail sequence). It contains links to SINE family-specific data (e.g. consensus sequence or publications) or to term descriptions. The databases of consensus sequences of SINE families, central domains and LINE-derived regions can be downloaded in the Download section, whereas individual consensus sequences and the multiple alignments are accessible as Supplementary Sequences S1 and Supplementary Alignments S2 and S3, respectively.

SINEBase tool

Based on our long-term experience in SINE analysis, we offer a toolset for the identification of SINE families and modules (SINESearch). This tool can also ascertain that the sequence of interest is not a SINE or that it belongs to an unknown SINE family. In the latter case, SINESearch can be used to analyse the modules of a new SINE. It is a FASTA-based search that uses parameters other than the FASTA’s statistical significance test to select sequences. This obviates two limitations of FASTA (as well as BLAST etc.) in the case of relatively short and degenerate similarities between nucleotide sequences: a bias to short (almost) perfect matches, whereas the goal is full-length and significant similarities; and missing significant hits when the bank includes many sequences similar to query. The search banks include our collections of full-length SINEs and their modules (certain RNAs, central domains and LINE-derived regions). SINESearch is simple to use and fast. The search parameters used, overlap length and sequence identity, are biologically sensible and allow easy adjustment of hit selection. Query sequence can be manually input or uploaded. SINESearch offers four banks: SINEBank (consensus sequences of SINE families), RNABank (human tRNA species (10) plus 7SL RNA and 5S rRNA), LINEBank (SINE consensus sequences derived from partner LINEs) and COREBank [consensus sequences of central (CORE, Deu-, V-, Ceph-, α- and β-) domains]. The recommended protocol for the analysis of putative SINE sequences (explained in detail in the Help section) includes the following steps: preliminary analysis of a sequence of interest to exclude non-SINE sequences; SINESearch against the SINEBank to identify SINEs that belong to known SINE families and SINESearch against other banks to identify individual modules of a putative SINE.

SINE data analysis

The length of SINE consensus sequences without tail ranges from 75 to 662 nt, with the mean length of 253 nt (Figure 1). In terms of structure (Figure 2), the majority of families are monomeric (87%) tRNA-derived (84%; green sectors in Figure 2) SINEs. There are roughly 3 times less SINE families with the LINE-derived region (dark green sectors) than without it (light green sectors), although this ratio can decrease as new partner LINEs become identified. More than a quarter families contain CORE and similar domains (dotted sectors). The most common SINE structure is a tRNA-derived head followed by a body of unknown origin and a tail (41%); other patterns range from 2 to 14%. Complex SINE families amount to 13% (purple sectors in Figure 2).

Figure 1.

Length distribution for 175 eukaryotic SINE families (without tail). The range is from 75 to 662 nt with the mean and median length of 253 and 236 nt, respectively.

Figure 2.

Occurrence of different SINE structures. Complex SINEs are highlighted by shades of purple; brown and yellow sectors represent 7SL RNA- and 5S rRNA-derived SINEs, respectively; tRNA-derived are shown in shades of green (dark and light hues correspond to SINEs with and without LINE-derived region, respectively). Dotted sectors indicate SINEs with CORE domains; schematic SINE structures (as in SINETable/Supplementary Table S1) are shown next to or over the corresponding sectors. Percentage is the fraction of a structure among 175 SINE families, and the number in parentheses is the number of SINE families representing the structure.

Length distribution for 175 eukaryotic SINE families (without tail). The range is from 75 to 662 nt with the mean and median length of 253 and 236 nt, respectively. Occurrence of different SINE structures. Complex SINEs are highlighted by shades of purple; brown and yellow sectors represent 7SL RNA- and 5S rRNA-derived SINEs, respectively; tRNA-derived are shown in shades of green (dark and light hues correspond to SINEs with and without LINE-derived region, respectively). Dotted sectors indicate SINEs with CORE domains; schematic SINE structures (as in SINETable/Supplementary Table S1) are shown next to or over the corresponding sectors. Percentage is the fraction of a structure among 175 SINE families, and the number in parentheses is the number of SINE families representing the structure. The collection of consensus sequences was further analysed in an attempt to identify similar patterns in their structure. All tRNA-derived sequences of SINE families were used to generate a sequence logo, which was compared with that of human tRNA genes (Figure 3; Supplementary Alignments S4). Overall, the same sequence pattern is observed in both cases, although it is less pronounced for SINEs (i.e. SINE sequences are more variable). SINEs have a short G-rich extra sequence at the 5′-end compared with tRNAs (and of course extra downstream sequences).

Figure 3.

Conservation of tRNA-derived sequences in SINEs. Sequence logos of (A) 175 tRNA-derived sequences in SINEs (including those in the second and third monomers of complex SINEs) and (B) 359 human tRNA genes (10) were generated by WebLogo 3.1 (13). The multiple alignments were edited to eliminate gaps in the logos. The original alignments of tRNAs and tRNA-derived sequences in SINEs are available in Supplementary Alignments S4. Special surveys were carried out for the body region targeted at the central CORE-like domains and LINE-derived regions. This allowed us to identify such domains in certain SINEs, where they remained unnoticed (e.g. the CORE domain was found in several sea urchin SINEs), as well as to identify two new central domains named α and β. As a result, the new consensus sequences were generated for most central domains (CORE, Deu, V, Ceph, α and β) (Supplementary Alignments S2). A similar analysis of the body 3′ terminal regions allowed us to generate multiple alignments and consensus sequences for four LINE-derived regions corresponding to Bov-B, CR1 and two L2 LINEs (Supplementary Alignments S3).

Similar resources

The most comprehensive up-to-date database of repetitive genomic elements (REs) is RU (11). De facto, it has become the standard source for RE research and nomenclature. RU includes many other types of REs apart from SINEs, whereas SINE consensus sequences represent families and subfamilies from groups of organisms and individual genomes in the same pool. Clearly, SINEBase that covers only SINE families cannot be considered as a RU replacement. At the same time, our analysis revealed a number of discrepancies between SINEBase and RU, what we believe stem from certain errors and ambiguities in RU (Supplementary Table S2). (i) As many as 80 records annotated in RU as SINEs (527 by the analysis time) were not included in SINEBase as they correspond to other RE types (largely, LINEs); (ii) SINEBase assigns consistent names to the same SINEs in different species and to SINE subfamilies, which in RU can be assigned different names (only 130 RU records correspond to SINEBase families, whereas 258 correspond to subfamilies and species variants); (iii) A substantial fraction of SINEBase families (45 in total) are missing from RU; and (iv) Finally, SINEBase uses a straightforward SINE nomenclature, which in most cases relies on the previously described SINEs, whereas RU tends to rename them. As an example, the RU naming scheme (39) (which changed several times) includes 43 records starting from ‘SINE2-’ and a text with numerous SINE2-2_CQ, SINE2-2_NV and SINE2-2_SP is hard to read. We renamed such families by omitting the redundant ‘SINEx’. Supplementary Table S2 lists all RU records annotated as SINEs and describes their status in SINEBase. Furthermore, the RepeatMasker program (http://www.repeatmasker.org) routinely used to identify SINEs relies on (slightly modified) RU records. RepeatMasker finds the best hit among RU sequences based on certain statistical parameters, such that a high similarity over short fragments can be considered more significant than a lower similarity throughout the element; this is particularly true for short sequences. At the same time, different SINE families can share the same (e.g. 5S rRNA-derived) module, and the sequences in this region can be highly similar, whereas those in the other regions are dissimilar. The situation is even worse for SINE subfamilies distinguished by diagnostic characters (often single nucleotide), as RepeatMasker considers them on par with random (non-diagnostic) mutations. Because many RU records belong to the same SINE family, the RepeatMasker can recognize a set of sequences of the same family as several different SINEs. As a result, blind reliance on this tool often leads to confusing misidentifications of SINEs or even anecdotal errors in otherwise competent publications, e.g. hundreds thousands of Alu copies in the genomes of mouse and rat (rodents have a different much shorter B1 SINE; Alu is limited to primates) or a single 51-nt B2 in the human genome (SINEs are longer and repetitive; most likely it is a tRNA pseudogene) (12). We have designed the SINESearch tool to be free from these limitations.

SINEBase website

SINEBase is hosted on an Apache web server using CGI-Perl and JavaScript to generate dynamic HTML pages. It is fully functional with major web browsers. Some older browsers tested still allow the backbone functions of the site, whereas some decorations may not work (e.g. SINETable sorting). The database will be updated as new SINE data become available to us (at least biannually). We encourage the submission of new data on SINE families. SINEBase is freely available at http://sines.eimb.ru. There are no access restrictions for academic and commercial use. We kindly ask all users to cite this article if they use SINEBase in their publications.

CONCLUSION

As more and more genome sequences become available, the number of known SINEs will grow and new researchers will be involved in their analysis. SINEBase is aimed to bring some order to the system of SINEs and to set a basis for further studies on these genomic elements. The database of SINE consensus sequences and motifs will be updated as new SINEs are described. We will develop new tools to assist in SINE analysis (the identification of TSDs and internal duplications are first in the list). We appreciate feedback from SINEBase users to improve the service.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online: Supplementary Caption 1, Supplementary Tables 1 and 2, Supplementary Sequence 1, Supplementary Alignments 1–4 and Supplementary References [13-99].

FUNDING

Molecular and Cellular Biology Program of the Russian Academy of Sciences; Russian Foundation for Basic Research [project nos. 13-04-01678-а and 11-04-00439-a]. Funding for open access charge: Personal funds. Conflict of interest statement. None declared.

97 in total

1. B1 SINEs in different rodent families.

Authors: Natalia A Veniaminova; Nikita S Vassetzky; Dmitri A Kramerov
Journal: Genomics Date: 2007-04-12 Impact factor: 5.736

2. MyrSINEs: a novel SINE family in the anteater genomes.

Authors: Hidenori Nishihara; Shuichi Kuno; Masato Nikaido; Norihiro Okada
Journal: Gene Date: 2007-06-13 Impact factor: 3.688

3. Novel SINE families from salmons validate Parahucho (Salmonidae) as a distinct genus and give evidence that SINEs can incorporate LINE-related 3'-tails of other SINEs.

Authors: Vitaliy Matveev; Hidenori Nishihara; Norihiro Okada
Journal: Mol Biol Evol Date: 2007-04-29 Impact factor: 16.240

4. A SINE family widely distributed in the plant kingdom and its evolutionary history.

Authors: Jeffrey A Fawcett; Taihachi Kawahara; Hitoshi Watanabe; Yasuo Yasui
Journal: Plant Mol Biol Date: 2006-06 Impact factor: 4.076

5. Retropositional parasitism of SINEs on LINEs: identification of SINEs and LINEs in elasmobranchs.

Authors: I Ogiwara; M Miya; K Ohshima; N Okada
Journal: Mol Biol Evol Date: 1999-09 Impact factor: 16.240

6. Newly discovered young CORE-SINEs in marsupial genomes.

Authors: Maruo Munemasa; Masato Nikaido; Hidenori Nishihara; Stephen Donnellan; Christopher C Austin; Norihiro Okada
Journal: Gene Date: 2007-10-12 Impact factor: 3.688

7. Evolutionary dynamics of transposable elements in the short-tailed opossum Monodelphis domestica.

Authors: Andrew J Gentles; Matthew J Wakefield; Oleksiy Kohany; Wanjun Gu; Mark A Batzer; David D Pollock; Jerzy Jurka
Journal: Genome Res Date: 2007-05-10 Impact factor: 9.043

8. Bov-B-mobilized SINEs in vertebrate genomes.

Authors: Konstantin P Gogolevsky; Nikita S Vassetzky; Dmitri A Kramerov
Journal: Gene Date: 2007-10-05 Impact factor: 3.688

Review 9. Mammalian non-LTR retrotransposons: for better or worse, in sickness and in health.

Authors: Victoria P Belancio; Dale J Hedges; Prescott Deininger
Journal: Genome Res Date: 2008-02-06 Impact factor: 9.043

10. Short interspersed elements (SINEs) in plants: origin, classification, and use as phylogenetic markers.

Authors: Jean-Marc Deragon; Xiaoyu Zhang
Journal: Syst Biol Date: 2006-12 Impact factor: 15.683

51 in total

1. Massive horizontal transfer of transposable elements in insects.

Authors: Jean Peccoud; Vincent Loiseau; Richard Cordaux; Clément Gilbert
Journal: Proc Natl Acad Sci U S A Date: 2017-04-17 Impact factor: 11.205

2. Identification and characterisation of Short Interspersed Nuclear Elements in the olive tree (Olea europaea L.) genome.

Authors: Elena Barghini; Flavia Mascagni; Lucia Natali; Tommaso Giordani; Andrea Cavallini
Journal: Mol Genet Genomics Date: 2016-10-06 Impact factor: 3.291

3. Retrotransposon-Based Blood Meal Analysis of Nymphal Deer Ticks Demonstrates Spatiotemporal Diversity of Borrelia burgdorferi and Babesia microti Reservoirs.

Authors: Heidi K Goethert; Thomas N Mather; Joanna Buchthal; Sam R Telford
Journal: Appl Environ Microbiol Date: 2021-01-04 Impact factor: 4.792

4. Computational Analysis of Transposable Elements and CircRNAs in Plants.

Authors: Liliane Santana Oliveira; Andressa Caroline Patera; Douglas Silva Domingues; Danilo Sipoli Sanches; Fabricio Martins Lopes; Pedro Henrique Bugatti; Priscila Tiemi Maeda Saito; Vinicius Maracaja-Coutinho; Alan Mitchell Durham; Alexandre Rossi Paschoal
Journal: Methods Mol Biol Date: 2021

5. RUDI, a short interspersed element of the V-SINE superfamily widespread in molluscan genomes.

Authors: Andrea Luchetti; Eva Šatović; Barbara Mantovani; Miroslav Plohl
Journal: Mol Genet Genomics Date: 2016-03-17 Impact factor: 2.980

6. Do larger genomes contain more diverse transposable elements?

Authors: Tyler A Elliott; T Ryan Gregory
Journal: BMC Evol Biol Date: 2015-04-22 Impact factor: 3.260

7. Insights into the genomic evolution of insects from cricket genomes.

Authors: Guillem Ylla; Taro Nakamura; Takehiko Itoh; Rei Kajitani; Atsushi Toyoda; Sayuri Tomonari; Tetsuya Bando; Yoshiyasu Ishimaru; Takahito Watanabe; Masao Fuketa; Yuji Matsuoka; Austen A Barnett; Sumihare Noji; Taro Mito; Cassandra G Extavour
Journal: Commun Biol Date: 2021-06-14