Literature DB >> 16756682

LINE FUSION GENES: a database of LINE expression in human genes.

Dae-Soo Kim1, Tae-Hyung Kim, Jae-Won Huh, Il-Chul Kim, Seok-Won Kim, Hong-Seog Park, Heui-Soo Kim.   

Abstract

BACKGROUND: Long Interspersed Nuclear Elements (LINEs) are the most abundant retrotransposons in humans. About 79% of human genes are estimated to contain at least one segment of LINE per transcription unit. Recent studies have shown that LINE elements can affect protein sequences, splicing patterns and expression of human genes. DESCRIPTION: We have developed a database, LINE FUSION GENES, for elucidating LINE expression throughout the human gene database. We searched the 28,171 genes listed in the NCBI database for LINE elements and analyzed their structures and expression patterns. The results show that the mRNA sequences of 1,329 genes were affected by LINE expression. The LINE expression types were classified on the basis of LINEs in the 5' UTR, exon or 3' UTR sequences of the mRNAs. Our database provides further information, such as the tissue distribution and chromosomal location of the genes, and the domain structure that is changed by LINE integration. We have linked all the accession numbers to the NCBI data bank to provide mRNA sequences for subsequent users.
CONCLUSION: We believe that our work will interest genome scientists and might help them to gain insight into the implications of LINE expression for human evolution and disease. AVAILABILITY: http://www.primate.or.kr/line.

Entities:  

Mesh:

Substances:

Year:  2006        PMID: 16756682      PMCID: PMC1501021          DOI: 10.1186/1471-2164-7-139

Source DB:  PubMed          Journal:  BMC Genomics        ISSN: 1471-2164            Impact factor:   3.969


Background

Most retroelements have been considered harmful because they cause accumulation of insertion and deletion mutations in the host genome [1]. Mutation of retroelements could affect gene transcription and translation. However, recent investigations have shown that HERV and Alu elements in the intron or flanking regions of functional human genes provide alternative promoters, splicing sites and polyadenylation signals [2,3]. Unlike HERV and Alu, LINE elements tend to contain multiple potential splice sites (ESE) [4] and polyadenylation signals [5] in their sequences. There are four types of transposable elements in the human genome: long interspersed nuclear elements (LINEs or L1s) or non-long terminal repeat retrotransposons, short interspersed nuclear elements (SINEs), LTR retrotransposons (endogenous retroviruses) and DNA transposons [1], which together constitute 45% of the total genome. Most of these elements are inactive. However, a few LTR elements have been shown to contain intact open reading frames (ORFs) [6], and LINE elements also have the capacity for autonomous retrotransposition [7,8]. SINE elements cannot be expressed by themselves and depend on L1 elements for active mobility [9]. The L1 elements constitute about 17% of the human genome and are present in an estimated 79% of human genes in at least one copy [10]. The full length of L1 is about 6 kb. It consists of a 5' untranslated region (5'UTR); two nonoverlapping open reading frames (ORF1 and ORF2) encoding an RNA binding protein [11], an endonuclease [12] and a reverse transcriptase [13]; and a 3'UTR that ends in an AATAAA polyadenylation signal and a polyA tail [9]. The Alu and SVA transposable elements and processed pseudogenes are believed to have been inserted into the genome by borrowing the endonuclease and reverse transcriptase from L1 elements [14-16]. The L1 element itself has also been inserted into new genomic locations during mammalian evolution. Such elements are mostly truncated and rearranged to form inactive copies of their progenitors. These insertional mutations are reported to be associated with twelve genetic diseases [17] and also contribute to protein variability or versatility [18]. Active or functional L1 elements, which are involved in shaping the human genome, are differentiated into three types depending on where they are inserted into the genome. First, a 6 kb-long full-length or variable-length 5'-truncated L1 element is inserted into the 5'UTR or introns of a gene, affecting its expression. In this process, LINE elements are probably reverse transcribed and integrated in the new location by target-primed reverse transcription (TPRT) [19]. LINE elements have provided not only many internal promoters at new genomic locations, but also 5'-UTR-located internal promoters, which could guide the transcription of many adjacent genes [20]. Second, retrotransposition of the L1 element results in the transduction of a 3'-UTR flanking fragment to a new genomic location; this is due to the effect of the ambiguous L1 polyadenylation signal [21]. Third, the L1 components are shuffled into exons, affecting the splicing site at transcription and consequently leading to the production of alternative mRNA transcripts [22]. Assembling genomic information and constructing a web-database of genome annotations and genes with particular functions is generally useful for implementing functional studies and for understanding evolutionary genomic organization. Representative web-databases of transposable elements in the human genome have been reported: a database of Alu elements incorporated within protein-coding gene [2], an HERV expression and structure analysis system [3] and a system for extrapolating functional annotation to the prediction of active LINE-1 elements [23]. Although it is well established that information about the structure and position of LINE elements in genes is important for functional studies of genetic diseases, such data are limited and are not included in any database that allows large amounts of scattered information to be searched easily. To address this deficiency, we developed a database for LINE expression and structure in the human genome, LINE FUSION GENES. Our database provides the structures and expression patterns of LINE elements including their relative positions in the genes, and additional information such as the tissue distribution and chromosomal location of the genes and their domain structures. To enhance ease of access for subsequent users, we linked all of the accession numbers to the NCBI data bank to provide mRNA sequences.

Construction and content

Identification of transcript variants by LINE insertion (LINE FUSION GENES)

First, 28,171 mRNA human-gene sequences and human expressed sequence tags (EST) were downloaded from the NCBI database Build 35 (INSDC, ) and aligned with genomic assembly sequences (Build 35) using the SIM4 program [24]. Only alignments showing >97% sequence identity were used for further stages. As a result, we extracted positional information about the exon and genome sequences to be matched. On the basis of this information we collected contiguous sequences from 5 kb upstream of the 5'UTR end to the same distance downstream of the 3' UTR end. All the sequences were stored as mapping data for each gene. In addition, the DNA sequences of the LINE elements (LINE-1, LINE-2, LINE-3) were downloaded from Repbase Update [25]. We constructed a LINE component library, using BLASTX, from these 205 downloaded sequences, which included 5'UTR, ORF1, ORF2 and 3'UTR. We used RepeatMasker to search for LINE sequences in the contiguous segments. For each gene entry, LINE locations on the contig, orientation and sequence were stored in the database. The locations of LINEs and exons on each contig were calculated from their positions. We then merged them on the basis of their positions and found that 4,489 LINEs were fused on 5' UTR (1,392), 3'UTR (2,167) and exonization (930). Finally, we constructed the LINE FUSION GENES database for chimeric transcripts containing L1-5'UTR heads and cellular sequence tails (102) and L1-3'UTR incorporated within transcripts tails (676), and the LINE elements that led to novel splice variants (632). Information about tissue expression and pathogenic LINE fusion transcripts was obtained by gene expression vocabulary (eVOC) annotations of cDNA library sources [26].

Classification of the LINE FUSION GENES

As shown Figure 1, we classified the LINE FUSION GENES into three types, alternative promoter, alternative polyadenylation signal and exonization, on the basis of the effects of their insertion in the genes. These effects of LINE insertion depend on position and sequence.
Figure 1

Classification of the LINE fusion types. LINE FUSION GENES were classified into three types. (Type I) Alternative promoter: the promoters of LINEs incorporated near the 5'UTR or into an intron of the gene can act as antisense (ASPs) or sense (SPs) promoters, producing chimeric transcripts different from those of that gene. (Type II) Alternative Poly A signal: LINEs with the poly A signal incorporated in the gene can affect the transcription process resulting in alternative transcripts. (Type III) Exonization: LINEs can be recognized as splicing sites (AG-GT) or intact exons by the spliceosome. LINE element is indicated by yellow box, exon by green box and 5'-3'UTR by blue box.

Type I. Alternative promoter

LINE FUSION GENES of Type I involve insertion near the 5'UTR of the gene or in an intron. LINEs have their own sense and antisense promoters in their 5'UTRs. Consequently, Type I genes might be transcribed from the promoters of the inserted LINE rather than from the cellular promoter. Previously, several cases of Type I LINE FUSION GENES have been reported [27].

Type II. Alternative polyadenylation signal

If LINE elements have a polyadenylation signal within the 3' UTR gene flanking region, they could be responsible for a transduction event [8]. Such LINE expression occurs occasionally in human genes; the transcript is stopped by the LINE polyadenylation signal rather than the one endogenous to the gene. When the LINE is incorporated into the intron behind the 3'UTR, transcription is again occasionally stopped by the LINE polyadenylation signal rather than that of the gene. We classified such genes as Type II LINE FUSION GENES. In other words, Type II LINE FUSION GENES are LINE fusion genes with LINE polyadenylation signals on their 3' UTRs.

Type III. Exonization

Generally, the intron sequences are spliced out by the spliceosome, which recognizes the splicing site (AG-GT) between the intron and the exon. Most LINEs inserted into introns are spliced out and do not affect target gene expression. However, recent studies have shown that some LINEs can be recognized as splicing sites (AG-GT) or as intact exons by the spliceosome [28]. Consequently, the LINE sequences are fused to mRNA coding sequences. We classified these genes as Type III LINE FUSION GENES.

Utility and discussion

LINE FUSION GENES uses JSP technology; the data come from a primary database. Users can efficiently retrieve three modes of information concerning LINE expression within genes. First, they can search LINE expression within a gene by typing a gene ID or clicking on the gene name listed on the view page according to its chromosomal location. Second, the database provides type information in which LINE expression is classified into three types (alternative promoter, alternative polyadenylation signal and exonization). The type information can help users to speculate more readily about the effects of LINE expression within interesting genes. Third, users can search interesting genes using accession numbers from the NCBI data bank or from the HUGO symbol name provided on the view page, and even acquire mRNA sequences from the NCBI data bank for further study. The result pages are listed in a tabular format that provides the evidence for and information about LINE expression within genes. As shown in Figure 2, the LINEs are visualized by colors: red (5' UTR elements), blue (3' UTR elements) and green (ORF1 and ORF2). LINE fusion regions within mRNAs are indicated in red. Moreover, detailed information about the LINE fusion regions are displayed in the table on the result page. Occasionally, LINE incorporation results in domain changes in a protein. In order to speculate about these domain changes, users can check the domain description on the page. The domain information includes the results obtained from searching queries about genes with LINEs by RPS-BLAST [29].
Figure 2

Part of output from LINE FUSION GENES. LINE FUSION GENES shows evidence of and information about expressed LINE events within genes. Both the LINE fusion regions and transcript information are shown in tabular form and a graphic view represents the family, orientation, structure and length of the LINE. This view provides more information such as the tissue distribution of the genes, merging LINE elements as evidence of their expression, and domain information related to LINE expression.

Conclusion

From our in silico analysis of the human genome, 1,329 genes were identified as being affected by LINE elements during expression. LINE FUSION GENES is continually supplemented with new human gene data from the available sources. We are planning to update the database with full length human cDNA data obtained from various clinical samples representing human diseases. Through this update, we will be able to profile the patterns of LINE expression in various diseases and to identify LINEs that affect the expression of functional human genes. We will also supplement the database with LINE fusion genes from other mammalian species and compare them with those of humans. We also envision the integration of our HESAS [3] and LINE FUSION GENES databases, intended for release in 2007. We believe that our work will help us to gain insight into the implications of LINE expression for human evolution and disease.

Availability and requirements

LINE FUSION GENES is publicly available at the URL . Questions and comments are welcomed through the site.

Abbreviations

LINE – Long Interspersed Element HERV – Human Endogenous Retrovirus SINE – Short Interspersed Nucleotide Element ORF – Open Reading Frame BLAST – Basic Local Alignment Search Tool JSP – Java Server Pages RPS-BLAST – Reversed Position Specific Blast HESAS – HERVs Expression and Structure Analysis System EST – Expressed Sequence Tag UTR – Untranslated Regions NCBI – National Center for Biotechnology Information HUGO – Human Genome Organisation INSDC – International Nucleotide Sequence Databases

Authors' contributions

DS Kim analyzed the contents of the paper and wrote the manuscript. HS Kim participated in the analysis and provided essential direction. TH Kim provided biological context and guidance during the initial phase of the bioinformatics analysis. HS Park and IC Kim contributed the manuscript correction and continuous discussions. SW Kim helped in the general design of the database and the user interface. JW Huh provided biological direction. All authors read and approved the final manuscript.
  28 in total

Review 1.  Mobile elements and the human genome.

Authors:  E T Prak; H H Kazazian
Journal:  Nat Rev Genet       Date:  2000-11       Impact factor: 53.242

2.  Initial sequencing and analysis of the human genome.

Authors:  E S Lander; L M Linton; B Birren; C Nusbaum; M C Zody; J Baldwin; K Devon; K Dewar; M Doyle; W FitzHugh; R Funke; D Gage; K Harris; A Heaford; J Howland; L Kann; J Lehoczky; R LeVine; P McEwan; K McKernan; J Meldrim; J P Mesirov; C Miranda; W Morris; J Naylor; C Raymond; M Rosetti; R Santos; A Sheridan; C Sougnez; Y Stange-Thomann; N Stojanovic; A Subramanian; D Wyman; J Rogers; J Sulston; R Ainscough; S Beck; D Bentley; J Burton; C Clee; N Carter; A Coulson; R Deadman; P Deloukas; A Dunham; I Dunham; R Durbin; L French; D Grafham; S Gregory; T Hubbard; S Humphray; A Hunt; M Jones; C Lloyd; A McMurray; L Matthews; S Mercer; S Milne; J C Mullikin; A Mungall; R Plumb; M Ross; R Shownkeen; S Sims; R H Waterston; R K Wilson; L W Hillier; J D McPherson; M A Marra; E R Mardis; L A Fulton; A T Chinwalla; K H Pepin; W R Gish; S L Chissoe; M C Wendl; K D Delehaunty; T L Miner; A Delehaunty; J B Kramer; L L Cook; R S Fulton; D L Johnson; P J Minx; S W Clifton; T Hawkins; E Branscomb; P Predki; P Richardson; S Wenning; T Slezak; N Doggett; J F Cheng; A Olsen; S Lucas; C Elkin; E Uberbacher; M Frazier; R A Gibbs; D M Muzny; S E Scherer; J B Bouck; E J Sodergren; K C Worley; C M Rives; J H Gorrell; M L Metzker; S L Naylor; R S Kucherlapati; D L Nelson; G M Weinstock; Y Sakaki; A Fujiyama; M Hattori; T Yada; A Toyoda; T Itoh; C Kawagoe; H Watanabe; Y Totoki; T Taylor; J Weissenbach; R Heilig; W Saurin; F Artiguenave; P Brottier; T Bruls; E Pelletier; C Robert; P Wincker; D R Smith; L Doucette-Stamm; M Rubenfield; K Weinstock; H M Lee; J Dubois; A Rosenthal; M Platzer; G Nyakatura; S Taudien; A Rump; H Yang; J Yu; J Wang; G Huang; J Gu; L Hood; L Rowen; A Madan; S Qin; R W Davis; N A Federspiel; A P Abola; M J Proctor; R M Myers; J Schmutz; M Dickson; J Grimwood; D R Cox; M V Olson; R Kaul; C Raymond; N Shimizu; K Kawasaki; S Minoshima; G A Evans; M Athanasiou; R Schultz; B A Roe; F Chen; H Pan; J Ramser; H Lehrach; R Reinhardt; W R McCombie; M de la Bastide; N Dedhia; H Blöcker; K Hornischer; G Nordsiek; R Agarwala; L Aravind; J A Bailey; A Bateman; S Batzoglou; E Birney; P Bork; D G Brown; C B Burge; L Cerutti; H C Chen; D Church; M Clamp; R R Copley; T Doerks; S R Eddy; E E Eichler; T S Furey; J Galagan; J G Gilbert; C Harmon; Y Hayashizaki; D Haussler; H Hermjakob; K Hokamp; W Jang; L S Johnson; T A Jones; S Kasif; A Kaspryzk; S Kennedy; W J Kent; P Kitts; E V Koonin; I Korf; D Kulp; D Lancet; T M Lowe; A McLysaght; T Mikkelsen; J V Moran; N Mulder; V J Pollara; C P Ponting; G Schuler; J Schultz; G Slater; A F Smit; E Stupka; J Szustakowki; D Thierry-Mieg; J Thierry-Mieg; L Wagner; J Wallis; R Wheeler; A Williams; Y I Wolf; K H Wolfe; S P Yang; R F Yeh; F Collins; M S Guyer; J Peterson; A Felsenfeld; K A Wetterstrand; A Patrinos; M J Morgan; P de Jong; J J Catanese; K Osoegawa; H Shizuya; S Choi; Y J Chen; J Szustakowki
Journal:  Nature       Date:  2001-02-15       Impact factor: 49.962

3.  Repbase update: a database and an electronic journal of repetitive elements.

Authors:  J Jurka
Journal:  Trends Genet       Date:  2000-09       Impact factor: 11.639

4.  Predictive identification of exonic splicing enhancers in human genes.

Authors:  William G Fairbrother; Ru-Fang Yeh; Phillip A Sharp; Christopher B Burge
Journal:  Science       Date:  2002-07-11       Impact factor: 47.728

5.  Human L1 retrotransposition: cis preference versus trans complementation.

Authors:  W Wei; N Gilbert; S L Ooi; J F Lawler; E M Ostertag; H H Kazazian; J D Boeke; J V Moran
Journal:  Mol Cell Biol       Date:  2001-02       Impact factor: 4.272

6.  Long terminal repeats are used as alternative promoters for the endothelin B receptor and apolipoprotein C-I genes in humans.

Authors:  P Medstrand; J R Landry; D L Mager
Journal:  J Biol Chem       Date:  2000-10-27       Impact factor: 5.157

7.  Many human genes are transcribed from the antisense promoter of L1 retrotransposon.

Authors:  Pilvi Nigumann; Kaja Redik; Kert Mätlik; Mart Speek
Journal:  Genomics       Date:  2002-05       Impact factor: 5.736

8.  A new exon created by intronic insertion of a rearranged LINE-1 element as the cause of chronic granulomatous disease.

Authors:  C Meischl; M Boer; A Ahlin; D Roos
Journal:  Eur J Hum Genet       Date:  2000-09       Impact factor: 4.246

Review 9.  Interspersed repeats and other mementos of transposable elements in mammalian genomes.

Authors:  A F Smit
Journal:  Curr Opin Genet Dev       Date:  1999-12       Impact factor: 5.578

10.  LINE-1 RNA splicing and influences on mammalian gene expression.

Authors:  Victoria P Belancio; Dale J Hedges; Prescott Deininger
Journal:  Nucleic Acids Res       Date:  2006-03-22       Impact factor: 16.971

View more
  8 in total

1.  Novel mechanism of conjoined gene formation in the human genome.

Authors:  Ryong Nam Kim; Aeri Kim; Sang-Haeng Choi; Dae-Soo Kim; Seong-Hyeuk Nam; Dae-Won Kim; Dong-Wook Kim; Aram Kang; Min-Young Kim; Kun-Hyang Park; Byoung-Ha Yoon; Kang Seon Lee; Hong-Seog Park
Journal:  Funct Integr Genomics       Date:  2012-01-10       Impact factor: 3.410

2.  Novel sex pheromone desaturases in the genomes of corn borers generated through gene duplication and retroposon fusion.

Authors:  Bingye Xue; Alejandro P Rooney; Masaki Kajikawa; Norihiro Okada; Wendell L Roelofs
Journal:  Proc Natl Acad Sci U S A       Date:  2007-03-05       Impact factor: 11.205

Review 3.  Alu elements as regulators of gene expression.

Authors:  Julien Häsler; Katharina Strub
Journal:  Nucleic Acids Res       Date:  2006-10-04       Impact factor: 16.971

4.  The cancer-associated CTCFL/BORIS protein targets multiple classes of genomic repeats, with a distinct binding and functional preference for humanoid-specific SVA transposable elements.

Authors:  Elena M Pugacheva; Evgeny Teplyakov; Qiongfang Wu; Jingjing Li; Cheng Chen; Chengcheng Meng; Jian Liu; Susan Robinson; Dmitry Loukinov; Abdelhalim Boukaba; Andrew Paul Hutchins; Victor Lobanenkov; Alexander Strunnikov
Journal:  Epigenetics Chromatin       Date:  2016-08-31       Impact factor: 4.954

5.  Retrotransposons evolution and impact on lncRNA and protein coding genes in pigs.

Authors:  Cai Chen; Wei Wang; Xiaoyan Wang; Dan Shen; Saisai Wang; Yali Wang; Bo Gao; Klaus Wimmers; Jiude Mao; Kui Li; Chengyi Song
Journal:  Mob DNA       Date:  2019-05-06

6.  DeepSAGE reveals genetic variants associated with alternative polyadenylation and expression of coding and non-coding transcripts.

Authors:  Daria V Zhernakova; Eleonora de Klerk; Harm-Jan Westra; Anastasios Mastrokolias; Shoaib Amini; Yavuz Ariyurek; Rick Jansen; Brenda W Penninx; Jouke J Hottenga; Gonneke Willemsen; Eco J de Geus; Dorret I Boomsma; Jan H Veldink; Leonard H van den Berg; Cisca Wijmenga; Johan T den Dunnen; Gert-Jan B van Ommen; Peter A C 't Hoen; Lude Franke
Journal:  PLoS Genet       Date:  2013-06-20       Impact factor: 5.917

7.  TranspoGene and microTranspoGene: transposed elements influence on the transcriptome of seven vertebrates and invertebrates.

Authors:  Asaf Levy; Noa Sela; Gil Ast
Journal:  Nucleic Acids Res       Date:  2007-11-05       Impact factor: 16.971

8.  Exonization of active mouse L1s: a driver of transcriptome evolution?

Authors:  Tomasz Zemojtel; Tobias Penzkofer; Jörg Schultz; Thomas Dandekar; Richard Badge; Martin Vingron
Journal:  BMC Genomics       Date:  2007-10-26       Impact factor: 3.969

  8 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.