| Literature DB >> 31824778 |
Gökhan Karakülah1,2, Nazmiye Arslan1, Cihangir Yandım1,3, Aslı Suner4.
Abstract
INTRODUCTION: Recent studies highlight the crucial regulatory roles of transposable elements (TEs) on proximal gene expression in distinct biological contexts such as disease and development. However, computational tools extracting potential TE -proximal gene expression associations from RNA-sequencing data are still missing. IMPLEMENTATION: Herein, we developed a novel R package, using a linear regression model, for studying the potential influence of TE species on proximal gene expression from a given RNA-sequencing data set. Our R package, namely TEffectR, makes use of publicly available RepeatMasker TE and Ensembl gene annotations as well as several functions of other R-packages. It calculates total read counts of TEs from sorted and indexed genome aligned BAM files provided by the user, and determines statistically significant relations between TE expression and the transcription of nearby genes under diverse biological conditions. AVAILABILITY: TEffectR is freely available at https://github.com/karakulahg/TEffectR along with a handy tutorial as exemplified by the analysis of RNA-sequencing data including normal and tumour tissue specimens obtained from breast cancer patients. ©2019 Karakülah et al.Entities:
Keywords: Gene expression; Linear model; R package; Regression; Gene regulation; Transposable elements
Year: 2019 PMID: 31824778 PMCID: PMC6899341 DOI: 10.7717/peerj.8192
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Figure 1The workflow of TEffectR package.
The package contains six core functions for the identification of the potential links between TEs and nearby genes at genome-wide scale. TEffectR requires two inputs provided by the user: (i) a raw gene count matrix and (ii) genomic alignments of sequencing reads in BAM file format.
Examples of significant associations of LINE, SINE, LTR and DNA transposons with genes that were previously linked to breast cancer as TEffectR outputs along with multiple covariates.
Expression levels of TEs within the upstream 5 kb regions of the given genes and other covariates such as the tissue type (healthy vs. tumor) or patient number were included in the linear regression model. The p-value of the model indicates the significance of the linear model. P-values for each covariate indicate whether these factors have significant associations with the expression of the given gene. Adjusted r-square score indicates the precision of the model with significant covariate associations in terms of predicting the expression of the gene. For example, an adjusted R square of 0.8422 indicates that the linear model could explain 84.22% of the gene’s expression.
| KRT8 (CK8) | L2c (LINE) | 0.8532 | 0.8422 | 1.026E–16 | L2c: 1.356E–13 | |
| SLC39A6 (LIV-1) | L2b | 0.7231 | 0.7023 | 3.100E–11 | L2b: 7.536E–08 | |
| SAFB | L1MB7 (LINE) | 0.5131 | 0.4766 | 2.107E–06 | L1MB7: 1.114E–07 | |
| CHEK2 | AluJb, AluSx AluS | 0.6362 | 0.5883 | 1.645E–07 | AluJb: 0.0433 | |
| FEN1 | MIR3 | 0.5545 | 0.5211 | 3.703E–07 | MIR3: 2.572E–06 | |
| CENPL | AluSx3, AluY | 0.5489 | 0.5027 | 2.118E–06 | AluSx3: 0.0007 | |
| MCM4 | MLT1D | 0.5733 | 0.5413 | 1.587E–07 | MLT1D: 0.0012 | |
| RMND1 | LTR5_Hs | 0.4318 | 0.3892 | 4.279E–05 | LTR5_Hs: 1.782E–05 | |
| CPNE3 | MLT1H2 | 0.3910 | 0.3453 | 0.0002 | MLT1H2: 0.0002 | |
| HLA-DPB1 | hAT-1_Mam | 0.8318 | 0.8192 | 1.548E–15 | hAT-1_Mam: 1.092E–14 | |
| HSPB2 (HSP27) | MER5B | 0.7756 | 0.7587 | 4.791E–13 | MER5B: 8.050E–07 | |
| PARP9 | MER5B | 0.5929 | 0.5624 | 6.287E–08 | MER5B: 1.141E–06 |
Notes.
Heo et al. (2013).
Kasper et al. (2005).
Hammerich-Hille et al. (2010).
Nagel et al. (2012)
Li et al. (2014); Li, Liang & Zhang (2014)
Chung et al. (2015).
Abdel-Fatah et al. (2014).
Tishchenko et al. (2016).
Kwok et al. (2015).
Dunning et al. (2016).
Heinrich et al. (2010).
Forero et al. (2016).
Wei et al. (2011).
Storm et al. (1995).
Tang et al. (2018).
Figure 2Scatter plots that demonstrate the correlations of normalized read counts of genes given in Table 1 with the normalized read counts of TEs present in their upstream 5-kb regions.
(CPM: counts per million).