Literature DB >> 22088842

Dragon PolyA Spotter: predictor of poly(A) motifs within human genomic DNA sequences.

Manal Kalkatawi¹, Farania Rangkuti, Michael Schramm, Boris R Jankovic, Allan Kamau, Rajesh Chowdhary, John A C Archer, Vladimir B Bajic.

Abstract

MOTIVATION: Recognition of poly(A) signals in mRNA is relatively straightforward due to the presence of easily recognizable polyadenylic acid tail. However, the task of identifying poly(A) motifs in the primary genomic DNA sequence that correspond to poly(A) signals in mRNA is a far more challenging problem. Recognition of poly(A) signals is important for better gene annotation and understanding of the gene regulation mechanisms. In this work, we present one such poly(A) motif prediction method based on properties of human genomic DNA sequence surrounding a poly(A) motif. These properties include thermodynamic, physico-chemical and statistical characteristics. For predictions, we developed Artificial Neural Network and Random Forest models. These models are trained to recognize 12 most common poly(A) motifs in human DNA. Our predictors are available as a free web-based tool accessible at http://cbrc.kaust.edu.sa/dps. Compared with other reported predictors, our models achieve higher sensitivity and specificity and furthermore provide a consistent level of accuracy for 12 poly(A) motif variants. CONTACT: vladimir.bajic@kaust.edu.sa SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Gene Species

Mesh：

Substances：
Poly A

Year: 2011 PMID： 22088842 PMCID： PMC3244764 DOI： 10.1093/bioinformatics/btr602

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

The polyadenylic acid tail or poly(A) tail is a stretch of A nucleotides added to RNA during the RNA processing mainly to protect the primary RNA stability (Bernstein and Ross, 1989). In mammals, the poly(A) tail is added close and downstream of the characteristic poly(A) signal, most often AAUAAA. The problem of prediction of poly(A) signals has received considerable attention. Since the distance of poly(A) signal from the poly(A) tail is approximately 10–30 nt (Beaudoing ), recognizing such tails in mRNA is relatively simple. A more challenging problem is to find a motif in primary genomic sequence that corresponds to poly(A) signal site in the transcribed RNA. The process of predicting poly(A) motifs in DNA depends on successfully identifying relevant properties of the surrounding sequence of such motifs. We now present a brief survey of reported work in this field so far. Statistical properties of nucleotide sequences were used, for example, to reveal putative poly(A) signals in yeast (Van Helden ) or in Arabidopsis (Ji ). A program PROBE was developed to identify cis elements that potentially play regulatory roles in mRNA polyadenylation (Hu ). Several tools are developed for predicting poly(A) motifs in human. Polyadq tool for predicting poly(A) motifs in a DNA sequence is reported in Tabaska and Zhang (1999) where sequences of 100 nt downstream of a candidate poly(A) motif were used to derive the feature set for prediction. The ERPIN program (Legendre and Gautheret, 2003) utilizes 300 nt flanking sequence upstream and downstream of the candidate poly(A) motifs. A method based on application of support vector machines (SVMs) was reported by Liu ) in which 100 nt flanking sequence upstream and downstream around poly(A) candidate motifs were utilized. PolyApred system was introduced in Ahmed ). The 100 nt flanking sequence upstream and downstream around candidate poly(A) motif sequence were utilized. A method POLYAR for recognition of polyadenylation sites is reported recently (Akhtar ). The reported results of these tools are summarized in Table 1, together with the performance of publicly available ones achieved on our datasets. In this study, we present a web-based tool that implements two types of predictive models, one based on Artificial Neural Networks (ANNs) and the other based on Random Forest (RF) (Breiman, 2001). Our models cover 12 main variants of human poly(A) motifs with accuracies from 82.06% to 94.4%.

Table 1.

Accuracy of various poly(A) prediction tools

Tool	Results reported by authors	Results on our AATAAA dataset
Polyadq	MCC = 0.41–0.51	Se = 28.23%
		Sp = 83.88%
		Acc = 56.05%

Polya_SVM (Cheng et al., 2006)	Se = 37.2–71.0%	Se = 58.30%
	Sp = 74.6–96.7%	Sp = 64.42%
		Acc = 61.36%

Polyar	Se = 23.9–94.9%	Se = 57.28%
	Sp = 14.7–66.4%	Sp = 49.69%
		Acc = 53.48%

Our Model (ANN)	Table 2	Se = 80.55%
		Sp = 83.57%
		Acc = 82.06%

Our Model (RF)	Table 2	Se = 86.10%
		Sp = 91.60
		Acc = 88.90

Polyah (Salamov, 1997)	MCC = 0.62

ERPIN	Se = 56%
	Sp = 69–85%

Polyapred	Se = 57.0%
	Sp = 75.8–95.7%
Poly(A) Signal Miner (Liu et al., 2003)	Se = 56.0–89.3%
	Sp = 67.5–93.3%

Accuracy of various poly(A) prediction tools Performance of ANN and RF methods for 12 poly(A) motifs

2 METHODS

2.1 Datasets

We used human mRNA sequences and mapped 100 nt from their 3′ end back to the human genome applying stringent BLASTN matching criteria. Negative records were selected from human chromosome 21. Within candidate sequences, we selected those where the poly(A) motif is found at locations conforming to the distributions reported in Beaudoing ). We flanked such poly(A) motifs by 100 nt upstream and 100 nt downstream, resulting in training sequences of 206 nt in length. Overall, 14 799 sequences for 12 motif variants can be found at http://cbrc.kaust.edu.sa/dps/code/DataToBuildModel.tar.gz. More details are given in Supplementary Material 1.

2.2 Features and feature selection

Our model uses features from thermodynamic, compositional, statistical and other properties of nucleotides and polynucleotide sequences. The thermodynamic and structural properties of dinucleotides that we used were selected from Friedel ). We also used electron–ion interaction potential (EIIP) of nucleotides (Veljkovic and Slavic, 1972). Finally, our models utilize scores from position weight matrices (PWMs) in the upstream and downstream regions of the poly(A) motifs. This process resulted in 274 features used (Supplementary Material 2).

2.3 The tool

For details of models see Supplementary Material 2. Our tool contains two types of predictors of poly(A) motifs, ANN-based and RF-based. The ANN models consist of an input, a hidden and an output layers. The output layer contains two neurons that predict if the input pattern corresponds to real or false poly(A) motif (the stronger wins). To mitigate overfitting, we deployed an early stopping method (Zang and Yu, 2005). The RF model is based on WEKA implementation (Hall ).

3 RESULTS

For performance we used sensitivity Se = TP/(TP + FN), specificity Sp = TN/(TN + FP) and accuracy Acc = (TP + TN)/ (TP + TN + FP + FN), where TP, TN, FP and FN are the numbers of true positives, true negatives, false positives and false negatives, respectively. We compared our results those of publicly available tools when applied to our datasets (Table 1). We tested on the only motif common to all tools (AATAAA). In Table 2, we report the performance of our ANN and RF-derived models on 12 poly(A) motifs. ANN model is trained on 50% of data and tested on the remaining 50% (training takes a long time so cross-validation is not applied). For the RF model, we achieved the best results using 100 trees without restricting maximal depth using nine random features per node. Model performance in 100-fold cross-validation is shown.

Table 2.

Performance of ANN and RF methods for 12 poly(A) motifs

Varian	ANN mode		RF model
	Se (%) Sp (%)	Acc (%)	Se (%) Sp (%)	Acc (%)
AAAAAG	94.57	90.01	93.2	94.4
	85.44		95.6

AAGAAA	86.04	85.39	88.7	91.4
	84.74		94.1

AATAAA	80.55	82.06	86.1	88.9
	83.57		91.6

AATACA	91.71	90.05	87.3	89.9
	88.39		92.5

AATAGA	95.18	94.27	86.7	89.0
	93.37		91.3

AATATA	91.32	90.30	87.2	90.4
	89.28		93.6

ACTAAA	89.85	89.67	85.0	88.1
	89.49		91.1

AGTAAA	89.94	87.78	83.1	88.8
	85.63		94.5

ATTAAA	83.71	83.84	85.2	88.9
	83.96		92.6

CATAAA	91.56	91.77	83.5	88.0
	91.98		92.4

GATAAA	88.75	90.20	87.9	90.2
	91.66		92.5

TATAAA	92.30	89.25	86.1	90.1
	86.20		94.2

4 CONCLUSION

We developed a web tool for the recognition of poly(A) motifs in human genomic DNA that demonstrates improved prediction accuracy over the existing publicly available poly(A) predictors. We hope that our tool will find good use in the studies of human gene properties. Funding: This work is supported by the Base Research Funds of VBB at King Abdullah University of Science and Technology. The open access charges for this article are covered from the same fund. Conflict of Interest: none declared.

13 in total

1. Detection of polyadenylation signals in human DNA sequences.

Authors: J E Tabaska; M Q Zhang
Journal: Gene Date: 1999-04-29 Impact factor: 3.688

2. Bioinformatic identification of candidate cis-regulatory elements involved in human mRNA polyadenylation.

Authors: Jun Hu; Carol S Lutz; Jeffrey Wilusz; Bin Tian
Journal: RNA Date: 2005-08-30 Impact factor: 4.942

3. Prediction of mRNA polyadenylation sites by support vector machine.

Authors: Yiming Cheng; Robert M Miura; Bin Tian
Journal: Bioinformatics Date: 2006-07-26 Impact factor: 6.937

4. Recognition of 3'-processing sites of human mRNA precursors.

Authors: A A Salamov; V V Solovyev
Journal: Comput Appl Biosci Date: 1997-02

5. A classification-based prediction model of messenger RNA polyadenylation sites.

Authors: Guoli Ji; Xiaohui Wu; Yingjia Shen; Jiangyin Huang; Qingshun Quinn Li
Journal: J Theor Biol Date: 2010-05-26 Impact factor: 2.691

6. An in-silico method for prediction of polyadenylation signals in human sequences.

Authors: Huiqing Liu; Hao Han; Jinyan Li; Limsoon Wong
Journal: Genome Inform Date: 2003

7. DiProDB: a database for dinucleotide properties.

Authors: Maik Friedel; Swetlana Nikolajewa; Jürgen Sühnel; Thomas Wilhelm
Journal: Nucleic Acids Res Date: 2008-09-19 Impact factor: 16.971

8. Prediction of polyadenylation signals in human DNA sequences using nucleotide frequencies.

Authors: Firoz Ahmed; Manish Kumar; Gajendra P S Raghava
Journal: In Silico Biol Date: 2009

Review 9. Poly(A), poly(A) binding protein and the regulation of mRNA stability.

Authors: P Bernstein; J Ross
Journal: Trends Biochem Sci Date: 1989-09 Impact factor: 13.807

10. POLYAR, a new computer program for prediction of poly(A) sites in human sequences.

Authors: Malik Nadeem Akhtar; Syed Abbas Bukhari; Zeeshan Fazal; Raheel Qamar; Ilham A Shahmuradov
Journal: BMC Genomics Date: 2010-11-19 Impact factor: 3.969

17 in total

1. DeepPASTA: deep neural network based polyadenylation site analysis.

Authors: Ashraful Arefeen; Xinshu Xiao; Tao Jiang
Journal: Bioinformatics Date: 2019-11-01 Impact factor: 6.937

2. Machine learning-based differential network analysis: a study of stress-responsive transcriptomes in Arabidopsis.

Authors: Chuang Ma; Mingming Xin; Kenneth A Feldmann; Xiangfeng Wang
Journal: Plant Cell Date: 2014-02-11 Impact factor: 11.277

3. Poly(A) code analyses reveal key determinants for tissue-specific mRNA alternative polyadenylation.

Authors: Lingjie Weng; Yi Li; Xiaohui Xie; Yongsheng Shi
Journal: RNA Date: 2016-04-19 Impact factor: 4.942

4. Omni-PolyA: a method and tool for accurate recognition of Poly(A) signals in human genomic DNA.

Authors: Arturo Magana-Mora; Manal Kalkatawi; Vladimir B Bajic
Journal: BMC Genomics Date: 2017-08-15 Impact factor: 3.969

5. QAPA: a new method for the systematic analysis of alternative polyadenylation from RNA-seq data.

Authors: Kevin C H Ha; Benjamin J Blencowe; Quaid Morris
Journal: Genome Biol Date: 2018-03-28 Impact factor: 13.583

6. Genome-wide identification and predictive modeling of tissue-specific alternative polyadenylation.

Authors: Dina Hafez; Ting Ni; Sayan Mukherjee; Jun Zhu; Uwe Ohler
Journal: Bioinformatics Date: 2013-07-01 Impact factor: 6.937

7. Motif types, motif locations and base composition patterns around the RNA polyadenylation site in microorganisms, plants and animals.

Authors: Xiu-Qing Li; Donglei Du
Journal: BMC Evol Biol Date: 2014-07-23 Impact factor: 3.260

8. RNA polyadenylation sites on the genomes of microorganisms, animals, and plants.

Authors: Xiu-Qing Li; Donglei Du
Journal: PLoS One Date: 2013-11-18 Impact factor: 3.240

Review 9. Means to an end: mechanisms of alternative polyadenylation of messenger RNA precursors.

Authors: Andreas R Gruber; Georges Martin; Walter Keller; Mihaela Zavolan
Journal: Wiley Interdiscip Rev RNA Date: 2013-11-14 Impact factor: 9.957

10. Long-Read Isoform Sequencing Reveals a Hidden Complexity of the Transcriptional Landscape of Herpes Simplex Virus Type 1.

Authors: Dóra Tombácz; Zsolt Csabai; Attila Szűcs; Zsolt Balázs; Norbert Moldován; Donald Sharon; Michael Snyder; Zsolt Boldogkői
Journal: Front Microbiol Date: 2017-06-20 Impact factor: 5.640