| Literature DB >> 29081611 |
Charles Chilaka1,2, Steven Carr3,4, Nabil Shalaby4,5, Wolfgang Banzhaf4,6.
Abstract
A microarray DNA sequencing experiment for a molecule of N bases produces a 4xN data matrix, where for each of the N positions each quartet comprises the signal strength of binding of an experimental DNA to a reference oligonucleotide affixed to the microarray, for the four possible bases (A, C, G, or T). The strongest signal in each quartet should result from a perfect complementary match between experimental and reference DNA sequence, and therefore indicate the correct base call at that position. The linear series of calls should constitute the DNA sequence. Variation in the absolute and relative signal strengths, due to variable base composition and other factors over the N quartets, can interfere with the accuracy and (or) confidence of base calls in ways that are not fully understood. We used a feed-forward back-propagation neural network model to predict normalized signal intensities of a microarray-derived DNA sequence of N = 15,453 bases. The DNA sequence was encoded as n-gram neural input vectors, where n = 1, 2, and their composite. The data were divided into training, validation, and testing sets. Regression values were >99% overall, and improved with increased number of neurons in the hidden layer, and in the composition n-grams. We also noticed a very low mean square error overall which transforms to a high performance value.Entities:
Keywords: Neural networks; Performance; Regression values; n-grams
Year: 2017 PMID: 29081611 PMCID: PMC5651225 DOI: 10.6026/97320630013313
Source DB: PubMed Journal: Bioinformation ISSN: 0973-2063
The nucleotide percentages (ratios)
| Nucleotides | A | C | G | T |
| Ratios | 0.31 | 0.31 | 0.13 | 0.25 |
The dinucleotide percentages (ratios)
| Dinucleotdes | Ratios |
| AA | 0.1 |
| AC | 0.09 |
| AG | 0.05 |
| AT | 0.07 |
| CA | 0.09 |
| CC | 0.11 |
| CG | 0.03 |
| CT | 0.09 |
| GA | 0.04 |
| GC | 0.04 |
| GG | 0.03 |
| GT | 0.03 |
| TA | 0.08 |
| TC | 0.07 |
| TG | 0.03 |
| TT | 0.06 |
Figure 1A neural network system for signal intensity prediction. The DNA sequences are first converted into n-gram profiles as input vectors. The neural network then predicts the normalized signal intensities after network training.
Figure 2Algorithmic flowchart for computing n-gram profiles and doing normalization on the DNA sequence.
Figure 3A 1-2 gram composition performance plot with 40 neurons in the hidden layer showing training, validation and testing data set in terms of mean square error.
Figure 4A 1-2-gram composition regression plot with 20 neurons in the hidden layer showing training, validation, testing and overall regression values.
Best performance and regression values for 1-gram with varying number of neurons in the hidden layer
| No. of neurons | Best perf. Values | Training | Validation | Testing |
| 20 | 0.003209 | 0.99202 | 0.99054 | 0.97872 |
| 25 | 0.003173 | 0.99788 | 0.99055 | 0.9808 |
| 30 | 0.003137 | 0.99211 | 0.9907 | 0.97821 |
| 40 | 0.003195 | 0.99205 | 0.99049 | 0.97869 |
| Averages | 0.003178 | 0.99194 | 0.99057 | 0.97911 |
Best performance and Regression values for 2-gram with varying number of neurons in the hidden layer
| No. of neurons | Best perf. Values | Training | Validation | Testing |
| 20 | 0.027319 | 0.94319 | 0.91605 | 0.88381 |
| 25 | 0.025913 | 0.9383 | 0.93036 | 0.90905 |
| 30 | 0.02278 | 0.9357 | 0.9302 | 0.90725 |
| 40 | 0.022379 | 0.93779 | 0.93157 | 0.89763 |
| Averages | 0.024598 | 0.93875 | 0.92705 | 0.89944 |
Best performance and Regression values with 1-2-gram with varying number of neurons in the hidden layer
| No. of neurons | Best perf. values | Training | Validation | Testing |
| 20 | 0.002849 | 0.99378 | 0.99148 | 0.9801 |
| 25 | 0.002666 | 0.99395 | 0.99212 | 0.98136 |
| 30 | 0.00313 | 0.9942 | 0.99078 | 0.98123 |
| 40 | 0.002525 | 0.99388 | 0.98245 | 0.98128 |
| Averages | 0.002793 | 0.99395 | 0.99171 | 0.98099 |