Literature DB >> 24335433

Biologically inspired intelligent decision making: a commentary on the use of artificial neural networks in bioinformatics.

Timmy Manning¹, Roy D Sleator², Paul Walsh³.

Abstract

Artificial neural networks (ANNs) are a class of powerful machine learning models for classification and function approximation which have analogs in nature. An ANN learns to map stimuli to responses through repeated evaluation of exemplars of the mapping. This learning approach results in networks which are recognized for their noise tolerance and ability to generalize meaningful responses for novel stimuli. It is these properties of ANNs which make them appealing for applications to bioinformatics problems where interpretation of data may not always be obvious, and where the domain knowledge required for deductive techniques is incomplete or can cause a combinatorial explosion of rules. In this paper, we provide an introduction to artificial neural network theory and review some interesting recent applications to bioinformatics problems.

Entities: Chemical Disease Gene Species

Keywords: artificial neural networks; bioinformatics; gene identification; gene-gene interaction; genome wide association study; multilayer perceptron; protein structure prediction

Mesh：

Year: 2013 PMID： 24335433 PMCID： PMC4049912 DOI： 10.4161/bioe.26997

Source DB: PubMed Journal: Bioengineered ISSN： 2165-5979 Impact factor: 3.269

Introduction

Artificial neural networks (ANNs) are statistical machine learning models which emulate the processing technique of biological neurons to perform function approximation and pattern recognition from a set of exemplars, in a way which can generalize its mapping to new data.- ANNs are finding use across a number of domains for classification- and function approximation.- Figure 1, which plots number of bioinformatics papers in PubMed, that reference neural networks over the period 1994 to 2009, suggests that there is an increasing trend in the application of ANNs to bioinformatics problems.

Figure 1. The number of bioinformatics papers in PubMed that reference neural networks, grouped by year.

Figure 1. The number of bioinformatics papers in PubMed that reference neural networks, grouped by year. This paper begins with an introduction to neural networks, providing a description of what they are and how they are used, as well as a high level description of how they work. The advantages and disadvantages of ANNs are discussed, and information considered pertinent to their practical application is presented by way of a number of examples. Further analysis of the literature available on PubMed, as shown in Figure 2, indicates that the 3 main bioinformatics topics which reference neural networks are:

Figure 2. Breakdown of bioinformatics topics identified across a number of analyzed papers available on PubMed which reference neural networks.

Figure 2. Breakdown of bioinformatics topics identified across a number of analyzed papers available on PubMed which reference neural networks. Gene identification/prediction (554 papers) Protein secondary structure prediction (486 papers) Gene interaction (528 papers) Accordingly, this review will examine recent and interesting applications of neural networks to these three problem areas. Chen and Kurgan provide a review of ANNs which focuses on protein bioinformatics. The application of ANNs to the analysis of (typically noisy) microarrays and mass spectra is reviewed by Lancashire, Lemetre, and Ball. A general discussion of the application of ANNs to the topics of Quantitative structure-activity relationship (QSAR), gene expression data analysis, protein structure data analysis, biomarker identification and sequence data analysis is provided in review format by Yang.

Artificial Neural Networks

Broadly speaking, the function of a neural network is to enact a meaningful mapping function from a stimulus (the inputs to the neural network) to an excepted response (the output of the neural network)., For example, a neural network can map a nucleotide input sequence to a scalar output that differentiates between coding or non-coding regions, or can predict how a peptide chain will twist and turn based on the physical properties of the amino acids and their interactions. In principle, a neural network with a sufficiently large number of processing elements can approximate any continuous functional mapping with an arbitrary level of accuracy. The ANN is a machine learning approach that has the ability to autonomously identify and model complex nonlinear patterns and relations from a data set, without the need for the context of the data, explicit domain knowledge or operator interaction. The networks learn to carry out their desired mapping using exemplars of the inputs and expected outputs in a process referred to as “training.” The difference between the expected and actual output of the network for a given stimulus is referred to as the network error, and is calculated using an error function. In training, the parameters of the network are slowly adjusted such that the error of the network is iteratively reduced over the set of exemplars. A properly trained ANN will be robust and have the ability to generalize accurate output vectors given novel input vectors. However, neural networks operate as a black-box; how or why an output is achieved will not be directly interpretable.

Architecture

Artificial neural network is an umbrella term covering a wide variety of graph based machine learning approaches. Here, we limit discussion to the common multilayer perceptron (MLP) architecture., The perceptron is a simple feedforward linear classifier analogous to a single biological neuron. A perceptron accepts a number of input signals and fires (produces an output) if the combined input signal is above a threshold. The MLP architecture combines layers of perceptron-like processing elements (the neurons) connected by weighted connections (the synapses) to produce a network capable of dealing with complex nonlinearly separable mappings. The distributed nature of the processing which takes place in a neural network contributes to the robustness of the system., In the typical MLP architecture, the neurons are grouped into layers with full synaptic connections only between each successive layer, as shown in Figure 3. The first layer, referred to as the input layer accepts the stimulus. The signal is propagated along the synapses through the hidden layer(s) to the output layer, where the response of the network is presented. An MLP can contain zero or more hidden layers. The neurons of the hidden layer have no direct connection with the outside world, only with the preceding and succeeding layers. Each synapse has an associated weight, corresponding to synaptic strength in biologic systems, which it uses to scale the signals passed through it. The neural networks are trained by adjusting these synaptic weights to perform different mappings.,

Figure 3. Structure of a typical 3 layer feed forward multilayer perceptron artificial neural network.

Figure 3. Structure of a typical 3 layer feed forward multilayer perceptron artificial neural network. The inputs to a neuron are roughly analogous to the dendrites of a biological neuron, and the output of the neuron is comparable to an axon. The artificial neuron and its biological counterpart are shown graphically in Figure 4A and B respectively.

Figure 4. Neurons. (A) An artificial neuron from the hidden or output layer of an MLP, and (B) a simplified depiction of a naturally occurring biological neuron.

Figure 4. Neurons. (A) An artificial neuron from the hidden or output layer of an MLP, and (B) a simplified depiction of a naturally occurring biological neuron. The neurons of the input layer carry out no processing on the values they receive, and merely pass the input values to the next layer along the synapses. Each neuron in the hidden and output layers take the sum of the values on each of its input synapses multiplied by the weight value associated with each synapse to generate the “activation” of the neuron. An “activation function” is applied to the sum to produce the output of each neuron. Many of the MLP training algorithms require the use of a differentiable activation function for calculating weight adjustments for hidden layer neurons. The sigmoid is usually favored as its derivatives and partial derivatives can be easily and efficiently produced, which are relevant to typical gradient descent approaches as will be discussed later. Efficiency is important as these values may need to be calculated thousands or perhaps millions of times in teaching an ANN for non-trivial problems. An example of a sigmoidal plot is given in Figure 5. The sigmoid maps the activation of a neuron to a continuous output value in the range 0 to 1 (although the value never reaches 0 or 1).

Figure 5. A sigmoid function. If this sigmoid was used as an activation function, the activation of the neuron would be a value on the x-axis and the corresponding output of the neuron is mapped to the y-axis. Additional inputs with a constant value of 1 are connected to each hidden and output layer neuron. These inputs are referred to as biases. Adjusting the synaptic weight of the bias is equivalent to moving the center of the activation function. Alternative architectures to the feedforward MLP are available. For example, Chen and Lukasz provide an introduction to the application of two prominent alternative architectures to protein bioinformatics problems; recurrent and radial basis function (RBF) networks., In the generalized form of the MLP, synaptic connections are allowed between neurons in non-successive layers.

Training

There are many approaches to training neural networks, but the most typical is a supervised approach in which exemplars of the input to output mappings are used to train the network by adjusting the synaptic weights. Training is an iterative process. Weight updates can be made based on the error of each individual exemplar (online learning) or based on the error of a number of exemplars (batch learning). In each iteration; A set of exemplars is applied to the network and the output recorded The error of the network is quantified (using an error function) The weights of the synapses are then slightly adjusted so that the error of the network would be lower if the same set of training instances were re-applied In machine learning, a single pass through the entire training set is referred to as an epoch. Through repeated adjustment of the synaptic weights, the network eventually settles on a configuration where no slight adjustments to the weights will result in a net decrease in the error value across the set of exemplars. Gradient descent learning is one approach to identifying how the synaptic weights should be adjusted. Gradient descent learning uses the idea of an “error surface” which maps the error of a network as a function of the weights. On the error surface, the slope of the partial derivative of the error of the network with relation to the value of a single weight can be used to identify if a weight should be increased or decreased to lower its error contribution. An example error surface is shown in Figure 6 for a single weight plotted against network error.

Figure 6. An example of a simulated error surface. The value of a weight (on the x-axis) plotted against the error of the network (the y-axis). The solid red line represents the initial value of a synaptic weight. The dashed red line represents the slope of the error. The green line is a locally minima, a locally optimal weight value. The blue line is the globally optimal value for the weight at which the error contribution is minimized. The initial values of a synaptic weight give a point on the error function (e.g., 0.35 in Fig. 6). The slope of the derivate of the initial weight value is shown as the dashed red line. A positive slope (as in Fig. 6) indicates that the error contribution can be reduced by reducing the weight value slightly (moving the weight value left along the x-axis). Conversely, a negative slope indicates that the weight value should be increased. By iteratively adjusting the value of the each weight slightly in the direction of its negative of the gradient of the error surface, the error produced by the network can be reduced to a point. For Figure 6, the value of the weight would be reduced gradually until it reaches a minimum of the error surface at a value of 0.24. At this point, any small change to the value of the weight should increase the error produced by the network on the training set. A learning rate, α, is specified to set how much the weights are adjusted each generation. The learning rate should be low to allow the learning algorithm to edge toward the optimal solution. A learning rate that is too high will cause the adjustments to overshoot minima, making optimization difficult for the algorithm. Of course, a learning rate that is too low will cause slow convergence of the weights, and perhaps lead to the algorithm being unable to escape shallow local minima. A momentum rate can also be specified to increase learning speed by increasing weight adjustment size if consecutive adjustments are in the same direction.

The Feedforward Backpropagation Learning Algorithm

The contribution of a neuron on the output layer to the error of the network can be easily calculated as the difference between the expected and actual output. The contribution of the hidden layer neurons to the error is more difficult as they can may partially contribute to the error of many neurons in successive layers. The back propagation of errors (backpropagation, “Back-Prop” or BP) is a common mathematically provable gradient descent learning algorithm which uses differentiable activation functions to calculate the error of neurons and produce synaptic weights adjustments which reduce this error. If a sigmoid activation function is employed for the neurons of the hidden and output layers, the error signal and weight adjustment calculations can be reduced mathematically to the efficient forms given in Equations 1–4. For a more in-depth discussion of how these equations are derived, and a walk through of the first two iterations of weight updating for a simple example network, see the text “Neural Networks” by Phil Picton. For an output layer neuron k, the error signal, e can be calculated using Equation 1 as a function of the difference between the expected output, t, and the observed output, o. New values for the synaptic weights feeding into k can then be calculated using Equation 2. Where w and xi are the weight and input signal respectively of a single synaptic input to the output neuron k, and α is the value of the learning rate constant. The error signal of a hidden layer neuron must be considered respective of the error of all output layer and other hidden layer neurons to which it feeds its own output., To calculate the error signal, e, for a hidden layer neuron i, we first define the variable errSum as the summation of the error of each neuron to which i feeds its output, scaled by the synaptic weight connecting them. The error of i can be calculated using Equation 3, and the new synaptic weights feeding into i can be generated using Equation 4. Where w and x are the weight and input signal respectively of a single synapse input to the hidden layer neuron i, y is the output of neuron i, and α is the value of the learning rate constant. For a more in depth discussion of the mathematical principles and theory of the back propagation algorithm, the reader is directed to the paper “Neural Networks,” by Yang. An interesting variation of backpropagation is the quick prop algorithm. In quick prop, two iterations of BP are run and the points on the error surface are recorded. Under the assumption that error surfaces are elliptical in shape, quick prop uses the identified two points to estimate where the nadir of the error surface would be located. The synaptic weights can then be adjusted to jump to the expected optimized value. For many problems, the use of quick prop can reduce the processing power and the number of generations required relative to standard BP.

Stopping Criteria

A number of approaches can be used to decide on the number of iterations for which to run a training algorithm. A simple approach is to continue for only a predetermined number of generations, normally set using domain knowledge or experience, above which it can be reasonable assumed no improvement will be observed, or as the amount of available time or resources allows. A more dynamic approach is to continue iteratively until the error on the training set drops below a threshold specified a priori, or alternatively, until the error of the network over the entire training set drops by less than a specified value (for example 0.01%) over several generations. These simple approaches are however susceptible to “overtraining,” where the performance observed on the training data may not be representative of the performance of the network on novel data. As training proceeds, the training algorithm, in an attempt to reduce the error, may start to learn decision boundaries which over-fit the training data. This results in reduced error on the training data set at the expense of the ability to generalize to novel exemplars from outside the training data. This implies that the training data are not an accurate gauge of the ANNs error on novel data (resubstitution error). To address this problem, it is common to divide the available exemplars into two independent sets; the training and testing set. During training (on the training set), the testing set is continually evaluated by the network, but it is not used to affect the weight adjustment of the synapses. The testing set gauges the generalization ability of the network. If the error on the testing set increases over several epochs, while the error on the training set continues to decrease, it is considered an indication that overtraining is occurring. Training can then be stopped and the network weights reverted to the values at which the testing error was minimized. The performance of the ANN on the testing data cannot be used as an accurate gauge of accuracy either, as the training may have stopped at a point at which the ANN overfits the testing data. For this reason, a third independent set of exemplars, referred to as the validation set, is required to accurately measure the ANN error. Training an ANN for non-trivial problems can be computationally expensive, perhaps requiring thousands of updates to each synaptic weight. This can correspond to long training times. Florido et al. have described a method of sampling representative exemplars suitable for training an ANN from data sets. It is claimed that such a reduced (but representative) set of training data has the potential to improve training time and reduce overfitting.

Advantages and Disadvantages of ANNs for Bioinformatics

The case for the use of ANNs can be made by their proven successful application to a number of challenging problems in bioinformatics and other domains. Here, the strengths of ANNs, weaknesses, and reasons why they are appealing for bioinformatics problems are discussed.

Advantage: Generalization

Generalization in machine learning refers to the ability of the solution to extrapolate good outputs for unseen data or new combinations of inputs based on what it has been trained on. The generalization ability of ANNs is well documented, and is one of the strongest and most desirable qualities for bioinformatics, although it is a quality shared with many other machine learning and model based approaches. Strong generalization is a highly desirable in many situations in the bioinformatics domain. For example, often experiments can produce gigabytes of data, but diseases and conditions can present differently in different environments. Expression levels of a biomarker gene, for example, may vary depending on the age, gender, health, race, etc. of a patient. Generating data to cover all permutations of even these four contributing factors may be time consuming, expensive or just not possible for rare conditions. Using only a sampling of these permutations, an ANN can often learn how these attributes affect the observed biomarkers, and learn to correctly classify (or at least make a very well informed guess) as to the classification of patients whose combination of attributes were not covered by the training data.

Advantage: No Need for Complete Domain Knowledge

Another appealing property of ANNs for bioinformatics is that domain knowledge does not need to be complete. There are also situations where exact solutions exist, but ANNs can be used to estimate the desired output where generating the exact solution is prohibitive in terms of cost or time. Protein folding is an example of a problem which is not fully understood, but to which ANNs have been applied with a good level of success. The 3D structure of a protein can be generated using X-ray crystallography, but this is a very time consuming approach, so not suitable for the vast amount of sequences which can be generated in a single bioinformatics experiment. An ANN can be trained to hypothesize protein structure based on primary sequence alone (which can be identified much more quickly and cheaply), given the structure of a number of exemplars which may have been identified using X-ray crystallography. ANNs achieve this success without the need to understand the underlying mechanisms, but can examine how known contributing factors have affected the structure of proteins with identified form to conjecture how a novel chain will be affected. The application of ANNs to protein folding is discussed in more detail in the case studies section.

Advantages: Robust Solutions to Complex Problems

The third main advantage of ANNs for bioinformatics that we will discuss is the general robustness of the solutions that can be produced and the complexity of the problems to which it can successfully be applied. We demonstrate these points collectively by considering an example of how neural networks have been applied to microarray data. Generic microarrays covering a large number of genes can have a low statistical power due to the small sample sizes (number of arrays available) and high levels of noise (mismatch binding and contamination) typical in microarray experiments. Genes identified as differentially expressed in one experiment may not appear differentially expressed in another experiment. In fact, it has been demonstrated that thousands of microarray samples may be required to define reproducible biomarkers with confidence in some situations. ANNs are a technology which has provided good performance on microarray data in spite of the limitations of the arrays, by using readily available information from multiple heterogeneous sources to place meaning on the observed experimental results. Using the assumption that deregulated proteins are as a result of, or cause, deregulation of the proteins with which they interact, Chuang et al. demonstrated that the information in protein-protein interaction (PPI) networks can be combined with microarray data to define more robust biomarkers. This approach evaluates the aggregate behavior of sub-networks of interacting genes (connected within the PPI network), allowing the relevance of genes with subtle differential expression can be combined to produce more reproducible and robust biomarkers. The CRANE algorithm is an example of such an approach, which employs an ANN to perform classifications based on identified disregulated PPI subnetworks. The expression levels of the genes in the subnetworks form the inputs to the ANN. The use of these subnetworks of related genes is shown to outperform groups of genes selected based on experimentally observed differential expression alone in classification problems. This example demonstrates how ANNs applied in the bioinformatics domain can: Consider the impact of many attributes (capable of relatively high dimensionality) from multiple heterogeneous sources Work with multiple patterns (can use multiple PPI networks in generating its output, with potentially both positive and negative biomarkers) Consider the impact of even low impact attributes which may or may not even be present. Can produce robust fault tolerant solutions which, to a degree, can handle contamination, low statistical power, effects of machine calibration, background noise, and repeatability of experiments observed in the data sets, all of which can be inherent in data generated from generic microarrays As discussed earlier, gene expression levels can also be impacted the gender, age, race, etc. of the patient, but this can be addressed by the strong generalization ability of ANNs. For a detailed discussion of the application of ANNs to microarrays we suggest the paper “An introduction to artificial neural networks in bioinformatics-application to complex microarray and mass spectrometry datasets in cancer studies” by Lancashire, Lemetre and Ball.

Disadvantage: Local Minima

Learning algorithms such as back propagation are applied with the caveat that solutions may only be locally as opposed to globally optimal. In Figure 6 it can be seen that gradient descent starting at the initial point on the error surface (weight value 0.35) will adjust the weight value to a point where the error is reduced (weight value 0.24), but not necessarily to the value of the weight where the error is minimized. The point of lowest error on the error surface is referred to as the global minimum of the error surface and is given by a weight value of 0.66 for the example of Figure 6. The inability of gradient descent algorithms to consistently identify global minima is referred to as the local minima problem., This problem can be lessened by repeating training several times with different initial weight values (and therefore different starting points on the error surface), or through processes such as simulated annealing or neuroevolution.

Disadvantage: Selecting the Architecture

Whereas the number of input and output neurons is prescribed by the cardinality of the required mapping, the number of hidden layers and the number of neurons in each hidden layer is dictated by the complexity of the problem, and is typically empirically defined.- Although this is still considered an unsolved task, Xu and Chen overview several opinions and approaches to the selection of appropriate architectures. If too few neurons are present the potential complexity of the decision boundaries produced by the network will be limited (“under-fitting”), while too many neurons will encourage the network to overtrain by allowing overly intricate decision boundaries. One approach to this problem is neuroevolution; the use of evolutionary algorithms to discover good neural network architectures in an automated fashion. These evolutionary algorithms are not guaranteed to find the optimal solution, but should find a good solution in a reasonable amount of time. Topology and weight evolving artificial neural network (TWEANN) algorithms such as the NeuroEvolution of Augmenting Topologies (NEAT) and Cartesian Genetic Programming Evolved Artificial Neural Network (CGPANN) variations have recently shown good performance on bioinformatics data sets.-

Disadvantage: Not Always the Best Approach

In comparative analyses, ANNs generally perform well, but do not necessarily offer the best performance. For example, in the paper “Why neural networks should not be used for HIV-1 protease cleavage site prediction” it is demonstrated that although ANNs are capable of classifying linearly separable data, superior performance is achieved by linear classifiers when applied to linear problems. Additionally, alternative machine learning approaches exist which have proven more effective than ANNs on a number of problems. Isroy et al. have performed a survey of papers dealing with machine learning based classification from three bioinformatics journals over 2010 and 2011. It was observed that, of the papers surveyed, 13% employed artificial neural networks, while 57% employed Support Vector Machines (SVM). The findings of Isroy et al. are presented in Table 1.

Table 1. Use rates of different machine learning algorithms in a sampling of bioinformatics papers, as presented by Isroy et al.

Algorithm	Percentage(2010)	Percentage(2011)	Percentage(2010–2011)
Decision Tree (DT)	26	24	26
Support Vector Machine (SVM)	51	69	57
Rule Based Learning	4	3	4
Artificial Neural Network (ANN)	10	17	13
Naive Bayes (NB)	16	14	15
k-Nearest Neighbor (KNN)	15	17	15

Chan et al. compared the relative performance of the MLP and two SVM variations in terms of receiver operating characteristic (ROC) and sensitivity at set specificity levels on a glaucoma diagnosis data set. The approaches are evaluated using the full set of attributes and a reduced set identified using principal component analysis (PCA). The results (presented in Table 2) show the SVM out-performing the MLP on this problem.

Table 2. Performance of the MLP and SVM on Glaucoma diagnosis, as presented by Chan et al.

			Sensitivity at specificity of
		ROC area	0.9	0.75
Full	MLP	0.883	0.66	0.859
	Gaussian SVM	0.914	0.776	0.878
	Linear SVM	0.893	0.66	0.853
PCA	MLP	0.898	0.713	0.846
	Gaussian SVM	0.904	0.744	0.833
	Linear SVM	0.888	0.667	0.853

ANNs are however still a very powerful tool, and numerous papers can be identified where the ANN matches or outperforms the SVM approach. Chowdhury et al., for example, in describing the CRANE algorithm discussed previously, argue that how ANNs deal with sub-patterns makes them better suited to that problem than SVMs. Cho and Ryu compared the performance of MLPs and two variations of the SVM in combination with a number of feature selection algorithms on gene expression profiles. These results are presented in Table 3. It is noted that, on this data set, the MLP consistently performed as well as or better than the SVM approaches. The MLP also performed favorably compared with the self-organizing map (SOM), decision tree (DT) and k-nearest neighbor (KNN) algorithms (data not shown). Further work by Cho and Won produced similar results for Leukemia, colon and lymphoma data sets.

Table 3. Comparing the performance of the MLP with the SVM on gene expression data in combination with different feature selection algorithms

	Pearson	Spearman	Euclidean distance	Cosine coefficient	Information gain	Mutual information	S/N ratio
MLP	97.1	70.6	97.1	79.4	72.9	62.1	94.1
SVM_RBF	97.1	70.6	91.2	70.6	58.8	58.8	94.1
SVM_linear	79.4	70.6	88.2	58.8	58.8	58.8	94.1

Chang et al. directly compared the performance of the MLP and SVM on the classification of breast tumor images. Their results noted that the MLP and SVM have comparable accuracy (see Table 4), but the SVM could be trained much quicker.

Table 4. The performance of the MLP and SVM on the task of identifying breast cancer from image data

	SVM	MLP
Accuracy (%)	85.60	84.80
Sensitivity (%)	95.45	84.55
Specificity (%)	77.86	77.14

Implementations

There are a number of freely available open-source ANN implementations (in many programming languages) available through sites such as Google Code, Sourceforge and Github. The WEKA project (Waikato Environment for Knowledge Analysis) is an open-source implementation of a library of different machine learning algorithms. Implementations in the R programming language are hosted on the site http://cran.r-project.org/.

Case Studies

In the following section, we present a number of varied example applications of ANNs to bioinformatics problems. We do not advocate that these approaches necessarily represent the best approaches or practices, but rather they serve as examples of how the principles of ANNs can be applied to different real world bioinformatics problems.

Peptide Secondary Structure Prediction

A protein comprises of a chain or multiple chains of amino acid residues. The chemical properties of the amino acids in the peptide cause the chain to twist and fold into a number of regular structures to form a stable three-dimensional conformation., It is this three-dimensional conformation of the chains which designate the function of the protein., The structure of a protein is defined at several levels; Primary structure: the order of the amino acid residues which constitute the protein Secondary structure: the locations and identities of a number of regular local secondary structures along the primary structure of the protein (such as the α helix and the β strand) Tertiary structure: the overall three-dimensional conformation taken by a single peptide chain Quaternary structure: complexes formed from several peptide chains link together which act as a single protein

Example: Sequence Similarity Based Secondary Structure Prediction

Rao et al. gave an approach to identifying the secondary structure of a peptide given its primary structure using an ANN. The main idea behind the approach is that segments of a peptide chain with similar primary sequence are assumed to have similar secondary structure expressions. Under this assumption, the secondary structures for novel amino acid sequences can be generalized from similar amino acid sequences with known secondary structure classifications. To identify the secondary structure of an amino acid, a window of between 15 and 29 neighboring amino acids are used as the input to a neural network. The identity of each amino acid in the window is encoded using 20 inputs to the network. For a window of size W, the ANN is a single layered MLP with W*20 input neurons, W*2 + 1 hidden layer neurons, and 8 output neurons representing different structural designations. The secondary structure classification for an amino acid residue is therefore given as the classification corresponding to the highest output on the network. For example, if the first output of the network has the highest output, the amino acid residue under investigation is designated as an α-helix. If the last output of the network has the highest value, the amino acid residue is designated as a coil. The network is trained using the scaled conjugate gradient descent algorithm. The network was trained using data taken from the DSSP database, which contains peptide sequences and their corresponding secondary structure classifications. In evaluations on a single sequence, the network achieved a Q8 score of 72.3%, meaning it correctly classified 72.3% of the amino acid residues as belonging to the correct one of the 8 possible secondary structure classes.

Example: PSIPRED

PSIPRED is an application which predicts a proteins secondary structure from its primary structure using a pair of artificial neural networks trained using BP. For a given sequence, PSIPRED uses a “sequence profile” to examine how highly preserved elements of the sequence are relative to homologs and distant homologs identified from a database. Matching against the sequence profile is more relevant than the sequence itself, as functional regions of peptides tend to display a high level of preservation, but also as regions with high sequence similarity identified in the database may be purely coincidental. PSIPRED uses position specific scoring matrixes (PSSMs) generated as a by-product of another program, PSI-BLAST, to present this information to the first neural network. BLAST is a tool for finding homologous multiple sequence alignments from a database for a given sequence. For a sequence of length n, n − w + 1 words of length w can be generated. The database is then searched against each word using a finite state machine. Words are evaluated using a substitution matrix, and words scoring above a threshold T are extended in both directions. Position specific iterated BLAST (PSI-BLAST) makes a number of improvements over standard BLAST. One of the improvements is that once the original sequence alignment is completed, the identified similar sequences are used to form a PSSM of size 20× n. The process can then be repeated iteratively, where the PSSM generated in each iteration is used in place of the original substitution matrix. This iterative process allows the discovery of distant homologs from a database. PSIPRED generates a classification for an amino acid as one of 3 secondary structure states (Q3); a helix, strand, or loop. The Q3 training and testing data are generated from the DSSP database Q8 classifications (as used in the sequence similarity based approach) using the approach specified by Rost and Sander. To generate the Q3 value for an amino acid in a given sequence, a sub-sequence is first generated comprising the amino acid and a window of 7 amino acids to either side. This subsequence is then fed into PSI-BLAST. The values of the PSSM generated by PSI-BLAST after 3 iterations are used to generate 300 inputs (15 × 20) to the first ANN. An additional input is associated with each amino acid in the window representing if that amino acid spans the N or C terminus. The first ANN has a single hidden layer with 75 neurons, and 3 output layer neurons each representing an individual Q3 classification. The output with the highest evidential response is taken as the classification of the amino acid at the center of the window. This process is shown in Figure 7.

Figure 7. Generating a Q3 classification for a specific amino acid (in the dashed box) using the first ANN of PSIPRED.

Figure 7. Generating a Q3 classification for a specific amino acid (in the dashed box) using the first ANN of PSIPRED. Once the first network has been applied to classify the entire sequence, a second neural network with 60 hidden layer neurons and 3 outputs is used to further refine the results. To classify an amino acid in the sequence, the outputs of the first network for the window of 15 amino acids is used as input to the second network. Again, an additional input is added as previously for each amino acid in the window representing if the amino acid spans the N or C terminus. The output of the second network is still the Q3 classification of the central amino acid, but this network tends to be more accurate in deciding a conformation for an amino acid given likely conformations of its direct neighbors. These steps are shown graphically in Figure 8.

Figure 8. Improving the accuracy of the Q3 score using a second ANN.

Figure 8. Improving the accuracy of the Q3 score using a second ANN. PSIPRED was independently evaluated in the CASP3 (Third Critical Assessment of Structure Prediction) competition, where it was identified as the top performing approach across a number of blind evaluations. PSIPRED version 2.0 also performed well in CASP4. PSIPRED 3.2 claims to achieve an average Q3 score of 81.6% (http://bioinf.cs.ucl.ac.uk/index.php?id=779).

Gene Identification

Neural networks have previously been applied for the categorization of coding (exons) and non-coding (introns and intragenic spacer data) regions of a DNA. For an overview of eukaryote gene prediction strategies see Sleator.

Example: Gene Identification Using Coding Measures

An interesting example of this is the approach taken by Fogel, Chellapilla, and Fogel, who construct an ANN using neuroevolution to classify nucleotides as coding or non-coding., This work builds on network inputs identified by Uberbacher and Mural for the GRAIL application. As for many gene identification techniques, the window is pre-processed to extract features of the sequence, known as “coding measures.” Coding measures are statistical observations on the differences in distribution and repeated patterns of the nucleotides in coding and non-coding regions of DNA. These statistics present an opportune training set for neural network architectures; an established mapping that can be used for training a network to differentiate coding gene sequences on novel input. As such, the neural network is employed to define a nonlinear weighting for each of the coding measures, and allowing the consideration of how these coding measures can affect probability when observed under various combinations. The “frame bias matrix” is an example of a coding measure that works at the nucleotide level. The frame bias matrix works on the observation that the four nucleotides (ACGT) have different probabilities of being observed in the three codon positions for both coding and non-coding regions. Therefore, the presence of specific nucleotides in codon locations can be considered as positive or negative indicators for the codon being in a coding region. The “coding sextuple word preferences” coding measure works on the principle that certain sextuple nucleotide combinations can be identified which occur more frequently in coding regions of DNA. An instance of this would be the sextuple ACCGTA in the coding sequence CACACGACCGTACTCACAT. Through examining known coding and non-coding regions, n-tuples words can be identified which have higher probabilities of being observed in either coding or non-coding regions of DNA. Many coding measures are available, but this approach specifies the use of nine to form the input to the ANN; 2 at the nucleotide level and 7 at the word (n-tuple) level. Frame bias matrix Fickett Feature Coding Sextuple word preferences Coding sextuple in-frame word preferences Word preferences in frame 1 Word preferences in frame 2 Word preferences in frame 3 Maximum word preferences in frames Sextuple word commonality Repetitive Sextuple word To evaluate a nucleotide, a window of 99 nucleotides is isolated. The nucleotides in the window are pre-processed to generate the coding measures, which are then fed into the ANN. The network is a fully connected MLP with 14 hidden layer neurons and a single output representing the derived classification. The flow of data for this approach is demonstrated in Figure 9. Post-processing is performed on the output of the network to improve performance using domain knowledge.

Figure 9. Classifying a nucleotide (in the dashed box) as coding or non-coding using an ANN.

Figure 9. Classifying a nucleotide (in the dashed box) as coding or non-coding using an ANN. The synaptic weights of the ANN, in this situation, were trained using an evolutionary algorithm, which is itself a nature inspired machine learning algorithm. In this evolutionary algorithm, large populations of potential solutions (in this case, sets of synaptic weights) are created and evaluated. In each generation, half of the potential solutions with lower performance are purged, and the survivors are used as the basis for a new population of the original size. The candidate solutions for the new population are created by modifying a single weight from a solution of the previous generation which showed good performance. In this way, over many generations useful elements of successful solutions be propagated and increasingly more successful networks should be created. The large space of potential solutions evaluated tend to produce an optimal (if not the optimal) network solution. 250 000 examples of coding and non-coding nucleotides were used to train the network, with the mean squared error and correct classification percentages used to select the best performing each generation. The performance of this ANN based approach to gene identification was evaluated on two sets of human DNA sequences taken from GenBank. It has been reported that the network classified the majority of coding nucleotides correctly with sensitivity (the ratio of true positives to the number of true positives and false negatives) of 74% and 64%, outperforming a number of other systems. In particular, the authors of this study note that 1.4 times more true positives were observed for this approach compared with the GRAIL server on the same data. However, it was also reported that the network had a high false positive rate (some non-coding regions incorrectly as coding regions), resulting in a specificity (the ratio of true negatives to the number of true negatives and false positives) of only 38% and 42%. The authors attribute this over sensitivity to coding sub-sequences on the composition of the data used to train the network, as it was split equally between coding and non-coding exemplars, which does not reflect real world sequences where it is estimated that only 2% is coding. It is noted by the authors that a system that reports false positives is preferential to a system that reports false negatives as it will be less likely to miss coding regions. On the other hand, a system that has a higher ratio of false negatives will report coding regions as being non-coding and so they will be potentially excluded by researchers from further study.

Example: Neural Network for Promoter Prediction (NNPP)

A common problem for neural networks is detecting transient patterns which can occur at any point over a subsection of an input signal. An example of this is promoter binding sites, which can occur anywhere in a window of nucleotides relative to the transcription start site (TSS). Typically, this problem can be addressed by (1) training a network with exemplars of the pattern at all possible locations, or (2) training a network on specific exemplars of the pattern and applying the network brute force to every point in the input space where the pattern may occur. The Time delay neural network (TDNN) is a structured approach to this problem, which combines elements of both these methods. TDNN was originally applied for detecting the presence of specific phonemes from speech samples. The TDNN operates by learning feature detectors (the hidden layer neurons) for patterns which are replicated to cover the input signal in a continuous overlapping manner. Each feature detector examines only a subsection of the input signal, referred to as the detector’s receptive field. The activation of all the feature detectors is then combined to determine if the desired pattern is identified anywhere in the signal. TDNN weights are learned using a modified BP algorithm. An input signal is applied to the network in the standard feed-forward manner and BP used to calculate the error and identify the weight adjustment for each synapse. As a set of replicated feature detectors are all looking for the same pattern (but at different points in the signal), a synaptic weight is actually updated as the average weight adjustment (Δ) generated by BP for the corresponding synapse across all copies of that feature detector. This approach means that the actual offset of the pattern in the exemplar signals does not affect training or recognition. The Neural Network for Promoter Prediction (NNPP) approach employs two of these TDNNs. Given a nucleotide sequence, each TDNN will each examine a different overlapping window of nucleotides for patterns representing a TATA box and initiator box respectively. Both TDNNs are trained separately and subsequently combined to form a super network which can consider the presence or absence of both binding sites and their relative positions to decide if a point in the sequence is a TSS. The input to the TDNN for identifying an initiator box is a window of 25 base pairs, ranging from 14 base pairs upstream to 11 base pairs downstream of the point in the sequence under investigation. Instead of a time delay as for the standard TDNN, each nucleotide in the window is considered as a time slice in a signal. The initiator box detector TDNN employs a receptive field size of 15 base pairs. Therefore, 11 feature detectors are required to cover all possible 15 nucleotide frames in the window, as demonstrated in Figure 10. Each base pair is encoded in four binary bits, so each feature detector will receive 60 (4*15) synaptic connections, connecting it to a subset of the window. The TDNN for detecting TATA boxes works in exactly the same manner, but examines a window of between 40 base pairs upstream to 10 base pairs upstream of the point in the sequence under investigation.

Figure 10. The windowed subsection of the input sequence and the receptive frames for the initiator box. Each receptive field frame is connected to a separate feature detector.

Figure 10. The windowed subsection of the input sequence and the receptive frames for the initiator box. Each receptive field frame is connected to a separate feature detector. The NNPP was tested on the Adh region of the Drosophila genome. The data set comprised 2.9 million nucleotides with 92 annotated promoters. The NNPP super-network accepts a window of 51 bases comprising the two overlapping windows used by the pair of hidden layers. The window is moved along the entire sequence and a score generated for each nucleotide as a TSS. The scores are post-processed using a simple smoothing function as part of the NNPP process. The NNPP approach correctly identified 69 of the 92 known promoters (Sensitivity of 75%), and achieved 99.82% specificity. If a more exacting threshold was applied to only accept promoter classifications where the NNPP has a confidence in the prediction of greater than or equal to 97%, the specificity increased to 99.96% (1 false positive per 2416 nucleotides), but the NNPP could still successfully detect 38% of the known promoters. Although the results produced cannot account for all promoter regions, the low levels of false positives observed have helped NNPP find a great number of applications in identifying and verifying putative TSSs, often to compliment other TSS identification approaches.-

Gene-Gene Interaction

Genome wide association studies (GWAS) are used to identify genetic risk factors for common diseases. Genetic association studies directly compare the sequences of genotypes between target (displaying a specific trait or condition) and control populations. Any single nucleotide polymorphisms (SNP) common in the target group and rare in the other is taken as a likely contributing factor. Although examining individual SNPs in isolation has identified many genetic risk factors across a range of conditions such as type II diabetes and HDL-cholesterol, this approach has not been able to explain much of the causation thought to be attributable to genetic variation. Examining the target population in terms of epistasis (two or more interacting genes) is significantly more difficult because of the “curse of dimensionality”; as the pre-requisite for the situation (disease) becomes more complex, the amount of representative data becomes reduced and more difficult to identify. Additionally, epistatic interactions are typically observed with low effect sizes.

Example: ATHENA

ATHENA (Analysis Tool for Heritable and Environmental Network Associations) employs a neural network as a means of data mining such gene-gene interactions from genome wide association studies. Data mining is the process of discovering unknown patterns from large data sets, in this case the identification of the epistatic SNPs among a large number of unrelated SNPs., This approach uses a form of neuroevolution, grammatical evolution neural networks (GENN), to efficiently search the space of possible network inputs (feature selection) without the need for brute-force trial of all possible two locus SNP combinations. Similar to the approach of Corne et al. described previously, the GENN used in ATHENA is a form of evolutionary algorithm which evolves a population of differing neural network solutions, but GENN also attempts to evolve the architecture of the network. Combinations of inputs and hidden layer neurons which show relevance to predicting potential disease cases are replicated and disseminated across increasing solutions over the following generations. Crossover and mutation are used to evolve new generations of the population. Allele variations, which form the inputs to the network are encoded as (−1, 0, +1) representing the three forms of a gene with an SNP (AA, Aa, aa). The process was evaluated in silico using a simulation study for accurate evaluations of the process such that the true effect of each SNP is known and understood. The exemplars were generated with epistasis occurring under two models; the additive and dominant models. Under the additive model for example, the penetrance of a disease is increased as a function of the number of recessive alleles; i.e., for AABB penetrance will be at a minimum, but at a maximum for aabb. Two thousand exemplars are generated using the genomeSIMLA application. Each exemplar consisted of 2 epistatic SNPs and 498 irrelevant (to the particular disease penetrance) SNPs. Narrow-sense heritability was set at only 5% in the generated data, meaning very few of the case exemplars display the epistatic trait. The low epistatic effect size is typical of real world data. A 1% main effect is simulated for each of the epistatic loci. In some of the trials, a hybrid learning approach was used in which the BP algorithm trained the initial network population, and again after a number of generations. BP was run for a maximum of 100 epochs on each network. The authors also investigate the use of existing domain knowledge as a means of filtering the search space. This domain knowledge for the experiment is again simulated, mimicking the scores produced for SNP pairs generated by the Biofilter application. Biofilter examines available databases for published information which supports the selection of pairs of SNPs. The higher the implication score generated by Biofilter, the more support that can be found for that SNP pair. This implication level is simulated in the data by generating 4000 random edges. Trials were performed under differing implication levels, differing proportions of the population intelligently initialized using the domain knowledge, and in the presence of absence of backpropagation. To test each combination of these parameters 100 data sets were generated. Sensitivity was defined as the proportion of those 100 data set for which the best performing network identified (accepted as inputs) only the two SNPs generating the epistasis in the data set. The results attained by Turner et al.: Demonstrate the ability of neural networks to identity and model nonlinear interaction in data sets in spite of low effect levels (5%) and in the presence of substantial levels of noise Display the potential for efficiency gains when domain knowledge is incorporated into large search spaces, which is extremely important in the case of large scale problems such as genome wide association studies

Conclusions

Neural networks are a potentially powerful tool for bioinformatics, with reported successful applications across many areas and levels of the domain. The example applications given here show ANNs as being able to identify and model complex patterns and manage large data sets, which can be both sparse and noisy. The theory of neural networks is still evolving as the problems faced are changing. For example, Hawkins’s Hierarchical Temporal Memory (HTM) is an ANN model suggested as an alternative to storing large amounts of data common in commercial and bioinformatics domains. Built on an improving understanding of how the brain works, the HTM is a rough approximation of how layer 3 of the neocortex operates. It is a more biologically plausible ANN, which attends that the brain is a memory system as opposed to a processor (as with MLPs). The approach postulates that data in many domains decreases in relevance as it ages. Instead of storing all the data, the HTM builds a model encoding the patterns in the data, and constantly updates the model as new data becomes available. The HTM is capable of Identifying and modeling spatial (combinations that occur together) and temporal (spatial patterns occur together over time) patterns, and detecting anomalies in large data sets. Another area of ANN research which is gaining in popularity as its power is being full understood, is the idea of a “deep neural network” (DNN); neural networks comprising many hidden layers. These deep network architectures can be powerful, but the typical backpropagation algorithm can struggle or become intractable when it is required to learn many hidden layers, as the error signal being back propagated is constantly reducing., Hinton et al. describe a variation of the restricted Boltzmann machine (RBM) neural network approach capable of learning many layers, where each layer is a further abstraction of features in the training data., These feature abstractions of the network are trained to encode an input signal (the training exemplars) through a number of layers, and decode it back through the network to be able to replicate a good approximation of the original signal. In bioinformatics, a common problem is the lack of classified data. Generative models, such as Hinton’s RBM can be used to mitigate this issue, as it can handle the difficult problem of learning these abstractions without the need for labeled data. A smaller amount of labeled exemplars can be used to train the network to act as a classifier. ANNs may not always be the best approach to solving a problem. Although ANNs work in the absence of key domain knowledge, significant domain knowledge can be required in selecting the inputs and knowing how best to pre-process the input values. However, identifying what is relevant is often an easier task than defining how these values should be interpreted. If a problem is well understood, and can be addressed using a set of known and understood rules, this can be favorable or less error prone than the decisions or interpretations of stimuli produced by a neural network. There is also the need for a sufficient amount of accurately classified training data to be available to adequately describe the remit of situations the network must learn, which may not be readily available.

63 in total

1. The PSIPRED protein structure prediction server.

Authors: L J McGuffin; K Bryson; D T Jones
Journal: Bioinformatics Date: 2000-04 Impact factor: 6.937

Review 2. Advantages and disadvantages of using artificial neural networks versus logistic regression for predicting medical outcomes.

Authors: J V Tu
Journal: J Clin Epidemiol Date: 1996-11 Impact factor: 6.437

3. A recurrent neural network for closed-loop intracortical brain-machine interface decoders.

Authors: David Sussillo; Paul Nuyujukian; Joline M Fan; Jonathan C Kao; Sergey D Stavisky; Stephen Ryu; Krishna Shenoy
Journal: J Neural Eng Date: 2012-03-19 Impact factor: 5.379

4. EEG segmentation for improving automatic CAP detection.

Authors: Sara Mariani; Andrea Grassi; Martin O Mendez; Giulia Milioli; Liborio Parrino; Mario G Terzano; Anna M Bianchi
Journal: Clin Neurophysiol Date: 2013-05-01 Impact factor: 3.708

5. Artificial neural network-aided image analysis system for cell counting.

Authors: P J Sjöström; B R Frydel; L U Wahlberg
Journal: Cytometry Date: 1999-05-01

6. Support vector machines for diagnosis of breast tumors on US images.

Authors: Ruey-Feng Chang; Wen-Jie Wu; Woo Kyung Moon; Yi-Hong Chou; Dar-Ren Chen
Journal: Acad Radiol Date: 2003-02 Impact factor: 3.173

7. ATHENA: a tool for meta-dimensional analysis applied to genotypes and gene expression data to predict HDL cholesterol levels.

Authors: Emily R Holzinger; Scott M Dudek; Alex T Frase; Ronald M Krauss; Marisa W Medina; Marylyn D Ritchie
Journal: Pac Symp Biocomput Date: 2013

8. Diagnosis of Hashimoto's thyroiditis in ultrasound using tissue characterization and pixel classification.

Authors: U R Acharya; S Vinitha Sree; M R K Mookiah; R Yantri; F Molinari; W Zieleźnik; J Małyszek-Tumidajewicz; B Stępień; R H Bardales; A Witkowska; J S Suri
Journal: Proc Inst Mech Eng H Date: 2013-04-16 Impact factor: 1.617

9. Segmentation, feature extraction, and multiclass brain tumor classification.

Authors: Jainy Sachdeva; Vinod Kumar; Indra Gupta; Niranjan Khandelwal; Chirag Kamal Ahuja
Journal: J Digit Imaging Date: 2013-12 Impact factor: 4.056

10. Sequence memory for prediction, inference and behaviour.

Authors: Jeff Hawkins; Dileep George; Jamie Niemasik
Journal: Philos Trans R Soc Lond B Biol Sci Date: 2009-05-12 Impact factor: 6.237

19 in total

Review 1. Under the microscope: From pathogens to probiotics and back.

Authors: Roy D Sleator
Journal: Bioengineered Date: 2015 Impact factor: 3.269

2. The importance of physicochemical characteristics and nonlinear classifiers in determining HIV-1 protease specificity.

Authors: Timmy Manning; Paul Walsh
Journal: Bioengineered Date: 2016-04-02 Impact factor: 3.269

Review 3. Nodular Thyroid Disease and Thyroid Cancer in the Era of Precision Medicine.

Authors: Carles Zafon; Juan J Díez; Juan C Galofré; David S Cooper
Journal: Eur Thyroid J Date: 2017-03-03

4. Machine learning predicts clinically significant health related quality of life improvement after sensorimotor rehabilitation interventions in chronic stroke.

Authors: Wan-Wen Liao; Yu-Wei Hsieh; Tsong-Hai Lee; Chia-Ling Chen; Ching-Yi Wu
Journal: Sci Rep Date: 2022-07-04 Impact factor: 4.996

5. Predicting outcomes in patients with perforated gastroduodenal ulcers: artificial neural network modelling indicates a highly complex disease.

Authors: K Søreide; K Thorsen; J A Søreide
Journal: Eur J Trauma Emerg Surg Date: 2014-06-14 Impact factor: 3.693

6. The optimization of Marasmius androsaceus submerged fermentation conditions in five-liter fermentor.

Authors: Fanxin Meng; Gaoyang Xing; Yutong Li; Jia Song; Yanzhen Wang; Qingfan Meng; Jiahui Lu; Yulin Zhou; Yan Liu; Di Wang; Lirong Teng
Journal: Saudi J Biol Sci Date: 2015-06-27 Impact factor: 4.219

7. Prediction of Clinical Deterioration in Hospitalized Adult Patients with Hematologic Malignancies Using a Neural Network Model.

Authors: Scott B Hu; Deborah J L Wong; Aditi Correa; Ning Li; Jane C Deng
Journal: PLoS One Date: 2016-08-17 Impact factor: 3.240

8. Artificial neural network approach to predict surgical site infection after free-flap reconstruction in patients receiving surgery for head and neck cancer.

Authors: Pao-Jen Kuo; Shao-Chun Wu; Peng-Chen Chien; Shu-Shya Chang; Cheng-Shyuan Rau; Hsueh-Ling Tai; Shu-Hui Peng; Yi-Chun Lin; Yi-Chun Chen; Hsiao-Yun Hsieh; Ching-Hua Hsieh
Journal: Oncotarget Date: 2018-02-09

9. Sequential Pattern Mining to Predict Medical In-Hospital Mortality from Administrative Data: Application to Acute Coronary Syndrome.

Authors: Jessica Pinaire; Etienne Chabert; Jérôme Azé; Sandra Bringay; Paul Landais
Journal: J Healthc Eng Date: 2021-05-25 Impact factor: 2.682

10. Sequence homolog-based molecular engineering for shifting the enzymatic pH optimum.

Authors: Fuqiang Ma; Yuan Xie; Manjie Luo; Shuhao Wang; You Hu; Yukun Liu; Yan Feng; Guang-Yu Yang
Journal: Synth Syst Biotechnol Date: 2016-10-04