| Literature DB >> 15511290 |
Peter Meinicke1, Maike Tech, Burkhard Morgenstern, Rainer Merkl.
Abstract
BACKGROUND: Kernel-based learning algorithms are among the most advanced machine learning methods and have been successfully applied to a variety of sequence classification tasks within the field of bioinformatics. Conventional kernels utilized so far do not provide an easy interpretation of the learnt representations in terms of positional and compositional variability of the underlying biological signals.Entities:
Mesh:
Substances:
Year: 2004 PMID: 15511290 PMCID: PMC535353 DOI: 10.1186/1471-2105-5-169
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Illustration of positional uncertainty. Figure 1 illustrates how positional uncertainty is represented by the proposed oligo functions. In the example, we have an occurrence of a certain oligomer at position -5 (green curve) and -9 (blue curve). The degree of uncertainty depends on the variance σ2 of the Gaussian bumps centered on these positions. Adding the two Gaussians, the smoothness of the resulting oligo function (red curve) increases with increasing σ and the assignment of the oligomer to certain positions becomes more fuzzy.
Performance of oligo kernel classifiers with oligomer length K = 1,...,6. The first line shows the mean classification error, given in percent, on the test sets. The rates are averages over 50 runs on randomly partitioned data. The second line shows the standard deviation of the classification error. The last line shows the mean over the 50 optimal values of σ which had been chosen from the set {0.5,0.75,1,1.5,2} for each run to minimize the error on a validation set.
| oligomer length | 1 | 2 | 3 | 4 | 5 | 6 |
| mean (median) error | 11.8 (11.8) | 9.7 (9.6) | 8.9 (8.7) | 9.6 (9.5) | 12.7 (12.6) | 15.0 (15.0) |
| standard deviation | 1.3 | 1.3 | 1.4 | 1.2 | 1.3 | 1.2 |
| mean optimal | 0.8 | 0.8 | 1.25 | 1.34 | 1.24 | 1.27 |
Performance of oligo kernel classifiers on an enlarged data set with four times the number of examples than utilized for the EcoGene-based analysis with results shown in table 1. The oligomer length again varies according to K = 1,...,6. The table shows the mean classification error, given in percent, on the test sets. The rates are averages over 20 runs on randomly partitioned data with the same proportions of training, validation and test sets as for the previous results shown in table 1. According to the main paradigm of machine learning we would expect the error to decrease for an increased data set. However, obviously this is not the case, as the error rates are rising up to 6.4 percent, as compared with table 1. Therefore the results indicate that the additional data which have not been experimentally verified, are distributed in a different way than the verified TIS sequences from EcoGene. For that reason we conclude that these additional data should not be used for analysis of TIS, because it cannot be excluded that the distinct distribution is due to erroneous annotation.
| oligomer length | 1 | 2 | 3 | 4 | 5 | 6 |
| mean error | 17.3 | 15.6 | 15.3 | 16.0 | 17.0 | 18.9 |
Figure 2Image matrix of discriminative weight functions derived from trained classifiers based on the trimer kernel. Each of the 64 lines shows the values of one trimer-specific weight function obtained from an average over 50 runs (see text). Each of the 200 columns corresponds to a certain position with 0 indicating the position of the start codon. The function values are visualized by the color of the corresponding matrix elements. The complete matrix of function values has been scaled to yield a unit maximum which is located at the ATG line at position 0. For noise reduction all matrix elements with an absolute value below 0.1 have been zeroed. In general, maxima and minima indicate discriminative features which contribute to the prediction of positive (true) and negative (false) TIS, respectively. Note that the region of discriminative features is rather small and mainly concentrated around the start codon on the left (upstream) side of the image.
Figure 3Exemplary weight functions derived from trained classifiers based on trimer kernel. Shown are discriminative weight functions for ATG (the most frequent start codon) GGA (having its highest peak in Shine-Dalgarno region), AAA (showing a weak maximum downstream of the start codon) and TTT (with higher values in a region ≈ 20 nt upstream the start codon); function values are plotted versus position. All values are normalized, i.e. they are relative values with respect to a unit global maximum over all functions. The complete set of weight functions for K = 3 can be found on our web page [20].
Figure 4Oligomer Ranking. Figure 4 shows the ten most important oligomers for discrimination based on the trimer, tetramer, pentamer and hexamer kernels. The bars show the relative norm of the oligomer-specific weight functions (see text), i.e. their relevance for classification. All values have been scaled to a unit maximum norm of the most discriminative oligomer.
Comparison of oligo kernels (OK) with inhomogeneous Markov models of order 0 (MM1) and order 1 (MM2) based on monomer and dimer occurrences, respectively. All higher order Markov models led to a severe breakdown of the performance with an error rising to ≈ 30 percent. The best spectrum kernel (SK) among the position-independent oligo kernels (σ → ∞) with K = 1,...,6 is incorporated into the comparison in order to stress the importance of position information. The table shows the mean classification error, given in percent, on the test sets. The rates are averages over 50 runs on randomly partitioned data. The lowest classification error is achieved by the combined oligo kernel OK1...6 with simple adding of length 1,...,6 kernels. The combined oligo kernel is closely followed by the best single length trimer kernel OK3 which still performs better than the two Markov model based methods. Obviously, the "best" position-independent kernel SP2, based on dimer occurrences is performing worst, only slightly better than classification by chance.
| method | OK3 | OK1...6 | MM1 | MM2 | SP2 |
| mean (median) error | 8.9 (8.7) | 8.1 (7.8) | 11.4 (11.4) | 11.3 (11.4) | 44.6 (44.9) |