| Literature DB >> 17112372 |
Jonathan G Lees1, Andrew J Miles, Robert W Janes, B A Wallace.
Abstract
BACKGROUND: Circular Dichroism (CD) spectroscopy is a widely used method for studying protein structures in solution. Modern synchrotron radiation CD (SRCD) instruments have considerably higher photon fluxes than do conventional lab-based CD instruments, and hence have the ability to routinely measure CD data to much lower wavelengths. Recently a new reference dataset of SRCD spectra of proteins of known structure, designed to cover secondary structure and fold space, has been produced which includes low wavelength (vacuum ultraviolet - VUV) data. However, the existing algorithms used to calculate protein secondary structures from CD data have not been designed to take optimal advantage of the additional information in these low wavelength data.Entities:
Mesh:
Year: 2006 PMID: 17112372 PMCID: PMC1676025 DOI: 10.1186/1471-2105-7-507
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
The cross-validation performance of various algorithms using the SP175 reference dataset [9] with the standard [1] secondary structure assignment scheme.
| 0.048 | 0.956 | 0.046 | 0.960 | 0.041 | 0.970 | ||||
| 0.809 | 0.036 | 0.791 | 0.037 | 0.779 | |||||
| 0.073 | 0.792 | 0.064 | 0.849 | 0.063 | 0.853 | ||||
| 0.020 | 0.913 | 0.023 | 0.889 | 0.025 | 0.867 | ||||
| 0.052 | 0.325 | 0.053 | 0.297 | 0.052 | 0.319 | ||||
| 0.050 | 0.717 | 0.046 | 0.770 | 0.050 | 0.720 | ||||
| 0.049 | 0.954 | 0.048 | 0.956 | 0.042 | 0.969 | ||||
| 0.037 | 0.776 | 0.037 | 0.778 | 0.038 | 0.764 | ||||
| 0.083 | 0.725 | 0.067 | 0.832 | 0.065 | 0.841 | ||||
| 0.023 | 0.891 | 0.024 | 0.880 | 0.026 | 0.857 | ||||
| 0.055 | 0.261 | 0.054 | 0.277 | 0.052 | 0.295 | ||||
| 0.055 | 0.671 | 0.047 | 0.754 | 0.054 | 0.683 | ||||
The (nr) tag indicates that the cross-validation was carried out under more stringent (non-redundant) conditions such that no proteins in the training set with the same CATH homologous superfamily as that of the test protein were included. The best results (lowest δ or highest r) for each secondary structure type with the SP175 and SP175(nr) datasets are shown in bold.
The cross-validation performances for different types of secondary structure assignments using the SP175 dataset with the PLS algorithm.
| β-sheet (parallel) | 0.060 | 0.233 | 0.99 | 6 | 0.02 |
| β-Turn II | 0.032 | 0.125 | 1.00 | 1 | 0.03 |
The results shown are for the optimal number of principal components k·n is the proportion of residues in the reference dataset identified as having this type of secondary structure. Those secondary structures with ζ values greater than 1.0 are shown in bold.
The cross-validation performances of various algorithms using the alternative secondary structure assignment scheme.
| 0.968 | 0.895 | 0.875 | 0.687 | 0.968 | 0.883 | 0.861 | 0.678 | ||||
| 0.054 | 0.022 | 0.034 | 0.036 | 0.055 | 0.023 | 0.036 | 0.036 | ||||
| 0.966 | 0.894 | 0.684 | 0.841 | 0.965 | 0.881 | 0.677 | 0.837 | ||||
| 0.056 | 0.022 | 0.036 | 0.052 | 0.057 | 0.023 | 0.036 | 0.053 | ||||
| 0.889 | 0.863 | 0.641 | 0.839 | 0.881 | 0.854 | 0.628 | 0.833 | ||||
| 0.023 | 0.035 | 0.038 | 0.052 | 0.024 | 0.036 | 0.039 | 0.053 | ||||
| 0.971 | 0.868 | 0.867 | 0.835 | 0.969 | 0.856 | 0.846 | 0.830 | ||||
| 0.053 | 0.025 | 0.035 | 0.053 | 0.054 | 0.026 | 0.037 | 0.054 | ||||
| 0.957 | 0.911 | 0.811 | 0.640 | 0.827 | 0.954 | 0.888 | 0.751 | 0.530 | 0.772 | ||
| 0.063 | 0.021 | 0.041 | 0.039 | 0.054 | 0.065 | 0.023 | 0.047 | 0.043 | 0.062 | ||
| 0.958 | 0.815 | 0.669 | 0.796 | 0.955 | 0.774 | 0.668 | 0.771 | ||||
| 0.062 | 0.040 | 0.037 | 0.058 | 0.065 | 0.045 | 0.037 | 0.062 | ||||
The (nr) tag indicates that the cross-validation was carried out under the more stringent conditions where no proteins in the training set with the same CATH homologous superfamily as that of the test protein were included. The best results for each secondary structure type for the standard and non-redundant datasets are shown in bold.
The cross-validation performances of various algorithms using the three-state α-helix (H), β-sheet (E) and other (O) assignment scheme.
| 0.063 | 0.957 | 0.083 | 0.862 | 0.078 | 0.701 | 0.065 | 0.954 | 0.090 | 0.833 | 0.083 | 0.672 | |
| 0.062 | 0.958 | 0.070 | 0.904 | 0.071 | 0.757 | 0.065 | 0.955 | 0.072 | 0.897 | 0.073 | 0.746 | |
| 0.055 | 0.968 | 0.070 | 0.905 | 0.065 | 0.800 | 0.056 | 0.967 | 0.071 | 0.901 | 0.065 | 0.797 | |
| 0.057 | 0.966 | 0.069 | 0.906 | 0.066 | 0.796 | 0.058 | 0.965 | 0.071 | 0.902 | 0.066 | 0.792 | |
| 0.053 | 0.971 | 0.073 | 0.895 | 0.068 | 0.781 | 0.074 | 0.893 | 0.069 | 0.774 | |||
| 0.070 | 0.902 | 0.066 | 0.796 | 0.054 | 0.970 | 0.072 | 0.900 | 0.066 | 0.790 | |||
| 0.055 | 0.968 | 0.067 | 0.912 | 0.062 | 0.816 | 0.056 | 0.967 | 0.068 | 0.909 | 0.064 | 0.805 | |
| 0.057 | 0.965 | 0.056 | 0.964 | |||||||||
| 0.057 | 0.966 | 0.069 | 0.908 | 0.066 | 0.792 | 0.060 | 0.964 | 0.072 | 0.902 | 0.067 | 0.785 | |
The (nr) tag indicates that the cross-validation was carried out under more stringent conditions where no proteins in the training or validation set with the same CATH homologous superfamily as that of the test protein were included. The best results for each secondary structure type are shown in bold.
The cross-validation performance of the NN method using various numbers of hidden neurons.
| 0.058 | 0.965 | 0.081 | 0.873 | 0.086 | 0.609 | |
| 0.063 | ||||||
| 0.068 | 0.909 | 0.063 | 0.815 | |||
| 0.967 | 0.063 | 0.815 | ||||
The secondary structure assignment scheme is the three-state α-helix (H), β-sheet (E) and other (O). The best results for each secondary structural type are highlighted in bold; they indicate that 7 neurons are marginally optimal overall for the SP175 dataset.
Figure 1The effect of low-wavelength cut-off on performance accuracy. The ζ parameter was calculated for the SIMPLS algorithm using various low wavelength cut-offs applied to the SP175 dataset. In shades of red are the performance curves for α-helices (H) and in shades of blue are those for β-sheets (E). The thick solid lines indicate the performance using k = 8 (ie. 8 principal components). The thin lines are derived using progressively smaller values of k. The dashed lines are for values calculated using SELMAT3 instead of SIMPLS.
Figure 2a-c – The effect of low wavelength cut-off as a function of increasing errors in magnitude. An example of the effect of the low-wavelength cut-off as a function of increasing errors in spectral magnitude on the performance as judged by the parameters A) ζ, B) r, and C) δ, respectively, using the SIMPLS algorithm with the SP175 dataset. The different curves on each plot represent different results obtained after applying progressively larger scale factors to represent errors in magnitude. The numbers represent the variance of the normal distribution from which the scaling factors were randomly chosen. These were for the α-helix (H) secondary structure component as assigned by DSSP.