| Literature DB >> 19770262 |
Christian Blouin1, Scott Perry, Allan Lavell, Edward Susko, Andrew J Roger.
Abstract
MOTIVATION: Aligning protein sequences with the best possible accuracy requires sophisticated algorithms. Since the optimal alignment is not guaranteed to be the correct one, it is expected that even the best alignment will contain sites that do not respect the assumption of positional homology. Because formulating rules to identify these sites is difficult, it is common practice to manually remove them. Although considered necessary in some cases, manual editing is time consuming and not reproducible. We present here an automated editing method based on the classification of 'valid' and 'invalid' sites.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19770262 PMCID: PMC2778337 DOI: 10.1093/bioinformatics/btp552
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Boxplot of the classification performance for single alignments in the MANUEL corpus. The SVM model for a given alignment A included all other alignments in MANUEL other than A. Sn, Sp and Acc stand, respectively, for sensitivity, specificity and accuracy. The proportion of valid sites is shown as the balance of the dataset for reference. The balance corresponds to the specificity and accuracy of a baseline classifier which considers all sites as valid. The sensitivity of this classifier would be 1.0.
Cross-validated performances of the classification of valid sites
| Experiment | Sn | Sp | Accuracy |
|---|---|---|---|
| 0.967 | 0.967 | 0.950 | |
| 0.984 | 0.911 | 0.917 | |
| GBLOCK | 0.431 | 0.999 | 0.584 |
| GBLOCK | 0.520 | 0.993 | 0.647 |
| GBLOCK | 0.544 | 0.963 | 0.651 |
| GBLOCK | 0.909 | 0.735 | 0.694 |
SVM classification on single sites f and window of 3 f′. The performance of GBLOCKS was evaluated under four sets of parameters.
aMin. block length = 10, no gaps (Default).
bMin. block length = 5, gaps.
cMin. block length = 2, all gaps.
dMin. block length = 2, all gaps, <32K non-conserved contiguous positions.
Fraction of sites classified as valid in BaliBASE alignments
| Reference set | Test set | Valid (%) |
|---|---|---|
| Ref1 | All | 81.9 |
| Ref1 | Test1 | 82.7 |
| Ref1 | Test2 | 77.9 |
| Ref1 | Test3 | 83.3 |
| Ref2 | All | 78.7 |
| Ref3 | All | 67.3 |
| Ref3 | Test | 67.6 |
| Ref3 | Test1 | 67.1 |
| Ref4 | All | 26.1 |
| Ref5 | All | 58.9 |
Includes only MSA with five or more sequences.
Fig. 2.ROC analysis of the conservation scores from AL2CO as a classifier for sequence editing. All sites from the MANUEL corpus were scored using AL2CO's default arguments, These scores were then compared against the corpus' manual annotation.
Fig. 3.Boxplot of the classification performance of 3000 sites with respect to the size of the training set. Training set sizes are expressed in thousands of sites. A subset of the MANUEL corpus was randomly selected as a training set, while another subset of 3000 sites was selected for testing. Each training set size category was evaluated with 100 replicate experiments.