| Literature DB >> 24705206 |
Abstract
Protein databases are heavily contaminated with erroneous (mispredicted, abnormal and incomplete) sequences and these erroneous data significantly distort the conclusions drawn from genome-scale protein sequence analyses. In our earlier work we described the MisPred resource that serves to identify erroneous sequences; here we present the FixPred computational pipeline that automatically corrects sequences identified by MisPred as erroneous. The current version of the associated FixPred database contains corrected UniProtKB/Swiss-Prot and NCBI/RefSeq sequences from Homo sapiens, Mus musculus, Rattus norvegicus, Monodelphis domestica, Gallus gallus, Xenopus tropicalis, Danio rerio, Fugu rubripes, Ciona intestinalis, Branchostoma floridae, Drosophila melanogaster and Caenorhabditis elegans; future releases of the FixPred database will include corrected sequences of additional Metazoan species. The FixPred computational pipeline and database (http://www.fixpred.com) are easily accessible through a simple web interface coupled to a powerful query engine and a standard web service. The content is completely or partially downloadable in a variety of formats. Database URL: http://www.fixpred.com.Entities:
Mesh:
Substances:
Year: 2014 PMID: 24705206 PMCID: PMC3975993 DOI: 10.1093/database/bau032
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.Flow chart of the FixPred pipeline.
Figure 2.Screen shot of an entry of the FixPred database. The figure shows the corrected version (upper part) of an erroneous protein sequence of G. gallus, deposited in the UniProtKB/SwissProt database with the protein ID: FZD3_CHICK (lower part). The FZD3_CHICK protein was identified as erroneous by MisPred tool 4 (domain size deviation) because it contains only a fragment of the Frizzled (PF01534) domain. The erroneous protein was corrected by the FixPred pipeline in Step 2 by identifying a full-length version of the frizzled-3 precursor (NP_001258869.1).
Rate of correction of different types of sequence errors
| Error type identified | Erroneous sequences | Corrected sequences | Apparent rate of correction (%) |
|---|---|---|---|
| MisPred tool 1 | 3394 | 799 | 23.5 |
| MisPred tool 2 | 10 | 2 | 20.0 |
| MisPred tool 3 | 12 | 16 | 133.3 |
| MisPred tool 4 | 2033 | 592 | 29.1 |
| MisPred tool 5 | 890 | 36 | 4.0 |
| MisPred tool 6 | 916 | 4 | 0.4 |
| MisPred tool 7 | 479 | 32 | 6.7 |
| MisPred tool 8 | 50 | 0 | 0.0 |
| MisPred tool 9 | 3 | 0 | 0.0 |
| MisPred tool 10 | 331 | 3 | 0.9 |
aErroneous sequences identified by MisPred tool 11 and corrected by the FixPred pipeline are not yet deposited in the FixPred database. These data will be released in the next update of FixPred.
bIn the case of MisPred tool 3, correction of an erroneous sequence containing both nuclear and extracellular domains is expected to yield two corrected sequences (Supplementary Figure S1C).
Rate of correction of erroneous sequences of different metazoan species
| Species | Erroneous sequences | Corrected sequences | Apparent rate of correction (%) |
|---|---|---|---|
| 941 | 331 | 35.2 | |
| 455 | 106 | 23.3 | |
| 704 | 178 | 25.3 | |
| 434 | 93 | 21.4 | |
| 458 | 118 | 25.8 | |
| 547 | 176 | 32.2 | |
| 1376 | 180 | 13.1 | |
| 507 | 97 | 19.1 | |
| 1753 | 46 | 2.6 | |
| 391 | 28 | 7.2 | |
| 215 | 49 | 22.8 | |
| 337 | 60 | 17.8 |
Correction of erroneous proteins in different steps of the FixPred pipeline (see Figure 1)
| Steps | Sequences analyzed | Sequences corrected | Proportion corrected | Percent of total correction |
|---|---|---|---|---|
| Total | 8118 | 1462 | 0.18 | 100 |
| Step 2 | 8118 | 822 | 0.10 | 56.2 |
| Step 3 | 7296 | 541 | 0.07 | 37.0 |
| Step 4 | 6755 | 75 | 0.01 | 5.1 |
| Step 5 | 6680 | 73 | 0.01 | 5.0 |
| Step 6 | 6607 | 21 | 0.00 | 1.4 |
Figure 3.Correction of an erroneous protein sequence by the FixPred pipeline. (A) The upper part of the screen shot shows a H. sapiens protein sequence (NP_001184026.2, trypsin-3 isoform 3 preproprotein) that was identified as erroneous by MisPred tool 1 because it has an extracellular domain but lacks secretory signal peptide. (B) The erroneous protein was corrected by the FixPred pipeline in Step 2 by identifying a version (NP_002762.2, trypsin-3 isoform 2 preproprotein) that does not suffer from this type of error (see lower part of the screen shot).