| Literature DB >> 23864220 |
Abstract
Correct prediction of the structure of protein-coding genes of higher eukaryotes is still a difficult task; therefore, public databases are heavily contaminated with mispredicted sequences. The high rate of misprediction has serious consequences because it significantly affects the conclusions that may be drawn from genome-scale sequence analyses of eukaryotic genomes. Here we present the MisPred database and computational pipeline that provide efficient means for the identification of erroneous sequences in public databases. The MisPred database contains a collection of abnormal, incomplete and mispredicted protein sequences from 19 metazoan species identified as erroneous by MisPred quality control tools in the UniProtKB/Swiss-Prot, UniProtKB/TrEMBL, NCBI/RefSeq and EnsEMBL databases. Major releases of the database are automatically generated and updated regularly. The database (http://www.mispred.com) is easily accessible through a simple web interface coupled to a powerful query engine and a standard web service. The content is completely or partially downloadable in a variety of formats. DATABASE URL: http://www.mispred.com.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23864220 PMCID: PMC3713709 DOI: 10.1093/database/bat053
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.MisPred annotation of an erroneous protein sequence. The figure shows the entry for a protein sequence of X. tropicalis deposited in NCBI/RefSeq database with the protein ID: NP_001072931.1 and in the UniProtKB/TrEMBL database with the protein ID: Q08CW3_XENTR. The protein was identified as erroneous by MisPred tool 4 (domain size deviation) because it contains only a fragment of a domain (Pfam-A domain PF01822, WSC).
Percentage of mispredicted sequences in various databases
| Database | Number of proteins | Identified as suspicious by MisPred | Percentage (%) |
|---|---|---|---|
| UniProtKB/SwissProt (release 2012_05, May 2012) | 59 000 | 2245 | 3.81 |
| UniProtKB/TrEMBL (release 2012_05, May 2012) | 598 362 | 65 786 | 10.99 |
| EnsEMBL (release 67, May 2012) | 392 818 | 34 050 | 8.67 |
| NCBI/RefSeq (May 2012) | 374 046 | 24 996 | 6.68 |
Percentage of mispredicted sequences in 19 metazoan species
| Species | Number of proteins | Identified as suspicious by MisPred | Percentage (%) |
|---|---|---|---|
| UniProtKB/SwissProt | 20 215 | 762 | 3.77 |
| UniProtKB/TrEMBL | 101 629 | 22 790 | 22.42 |
| EnsEMBL | 83 139 | 10 798 | 12.99 |
| NCBI/RefSeq | 23 125 | 1183 | 5.12 |
| UniProtKB/SwissProt | 16 526 | 588 | 3.56 |
| UniProtKB/TrEMBL | 61 249 | 6609 | 10.79 |
| EnsEMBL | 50 702 | 4471 | 8.82 |
| NCBI/RefSeq | 26 251 | 1096 | 4.18 |
| UniProtKB/SwissProt | 7750 | 251 | 3.24 |
| UniProtKB/TrEMBL | 33 859 | 2637 | 7.79 |
| EnsEMBL | 32 780 | 2042 | 6.23 |
| NCBI/RefSeq | 25 304 | 1212 | 4.79 |
| UniProtKB/SwissProt | 2244 | 100 | 4.46 |
| UniProtKB/TrEMBL | 26 611 | 2649 | 9.95 |
| EnsEMBL | 21 866 | 1874 | 8.57 |
| NCBI/RefSeq | 17 365 | 990 | 5.70 |
| UniProtKB/SwissProt | 43 | 2 | 4.65 |
| UniProtKB/TrEMBL | 32 893 | 2452 | 7.45 |
| EnsEMBL | 32 422 | 2359 | 7.28 |
| NCBI/RefSeq | 18 979 | 996 | 5.25 |
| UniProtKB/SwissProt | 1655 | 56 | 3.38 |
| UniProtKB/TrEMBL | 28 865 | 1898 | 6.58 |
| EnsEMBL | 22 579 | 1522 | 6.74 |
| NCBI/RefSeq | 22 515 | 1288 | 5.72 |
| UniProtKB/SwissProt | 2821 | 159 | 5.64 |
| UniProtKB/TrEMBL | 53 846 | 5307 | 9.86 |
| EnsEMBL | 39 423 | 3662 | 9.29 |
| NCBI/RefSeq | 26 263 | 2094 | 7.97 |
| UniProtKB/SwissProt | 173 | 12 | 6.94 |
| UniProtKB/TrEMBL | 48 816 | 4873 | 9.98 |
| EnsEMBL | 47 728 | 3999 | 8.38 |
| NCBI/RefSeq | 442 | 15 | 3.39 |
| UniProtKB/TrEMBL | 19 010 | 1523 | 8.01 |
| EnsEMBL | 17 281 | 1377 | 7.97 |
| NCBI/RefSeq | 13 825 | 837 | 6.05 |
| UniProtKB/SwissProt | 55 | 1 | 1.82 |
| UniProtKB/TrEMBL | 29 164 | 3040 | 10.42 |
| NCBI/RefSeq | 29 226 | 3042 | 10.41 |
| UniProtKB/SwissProt | 109 | 7 | 6.42 |
| UniProtKB/TrEMBL | 29 430 | 3189 | 10.84 |
| NCBI/RefSeq | 24 414 | 2804 | 11.49 |
| UniProtKB/SwissProt | 3152 | 136 | 4.31 |
| UniProtKB/TrEMBL | 33 786 | 2103 | 6.22 |
| EnsEMBL | 19 460 | 853 | 4.38 |
| NCBI/RefSeq | 19 577 | 860 | 4.39 |
| UniProtKB/SwissProt | 199 | 8 | 4.02 |
| UniProtKB/TrEMBL | 18 624 | 1390 | 7.46 |
| NCBI/RefSeq | 15 359 | 954 | 6.21 |
| UniProtKB/TrEMBL | 495 | 34 | 6.87 |
| NCBI/RefSeq | 15 995 | 824 | 5.15 |
| UniProtKB/SwissProt | 3354 | 140 | 4.17 |
| UniProtKB/TrEMBL | 22 309 | 971 | 4.35 |
| EnsEMBL | 25 438 | 1093 | 4.30 |
| NCBI/RefSeq | 23 229 | 1056 | 4.55 |
| UniProtKB/SwissProt | 566 | 17 | 3.00 |
| UniProtKB/TrEMBL | 21 341 | 1007 | 4.72 |
| NCBI/RefSeq | 19 222 | 986 | 5.13 |
| UniProtKB/TrEMBL | 157 | 7 | 4.46 |
| NCBI/RefSeq | 17 002 | 1467 | 8.63 |
| UniProtKB/SwissProt | 115 | 5 | 4.35 |
| UniProtKB/TrEMBL | 24 717 | 2493 | 10.09 |
| NCBI/RefSeq | 24 430 | 2483 | 10.16 |
| UniProtKB/SwissProt | 23 | 1 | 4.35 |
| UniProtKB/TrEMBL | 11 561 | 814 | 7.04 |
| NCBI/RefSeq | 11 523 | 809 | 7.02 |
Figure 2.MisPred analysis of a protein sequence for potential sequence errors. The sequence shown in Figure 1 was analysed with the various MisPred tools. This figure shows basic information about the input protein sequence (automatically generated sequence ID, species name, protein sequence, task status and date and time of the completion of the analysis).
Figure 3.MisPred analysis of a protein sequence for potential sequence errors. The sequence shown in Figure 1 was analysed with the various MisPred tools. The figure shows the primary conclusions based on the analyses for signal peptide, Pfam-A domains, transmembrane helix, GPI anchor, domain-size integrity and chromosomal localization of the exons encoding the protein. In the rows showing the Pfam-A domains present in this protein, the different characters represent the output of the HMMscan program. For example, in the first row, the characters (from left to right) indicate the Model used (ls), the domain type identified (PF00051.10), the number of copies of this domain type in this protein (1), the first and last residues of the domain, defined by residue numbering of this protein (25 106), the first and last residues of the HMM of this domain type that align with PF00051 of this protein (1 85), the score of the match (84.6) and the E-value of the match (2.1 e-24). Note that these analyses revealed that the protein is a secreted extracellular protein that contains a secretory signal peptide and two types of extracellular domains. In harmony with the extracellular localization of the protein, it does not contain intracellular signaling domains, nuclear domains or transmembrane helices. However, the protein is erroneous in as much as one of its extracellular protein domains, the Pfam-A domain PF01822 (WSC-domain) is truncated, an error that is detected by MisPred tool 4 (domain-size deviation).
Figure 4.MisPred analysis of a protein sequence for potential sequence errors. The sequence shown in Figure 1 was analysed with the various MisPred tools. This figure summarizes the conclusions: the sequence violates only one of the MisPred rules: the size of one of its domains deviates significantly from the size typical of the given domain family. Note that conflict 11 is missing from the type of sequence errors, as MisPred tool 11 is not yet available in searches on the MisPred website. This tool will be released in the next update of MisPred.