| Literature DB >> 17908308 |
Anders Bresell1, Bengt Persson.
Abstract
BACKGROUND: Recent sequencing projects and the growth of sequence data banks enable oligopeptide patterns to be characterized on a genome or kingdom level. Several studies have focused on kingdom or habitat classifications based on the abundance of short peptide patterns. There have also been efforts at local structural prediction based on short sequence motifs. Oligopeptide patterns undoubtedly carry valuable information content. Therefore, it is important to characterize these informational peptide patterns to shed light on possible new applications and the pitfalls implicit in neglecting bias in peptide patterns.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17908308 PMCID: PMC2231379 DOI: 10.1186/1471-2164-8-346
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Number of different observed oligopeptides
| Length ( | Theoretical (20 | Swiss-Prot | Genome | ||
| original | randomized | original | randomized | ||
| 4 | 160 000 | 159 999 | 160 000 | 160 000 | 160 000 |
| 5 | 3 200 000 | 3 021 259 | 3 136 980 | 3 196 081 | 3 199 490 |
| 6 | 64 000 000 | 25 025 493 | 34 155 965 | 52 989 609 | 58 435 452 |
Number of different observed oligopeptide patterns of lengths four, five and six in the original and randomized data sets.
Figure 1Kingdom distributions of protein number and sequence length. The pie charts show the total sequence length and the number of proteins found in the kingdoms of archaea (A), bacteria (B) and eukaryotes (E). The fraction of proteins in bacteria and eukaryotes are approximately the same, but the eukaryotic proteins are 8% longer on average in the genome set compared to Swiss-Prot. The archaeal portion constitutes only 2–5% of the data and will in absolute number of observations therefore be considerably lower.
Figure 2Peptide patterns among kingdoms. The Venn diagrams show the percentage of peptide patterns common to the kingdoms in the original sequence sets. Only few peptide patterns are unique to a kingdom in the genome data set. As many as 98.7% of the peptide patterns are common to two or more kingdoms in the genome set, while the corresponding number for Swiss-Prot is 82.3%. <0.1 indicates less than 0.1%. The sum for each data set is 100%.
Figure 3Amino acid residue differences in peptide sets and kingdoms. Graphics shows the differences in percentage points versus the overall residue frequencies in the respective kingdom (cf. Additional file 1). All peptides sets contain 100 peptide patterns, except ORP-A for which only 54 and 6 patterns passed the filtering step for the Swiss-Prot and genome datasets, respectively (cf. Table 3). Green indicates that the residue is more abundant than background, while red indicates less abundant than background. The difference in percentage points is shown by color intensities according to the scale.
Swiss-Prot sequence features matching peptide patterns
| Dataset | wf | wof | Feature (Number of peptide patterns) |
| POP-A-Swissprot | 46 | 54 | NP_BIND 25, ACT_SITE 7, REPEAT 6, MOTIF 4, BINDING 4, ZN_FING 3, METAL 2 |
| POP-A-genome | 52 | 48 | METAL 22, ZN_FING 15, NP_BIND 7, ACT_SITE 3, DISULFID 3, CARBOHYD 2, VAR_SEQ 2, LIPID 1, SIGNAL 1, MUTAGEN 1, REPEAT 1 |
| POP-B-Swissprot | 13 | 87 | METAL 6, ACT_SITE 5, MOTIF 2, NP_BIND 1 |
| POP-B-genome | 67 | 33 | METAL 33, ZN_FING 18, DISULFID 7, BINDING 6, ACT_SITE 4, TRANSMEM 2, VAR_SEQ 2, STRAND 2, REPEAT 1, SE_CYS 1, SITE 1, TURN 1, HELIX 1 |
| POP-E-Swissprot | 24 | 76 | METAL 10, TRANSMEM 5, DNA_BIND 5, DISULFID 3, ZN_FING 2, BINDING 1, ACT_SITE 1 |
| POP-E-genome | 39 | 61 | TRANSMEM 10, ZN_FING 6, DISULFID 6, REPEAT 5, VAR_SEQ 4, COMPBIAS 2, METAL 2, CARBOHYD 2, VARIANT 1, BINDING 1, PROPEP 1, CONFLICT 1 |
| NEP-A-Swissprot | 16 | 84 | TRANSMEM 16 |
| NEP-A-genome | 6 | 94 | TRANSMEM 5, PEPTIDE 1 |
| NEP-B-Swissprot | 10 | 90 | TRANSMEM 7, VAR_SEQ 1, COMPBIAS 1, COILED 1 |
| NEP-B-genome | 38 | 62 | DISULFID 12, VAR_SEQ 8, REPEAT 6, TRANSMEM 5, COMPBIAS 3, METAL 2, ZN_FING 2, VARIANT 1, COILED 1, STRAND 1 |
| NEP-E-Swissprot | 7 | 93 | TRANSMEM 3, VAR_SEQ 2, COMPBIAS 1, NP_BIND 1 |
| NEP-E-genome | 14 | 86 | TRANSMEM 4, DISULFID 4, STRAND 3, ZN_FING 1, MOTIF 1, PROPEP 1 |
| ORP-A-Swissprot | 10 | 44 | METAL 3, BINDING 2, ACT_SITE 1, DISULFID 1, MOTIF 1, TRANSMEM 1, HELIX 1 |
| ORP-A-genome | 0 | 6 | |
| ORP-B-Swissprot | 7 | 93 | METAL 4, BINDING 3, NP_BIND 1 |
| ORP-B-genome | 29 | 71 | TRANSMEM 17, STRAND 4, HELIX 3, METAL 2, DNA_BIND 1, ACT_SITE 1, TURN 1, ZN_FING 1, DISULFID 1, CROSSLNK 1 |
| ORP-E-Swissprot | 24 | 76 | METAL 10, ZN_FING 9, DNA_BIND 3, NP_BIND 1, TRANSMEM 1 |
| ORP-E-genome | 40 | 60 | ZN_FING 22, DISULFID 5, COMPBIAS 2, REPEAT 2, CARBOHYD 2, TRANSMEM 2, LIPID 1, DNA_BIND 1, HELIX 1, COILED 1, VAR_SEQ 1, PROPEP 1 |
| URP-A-Swissprot | 13 | 87 | COMPBIAS 7, METAL 4, VAR_SEQ 3, ZN_FING 2, REPEAT 1, COILED 1 |
| URP-A-genome | 79 | 21 | ZN_FING 49, COMPBIAS 10, TRANSMEM 6, DISULFID 6, REPEAT 3, DNA_BIND 2, SIGNAL 2, ACT_SITE 2, NP_BIND 1, METAL 1, CARBOHYD 1, VAR_SEQ 1 |
| URP-B-Swissprot | 23 | 77 | ZN_FING 10, METAL 8, DNA_BIND 3, NP_BIND 1, TRANSMEM 1 |
| URP-B-genome | 47 | 53 | ZN_FING 24, DISULFID 9, COMPBIAS 2, TRANSMEM 2, HELIX 2, CARBOHYD 2, REPEAT 1, SIGNAL 1, METAL 1 TURN 1, NP_BIND 1, COILED 1, PROPEP 1, LIPID 1 |
| URP-E-Swissprot | 6 | 94 | NP_BIND 2, METAL 2, BINDING 1, ACT_SITE 1 |
| URP-E-genome | 31 | 69 | TRANSMEM 19, HELIX 4, STRAND 4, METAL 2, TURN 2, DNA_BIND 1, ACT_SITE 1, ZN_FING 1, CROSSLNK 1 |
Known Swiss-Prot sequence features of peptide patterns were retrieved by matching peptide patterns against Swiss-Prot entries. Features that matched fewer than 20% of the sequence hits of a peptide pattern or were ambiguous (see Methods section) were excluded. The table shows the fraction of feature-associated patterns for each peptide set and the respective number of patterns associated with each feature. Individual data on each pattern are given in Additional files 3 and 4. wf, with feature, number of peptide patterns that matches a sequence feature type in at least 20% of all the sequence hits in Swiss-Prot (release 51.5); wof, without feature, number of peptide patterns that does not meet the criteria of wf.
Categorization of peptide sets
| Peptide set | Kingdom | Filter | Top 100 by |
| POP | A | ≥ 10 in A, ≤ 2 in rand. A | Freq. in orig. A |
| B | ≥ 10 in B, ≤ 2 in rand. B | Freq. in orig. B | |
| E | ≥ 10 in E, ≤ 2 in rand. E | Freq. in orig. E | |
| NEP | A | ≤ 2 in A, ≥ 10 in rand. A | Freq. in rand. A |
| B | ≤ 2 in B, ≥ 10 in rand. B | Freq. in rand. B | |
| E | ≤ 2 in E, ≥ 10 in rand. E | Freq. in rand. E | |
| ORP | A | ≥ 10 in A, ≤ 2 in B+E | Freq. in orig. A |
| B | ≥ 10 in B, ≤ 2 in A+E | Freq. in orig. B | |
| E | ≥ 10 in E, ≤ 2 in A+B | Freq. in orig. E | |
| URP | A | ≤ 2 in A, | Freq in orig. B+E |
| B | ≤ 2 in B, | Freq in orig. A+E | |
| E | ≤ 2 in E, | Freq in orig. A+B | |
The peptide sets POP, NEP, ORP and URP (for archaea (A), bacteria (B) and eukaryota (E), respectively) were generated by an initial filtering step selecting peptides occurring at least 10 times in one set and at most twice in another, as detailed in the table, then the 100 highest ranked peptide patterns were extracted according to the parameter settings. orig, original; rand, randomized.
The number of patterns that have cysteine-cysteine and proline-proline dipeptides in NEP
| Dipeptide | ||
| Dataset | CC | PP |
| NEP-A-Genome | 0 | 2 |
| NEP-A-Swissprot | 0 | 0 |
| NEP-B-Genome | 0 | 12 |
| NEP-B-Swissprot | 0 | 14 |
| NEP-E-Genome | 12 | 1 |
| NEP-E-Swissprot | 1 | 1 |
| All-A-Genome | 4% | 35% |
| All-A-Swissprot | 3% | 34% |
| All-B-Genome | 4% | 38% |
| All-B-Swissprot | 3% | 38% |
| All-E-Genome | 18% | 57% |
| All-E-Swissprot | 18% | 54% |
| All: all proteins from the kingdom. | ||
Dipeptides of cysteine and proline are suspected to be structurally unfavored. Several of the peptides in bacterial NEPs do have diprolines but no dicysteines. Eukaryotic NEPs from the genome set have several dicysteines. However, for NEPs, the fraction of proteins having these dipeptides are considerably lower than in there their respective kingdom section of the full data set.
Figure 4Overlap of patterns between the peptide pattern subsets. Numbers give proportions of overlapping patterns. A, B and E indicate the kingdoms of archaea, bacteria and eukaryota, respectively. Clustering is based on single linkage hierarchical clustering and is shown schematically at the top. Peptide subsets with an overlap of 50% or more are red boxed, the remaining are in grey boxes.
Common patterns between ORPs and URPs in bacterias and eukaryotes
| ORP-E/URP-B | ORP-B/URP-E | ||||||
| Swiss-Prot | Genome | Swiss-Prot | genome | ||||
| WDTAG | NP_BIND | PYVCK | ZN_FING | PGCSM | METAL | WNYWV | TRANSMEM |
| QGPPG | REGION | KPYTC | ZN_FING | CDKIT | METAL | RCWHY | METAL |
| EECGK | ZN_FING | YECNQ | ZN_FING | TRMKS | REGION | WAWGH | METAL |
| FHFIL | METAL | PHECK | ZN_FING | WQGQC | ZN_FING | ||
| AFHFI | METAL | KPYNC | ZN_FING | WFPKM | TRANSMEM | ||
| NPIIY | TRANSMEM | FECKQ | ZN_FING | MVPMW | TRANSMEM | ||
| NHCGK | ZN_FING | AMWWI | TRANSMEM | ||||
| PYQCK | ZN_FING | WGGWW | TRANSMEM | ||||
| KPHKC | ZN_FING | WHPEW | ACT_SITE | ||||
| PYKCQ | ZN_FING | WGIMH | TRANSMEM | ||||
| PYKCT | ZN_FING | IYWHF | DOMAIN | ||||
| PCGHN | ZN_FING | ||||||
| CMNGG | DISULFID | ||||||
| RLSCA | AWTWN | GANMQ | PMVWR | GHWYF | |||
| GWIIR | ECVWQ | PTDMQ | NFWQM | MTAWH | |||
| LRLSC | YWEFQ | ANMQR | YWGCP | WYVVH | |||
| ICLFL | YCQEY | GSYHD | GENHW | ANHWM | |||
| NYTPA | YHEWT | MIGDP | HGCCH | MWPVH | |||
| TLTWI | TMYCE | YHDVD | ACMHC | YWQVY | |||
| GHPIS | MIKCY | PYRKV | HNWPG | WIAAW | |||
| RNLSH | CYIFM | YPAME | SHIWY | EFWCR | |||
| KQRSM | GFHCN | RDVHP | WPMKH | MNAWA | |||
| FCAEA | SKFWY | LPHRY | WWIKA | YPCNY | |||
| FCQVR | FHIGG | WMPSW | LCHYW | ||||
| CKQDV | TYNFP | FPMDW | IRWQH | ||||
| RSKFW | VMFGN | FCDWY | HWAYK | ||||
| NTWHR | MIEGP | WHKRP | MGKWL | ||||
| CKPPN | NIMEF | IEAHW | FWWNP | ||||
| MFGCP | VYKHA | TMWRG | GWLWF | ||||
| MWIPK | HGTYP | WMAMN | GMNKW | ||||
| PKYCI | KDHHS | FGWQV | PRHYW | ||||
| QNVMC | AHDWC | WRNAW | |||||
| MCVDY | WDMNF | MAHDW | |||||
| MYCEA | IMTWM | WAMTQ | |||||
| WWVSM | EHWHT | RHWMI | |||||
| RNMCP | TYAMW | ||||||
Table shows patterns that are ORPs in eukaryotes and URPs in bacteria and vice versa. The upper part shows pattern with associated features and the lower part lists those without feature association. Only peptide patterns where residue bias can be excluded are shown (p ≤ 0.05).