| Literature DB >> 18096064 |
Niklaus Fankhauser1, Tien-Minh Nguyen-Ha, Joël Adler, Pascal Mäser.
Abstract
BACKGROUND: Many parasitic organisms, eukaryotes as well as bacteria, possess surface antigens with amino acid repeats. Making up the interface between host and pathogen such repetitive proteins may be virulence factors involved in immune evasion or cytoadherence. They find immunological applications in serodiagnostics and vaccine development. Here we use proteins which contain perfect repeats as a basis for comparative genomics between parasitic and free-living organisms.Entities:
Year: 2007 PMID: 18096064 PMCID: PMC2254594 DOI: 10.1186/1477-5956-5-20
Source DB: PubMed Journal: Proteome Sci ISSN: 1477-5956 Impact factor: 2.480
Comparison of programs for the detection of repetitive subsequences in proteins
| Reptile | Hashing2 | No | Yes | Yes | 153 | Yes |
| REP [2] | Profiles of known repeats | Yes | No | No | 1.1 | No |
| RADAR [5] | Alignment | Yes | No | No | 28 | Yes |
| REPRO [7] | Alignment | Yes | No | No | n.a. | Yes |
| Internal Repeats finder [8] | Alignment | Yes | Yes | No | 14 | No |
| TRIPS [9] | Fourier transform | Yes | No | No | 12 | No |
| RepSeq [10] | Hashing | Yes | Yes | Yes | n.a. | Yes |
| ProtRepeatsDB [11] | Mixed | Yes | Yes | Yes | n.a. | Yes |
| Repper [12] | Fourier transform | Yes | No | No | n.a. | No |
1The T. brucei surface protein (GenBank accession AAK62893) with five GPEET repeats [25] was used for benchmarking.
2Word count using a hash table.
3Using P < 0.001 (same as for Internal Repeats Finder).
Eukaryotic proteomes analyzed
| Metazoa | F | 38220 | |
| Metazoa | F | 35593 | |
| Viridiplantae | F | 34554 | |
| Metazoa | F | 22431 | |
| Metazoa | F | 16239 | |
| Metazoa | F | 15647 | |
| Metazoa | F | 13486 | |
| Protozoa | F | 13017 | |
| Metazoa | F | 11987 | |
| Fungi | F | 6525 | |
| Fungi | F | 5810 | |
| Fungi | F | 5326 | |
| Fungi | F | 5009 | |
| Protozoa | P | 9772 | |
| Protozoa | P | 9646 | |
| Protozoa | P | 9210 | |
| Protozoa | P | 8010 | |
| Fungi | P | 6569 | |
| Protozoa | P | 5283 | |
| Protozoa | P | 4071 | |
| Protozoa | P | 3886 | |
| Protozoa | P | 3790 | |
| Fungi | P | 1909 |
F, free-living; P, endoparasitic.
Figure 1Comparative genomics of repeat-containing proteins. Double logarithmic plot of the percentage of highly repetitive (P < 10-10) proteins vs. mean protein length of eukaryotic proteomes. Ag, A. gambiae; At, A. thaliana; Br, B. rerio; Ce, C. elegans; Dd, D. discoideum; Dm, D. melanogaster; Hs, H. sapiens; Kl, K. lactis; Mm, M. musculus; Rn, R. norvegicus; Sc, S. cerevisiae; Sp, S. pombe; Yl, Y. lipolytica; Ch, C. hominis; Cn, C. neoformans; Ec, E. cuniculi; Eh, E. histolytica; Gd, G. duodenalis; Lm, L. major; Pf, P. falciparum; Ta, T. annulata; Tb, T. brucei; Tp, T. parva; rS, Spearman coefficient.
A selection of the most repetitive proteins from pathogens
| Hypothetical protein, Tb927.1.1740 | Tb | 7154 | 132 × LAEESQQHTARSEADIDE | 2806 |
| Gene 11-1 protein*, Q8I6U6 | Pf | 10589 | 967 × EEV | 2457 |
| Conserved protein, LmjF29.0110 | Lm | 3418 | 146 × AEEQARR | 1080 |
| Proteophosphoglycan-like, LmjF35.0550 | Lm | 2425 | 105 × SSSSSAPSA | 1052 |
| Putative antigen*, Tb04.29M18.750 | Tb | 4455 | 66 × NEQYETLQRTNAA | 958 |
| Gb4*, Tb09.160.1200 | Tb | 8214 | 35 × VVIIDCRLGSLLIDYKVI | 701 |
| Hypothetical protein, Chro.50162 | Ch | 1589 | 84 × KKDAP | 407 |
| Hypothetical protein, Q8I455 | Pf | 2349 | 67 × LKEEER | 389 |
| Interspersed repeat antigen*, Q8I486 | Pf | 1720 | 67 × QEPVT | 313 |
| Putative antigen 332*, Q8IHN3 | Pf | 5507 | 144 × EEI | 274 |
| Cell wall surface anchor family, Q97P71 | Spn | 4776 | 1074 × SAS | 3418 |
| Cell surface SD repeat protein, Q88XB6 | Lpl | 3360 | 796 × DS | 1619 |
| Hypothetical protein, Q8E473 | Sag | 1310 | 106 × TSAS | 447 |
| Putative peptidoglycan-bound, Q8Y697 | Lmo | 903 | 78 × ADADA | 403 |
| Avirulence protein, Q5GYF3 | Xor | 1790 | 20 × ETVQRLLPVLCQDHGLTP | 401 |
| Serine/threonine-rich antigen, Q99QY4 | Sau | 2271 | 163 × STS | 391 |
| PE-PGRS family, PG54_MYCTU | Mt | 1901 | 136 × GAG | 326 |
| Structural toxin RtxA, Q5X7A6 | Lpn | 7679 | 29 × RFEDDGPVV | 247 |
| Ice nucleation protein, Q8PD38 | Xca | 1333 | 52 × GYGST | 242 |
| PPE family protein, Q6MX44 | Mtu | 3300 | 95 × NTG | 184 |
Eukaryotic proteins (top) whose expression is confirmed by the presence of expressed sequence tags (EST) in GenBank are marked with an asterisk. L, length; pP, negative logarithm of the P-value; Sp, species (Ch, C. hominis; Lm, L. major; Pf, P. falciparum; Tb, T. brucei; Lmo, Listeria monocytogenes; Lpl, Lactobacillus plantarum; Lpn, Legionella pneumophila; Mtu, M. tuberculosis; Sau, S. aureus; Spn, S. pneumoniae; Sag, Streptococcus agalactiae; Xca, Xanthomonas campestris; Xor, X. oryzae).
Figure 2Amino acid composition of the repeats. For each amino acid, the frequency in the repeats of P < 10-10 is plotted vs. its frequency in the remainder of the proteome (rS, Spearman coefficient). Data are pooled for bacteria (n = 193) and eukaryotes (n = 49). The small diamonds at 0.05 mark the expected frequency for random distribution, the diagonal represents equal frequency in the repeats as in the remainder of the respective proteome. Complete datatables including standard deviation are provided as a supplementary file [Additional file 1].
Figure 3Potential N-glycosylation sites in the repeats. The percentage of asparagines that are in glycosylation consensus (Asn-not Pro-Ser/Thr) is plotted for repeats of P < 10-10 and for the remainders of the respective proteomes. Bars indicate the median. The organism with 30% of asparagines in the repeats in N-glycosylation consensus is T. brucei.
Figure 4Flowchart to Dora, database of repetitive antigens. Reptile, Phobius [20], and GPI-SOM [43] are integrated into an automated pipeline for the classification of proteins (top). The data are stored in a database that is accessible on-line [44] via the depicted interface (bottom). This allows user-defined Boolean queries for repeat-containing surface proteins.
Repetitive membrane proteins of P. falciparum (top) and T. brucei (bottom)
| Hypothetical protein, Q8IJ50 | GPI | 16 × EESHNFYNPTH | 184 |
| Circumsporozoite protein, Q7K740 | GPI | 38 × ANPN | 145 |
| Merozoite surface protein 8, Q8I476 | GPI | 32 × NN | 29 |
| Liver stage antigen, Q8IJ44 | 1 TM | 45 × AKEKLQEQQSDLEQER | 839 |
| Erythrocyte membrane protein 3, O96124 | 1 TM | 61 × QQNTGLKNTP | 665 |
| Trophozoite antigen, Q8IFL9 | 1 TM | 60 × NHKSD | 287 |
| Glycophorin-binding protein, Q8I6U8 | 1 TM | 10 × DPEGQIMREYAADPEYRKHL | 213 |
| MAEBL, Q8IHP3 | 1 TM | 19 × EEKKKADELKK | 213 |
| PF70 exoantigen, Q8IK15 | 3 TM | 8 × TKKPSKYTMNLDSPLLKGSS | 165 |
| MESA, Q8I492 | 1 TM | 94 × KE | 97 |
| PfEMP1, Q8I519 | 1 TM | 16 × GGGGGS | 77 |
| RESA, Q8IHN1 | 1 TM | 33 × EEN | 63 |
| Hypothetical protein, Tb11.02.2360 | GPI | 11 × TAVTDVNDNNSANTSNEDE | 229 |
| Hypothetical protein, Tb11.1550 | GPI | 12 × IIAHYC | 68 |
| Procyclin (EP-type), Tb10.6k15.0020 | GPI | 29 × PE | 46 |
| Hypothetical protein, Tb927.7.360 | GPI | 3 × DKEKTERTEVEEVPKKDPEG | 45 |
| Procyclin (GPEET-type), Tb927.6.510 | GPI | 6 × EETGP | 24 |
| VSG, Tb10.v4.0209 | GPI | 19 × AA | 13 |
| CRAM, Tb10.6k15.3510 | 1 TM | 80 × ITGDCNETDDC | 1050 |
| Hypothetical protein, Tb927.3.5530 | 2 TM | 49 × RLRAEEE | 337 |
| Hypothetical protein, Tb10.61.0660 | 3 TM | 12 × NEEVPAGVSARRGGVAMSF | 241 |
| Procyclic surface glycoprotein, Tb10.26.0790 | 2 TM | 5 × YGQPPPPQ | 31 |
| Invariant surface glycoprotein, Tb927.5.350 | 1 TM | 18 × EA | 12 |
TM, transmembrane domain; GPI, glycosylphosphatidyl-inositol anchor; pP, negative logarithm of the P-value. See text for full protein names.