| Literature DB >> 28592293 |
Juliana Bernardes1, Catherine Vaquero2, Alessandra Carbone3,4.
Abstract
BACKGROUND: With the availability of complete genome sequences of both human and non-human Plasmodium parasites, it is now possible to use comparative genomics to look for orthology across Plasmodium species and for species specific genes. This comparative analyses could provide important clues for the development of new strategies to prevent and treat malaria in humans, however, the number of functionally annotated proteins is still low for all Plasmodium species. In the context of genomes that are hard to annotate because of sequence divergence, such as Plasmodium, domain co-occurrence becomes particularly important to trust predictions. In particular, domain architecture prediction can be used to improve the performance of existing annotation methods since homologous proteins might share their architectural context.Entities:
Keywords: Annotation; Database; Domain architecture prediction; Genome; Genome comparison; Plasmodium; Protein architecture
Mesh:
Year: 2017 PMID: 28592293 PMCID: PMC5463329 DOI: 10.1186/s12936-017-1887-8
Source DB: PubMed Journal: Malar J ISSN: 1475-2875 Impact factor: 2.979
Features of Plasmodium genomes
| Species | Genome size (Mb) | #Proteinsa | AT%b |
|---|---|---|---|
|
| 23.33 | 5542 | 0.81 |
|
| 22.98 | 5491 | 0.81 |
|
| 27.01 | 5586 | 0.58 |
|
| 24.40 | 5229 | 0.61 |
|
| 26.18 | 5716 | 0.60 |
|
| 23.92 | 5846 | 0.81 |
|
| 18.97 | 5217 | 0.76 |
|
| 18.78 | 5076 | 0.78 |
|
| 22.76 | 5978 | 0.78 |
|
| 22.94 | 7724 | 0.76 |
|
| 22.03 | 5709 | 0.78 |
Numbers are reported from PlasmoDB [1]
aThe symbol # stands for “the number of”
bAT richness is computed on CDS regions only
Fig. 1Newly predicted domain architecture of P. falciparum gene PF3D7_1369500. Plasmobase “Look up” page associated to gene PF3D7_1369500. CLADE predicted architecture (top) contains three domains: MIF4G, MIF4G_like_2, MIF4G_like. Pfam_27 architecture is displayed below and it highlights no identified domains. The list of all domains identified by CLADE is given. Besides the three domains belonging to CLADE architecture, there is one more domain displayed in grey that has been also identified by CLADE but not selected by DAMA. The user might be interested to consider it in view of a putative functional annotation of the protein. Indeed, he/she can select a combination of CLADE domains and explore it either in Plasmobase or in UniProt by clicking on the corresponding buttons (bottom). The list of orthologs and paralogs, according to PlasmoDB, can be displayed by clicking on the “show” link. Windows with informations on identified domains (Pfam domain name, Pfam accession number, CLADE model species, position of the domain in the protein, domain coverage, E-value, CLADE SVM probability, clan name if any) are accessible by passing the mouse above the domain location, as illustrated by the information box for the blue domain. Note that the clade-centred model generated by the Bombix mori MIF4G_like sequence is the one that obtained the best match with Plasmodium sequence PF3D7_1369500
Fig. 2Proteins with similar architectures explored in Plasmobase/UniProt. a All Plasmodium species contain a protein sequence sharing the same CLADE architecture as PF3D7 1369500. A selection of these species allows to explore these protein sequences in Plasmobase, and verify information for domain architecture identification in other species. The whole list of domains is reported (in this specific example, there is only one architecture per species). b Plasmobase allows to explore the UniProt database for architectures that are similar to the one identified by CLADE for PF3D7 1369500. There are similar architectures in Metazoa, Fungi, Viridiplantae and other clades. A selection of Fungi and Viridiplantae allows the user to compare the architectures among these clades. Fungi contains 41 sequences with the given architecture and the full list is accessible
New domains identified in Plasmobase, possibly by co-occurrence (Cooc)
| First timea | Enriching domainsb | New domainsc | Brand-new domainsd | |||||
|---|---|---|---|---|---|---|---|---|
| Cooce | Totalf | Cooce | Totalf | Cooce | Totalf | Cooce | Totalf | |
|
| 467 | 916 | 1052 | 1200 | 1519 | 2116 | 603 | 971 |
|
| 368 | 691 | 893 | 984 | 1261 | 1675 | 496 | 741 |
|
| 324 | 659 | 871 | 996 | 1195 | 1655 | 525 | 824 |
|
| 289 | 578 | 858 | 955 | 1147 | 1533 | 504 | 736 |
|
| 296 | 614 | 703 | 792 | 999 | 1406 | 392 | 632 |
|
| 373 | 679 | 911 | 1004 | 1284 | 1683 | 513 | 737 |
|
| 328 | 730 | 785 | 874 | 1113 | 1604 | 468 | 678 |
|
| 316 | 666 | 768 | 841 | 1084 | 1507 | 431 | 635 |
|
| 310 | 754 | 776 | 847 | 1086 | 1601 | 427 | 646 |
|
| 253 | 714 | 660 | 724 | 913 | 1438 | 370 | 556 |
|
| 310 | 702 | 779 | 851 | 1089 | 1553 | 426 | 642 |
aNumber of domain predictions occurring on proteins with no annotation in PlasmoDB
bNumber of new domains enriching known protein architectures
cTotal number of new domains, corresponding to the sum of a and b
dNumber of new domains that occur in no proteins for the current Plasmodium species, according to PlasmoDB
eNumber of predicted domains that are supported by co-occurrence
fTotal number of identified domains, predicted or not based on co-occurrence
Comparison between PlasmoDB and Plasmobase domain predictions
| Species | PlasmoDB | Plasmobase | |||
|---|---|---|---|---|---|
| #Pred domains | #Prots with no domaina | #Pred domains | %Improvb | #Prots with no domaina | |
|
| 6037 | 2068 (37.31%) | 7842 | 30 | 1526 (27.54%) |
|
| 5783 | 2085 (37.97%) | 7035 | 22 | 1718 (31.29%) |
|
| 5177 | 2132 (38.16%) | 6431 | 24 | 1830 (32.76%) |
|
| 5469 | 1929 (36.89%) | 6430 | 18 | 1627 (31.11%) |
|
| 4731 | 2449 (42.84%) | 5660 | 20 | 2242 (39.22%) |
|
| 6110 | 2090 (35.75%) | 7355 | 20 | 1734 (29.66%) |
|
| 4834 | 2017 (38.66%) | 6128 | 27 | 1572 (30.13%) |
|
| 4715 | 1951 (38,43%) | 5924 | 26 | 1566 (30.85%) |
|
| 5564 | 2038 (34.09%) | 6844 | 23 | 1872 (31.31%) |
|
| 5134 | 4078 (52.79%) | 6265 | 22 | 3311 (42.87%) |
|
| 5355 | 1981 (34.69%) | 6592 | 23 | 1557 (27.27%) |
aIn parenthesis, the percentage of proteins with no domain is computed as #Prots with no domain/#Proteins, where #Proteins is reported in Table 1
bThe improvement is computed as (#Predicted domains in Plasmobase—#Predicted domains in PlasmoDB)/#Predicted domains in Plasmobase