Literature DB >> 34196123

EvoProDom: evolutionary modeling of protein families by assessing translocations of protein domains.

Gon Carmi1, Alessandro Gorohovski1, Milana Frenkel-Morgenstern1.   

Abstract

Here, we introduce a novel 'evolution of protein domains' (EvoProDom) model for describing the evolution of proteins based on the 'mix and merge' of protein domains. We assembled and integrated genomic and proteomic data comprising protein domain content and orthologous proteins from 109 organisms. In EvoProDom, we characterized evolutionary events, particularly, translocations, as reciprocal exchanges of protein domains between orthologous proteins in different organisms. We showed that protein domains that translocate with highly frequency are generated by transcripts enriched in trans-splicing events, that is, the generation of novel transcripts from the fusion of two distinct genes. In EvoProDom, we describe a general method to collate orthologous protein annotation from KEGG, and protein domain content from protein sequences using tools such as KoFamKOAL and Pfam. To summarize, EvoProDom presents a novel model for protein evolution based on the 'mix and merge' of protein domains rather than DNA-based evolution models. This confers the advantage of considering chromosomal alterations as drivers of protein evolutionary events.
© 2021 The Authors. FEBS Open Bio published by John Wiley & Sons Ltd on behalf of Federation of European Biochemical Societies.

Entities:  

Keywords:  protein domains; protein evolution; translocations

Mesh:

Substances:

Year:  2021        PMID: 34196123      PMCID: PMC8409312          DOI: 10.1002/2211-5463.13245

Source DB:  PubMed          Journal:  FEBS Open Bio        ISSN: 2211-5463            Impact factor:   2.693


domain architecture evolution of protein domains Kyoto Encyclopedia of Genes and Genomes KEGG ortholog protein–protein interaction Proteins are composed from a set of domains that correspond to conserved regions with well‐defined functional and structural properties [1]. Consistent with the domain‐oriented view of proteins, domains cluster together to form domain architectures (DAs), that is, ordered sequences of domains. ‘Domain promiscuity’ or ‘domain mobility’ describes the diversity of DAs which participate in protein assembly. Analysis of domain promiscuity can reveal the mechanisms by which domains are gained or lost [2]. Marsh and Teichmann [1] described five mechanisms by which proteins gain domains: (a) gene fusion, namely, the fusion of a pair of adjacent genes via alternative splicing in noncoding intergenic regions; (b) exon extension, whereby exon regions expand into adjacent introns to encode a new domain; (c) exon recombination, involving the direct merging of two exons from two different genes; (d) intron recombination or exon shuffling, in which an exon inserts into an intron of a different gene; and (e) retroposition, where a sequence located within one gene is transposed into a different gene, along with a flanking genetic sequence. The properties of a gained domain, for example, position in protein sequence and number of exons, can identify which mechanism underlies domain addition. For example, gain of a multi‐exon domain at the C terminus is a result of gene fusion. Additionally, during metazoan evolution, new protein–protein interactions (PPIs) can emerge subsequent to the shuffling of exons encoding domains that mediate such interactions [3]. Work by Bornberg‐Bauer and Mar Albà [4] refined and expanded these mechanisms and introduced new concepts, such as intrinsically disordered regions, and implied links between the emergence of de novo domains and the appearance of de novo genes [4]. Here, we present a novel ‘evolution of protein domains (EvoProDom)’ model that determines the evolution of proteins, based on the ‘mix and merge’ approach of protein domains. In assembling this model, we collected and integrated genomic and proteomic data from 109 organisms. These data included protein domain and orthologous protein content. In EvoProDom, we accounted for evolutionary events, including translocations, namely, the reciprocal exchange of protein domains between orthologous proteins in different organisms. We found protein domains, which frequently appear in translocation events upon enrichment of trans‐splicing events, that is, when transcripts are producing upon slippage of two distinct genes [5]. EvoProDom, devised as a general method to obtain orthologous protein annotation and protein domain content, is based on predictions of these data from protein sequences using KoFamKOALA [6] and the Pfam search tool [7, 8]. The EvoProDom method can be implemented in other research fields such as proteomics [9], protein design [10] as well as assessing PPI in host‐virus systems [11].

Materials and methods

The EvoProDom model is based on full genomic and annotated proteome data. In addition, the model utilizes orthologous protein annotation and protein domain content. Orthologous protein groups were used to group proteins (Refseqs) from different organisms, thereby linking protein domain changes among orthologous proteins with the corresponding groups of organisms. Orthologous proteins were realized as Kyoto Encyclopedia of Genes and Genomes (KEGG) orthologs (KOs) [12, 13]. Protein domain content was identified with Pfam domains, and this content was associated with proteins. Accordingly, orthologous proteins were considered as a group of proteins with the same KO number and proteins were considered as a group or list of Pfam domains. Both KO assignments and Pfam domains of proteins were predicted from protein sequences alone, using KoFamKOALA [6] and the Pfam search tool [7, 8], respectively. By utilizing these protein sequence‐based methods to attain protein domain content and orthologous protein annotation, new organisms are easily added to EvoProDom. Finally, statistical analysis was performed using r (R: A language and environment for statistical computing, 3.3.2, 2016).

Data resources

The EvoProDom model was tested on a collection of 109 organisms of which 84 (77.06%), 6 (5.50%), and 19 (17.43%) are Eukaryota, Bacteria, and Viruses, respectively, with fully described genomes and annotated proteomes (Entrez/NCBI [14]) (Table 1). These organisms were grouped as follows: (a) 15 fish; (b) four subterranean, eight fossorial, and 21 aboveground animals [15, 16]; (c) 65 organisms with known PPIs (BioGrid version 3.5.173 [17, 18]); (d) 17 organisms with HiC datasets; (e) 4 cats; and (f) 15 pathogenic organisms [19]. Organisms with HiC datasets were obtained by searching for ‘HiC’ in the NCBI GEO database (Table 1). HiC is a NGS‐variant, high‐throughput method belonging to the chromosome conformation capture (3C) family. This method captures the 3D organization of a genome within the nucleolus by analysis of DNA contact frequencies, as estimated from HiC datasets [20].
Table 1

The EvoProDom model was applied to an assembly of organisms from diverse taxa belonging to superdomains, that is, Eukaryota, Viruses, and Bacteria. In total, 109 organisms were included in the ensemble and grouped as follows: (a) 15 fish; (b) four subterranean (S), eight fossorial (F), and 21 aboveground (A) animals (SFA) [15, 16]; (c) 65 organisms with known PPIs (BioGrid version 3.5.173, [17, 18]); (d) 17 organisms with HiC datasets (GEO_hic); (e) four cats; and (f) 15 pathogenic organisms [19]. Organisms with HiC datasets were obtained by searching for ‘HiC’ in the NCBI GEO database. Taxonomy ID, organism ID, organism name and common name are provided. Additionally, assembly and group classification are indicated. In addition, statistics for proteins and isoforms are included such that listed proteins are the longest isoforms and isoforms are alternative splicing variants. Total comprises both proteins and isoforms*. *Only proteins and isoforms with KO annotation are included. Organism ID is a 3–4 letter code, where the lowercase letter code corresponds to KEGG organisms and uppercase letters correspond to organisms not included in the KEGG database.

Organism IDOrganism nameSuper kingdomEcologyCommon nameSourcea AssemblyTotalProteinsIsoforms
agaAnopheles gambiae PESTEukaryotanaAfrican malaria mosquitobiogrid_3.5.173GCF_000005575.2_AgamP368025928874
aju Acinonyx jubatus EukaryotanaCheetahCatsGCF_001443585.1_aciJub119 24213 0186224
ame Apis mellifera Eukaryotanahoney beebiogrid_3.5.173GCF_000002195.4_Amel_4.512 55960166543
aniEmericella nidulans FGSC A4EukaryotanaAspergillus nidulansbiogrid_3.5.173GCF_000149205.2_ASM14920v2392539250
ASM Astyanax mexicanus EukaryotanaMexican tetraFishGCF_000372685.2_Astyanax_mexicanus‐2.029 29416 59312 701
ath Arabidopsis thaliana EukaryotanaThale cressGEO_hic, biogrid_3.5.173GCF_000001735.4_TAIR10.121 34711 6649683
bspBacillus subtilis PY79BacteriananaGEO_hicGCF_000497485.1_ASM49748v1239923990
bsuBacillus subtilis 168Bacteriananabiogrid_3.5.173GCF_002009135.1_ASM200913v1242524250
bta Bos taurus EukaryotaACattlebiogrid_3.5.173,SFAGCF_002263795.1_ARS‐UCD1.246 97015 44431 526
CAA Carassius auratus EukaryotanaGoldfishFishGCF_003368295.1 ASM336829v166 28234 81531 467
calCandida albicans SC5314Eukaryotananabiogrid_3.5.173, Jones, et al. 2008GCF_000182965.3_ASM18296v3341934190
CAP Cavia porcellus EukaryotaADomestic guinea pigbiogrid_3.5.173,SFAGCF_000151735.1 Cavpor3.027 51114 50213 009
ccar Cyprinus carpio EukaryotanaCommon carpFishGCF_000951615.1_common_carp_genome32 53924 1828357
ccr Caulobacter vibrioides BacteriananaGEO_hicGCF_000006905.1_ASM690v1199419940
cel Caenorhabditis elegans EukaryotanaNematodeGEO_hic, biogrid_3.5.173GCF_000002985.6_WBcel235791854622456
cfa Canis familiaris EukaryotanaDogbiogrid_3.5.173GCF_000002285.3_CanFam3.141 76114 30727 454
cge Cricetulus griseus EukaryotaAChinese hamsterbiogrid_3.5.173, SFAGCF_000419365.1_C_griseus_v1.023 91414 9318983
CHA Chrysochloris asiatica EukaryotaSCape golden moleSFAGCF_000296735.1_ChrAsi1.019 18014 7644416
CHL Chinchilla lanigera EukaryotaALong‐tailed chinchillaSFAGCF_000276665.1_ChiLan1.032 22514 46617 759
COC Condylura cristata EukaryotaFStar‐nosed moleSFAGCF_000260355.1_ConCri1.021 43112 9118520
COG Cottoperca gobio EukaryotanaChannel bull blennyFishGCF_900634415.1 fCotGob3.127 24915 02412 225
cre Chlamydomonas reinhardtii EukaryotanaGreen algaebiogrid_3.5.173GCF_000002595.1_v3.03874383539
csab Chlorocebus sabaeus EukaryotanaGreen monkeybiogrid_3.5.173GCF_000409795.2_Chlorocebus_sabeus_1.144 09114 55029 541
DAN Dasypus novemcinctus EukaryotaFNine‐banded armadilloSFAGCF_000208655.1_Dasnov3.026 47615 21311 263
ddiDictyostelium discoideum AX4Eukaryotananabiogrid_3.5.173GCF_000004695.1_dicty_2.7451745089
DIO Dipodomys ordii EukaryotaFOrd's kangaroo ratSFAGCF_000151885.1_Dord_2.021 28114 1297152
dme Drosophila melanogaster EukaryotaAFruit flySFA, GEO_hic, biogrid_3.5.173GCF_000001215.4_Release_6_plus_ISO1_MT15 74966309119
dre Danio rerio EukaryotanaZebrafishFish, biogrid_3.5.173, GEO_hicGCF_000002035.6_GRCz1137 27417 37519 899
ecb Equus caballus EukaryotanaHorsebiogrid_3.5.173GCF_002863925.1_EquCab3.044 29515 52928 766
ecoEscherichia coli str. K‐12 substr. MG1655Bacteriananabiogrid_3.5.173GCF_001566335.1_ASM156633v1319431940
ECTE Echinops telfairi EukaryotaASmall Madagascar hedgehogSFAGCF_000313985.1 EchTel2.016 95513 8273128
ELE Elephantulus edwardii EukaryotaACape elephant shrewSFAGCF_000299155.1 EleEdw1.018 98115 2553726
ERE Erinaceus europaeus EukaryotaAWestern European hedgehogSFAGCF_000296755.1_EriEur2.021 87314 1537720
fca Felis catus EukaryotaADomestic catSFA, catsGCF_000181335.3_Felis_catus_9.039 85514 57225 283
FUD Fukomys damarensis EukaryotaSDamara mole‐ratSFAGCF_000743615.1_DMR_v1.031 38614 13817 248
gga Gallus gallus EukaryotaAChickenbiogrid_3.5.173, SFA,GEO_hicGCF_000002315.5_GRCg6a35 50211 94723 555
gmx Glycine max EukaryotanaSoybeanbiogrid_3.5.173GCF_000004515.4_Glycine_max_v2.032 65321 05411 599
HCVHepatitis C virusVirusesnaHCVbiogrid_3.5.173, Jones et al. 2008GCF_000861845.1_ViralProj15432110
hgl Heterocephalus glaber EukaryotaSNaked mole‐ratSFAGCF_000247695.1_HetGla_female_1.031 47814 56516 913
HHV1Human Herpesvirus 1VirusesnaHerpes simplex virus type 1biogrid_3.5.173, Jones et al. 2008GCF_000859985.2_ViralProj1521727270
HHV2Human Herpesvirus 2VirusesnaHHV2biogrid_3.5.173GCF_000858385.2_ViralProj1521827270
HHV3Human Herpesvirus 3VirusesnaVaricella‐zoster virusbiogrid_3.5.173, Jones et al. 2008GCF_000858285.1_ViralProj15198660
HHV4Human gammaherpesvirus 4VirusesnaEBVGEO_hic, biogrid_3.5.173GCF_002402265.1_Decoy21192
HHV5Human Herpesvirus 5VirusesnaHuman cytomegalovirusbiogrid_3.5.173, Jones et al. 2008GCF_000845245.1_ViralProj1455916160
HHV6AHuman Herpesvirus 6AVirusesnaHHV6Abiogrid_3.5.173GCF_000845685.1_ViralProj14462550
HHV6BHuman Herpesvirus 6BVirusesnaHHV6Bbiogrid_3.5.173GCF_000846365.1_ViralProj14422550
HHV7Human Herpesvirus 7VirusesnaHHV7biogrid_3.5.173, Jones et al. 2008GCF_000848125.1_ViralProj14625440
HHV8Human gammaherpesvirus 8VirusesnaKSHVGEO_hic, biogrid_3.5.173, Jones et al. 2008GCF_000838265.1_ViralProj14158880
HIV1Human Immunodeficiency Virus 1VirusesnaHIV1biogrid_3.5.173, Jones et al. 2008GCF_000864765.1_ViralProj15476550
HIV2Human Immunodeficiency Virus 2VirusesnaHIV2biogrid_3.5.173, Jones et al. 2008GCF_000856385.1_ViralProj14991550
HPV10Human papillomavirus type 10VirusesnaHPV10biogrid_3.5.173, Jones et al. 2008GCF_000864905.1_ViralProj15504761
HPV16Human papillomavirus 16VirusesnaHPV16biogrid_3.5.173, GEO_hic, Jones et al. 2008GCF_000863945.3_ViralProj15505770
HPV6bHuman papillomavirus type 6bVirusesnaHPV6bbiogrid_3.5.173, Jones et al. 2008GCF_000861945.1_ViralProj15454660
hsa Homo sapiens EukaryotaAHumanbiogrid_3.5.173, SFA, GEO_hicGCF_000001405.37_GRCh38.p1176 30614 48461 822
ICT Ictidomys tridecemlineatus EukaryotaFThirteen‐lined ground squirrelSFAGCF_000236235.1 SpeTri2.028 82814 77614 052
lav Loxodonta africana EukaryotaAAfrican savanna elephantSFAGCF_000001905.1_Loxafr3.030 92915 68315 246
lcf Lates calcarifer EukaryotanaBarramundi perchFishGCF_001640805.1_ASM164080v131 30817 41613 892
lcm Latimeria chalumnae EukaryotanaCoelacanthFishGCF_000225785.1_LatCha122 31813 0889230
LEO Lepisosteus oculatus EukaryotanaSpotted garFishGCF_000242695.1 LepOcu128 77312 42216 351
MAM Macaca mulatta EukaryotanaRhesus monkeybiogrid_3.5.173, GEO_hicGCF_003339765.3 Mmul_1049 56315 06334 500
MARM Marmota marmota EukaryotaFEuropean marmotSFAGCF_001458135.1 marMar2.123 28415 0828202
mge Mycoplasma genitalium BacteriananaJones et al. 2008GCF_000027325.1_ASM2732v12652650
mgp Meleagris gallopavo EukaryotanaTurkeybiogrid_3.5.173GCF_000146605.2_Turkey_5.020 63111 0459586
MIO Microtus ochrogaster EukaryotaFprairie voleSFAGCF_000317375.1_MicOch1.023 04514 9508095
mmu Mus musculus EukaryotaAHouse mousebiogrid_3.5.173, SFA, GEO_hicGCF_000001635.26_GRCm38.p654 09515 93938 156
mtvMycobacterium tuberculosis H37RvBacteriananabiogrid_3.5.173, Jones et al. 2008GCF_000195955.2_ASM19595v2187418740
ncc Notothenia coriiceps EukaryotanaBlack rockcodFishGCF_000735185.1_NC0117 08912 3004789
ncrNeurospora crassa OR74AEukaryotananabiogrid_3.5.173GCF_000182925.2_NC1243033798505
NEL Neotoma lepida EukaryotaADesert woodratSFAGCF_001675575.1 ASM167557v111 06011 0600
nfu Nothobranchius furzeri EukaryotanaTurquoise killifishFishGCF_001465895.1_Nfu_2014052025 76015 05110 709
ngi Nannospalax galili EukaryotaSUpper Galilee mountains blind mole ratSFAGCF_000622305.1_S.galili_v1.028 58715 16313 424
nle Nomascus leucogenys EukaryotanaNorthern white‐cheeked gibbonGEO_hicGCF_000146795.2_Nleu_3.027 13014 00113 129
nto Nicotiana tomentosiformis EukaryotanaTobaccobiogrid_3.5.173GCF_000390325.2_Ntom_v0121 03112 5018530
oaa Ornithorhynchus anatinus EukaryotaAplatypusSFAGCF_000002275.2_Ornithorhynchus_anatinus_5.0.113 80310 3773426
oas Ovis aries EukaryotanaSheepbiogrid_3.5.173GCF_000298735.2_Oar_v4.031 31914 66316 656
OCD Octodon degus EukaryotaFDeguSFAGCF_000260255.1_OctDeg1.020 66315 3435320
ocu Oryctolagus cuniculus EukaryotanaRabbitbiogrid_3.5.173, GEO_hicGCF_000003625.3_OryCun2.027 56714 45013 117
ola Oryzias latipes EukaryotanaJapanese medakaFishGCF_002234675.1_ASM223467v131 53715 13516 402
ORA Orycteropus afer afer Eukaryota A AardvarkSFAGCF_000298275.1_OryAfe1.019 24314 5114732
ORM Oryzias melastigma EukaryotanaIndian medakaFishGCF_002922805.1 Om_v0.7.RACA29 50615 61513 891
osa Oryza sativa Japonica EukaryotanaRicebiogrid_3.5.173GCF_001433935.1_IRGSP‐1.018 25812 4045854
PAP Panthera pardus EukaryotanaLeopardCatsGCF_001857705.1_PanPar1.042 10214 69327 409
PEF Perca flavescens EukaryotanaYellow perchFishGCF_004354835.1 PFLA_1.030 05616 33513 721
PEM Peromyscus maniculatus bairdii EukaryotaAPrairie deer mouseSFAGCF_000500345.1_Pman_1.033 24915 59217 657
pfaPlasmodium falciparum 3D7EukaryotanaMalaria parasite P. falciparumbiogrid_3.5.173, Jones et al. 2008GCF_000002765.4_ASM276v22001197328
phu Pediculus humanus corporis EukaryotanaHuman body lousebiogrid_3.5.173GCF_000006295.1_JCVI_LOUSE_1.0529252902
pret Poecilia reticulata EukaryotanaGuppyFishGCF_000633615.1_Guppy_female_1.0_MT30 41215 28015 132
ptg Panthera tigris altaica EukaryotanaTigerCatsGCF_000464555.1_PanTig1.021 20513 2297976
ptr Pan troglodytes EukaryotaAChimpanzeebiogrid_3.5.173,SFAGCF_002880755.1_Clint_PTRv257 74314 93942 804
rcu Ricinus communis Eukaryotanacastor beanbiogrid_3.5.173GCF_000151685.1_JCVI_RCG_1.114 12110 0184103
rno Rattus norvegicus EukaryotaANorway ratbiogrid_3.5.173,SFAGCF_000001895.5_Rnor_6.040 25116 42623 825
sasa Salmo salar EukaryotanaAtlantic salmonFishGCF_000233375.1_ICSASG_v263 09528 78434 311
sceSaccharomyces cerevisiae S288cEukaryotanaBaker's yeastbiogrid_3.5.173,GEO_hicGCF_000146045.2_R64358835880
SIVSimian Immunodeficiency VirusVirusesnaSIVbiogrid_3.5.173GCF_000863925.1_ViralProj15501440
sly Solanum lycopersicum EukaryotanaTomatobiogrid_3.5.173GCF_000188115.3_SL2.5017 13111 9145217
smo Selaginella moellendorffii Eukaryotananabiogrid_3.5.173GCF_000143415.4_v1.019 20714 1355072
SOA Sorex araneus EukaryotaAEuropean shrewSFAGCF_000181275.2 SorAra2.017 31814 0753243
sot Solanum tuberosum EukaryotanaPotatobiogrid_3.5.173GCF_000226075.1_SolTub_3.017 43112 6984733
spo Schizosaccharomyces pombe EukaryotanaFission yeastGEO_hic, biogrid_3.5.173GCF_000002945.1_ASM294v2305330530
spu Strongylocentrotus purpuratus EukaryotanaPurple sea urchinbiogrid_3.5.173GCF_000002235.4_Spur_4.212 33087733557
ssc Sus scrofa EukaryotaAPigbiogrid_3.5.173,SFAGCF_000003025.6_Sscrofa11.147 35515 29732 058
SV40Simian Virus 40VirusesnaMacaca mulatta polyomavirus 1biogrid_3.5.173GCF_000837645.1_ViralProj14024110
TMVTobacco Mosaic VirusVirusesnaTMVbiogrid_3.5.173GCF_000854365.1_ViralProj15071000
URP Urocitellus parryii EukaryotaFArctic ground squirrelSFAGCF_003426925.1 ASM342692v127 13214 37012 762
USMUstilago maydis 521Eukaryotananabiogrid_3.5.173GCF_000328475.2 Umaydis521_2.0326532578
VAV Vaccinia Virus Virusesnanabiogrid_3.5.173GCF_000860085.1_ViralProj1524124240
vvi Vitis vinifera EukaryotanaWine grapebiogrid_3.5.173GCF_000003745.3_12X20 12712 1208007
xla Xenopus laevis EukaryotanaAfrican clawed frogbiogrid_3.5.173GCF_001663975.1_Xenopus_laevis_v240 27821 67118 607
zma Zea mays EukaryotanaMaizebiogrid_3.5.173GCF_000005005.2_B73_RefGen_v424 39114 7369655

Jones et al. 2008 [19].

The EvoProDom model was applied to an assembly of organisms from diverse taxa belonging to superdomains, that is, Eukaryota, Viruses, and Bacteria. In total, 109 organisms were included in the ensemble and grouped as follows: (a) 15 fish; (b) four subterranean (S), eight fossorial (F), and 21 aboveground (A) animals (SFA) [15, 16]; (c) 65 organisms with known PPIs (BioGrid version 3.5.173, [17, 18]); (d) 17 organisms with HiC datasets (GEO_hic); (e) four cats; and (f) 15 pathogenic organisms [19]. Organisms with HiC datasets were obtained by searching for ‘HiC’ in the NCBI GEO database. Taxonomy ID, organism ID, organism name and common name are provided. Additionally, assembly and group classification are indicated. In addition, statistics for proteins and isoforms are included such that listed proteins are the longest isoforms and isoforms are alternative splicing variants. Total comprises both proteins and isoforms*. *Only proteins and isoforms with KO annotation are included. Organism ID is a 3–4 letter code, where the lowercase letter code corresponds to KEGG organisms and uppercase letters correspond to organisms not included in the KEGG database. Jones et al. 2008 [19].

Orthologous protein annotation

Orthologous annotation was based on KEGG orthologs, or KO groups [12, 13]. Proteins were assigned to KO groups using KoFamKOALA, a Hidden Markov Model (HMM) profile‐based search tool [6]. To this end, an in‐house script was written to automatically assign proteins to KO groups using KoFamKOALA [6], based on protein sequence. Thus, only proteins with a unique KO annotation were collected. Additionally, an organism code was generated by selecting 3–4 letters from an organism's name in uppercase format (a lower case code represents organisms from the KEGG database; Table 1).

Protein domain detection

Pfam (release 32.0, http://pfam.xfam.org/about) domains were predicted from protein sequences, using a Hidden Markov Model (HMM)‐based search tool [7, 8]. Accordingly, protein domain content was derived from protein sequences using an in‐house script. Additionally, each Pfam domain was classified, based on membership in super‐families (‘clan' as per pfam nomenclature). These data were added to the protein domain content of every protein.

EvoProDomDB

Genomic and proteomic data, along with orthologous proteins and protein domain content data, were collated by shared data. The resulting relational database, EvoProDomDB, was written in MySQL on MariaDB (10.0.26, https://mariadb.org/about/) to generate an efficient search engine. The EvoProDom model was implemented and tested on the MySQL database (EvoProDomDB). EvoProDomDB was organized with orthologous proteins and protein content for the 2 190 207 protein products (1 123 544 full length and 1 066 663 isoforms) (Table 1), which are distributed among 23 147 KO groups, containing 17 929 unique Pfam domains. The Pfam domains were distributed among 629 super‐families, while EvoProDomDB integrated data for 109 organisms from diverse taxa. EvoProDomDB was built from six relational tables sharing common features, for example, organism identity and other features (Fig. 1). Relational tables, taxonomy, ko_annotation, clan_domain, and pfam_domain provided the annotation data for taxonomy rankings, for example, genus and species, KO assignments, domain, and super‐family descriptions, respectively.
Fig. 1

The MySQL scheme for EvoProDomDB. Six‐relation tables were included. Of these, four contained data regarding taxonomy (taxonomy), KO (ko_annotation,), super‐families (clan_domain), pfam domains (pfam_domain), such as taxonomy ranks, for example, genus and species, KO, domain and super‐family descriptions, respectively. The main relational tables contain protein, genomic and proteomic data (org_protein_annotation), as well as protein domain content (pfam data; see the main text for details).

The MySQL scheme for EvoProDomDB. Six‐relation tables were included. Of these, four contained data regarding taxonomy (taxonomy), KO (ko_annotation,), super‐families (clan_domain), pfam domains (pfam_domain), such as taxonomy ranks, for example, genus and species, KO, domain and super‐family descriptions, respectively. The main relational tables contain protein, genomic and proteomic data (org_protein_annotation), as well as protein domain content (pfam data; see the main text for details). Protein genomic and proteomic data, along with protein domain content, were included in the relational tables as org_protein_annotation and Pfam data, respectively. Additionally, genomic and proteomic data were also included, for example, gene_symbol, chromosome, strand, refseq_id, protein length, and protein description. To these data, the KO number was added (ko_number). Proteomic and genomic data were uniquely linked by the longest isoform identification (isoform). Protein domain content was comprised from standard Pfam domains as retrieved from the Pfam search tool output [7, 8], and computed data that identified nonoverlapping Pfam domains with maximal score (putative) delimited by ‘envfrom’ and ‘envto’ coordinates. These coordinates delineate the largest region within the protein sequence in which a Pfam domain was predicted. Unique putative domain refers to the highest scoring domain among multiple copies of same putative domains. To collect these data, both standard and custom scripts were written and combined to form a pipeline that included construction of EvoProDomDB using in‐house bash and perl scripts. The EvoProDom model was implemented as Perl with MySQL queries to retrieve data from EvoProDomDB and bash scripts. These data sources and databases are summarized in the study workflow (Fig. 2).
Fig. 2

Study workflow: A collection of 109 organisms was used to implement and test the EvoProDom model. The collection included six categories: (a) 15 fish; (ii) four subterranean, eight fossorial and 21 aboveground animals [15, 16]; (c) 65 organisms with known PPIs (BioGrid version 3.5.173, [17, 18]); (d) 17 organisms with HiC datasets; (e) four cats; and (f) 15 pathogenic organisms [19]. Protein domains were predicted using the Pfam (release 32.0) database, along with the search tool [7, 8]. Orthologous proteins were defined as belonging to a KEGG [12, 13] ortholog (KO) group. Assignment to a KO group was obtained using KofamKOALA [6].

Study workflow: A collection of 109 organisms was used to implement and test the EvoProDom model. The collection included six categories: (a) 15 fish; (ii) four subterranean, eight fossorial and 21 aboveground animals [15, 16]; (c) 65 organisms with known PPIs (BioGrid version 3.5.173, [17, 18]); (d) 17 organisms with HiC datasets; (e) four cats; and (f) 15 pathogenic organisms [19]. Protein domains were predicted using the Pfam (release 32.0) database, along with the search tool [7, 8]. Orthologous proteins were defined as belonging to a KEGG [12, 13] ortholog (KO) group. Assignment to a KO group was obtained using KofamKOALA [6].

Results

The EvoProDom model

We hypothesized that proteins evolve by means of ‘mix and merge’ or ‘shuffling’ of protein domains, which correspond to distinct functional units [1, 21, 22]. The evolutionary model that describes protein evolution as a function of protein domain dynamics was termed EvoProDom. The EvoProDom model defines and formulates standard evolutionary mechanisms, such as translocations, duplications, and indel (insertion and deletion) events, which acted upon protein domains that are recognized as Pfam domains [7, 8]. According to the EvoProDom model, proteins gained or lost function due to the respective presence or absence of function‐conferring domains. Accordingly, proteins were modeled as sets of protein domains and evolutionary events, such as translocations, were defined. These describe the gain and loss of particular domains among domain sets or DAs. The KEGG database catalogs diverse taxa and creates groups of orthologous proteins (KOs) based on shared function. Thus, all members of a KO group are orthologous proteins [6, 12, 13]. In the EvoProDom model, proteins were assigned to KO groups (see Materials and methods). Consequently, translocation events were mapped to groups of organisms according to underlying changes in DAs. Thus, evolutionary events, which acted upon domains and are manifested as changes in DAs, are reflected at the organism level. A link between changes at these two levels was, therefore, established. The EvoProDom model was implemented with and tested on the EvoProDomDB (see Materials and methods). In total, 6286 translocation events, involving 94 protein super‐families, were found (Table 2, Tables S1 and S2). This result indicates the existence of multiple evolutionary translocation events, as defined by the model.
Table 2

Translocation events per superfamily (counts). Translocations are characterized by mobile domains in organisms classified based on superdomain taxonomy*. These organism groups are assigned representative superdomain taxonomy if all organisms share same superdomain taxonomy. Otherwise, they are assigned as ‘Mixed’. Finally, translocations are classified based on organism group classification to superdomains, for example, Eukaryota‐Eukaryota, which represent the majority of translocations (over 99%) (Translocation Class). The most frequent clan for Eukaryota‐Eukaryota is Ig. Related to Tables S1 and S2. *Superdomain taxa are Eukaryota, Viruses, and Bacteria. Super‐family annotation is provided (Super family Description).

Translocation classSuper family IdSuper family nameCountsSuper family description
Eukaryota‐Eukaryota0011.26Ig1144Immunoglobulin superfamily
Eukaryota‐Eukaryota0010.21SH3630Src homology‐3 domain
Eukaryota‐Eukaryota0465.3Ank529Ankyrin repeat superfamily
Eukaryota‐Eukaryota0001.27EGF414EGF superfamily
Eukaryota‐Eukaryota0361.4C2H2‐zf390Classical C2H2 and C2HC zinc fingers
Eukaryota‐Eukaryota0022.32LRR282Leucine Rich Repeat
Eukaryota‐Eukaryota0020.25TPR246Tetratrico peptide repeat superfamily
Eukaryota‐Eukaryota0229.11RING242Ring‐finger/U‐box superfamily
Eukaryota‐Eukaryota0186.14Beta_propeller222Beta propeller clan
Eukaryota‐Eukaryota0221.11RRM210RRM‐like clan
Eukaryota‐Eukaryota9999.0Unknown208null
Eukaryota‐Eukaryota0159.16E‐set187Ig‐like fold superfamily (E‐set)
Eukaryota‐Eukaryota0466.3PDZ‐like165PDZ domain‐like peptide‐binding superfamily
Eukaryota‐Eukaryota0016.22PKinase164Protein kinase superfamily
Eukaryota‐Eukaryota0266.9PH141PH domain‐like superfamily
Eukaryota‐Eukaryota0023.34P‐loop_NTPase121P‐loop containing nucleoside triphosphate hydrolase superfamily
Eukaryota‐Eukaryota0220.12EF_hand115EF‐hand like superfamily
Eukaryota‐Eukaryota0511.3Retroviral_zf95Retrovirus zinc finger‐like domains
Eukaryota‐Eukaryota0271.7F‐box79F‐box‐like domain
Eukaryota‐Eukaryota0003.21SAM74Sterile Alpha Motif (SAM) domain
Eukaryota‐Eukaryota0390.4zf‐FYVE‐PHD47FYVE/PHD zinc finger superfamily
Eukaryota‐Eukaryota0357.4SMAD‐FHA37SMAD/FHA domain superfamily
Eukaryota‐Eukaryota0063.25NADP_Rossmann37FAD/NAD(P)‐binding Rossmann fold Superfamily
Eukaryota‐Eukaryota0123.18HTH34Helix‐turn‐helix clan
Eukaryota‐Eukaryota0680.1WW34WW domain
Eukaryota‐Eukaryota0167.15Zn_Beta_Ribbon33Zinc beta‐ribbon
Eukaryota‐Eukaryota0006.20C125Protein kinase C, C1 domain
Eukaryota‐Eukaryota0306.4HeH24LEM/SAP HeH motif
Eukaryota‐Eukaryota0214.13UBA24UBA superfamily
Eukaryota‐Eukaryota0459.3BRCT‐like23BRCT like
Eukaryota‐Eukaryota0188.10CH23Calponin homology domain
Eukaryota‐Eukaryota0537.2CCCH_zf22CCCH‐zinc finger
Eukaryota‐Eukaryota0004.20Concanavalin20Concanavalin‐like lectin/glucanase superfamily
Eukaryota‐Eukaryota0072.20Ubiquitin19Ubiquitin superfamily
Eukaryota‐Eukaryota0033.14POZ17POZ domain superfamily
Eukaryota‐Eukaryota0154.11C211C2 superfamily
Eukaryota‐Eukaryota0007.18KH9K‐Homology (KH) domain Superfamily
Eukaryota‐Eukaryota0392.4Chaperone‐J8Chaperone J‐domain superfamily
Eukaryota‐Eukaryota0164.13CUB8CUB clan
Eukaryota‐Eukaryota0029.20Cupin8Cupin fold
Eukaryota‐Eukaryota0049.15Tudor8Tudor domain 'Royal family'
Eukaryota‐Eukaryota0172.17Thioredoxin8Thioredoxin‐like
Eukaryota‐Eukaryota0212.9SNARE8SNARE‐like superfamily
Eukaryota‐Eukaryota0124.15Peptidase_PA7Peptidase clan PA
Eukaryota‐Eukaryota0575.2EFTPs7Translation proteins of Elongation Factors superfamily
Eukaryota‐Eukaryota0137.15HAD7HAD superfamily
Eukaryota‐Eukaryota0021.18OB7OB fold
Eukaryota‐Eukaryota0364.4Leu‐IlvD7LeuD/IlvD‐like
Eukaryota‐Eukaryota0541.2SH2‐like6SH2, phosphotyrosine‐recognition domain superfamily
Eukaryota‐Eukaryota0671.1AAA_lid5AAA+ ATPase lid domain superfamily
Eukaryota‐Eukaryota0244.9PGBD5PGBD superfamily
Eukaryota‐Eukaryota0192.13GPCR_A5Family A G protein‐coupled receptor‐like superfamily
Eukaryota‐Eukaryota0173.11STIR5STIR superfamily
Eukaryota‐Eukaryota0602.2Kringle5Kringle/FnII superfamily
Eukaryota‐Eukaryota0642.1SOCS_box4SOCS‐box like superfamily
Eukaryota‐Eukaryota0178.16PUA4PUA/ASCH superfamily
Eukaryota‐Eukaryota0041.13Death4Death Domain Superfamily
Eukaryota‐Eukaryota0183.14PAS_Fold4PAS domain clan
Eukaryota‐Eukaryota0084.13ADP‐ribosyl3ADP‐ribosylation Superfamily
Eukaryota‐Eukaryota0015.20MFS3Major Facilitator Superfamily
Eukaryota‐Eukaryota0198.16HHH3Helix‐hairpin‐helix superfamily
Eukaryota‐Eukaryota0661.1Gain3GPCR autoproteolysis inducing
Eukaryota‐Eukaryota0497.3GST_C3Glutathione S‐transferase, C‐terminal domain
Eukaryota‐Eukaryota0030.16Ion_channel3Ion channel (VIC) superfamily
Eukaryota‐Eukaryota0107.12KOW2KOW domain
Eukaryota‐Eukaryota0492.3S42S4 domain superfamily
Eukaryota‐Eukaryota0055.13AMP‐binding_C2AMP‐binding enzyme C‐terminal domain superfamily
Eukaryota‐Eukaryota0055.13Nucleoplasmin2Nucleoplasmin‐like/VP (viral coat and capsid proteins) superfamily
Eukaryota‐Eukaryota0027.15RdRP2RNA‐dependent RNA polymerase
Eukaryota‐Eukaryota0202.11GBD2Galactose‐binding domain‐like superfamily
Eukaryota‐Eukaryota0028.22AB_hydrolase2Alpha/Beta hydrolase fold
Eukaryota‐Eukaryota0677.1GHMP_C1GHMP C‐terminal domain superfamily
Eukaryota‐Eukaryota0025.14His_Kinase_A1His Kinase A (phospho‐acceptor) domain
Eukaryota‐Eukaryota0088.16Alk_phosphatase1Alkaline phosphatase‐like
Eukaryota‐Eukaryota0607.2TNF_receptor1TNF receptor‐like superfamily
Mixed‐Mixed0070.13ACT1ACT‐like domain
Eukaryota‐Eukaryota0113.13GT‐B1Glycosyl transferase clan GT‐B
Eukaryota‐Eukaryota0449.3G‐PATCH1DExH‐box splicing factor binding site
Eukaryota‐Eukaryota0144.13Periplas_BP1Periplasmic binding protein like
Eukaryota‐Eukaryota0505.3Pentapeptide1Pentapeptide repeat
Eukaryota‐Eukaryota0547.2GF_recep_C‐rich1Growth factor receptor Cys‐rich
Eukaryota‐Mixed0021.18OB1OB fold
Eukaryota‐Eukaryota0026.20CU_oxidase1Multicopper oxidase‐like domain
Eukaryota‐Eukaryota0110.12GT‐A1Glycosyl transferase clan GT‐A
Eukaryota‐Eukaryota0236.17PDDEXK1PD‐(D/E)XK nuclease superfamily
Eukaryota‐Eukaryota0672.1p351Baculovirus p35 protein superfamily
Eukaryota‐Eukaryota0125.15Peptidase_CA1Peptidase clan CA
Eukaryota‐Eukaryota0117.11uPAR_Ly6_toxin1uPAR/Ly6/CD59/snake toxin‐receptor superfamily
Eukaryota‐Eukaryota0005.27Kazal1Kazal like domain
Eukaryota‐Bacteria9999.0Unknown1null
Eukaryota‐Mixed9999.0Unknown1null
Eukaryota‐Eukaryota0196.12DSRM1DSRM‐like clan
Eukaryota‐Eukaryota0381.4Metallo‐HOrase1Metallo‐hydrolase/oxidoreductase superfamily
Eukaryota‐Eukaryota0114.12HMG‐box1HMG‐box like superfamily
Eukaryota‐Eukaryota0109.12CDA1Cytidine deaminase‐like (CDA) superfamily
Eukaryota‐Eukaryota0552.2Hect1Hect, E3 ligase catalytic domain
Eukaryota‐Eukaryota0426.4HRDC‐like1HRDC‐like superfamily
Eukaryota‐Eukaryota0630.1PSI1Plexin fold superfamily
Translocation events per superfamily (counts). Translocations are characterized by mobile domains in organisms classified based on superdomain taxonomy*. These organism groups are assigned representative superdomain taxonomy if all organisms share same superdomain taxonomy. Otherwise, they are assigned as ‘Mixed’. Finally, translocations are classified based on organism group classification to superdomains, for example, Eukaryota‐Eukaryota, which represent the majority of translocations (over 99%) (Translocation Class). The most frequent clan for Eukaryota‐Eukaryota is Ig. Related to Tables S1 and S2. *Superdomain taxa are Eukaryota, Viruses, and Bacteria. Super‐family annotation is provided (Super family Description).

Mapping of genes to proteins and alternative splicing

EvoProDom combines genomic information (genes) with proteins, and in turn, proteins with Pfam domain composition. In addition, proteins assigned to KO groups were also included [6, 12, 13]. Genes may map to more than one mRNA transcript and, in turn, to more than one protein product, recognized by their Refseq id. These transcripts encode isoforms of a gene product and result from alternative splicing, that is, the inclusion of gene exons. Since protein domains mostly coincide with exons [1, 3, 5, 21], changes in protein domain content can account for changes in DAs as a result of translocation events. Therefore, to avoid confounding effects of alternative splicing, only the longest isoform was used in the model (see Materials and methods). As such, each gene was associated with a single protein product.

Protein domain content

Overlapping domains within a protein are inconsistent with the linear structure of that protein. To resolve this issue for each overlapping group of domains, the highest scoring domain (the putative domain) was chosen. However, this procedure does remove multiple copies of putative domains. Translocation events require a unique set of nonoverlapping putative domains. To this end, a similar procedure was applied to remove multiple copies of putative domains by choosing domains with maximal score, subsequently referred to as unique putative domains.

The DA as a basic unit in EvoProDom

According to the EvoProDom model, evolutionary events, such as translocations and indels, operated on protein domains and the organisms involved in orthologous groups, that is, KO and DAs. Therefore, EvoProDomDB enables organizing these data according to DA. Briefly, each orthologous group (KO) was partitioned into distinct sets (items), that is, a list of domains (DAs), and corresponding lists of proteins and organisms. Notably, duplicated organisms within these matched lists represent paralogous proteins. For each DA, gained and missing domains were determined from all DAs within a particular KO. Mobile and translocation domains, that is, domains that had undergone all translocation events, were determined from these data. In total, we found 6286 translocation events, involving 94 protein super‐families (Table 2, Tables S1 and S2). We identified 2042 mobile domains, 260 which had undergone translocation and 1782 that were involved in indel events (Tables S1 and S3).

Evolutionary mechanisms represented in EvoProDom

Implementation of DAs

First, DAs were generated from EvoProDomDB, while filtering for putative and unique putative domains (see Materials and methods). DAs were uniquely identified as a (ko,item) pair. Each DA included: (a) a ko:item; (b) a Pfam domain list; (c) a list of organisms (org_id); (d) a list of refseq_ids; (e) a list of missing domains; and (f) a list of gained domains. Importantly, the list of organisms (c) and the list of refseq_ids (d) were matched lists, that is, the first refseq belonged to the first organism and the second refseq belonged to the second organism, etc. All other DA information was shared by all organisms and corresponding refseqs; namely, all refseqs were members of the same KO group and presented similar domain content (item). Gained and lost patterns [(e) and (f), above] were computed for each KO group across all DAs as items. Of note, the minimal number of DAs, that is, items, was two. Domain architecture, the putative domain, and unique putative domain were formally defined as follows: Definition: DA Algorithm: Let , where , is a set of protein domains and is DA. Grouping of DAs into distinct groups is a partition of . Definition: Putative domains and unique putative domains Assumptions: Protein, , must be DA, must be a score Algorithm: Domain is a putative domain if is maximal among overlapping or nested domains. A unique putative domain is the highest scoring putative domain among multiple copies of the same domain within .

Translocation and indel events of a mobile domain

Informally, translocations of mobile domains involve gain/loss from/to orthologous proteins from two KO groups, in which mobile domains were determined by gain/loss patterns within a single KO group. Therefore, a mobile domain was described and formally defined. The main objective of the EvoProDom model was to reflect changes in domain content, namely, at the protein level, with the organism level. This highlights groups of organisms with orthologous proteins that share similar patterns of protein domain gain/loss. Protein domain composition was coupled with organisms by defining mobile and translocation domains. This was based on groups of organisms and their sizes, with orthologous proteins sharing the same protein domain composition. Protein domains were contained within orthologous proteins, or the domain missing from a protein, which was based on a number of organisms in each group, that is, orthologous proteins with and without a particular domain. A mobile domain was defined as follows: Assumptions: Let be sets of organisms with proteins in a KO group, k, such that Organisms, , in contain domain d whereas organisms, , in lack domain d. Algorithm: Unique putative domain is mobile between organisms in and in if . Next, translocations and indel events of mobile domains were described. Translocations and indel events are mutually exclusive events. Translocation domains comprise a subset of mobile domains showing patterns of gain and loss between two KO groups in a reciprocal manner, namely, a mobile domain that was gained and lost in the first and second orthologous group, and vice versa (Fig. 3). Similar to the definition of a mobile domain, translocation event criteria were defined for groups of organisms with four or more members. For example, a translocation event of the Pfam domain FERM_C (FERM C‐terminal PH‐like domain) in FERM (F for 4.1 protein, E for ezrin, R for radixin, and M for moesin) is shown in Fig. 3. In this translocation event, FERM_C was present in KEGG orthologous group number 16822, corresponding to FERM domain‐containing protein 6 (FRMD6). FERM_C was absent from the orthologous protein group number 10637, which corresponds to E3 ubiquitin‐protein ligase MYLIP [EC:2.3.2.27] (MYLIP, MIR) [23, 24, 25, 26]. This gain and loss pattern of FERM_C was observed among 29 orthologous proteins in two groups of organisms (A* and B*) consisting of five and six members, respectively. The first group, A*, which includes CAA (Carassius auratus, goldfish), CHL Chinchilla lanigera, long‐tailed chinchilla), ECTE (Echinops telfairi, small Madagascar hedgehog), ccar (Cyprinus carpio, common carp), and lav (Loxodonta africana, African savanna elephant), each contains at least one protein which gained and lost domain FERM_C in FRMD6 and MYLIP, respectively. The second group, B*, which includes CHA ( asiatica, Cape golden mole), MIO (Microtus ochrogaster, prairie vole), PEM (Peromyscus maniculatus bairdii, prairie deer mouse), cge (Cricetulus griseus, Chinese hamster), ola (Oryzias latipes, Japanese medaka), and rno (Rattus norvegicus, Norway rat), each contains at least one protein which gained and lost domain FERM_C in MYLIP and FRMD6, respectively (Table 1, Fig. 3). Since domain FERM_C showed reciprocal gain and loss patterns for a minimum of four organisms in, A* and B*, it was determined that this domain had undergone a translocation event and was referred to as a translocation domain (Fig. 3). Orthologous proteins are indicated by refseqs for each organism, with multiple proteins per organism representing paralogous proteins (Fig. 3).
Fig. 3

Illustration of translocation event for FERM_C. FERM_C (red domain) underwent a reciprocal translocation event between two orthologous protein groups 16822 (FRMD6) and 10637 (MYLIP, MIR). Accordingly, the red domain (FERM_C) is present in FRMD6 and absent from MYLIP for organisms CAA, etc., while for organisms CHA, etc., FERM_C is present in MYLIP and missing from FRMD6. FERM_C (FERM C‐terminal PH‐like domain); FERM. Orthologous proteins are indicated by refseqs for each organism, and multiple proteins per organism represent paralogue proteins. Organism codes are indicated in Table 1.

Illustration of translocation event for FERM_C. FERM_C (red domain) underwent a reciprocal translocation event between two orthologous protein groups 16822 (FRMD6) and 10637 (MYLIP, MIR). Accordingly, the red domain (FERM_C) is present in FRMD6 and absent from MYLIP for organisms CAA, etc., while for organisms CHA, etc., FERM_C is present in MYLIP and missing from FRMD6. FERM_C (FERM C‐terminal PH‐like domain); FERM. Orthologous proteins are indicated by refseqs for each organism, and multiple proteins per organism represent paralogue proteins. Organism codes are indicated in Table 1. Translocations and indel events were formally defined as follows: Assumptions: Let be a mobile domain between and in , where , are sets of organisms and are KO groups. Let and . Algorithm: Mobile domain undergoes translocation if . Otherwise, an indel event has occurred. Over 77% of organisms in the EvoProDom database are eukaryotes. Therefore, translocation events are expected to predominately involve eukaryotes. To test this prediction, translocation events, which involve two organism groups (A*, B*), were classified based on superdomain taxonomy, namely, Eukaryota, Viruses, and Bacteria. Briefly, each organism group was assigned to the superdomain taxonomy shared by all organisms; otherwise, the group was assigned as ‘Mixed’. In these superdomain taxonomy assignments of organism groups, translocations were classified based on superdomain taxonomy, represented as composites of individual organism group assignments (A*–B*; Table S1). For example, Eukaryota‐Eukaryota consists of 6282 (99.94%) translocations (Tables S1 and S2). For this group, Ig_3 is the most frequent translocating domain (528/6282, 8.40%) and Ig is the most abundant superfamily (clan) (1144/6, 282, 18.21%; Table 2, Table S1). These results validate the prediction of overrepresentation of translocations involving only eukaryotes as a consequence of eukaryotes predominating in the EvoProDom database. Interestingly, a single translocation was assigned to the Eukaryota‐Bacteria group, which involved the FDX‐ACB domain. At the same time, three translocations were assigned to the Mixed group, that is, translocations involving at least one bacterial species in either organism group (Tables S1 and S2). This domain, ferredoxin‐fold anticodon binding (FDX‐ACB), is contained in Phenylalanine‐tRNA synthetase (PheRS, also known as Phenylalanine‐tRNA ligase) and is shared by bacteria and mitochondria [27, 28, 29, 30, 31, 32]. This translocation involves orthologous protein groups 01889, FARSA, pheS; phenylalanyl‐tRNA synthetase alpha chain [EC:6.1.1.20] and 01890, FARSB, pheT; phenylalanyl‐tRNA synthetase beta chain [EC:6.1.1.20] (Tables S1 and S2). These results indicate that translocations are not restricted to eukaryotes and support the theory of a bacterial origin of mitochondria. Moreover, examination of domains and protein orthologous groups (KO) revealed that they are common to bacteria species, for example, translocation domain Abhydrolase_1, which involves orthologous protein group (13700, ABHD6; abhydrolase domain‐containing protein 6 [EC:3.1.1.23]) was found in Alphaproteobacteria (e.g., ster Sphingopyxis terrae, tax_id33052), Betaproteobacteria (rhg Rhodoferax sediminis Gr‐4, tax_id2509614), Gammaproteobacteria (pfo Pseudomonas fluorescens Pf0–1, tax_id294), and Deltaproteobacteria (sur Stigmatella aurantiaca, tax_id41). The second orthologous protein group is 13703, ABHD11; abhydrolase domain‐containing protein 11 found in Alphaproteobacteria (e.g., abg Asaia bogorensis, tax_id91915) and Verrucomicrobia (e.g., mkc Methylacidiphilum kamchatkense, tax_id431057; Tables S1 and S2). These results point to possible translocations among bacteria, which share orthologous proteins with eukaryotes. Similar to translocation events, the vast majority (96.67%) of indel events involve only eukaryotes (Table S3). The most frequent domain for indel class Eukaryota‐Eukaryota is SNF2_N, which belong to P‐loop_NTPase superfamily, with 290 indel events (Table S4) and ‘Unknown’ with 8382 indels (Table S5). However, we found 570 indel events which involve bacteria, 70 of which involve either domain gain in bacteria yet absence of the gene in eukaryotes or vice versa (Table S3). Interestingly, we found two collections of indel events involving two orthologous proteins, 01889, phenylalanyl‐tRNA synthetase alpha chain [EC:6.1.1.20] and 01890, phenylalanyl‐tRNA synthetase beta chain [EC:6.1.1.20]. For example, the collection of indel events for alpha chains, which contain PheRS_DBD1, PheRS_DBD2, and PheRS_DBD3 domains, is gained in eukaryotes; that is, the events are classified as Bacteria‐Eukaryota, Eukaryota‐Eukaryota, and Mixed‐Eukaryota. However, the Phe_tRNA‐synt_N domain is gained Bacteria, namely, indel events which are classified Mixed‐Bacteria and Eukaryota‐Bacteria (Table S3). These results show that indel events are not restricted to eukaryotes.

Duplication of domains

Unique putative Pfam domains form the basis for defining mobile and translocation events. For duplication events, putative domains were considered so as to retain nonoverlapping duplicates of Pfam domains (see Materials and methods). These putative domains were calculated for each orthologous protein group, that is, KO group, to assign duplicate status. This status varied among KO groups, and corresponded to ‘duplicated’ or ‘nonduplicated’ for a particular KO group and thus varied among KO groups. Therefore, the final duplication status of a Pfam domain was determined by the majority of duplicate status assignments for individual KO groups. For example, the final duplication status of a Pfam domain was ‘duplicated’ if the difference between the number of KOs with ‘duplicated’ to ‘nonduplicated’ was significant, namely, in the 99% percentile of the cumulative sum of the differences. Similarly, a final ‘nonduplicated’ status was determined when considering ‘nonduplicated’ to ‘duplicated’ differences. The duplicate status of a domain in a given KO group was determined based on consistency of domain copy number across all Das; that is, if constant across all DAs, then ‘nonduplicated’ was assigned. Otherwise, ‘duplicated’ was assigned. Duplication was formally defined as follows: Assumptions: Let be a putative domain, be the KO group with DAs of putative domains. Then, is ‘nonduplicated’ in, if the copy number of is the same in each, otherwise is ‘duplicated’. Algorithm: is duplicated if the difference between the number of KO groups where is ‘duplicated’ and the number of KO groups where it is ‘nonduplicated’ is significant (above 99% of the cumulative sum of the differences). A nonduplicated domain is similarly defined.

Translocation domains are enriched in chimeric transcripts

Chimeric transcripts are combined transcripts derived from two genes. Frenkel‐Morgenstern and Valencia [5] analyzed domain content enrichment within chimeric transcripts and found enriched domains belonging to the following super‐families (super‐family name): ANK (Ank), EFh (EF_hand), EGF‐like (EFG), GTP_EFTU (P‐loop_NTPase), IG‐like (E‐set), LRR (LRR), PH (zf‐FYVE‐PHD), Pkinase (PKinase), RING (RING), RRM (RRM), SH2 (SH2‐like), SH3 (SH3), WD40 (Beta_propeller), and ZnF (C2H2‐zf) [5]. Of these, EFh (EF_hand), EGF‐like (EFG), GTP_EFTU (P‐loop_NTPase), IG‐like(E‐set), Pkinase (PKinase), RRM (RRM), SH2 (SH2‐like), SH3 (SH3), WD40 (Beta_propeller), and ZnF (C2H2‐zf), findings confirmed by RNA‐seq data analysis [5]. These domains were found in high copy numbers within proteins, such as Ank [33, 34, 35] and WD40 [36], or as repeats or highly abundant within proteins, such as SH3 [37, 38]. Therefore, we hypothesized that highly abundant domains might have experienced a high number of translocation events. Therefore, we applied EvoProDom to the collection of organisms (EvoProDomDB) and found a total of 2042 mobile domains. Of these, 260 had undergone translocation events and 1782 were involved in indel events (Tables S1 and S3). Translocation events and indel event frequencies were grouped by Pfam super‐family [7, 8] (Table 2 and Table S5, respectively). Among the 10 most frequent domain super‐families were SH3 (Src homology‐3 domain), Ig (Immunoglobulin super‐family) and Ank (Table 2). The most frequent super‐families of mobile domains involved in indel events were ‘Unknown’, P‐loop_NTPase and TPR (Table S5). Translocation events observed in the SH3 super‐family members were as follows: SH3_2 (239 translocations), SH3_1 (198 translocations), and SH3_9 (193 translocations). SH3 (src Homology‐3) domains are small protein domains approximately 50 amino acids in length [39, 40] and are found in various membrane‐associated or intracellular proteins [41, 42, 43], such as fodrin and yeast actin‐binding protein (ABP‐1). Additionally, SH3 domains mediate PPIs by facilitating protein complex assembly [37]. Translocation events observed in the Ig super‐family were as follows: Ig_3 (533 translocations), ig (219 translocations), I‐set (135 translocations), V‐set (117 translocations) and Ig_2 (116 translocations), C2‐set_2 (23 translocations), Ig_6 (5 translocations), and C1‐set (1 translocation). These domains are found in cell surface proteins and in intracellular muscle proteins (I‐set) and in the vertebrate immune system (V‐set) [44, 45]. The Ank repeats super‐family comprises Ank_2 (231 translocations), Ank_4 (184 translocations), Ank_5 (94 translocations), Ank_2 (19 translocations), and Ank_3 (1 translocation). These repeats are involved in PPIs that regulate cell cycle transition from G1 to S [33, 34, 35]. Such regulation is achieved by inhibitors of cyclin‐dependent kinase 4 protein complex formation and inhibition of CDK4/6 proteins [35]. These findings reveal that protein domains enriched in chimeric transcripts underwent many translocations. This supports a connection between chimeric transcripts and EvoProDom translocations. In addition, translocation events for protein domains, such as P kinase and ubiquitin, are found in multiple events and formed new fusions. Moreover, one domain encoded in each novel transcript underwent a translocation event [5]. Note that super‐families with the most and least number of translocations, SH3 (630) and SH2‐like (6), were enriched in chimeric transcripts (Table 2).

Discussion

Here, we presented a novel protein evolution model, EvoProDom, which was based on the ‘mix and merge’ of protein domains. The EvoProDom model was implemented with and tested on EvoProDomDB, which consists of genomic and proteome data, along with orthologous protein and protein domain data, from 109 organisms from diverse taxa. In the EvoProDom model, translocations, and indel and duplication events were defined to reflect changes in domain content of a protein in orthologous groups. Moreover, in this model, such changes in protein domain composition were manifested at the organism level. Thus, SH3, which binds ligands [37, 38] and mediates PPIs [46], was observed as a highly abundant protein domain in translocations. Repetitive domains, such as Ank [33, 34, 35] and WD40 [36], appeared in multiple copies in proteins. Generally, 3D confirmations mediate PPIs [33, 34, 35, 36] by modulating protein networks of parent proteins. This modulation is mediated by novel PPIs of chimeric proteins [47]. Indeed, such domains, for example, SH3_2, Ig, and Ank_2 and others (see Results), were enriched in multiple fusion event‐generated chimeric transcripts [5]. As hypothesized, these domains participated in a high number of evolutionary translocation events. A probable explanation for the high frequency of these translocation events is the repetition of these domains. In general, fusions are produced by slippage of two parent genes. Fusion genes lose domains at junction sites. As a result, the proper function of the chimeric protein is impaired [47]. For example, fusion within the catalytic domain would render the protein nonfunctional. Selection would thus be against such a fusion. Repetitive domains, which appear in high copy numbers, would appear in chimeras at higher frequencies than expected from their sheer numbers alone, albeit due to selection, with lower repeats. Indeed, their average copy numbers were reduced in chimeric transcripts [5]. In EvoProDom, abundant domains or repetitive domains, for example, SH3, within KO groups, resulted in higher numbers of distinct DAs. This translates into a higher number of (ko, item) pairs (see Materials and methods). Consequently, these domains contribute more to the pool of mobile domains from which translocation events were generated, and were thus highly abundant in translocation events. Collectively, these results indicate that translocation events involving repetitive domains and highly abundant domains rewire PPI networks to achieve adaptive evolution. The introduction of new organisms into EvoProDomDB required only full genomes and annotated proteomes. Orthologous protein and protein domain content data were identified from protein sequences using KoFamKOALA [6] and the Pfam search tool [7, 8]. Therefore, usage of these tools enables the extension of EvoProDom to any new organism with a full genome and annotated proteome. Moreover, the combined use of these tools provides a general method for obtaining protein domain content and orthologous protein annotation from protein sequence. In conclusion, EvoProDom presents a novel model for protein evolution based on the ‘mix and merge’ view of protein domains rather than DNA‐based models. This confers the advantage of considering chromosomal alterations in evolutionary events.

Conflict of interest

The authors declare no conflict of interest.

Author contributions

MFM designed, supervised the study and wrote the paper; GC and AG produced the data, verified results, and wrote the paper. Table S1. EvoProdom translocations. Table S2. Superdomain translocations counts based on mobile domain. Table S3. Raw indel events. Table S4. Indel frequencies for indel classes based on mobile domain. Table S5. Indel events per superfamily (counts). Click here for additional data file. Click here for additional data file.
  47 in total

1.  Signalling through SH2 and SH3 domains.

Authors:  B J Mayer; D Baltimore
Journal:  Trends Cell Biol       Date:  1993-01       Impact factor: 20.808

Review 2.  A decade of 3C technologies: insights into nuclear organization.

Authors:  Elzo de Wit; Wouter de Laat
Journal:  Genes Dev       Date:  2012-01-01       Impact factor: 11.361

Review 3.  The SH3 domain--a family of versatile peptide- and protein-recognition module.

Authors:  Tomonori Kaneko; Lei Li; Shawn S-C Li
Journal:  Front Biosci       Date:  2008-05-01

4.  The crystal structure of phenylalanyl-tRNA synthetase from thermus thermophilus complexed with cognate tRNAPhe.

Authors:  Y Goldgur; L Mosyak; L Reshetnikova; V Ankilova; O Lavrik; S Khodyreva; M Safro
Journal:  Structure       Date:  1997-01-15       Impact factor: 5.006

5.  Distinct functional domains contribute to degradation of the low density lipoprotein receptor (LDLR) by the E3 ubiquitin ligase inducible Degrader of the LDLR (IDOL).

Authors:  Vincenzo Sorrentino; Lilith Scheer; Ana Santos; Eric Reits; Boris Bleijlevens; Noam Zelcer
Journal:  J Biol Chem       Date:  2011-07-06       Impact factor: 5.157

6.  Structure of phenylalanyl-tRNA synthetase from Thermus thermophilus.

Authors:  L Mosyak; L Reshetnikova; Y Goldgur; M Delarue; M G Safro
Journal:  Nat Struct Biol       Date:  1995-07

Review 7.  The FERM domain: organizing the structure and function of FAK.

Authors:  Margaret C Frame; Hitesh Patel; Bryan Serrels; Daniel Lietha; Michael J Eck
Journal:  Nat Rev Mol Cell Biol       Date:  2010-11       Impact factor: 94.444

8.  How do proteins gain new domains?

Authors:  Joseph A Marsh; Sarah A Teichmann
Journal:  Genome Biol       Date:  2010-07-15       Impact factor: 13.583

9.  KEGG as a reference resource for gene and protein annotation.

Authors:  Minoru Kanehisa; Yoko Sato; Masayuki Kawashima; Miho Furumichi; Mao Tanabe
Journal:  Nucleic Acids Res       Date:  2015-10-17       Impact factor: 16.971

10.  KofamKOALA: KEGG Ortholog assignment based on profile HMM and adaptive score threshold.

Authors:  Takuya Aramaki; Romain Blanc-Mathieu; Hisashi Endo; Koichi Ohkubo; Minoru Kanehisa; Susumu Goto; Hiroyuki Ogata
Journal:  Bioinformatics       Date:  2020-04-01       Impact factor: 6.937

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.