Literature DB >> 28836963

Bioinformatics comparisons of RNA-binding proteins of pathogenic and non-pathogenic Escherichia coli strains reveal novel virulence factors.

Pritha Ghosh1, Ramanathan Sowdhamini2.   

Abstract

BACKGROUND: Pathogenic bacteria have evolved various strategies to counteract host defences. They are also exposed to environments that are undergoing constant changes. Hence, in order to survive, bacteria must adapt themselves to the changing environmental conditions by performing regulations at the transcriptional and/or post-transcriptional levels. Roles of RNA-binding proteins (RBPs) as virulence factors have been very well studied. Here, we have used a sequence search-based method to compare and contrast the proteomes of 16 pathogenic and three non-pathogenic E. coli strains as well as to obtain a global picture of the RBP landscape (RBPome) in E. coli.
RESULTS: Our results show that there are no significant differences in the percentage of RBPs encoded by the pathogenic and the non-pathogenic E. coli strains. The differences in the types of Pfam domains as well as Pfam RNA-binding domains, encoded by these two classes of E. coli strains, are also insignificant. The complete and distinct RBPome of E. coli has been established by studying all known E. coli strains till date. We have also identified RBPs that are exclusive to pathogenic strains, and most of them can be exploited as drug targets since they appear to be non-homologous to their human host proteins. Many of these pathogen-specific proteins were uncharacterised and their identities could be resolved on the basis of sequence homology searches with known proteins. Detailed structural modelling, molecular dynamics simulations and sequence comparisons have been pursued for selected examples to understand differences in stability and RNA-binding.
CONCLUSIONS: The approach used in this paper to cross-compare proteomes of pathogenic and non-pathogenic strains may also be extended to other bacterial or even eukaryotic proteomes to understand interesting differences in their RBPomes. The pathogen-specific RBPs reported in this study, may also be taken up further for clinical trials and/or experimental validations.

Entities:  

Keywords:  Escherichia coli; Genome-wide survey; PELOTA; Pathogen; RNA-binding proteins; Ribonuclease PH; Uncharacterised; Virulence

Mesh:

Substances:

Year:  2017        PMID: 28836963      PMCID: PMC5571608          DOI: 10.1186/s12864-017-4045-3

Source DB:  PubMed          Journal:  BMC Genomics        ISSN: 1471-2164            Impact factor:   3.969


Background

Escherichia coli is one of the most abundant, facultative anaerobic gram-negative bacterium of the intestinal microflora and colonises the mucus layer of the colon. The core genomic structure is common among the commensal strains and the various pathogenic E. coli strains that cause intestinal and extra-intestinal diseases in humans [1]. In the pathogenic strains, novel genetic islands and small clusters of genes are present in addition to the core genomic framework and provide the bacteria with increased virulence [2-4]. The extracellular intestinal pathogen, enterohemorrhagic E. coli (EHEC), which cause diarrhea, hemorrhagic colitis and the haemolytic uremic syndrome, is the most devastating of the pathogenic E. coli strains [5, 6]. Pathogenic bacteria have evolved various strategies to counteract host defences. They are also exposed to environments that are undergoing constant changes. Hence, in order to survive, bacteria must adapt themselves to the changing environmental conditions by altering gene expression levels and in turn adjusting protein levels according to the need of the cell. Such regulations may occur at the transcriptional and/or post-transcriptional levels [7]. RNA-binding proteins (RBPs) are a versatile group of proteins that perform a diverse range of functions in the cell and are ‘master regulators’ of co-transcriptional and post-transcriptional gene expression like RNA modification, export, localization, mRNA translation, turnover [8-12] and also aid in the folding of RNA into conformations that are functionally active [13]. In bacteria, many different classes of RBPs interact with small RNAs (sRNA) to form ribonucleoprotein (RNP) complexes that participate in post-transcriptional gene regulation processes [14-23]. In eukaryotes, noncoding RNAs (ncRNAs) are known to be important regulators of gene expression [24-26]. Hence, bacterial RBPs that are capable of inhibiting this class of RNAs, are also capable of disrupting the normal functioning of their host cells, thus acting as virulence factors. Roles of RBPs like the Hfq [27-36], Repressor of secondary metabolites A (RsmA) [36-41] and endoribonuclease YbeY [42] as virulence factors, have also been very well studied. Here, we describe the employment of mathematical profiles of RBP families to study the RBP repertoire, henceforth referred to as the ‘RBPome’, in E. coli strains. The proteomes of 19 E. coli strains (16 pathogenic and three non-pathogenic strains) have been studied to compare and contrast the RBPomes of pathogenic and non-pathogenic E. coli. More than 40 different kinds of proteins have been found to be present in two or more pathogenic strains, but absent from all the three non-pathogenic ones. Many of these proteins are previously uncharacterised and may be novel virulence factors and probable candidates for further experimental validations. We have also extended our search method to probe to all available E. coli complete proteomes (till the date of the study) for RBPs, and thus obtain a bigger picture of the RBP landscape in all known E. coli strains. The search method can also be adapted in future for comparing the RBPomes of other species of bacteria as well. In addition, our work also discusses case studies on a few interesting RBPs. The first of them is an attempt to provide a structural basis for the inactivity of the Ribonuclease PH (RNase PH) protein from E. coli strain K12, the second study deals with the structural modelling and characterisation of RNA substrates of an ‘uncharacterised’ protein that is exclusively found in the pathogenic E. coli strains, whereas the third one involves the analysis of pathogen-specific Cas6 proteins and comparison with their non-pathogenic counterparts.

Methods

Dataset

Protein families were grouped on the basis of either structural homology (structure-centric families) or sequence homology (sequence-centric families). A dataset of 1285 RNA-protein and 14 DNA/RNA hybrid-protein complexes were collected from the Protein Data Bank (PDB) (May 2015) and were split into protein and RNA chains. The RNA-interacting protein chains in this dataset were classified into 182 Structural Classification of Proteins (SCOP) families, 135 clustered families and 127 orphan families (a total of 437 structure-centric families), on the basis of structural homology with each other. Sequence-centric RNA-binding families were retrieved from Pfam, using an initial keyword search of ‘RNA’, followed by manual curation to generate a dataset of 746 families. The structure-centric classification scheme, the generation of structure-centric family Hidden Markov Models (HMMs) and retrieval of sequence-centric family HMMs from the Pfam database (v 28) were as adapted from our previous study [43]. Proteomes of 19 E. coli strains were retrieved from UniProt Proteomes (May 2016) [44] for the comparative study of pathogenic and non-pathogenic strains. The names and organism IDs of the E. coli strains, their corresponding UniProt proteome IDs and the total number of proteins in each proteome have been listed in Table 1.
Table 1

E. coli proteomes for comparative study. The 19 E. coli proteomes from UniProt (May 2016) used in the study for the comparison of RBPomes of pathogenic and non-pathogenic strains have been listed in this table. The pathogenic and the non-pathogenic E. coli strains have been represented in red and green fonts, respectively

apoints to the differences among the strains at the genome level, considering the strain K12 (or lab strain) as the standard

E. coli proteomes for comparative study. The 19 E. coli proteomes from UniProt (May 2016) used in the study for the comparison of RBPomes of pathogenic and non-pathogenic strains have been listed in this table. The pathogenic and the non-pathogenic E. coli strains have been represented in red and green fonts, respectively apoints to the differences among the strains at the genome level, considering the strain K12 (or lab strain) as the standard All complete E. coli proteomes were retrieved from RefSeq (May 2016) [45] to study the overall RBP landscape in E. coli. The names of the E. coli strains, their corresponding assembly IDs and the total number of proteins in each proteome and have been listed in Table 2.
Table 2

Complete E. coli proteomes. The 166 E. coli complete proteomes from RefSeq (May 2016) that have been used in the study have been listed in this table

Organism/NameStrainAssemblyProteins
E. coli O157:H7 str. SakaiSakai substr. RIMD 0509952GCA_000008865.15292
E. coli IAI39IAI39GCA_000026345.14725
E. coli str. K-12 substr. MG1655K-12 substr. MG1655GCA_000005845.24140
E. coli O83:H1 str. NRG 857CNRG 857CGCA_000183345.14582
E. coli O104:H4 str. 2011C-34932011C-3493GCA_000299455.15149
E. coli CFT073CFT073GCA_000007445.14897
E. coli BL21(DE3)BL21(DE3)GCA_000009565.24302
E. coli str. K-12 substr. W3110K-12 substr. W3110GCA_000010245.14410
E. coli SE11SE11GCA_000010385.14968
E. coli SE15SE15GCA_000010485.14573
E. coli O103:H2 str. 12,00912,009GCA_000010745.15423
E. coli O111:H- str. 11,12811,128GCA_000010765.15673
E. coli UTI89UTI89GCA_000013265.14963
E. coli 536536GCA_000013305.14542
E. coli APEC O1APEC O1GCA_000014845.15292
E. coli O139:H28 str. E24377AE24377AGCA_000017745.15021
E. coli HSHSGCA_000017765.14366
E. coli B str. REL606REL606GCA_000017985.14344
E. coli ATCC 8739ATCC 8739GCA_000019385.14434
E. coli str. K-12 substr. DH10BK-12 substr. DH10BGCA_000019425.14450
E. coli SMS-3-5SMS-3-5GCA_000019645.14908
E. coli O157:H7 str. EC4115EC4115GCA_000021125.15631
E. coli O157:H7 str. TW14359TW14359GCA_000022225.15537
E. coli BW2952K-12 substr. BW2952GCA_000022345.14347
E. coli BL21(DE3)BL21(DE3)GCA_000022665.24302
E. coli DH1DH1GCA_000023365.14369
E. coli ‘BL21-Gold(DE3)pLysS AG’BL21-Gold(DE3)pLysS AGGCA_000023665.14322
E. coli O55:H7 str. CB9615CB9615GCA_000025165.15262
E. coli IHE3034IHE3034GCA_000025745.14911
E. coli 55,98955,989GCA_000026245.14953
E. coli IAI1IAI1GCA_000026265.14450
E. coli S88S88GCA_000026285.14696
E. coli O127:H6 str. E2348/69E2348/69GCA_000026545.14924
E. coli 04242GCA_000027125.15131
E. coli O26:H11 str. 11,36811,368GCA_000091005.15833
E. coli KO11FLKO11GCA_000147855.34850
E. coli ABU 83972ABU 83972GCA_000148365.14862
E. coli UM146UM146GCA_000148605.14779
E. coli WWGCA_000184185.14825
E. coli ETEC H10407ETEC H10407GCA_000210475.15124
E. coli UMNK88UMNK88GCA_000212715.25542
E. coli NA114NA114GCA_000214765.24720
E. coli PCN033PCN033GCA_000219515.34881
E. coli UMNF18UMNF18GCA_000220005.25521
E. coli O7:K1 str. CE10CE10GCA_000227625.15152
E. coli str. ‘clone D i2’clone D i2GCA_000233875.14740
E. coli str. ‘clone D i14’clone D i14GCA_000233895.14742
E. coli O55:H7 str. RM12579RM12579GCA_000245515.15213
E. coli P12bP12bGCA_000257275.14549
E. coli KO11FLKO11FLGCA_000258025.14732
E. coli WWGCA_000258145.14831
E. coli Xuzhou21Xuzhou21GCA_000262125.15402
E. coli str. K-12 substr. MG1655K-12 substr. MG1655GCA_000269645.24405
E. coli DH1DH1GCA_000270105.14351
E. coli str. K-12 substr. MG1655K-12 substr. MG1655GCA_000273425.14404
E. coli LF82LF82GCA_000284495.14544
E. coli O25b:H4-ST131EC958GCA_000285655.35037
E. coli O104:H4 str. 2009EL-20502009EL-2050GCA_000299255.15283
E. coli O104:H4 str. 2009EL-20712009EL-2071GCA_000299475.15227
E. coli APEC O78APEC O78GCA_000332755.14598
E. coli str. K-12 substr. MDS42K-12 substr. MDS42GCA_000350185.13713
E. coli LY180LY180GCA_000468515.14586
E. coli PMV-1PMV-1GCA_000493595.15100
E. coli JJ1886JJ1886GCA_000493755.15151
E. coli str. K-12 substr. MC4100K-12 substr. MC4100GCA_000499485.14284
E. coli O145:H28 str. RM13514RM13514GCA_000520035.15524
E. coli O145:H28 str. RM13516RM13516GCA_000520055.15354
E. coli ST540GCA_000597845.14498
E. coli ST540GCA_000599625.14532
E. coli ST540GCA_000599645.14550
E. coli ST2747GCA_000599665.14665
E. coli ST2747GCA_000599685.14585
E. coli ST2747GCA_000599705.14547
E. coli O145:H28 str. RM12761RM12761GCA_000662395.15349
E. coli O145:H28 str. RM12581RM12581GCA_000671295.15520
E. coli Nissle 1917Nissle 1917GCA_000714595.14990
E. coli KLYKLYGCA_000725305.14478
E. coli O157:H7 str. SS17SS17GCA_000730345.15532
E. coli O157:H7 str. EDL933EDL933GCA_000732965.15530
E. coli ATCC 25922ATCC 25922GCA_000743255.14940
E. coli BW25113K-12 substr. BW25113GCA_000750555.14398
E. coli ECONIH1GCA_000784925.15320
E. coli ER2796ER2796GCA_000800215.14311
E. coli K-12ER3413GCA_000800765.14309
E. coli RS218RS218GCA_000800845.24791
E. coli RM9387GCA_000801165.14775
E. coli 94–3024GCA_000801185.24792
E. coli str. K-12 substr. MG1655K-12 substr. MG1655GCA_000801205.14387
E. coli O157:H7 str. SS52SS52GCA_000803705.15489
E. coli APEC IMT5155APEC IMT5155GCA_000813165.14840
E. coli 6409GCA_000814145.24893
E. coli GCA_000819645.14996
E. coli O157:H16SantaiGCA_000827105.14776
E. coli 13031303GCA_000829985.14849
E. coli C41(DE3)GCA_000830035.14302
E. coli ECC-1470ECC-1470GCA_000831565.14673
E. coli BL21 (TaKaRa)GCA_000833145.14262
E. coli MNCRE44GCA_000931565.15137
E. coli K-12 substr. RV308GCA_000952955.14342
E. coli K-12 substr. HMS174GCA_000953515.14344
E. coli HUSEC2011GCA_000967155.15294
E. coli VR50VR50GCA_000968515.14968
E. coli CI5GCA_000971615.14874
E. coli K-12ER3454GCA_000974405.14375
E. coli K-12ER3440GCA_000974465.14367
E. coli K-12ER3476GCA_000974505.14354
E. coli K-12ER3445GCA_000974535.14360
E. coli K-12ER3466GCA_000974575.14415
E. coli K-12ER3446GCA_000974825.14357
E. coli K-12ER3475GCA_000974865.14359
E. coli K-12ER3435GCA_000974885.14443
E. coli K-12K-12 substr. AG100GCA_000981485.14394
E. coli O104:H4 str. C227–11C227–11GCA_000986765.15269
E. coli SEC470GCA_000987875.14941
E. coli SQ37GCA_000988355.14405
E. coli SQ88GCA_000988385.14403
E. coli SQ2203GCA_000988465.14402
E. coli CFSAN029787GCA_001007915.15090
E. coli K-12K-12 substr. GM4792GCA_001020945.24362
E. coli K-12K-12 substr. GM4792GCA_001021005.24368
E. coli PCN061PCN061GCA_001029125.14680
E. coli C43(DE3)GCA_001039415.14254
E. coli NCM3722GCA_001043215.14530
E. coli ACN001ACN001GCA_001051135.14671
E. coli DH1Ec095GCA_001183645.14345
E. coli DH1Ec104GCA_001183665.14342
E. coli DH1Ec169GCA_001183685.14342
E. coli RR1GCA_001276585.14337
E. coli SF-088GCA_001280325.15019
E. coli SF-468GCA_001280345.15218
E. coli SF-166GCA_001280385.14773
E. coli SF-173GCA_001280405.14936
E. coli O157:H7WS4202GCA_001307215.15294
E. coli str. K-12 substr. MG1655K-12 substr. MG1655GCA_001308065.14398
E. coli K-12 substr. MG1655_TMP32XR1GCA_001308125.14398
E. coli K-12 substr. MG1655_TMP32XR2GCA_001308165.14399
E. coli 2012C-4227GCA_001420935.15142
E. coli 2009C-3133GCA_001420955.15311
E. coli YD786GCA_001442495.14604
E. coli CQSW20GCA_001455385.14142
E. coli uk_P46212GCA_001469815.15023
E. coli ST648GCA_001485455.14838
E. coli CD306GCA_001513615.15021
E. coli JJ2434GCA_001513635.15099
E. coli ACN002GCA_001515725.14618
E. coli MRE600GCA_001542675.24603
E. coli str. K-12 substr. MG1655K-12 substr. MG1655GCA_001544635.14407
E. coli JEONG-1266GCA_001558995.15358
E. coli BC2566GCA_001559615.14209
E. coli BC3029GCA_001559635.14317
E. coli K-12DHB4GCA_001559655.14522
E. coli K-12C3026GCA_001559675.14731
E. coli str. K-12 substr. MG1655JW5437–1 substr. MG1655GCA_001566335.14405
E. coli SaT040GCA_001566615.14963
E. coli G749GCA_001566635.14983
E. coli ZH193GCA_001566675.15040
E. coli ZH063GCA_001577325.14984
E. coli JJ1887JJ1887GCA_001593565.15142
E. coli str. SanjiSanjiGCA_001610755.15172
E. coli 28RC1GCA_001612475.15504
E. coli SRCC 1675GCA_001612495.15511
E. coli Ecol_732GCA_001617565.15243
E. coli Ecol_743GCA_001618325.14866
E. coli Ecol_745GCA_001618345.14803
E. coli Ecol_448GCA_001618365.14956
E. coli B7AB7AGCA_000725265.15384
Complete E. coli proteomes. The 166 E. coli complete proteomes from RefSeq (May 2016) that have been used in the study have been listed in this table

Search method

The search method was described in our previous study [43] and is represented schematically in Fig. 1. A library of 1183 RBP family HMMs (437 structure-centric families and 746 sequence-centric families) were used as start points to survey the E. coli proteomes for the presence of putative RBPs. The genome-wide survey (GWS) for each E. coli proteome was performed with a sequence E-value cut-off of 10−3 and the hits were filtered with a domain i-Evalue cut-off of 0.5. i-Evalue (independent E-value) is the E-value that the sequence/profile comparison would have received if this were the only domain envelope found in it, excluding any others. This is a stringent measure of how reliable this particular domain may be. The independent E-value uses the total number of targets in the target database. We have now mentioned this definition in the revised manuscript. The Pfam (v 28) domain architectures (DAs) were also resolved at the same sequence E-value and domain i-Evalue cut-offs.
Fig. 1

Search scheme for the genome-wide survey. A schematic representation of the search method for the GWS has been represented in this figure. Starting from 437 structure-centric and 746 sequence-centric RBP families, a library of 1183 RBP family HMMs were built. These mathematical profiles were then used to search proteomes of 19 different E. coli strains (16 pathogenic and three non-pathogenic strains). It is to be noted here that the same search scheme has been used later to extend the study to all 166 available E. coli proteomes in the RefSeq database as of May 2016 (see text for further details)

Comparison of RNA-binding proteins across strains

The RBPs identified from 19 different strains of E. coli, were compared by performing all-against-all protein sequence homology searches using the BLASTP module of the NCBI BLAST 2.2.30 + suite [46] with a sequence E-value cut-off of 10−5. The hits were clustered on the basis of 30% sequence identity and 70% query coverage cut-offs to identify similar proteins i.e., proteins that had a sequence identity of greater than or equal to 30%, as well as a query coverage of greater than or equal to 70%, were considered to homologous in terms of sequence and hence clustered. These parameters were standardised on the basis of previous work from our lab to identify true positive sequence homologues [47]. Associations for proteins that were annotated as ‘hypothetical’ or ‘uncharacterised’, were obtained by sequence homology searches against the NCBI non-redundant (NR) protein database (February 2016) with a sequence E-value cut-off of 10−5. The BLASTP hits were also clustered on the basis of 100% sequence identity, 100% query coverage and equal length cut-offs to identify identical proteins. Clusters that consist of proteins from two or more of the pathogenic strains, but not from any of the non-pathogenic ones, will henceforth be referred as ‘pathogen-specific clusters’ and the proteins in such clusters as ‘pathogen-specific proteins’. Sequence homology searches were performed for these proteins against the reference human proteome (UP000005640) retrieved from Swiss-Prot (June 2016) [44] at a sequence E-value cut-off of 10−5. The hits were filtered on the basis of 30 percentage sequence identity and 70 percentage query coverage cut-offs.

Modelling and dynamics studies of RNase PH protein

The structures of the active and inactive monomers of the tRNA processing enzyme Ribonuclease PH (RNase PH) from strains O26:H11 (UniProt ID: C8TLI5) and K12 (UniProt ID: P0CG19), respectively, were modelled on the basis of the RNase PH protein from Pseudomonas aeruginosa (PDB code: 1R6M: A) (239 amino acids) using the molecular modelling program MODELLER v 9.15 [48]. The active and inactive RNase PH monomers are 238 and 228 amino acids in length, respectively and are 69% and 70% identical to the template, respectively. Twenty models were generated for each of the active and inactive RNase PH monomers and validated using PROCHECK [49], VERIFY3D [50], ProSA [51] and HARMONY [52]. The best model for each of the active and inactive RNase PH monomers were selected on the basis of Discrete Optimized Protein Energy (DOPE) score and other validation parameters obtained from the above-mentioned programs. The best models for the active and inactive RNase PH monomers were subjected to 100 iterations of the Powell energy minimisation method in the Tripos Force Field (in absence of any electrostatics) using SYBYL7.2 (Tripos Inc.). These were subjected to 100 ns (ns) molecular dynamics (MD) simulations (three replicates each) in the AMBER99SB protein, nucleic AMBER94 force field [53] using the Groningen Machine for Chemical Simulations (GROMACS 4.5.5) program [54]. The biological assembly (hexamer) of RNase PH from Pseudomonas aeruginosa (PDB code: 1R6M) served as the template and was obtained using the online tool (PISA) (http://www.ebi.ac.uk/pdbe/prot_int/pistart.html) [55]. The structures of the active and inactive hexamers of RNase PH from strains O26:H11 and K12, respectively were modelled and the 20 models generated for each of the active and inactive RNase PH hexamers were validated using the same set of tools, as mentioned above. The best models were selected and subjected to energy minimisations, as described above. Electrostatic potential on the solvent accessible surfaces of the proteins were calculated using PDB2PQR [56] (in the AMBER force field) and Adaptive Poisson-Boltzmann Solver (APBS) [57]. The head-to-head dimers were randomly selected from both the active and the inactive hexamers of the protein for performing MD simulations, to save computational time. Various energy components of the dimer interface were measured using the in-house algorithm, PPCheck [58]. This algorithm identifies interface residues in protein-protein interactions on the basis of simple distance criteria, following which the strength of interactions at the interface are quantified. 100 ns MD simulations (three replicates each) were performed with the same set of parameters as mentioned above for the monomeric proteins.

Modelling and dynamics studies of an ‘uncharacterised’ pathogen-specific protein

The structure of the PELOTA_1 domain (Pfam ID: PF15608) of an ‘uncharacterised’ pathogen-specific protein from strain O103:H2 (UniProt ID: C8TX32) (371 amino acids) was modelled on the basis of the L7Ae protein from Methanocaldococcus jannaschii (PDB code: 1XBI: A) (117 amino acids) and validated, as described earlier. The 64 amino acids long PELOTA_1 domain of the uncharacterised protein, has 36% sequence identity with the corresponding 75 amino acids domain of the template. The best model was selected as described in the case study on RNase PH. This model was subjected to 100 iterations of the Powell energy minimisation method in the Tripos Force Field (in absence of any electrostatics) using SYBYL7.2 (Tripos Inc.). Structural alignment of the modelled PELOTA_1 domain and the L7Ae K-turn binding domain from Archaeoglobus fulgidus (PDB code: 4BW0: B) was performed using Multiple Alignment with Translations and Twists (Matt) [59]. The same kink-turn RNA from H. marismortui, found in complex with the L7Ae K-turn binding domain from A. fulgidus, was docked onto the model, guided by the equivalents of the RNA-interacting residues (at a 5 Å cut-off distance from the protein) in the A. fulgidus L7Ae protein (highlighted in yellow in the upper panel of Fig. 7c) using the molecular docking program HADDOCK [60]. The model and the L7Ae protein from A. fulgidus, in complex with kink-turn RNA from H. marismortui, were subjected to 100 ns MD simulations (three replicates each) in the AMBER99SB protein, nucleic AMBER94 force field using the GROMACS 4.5.5 program.
Fig. 7

Uncharacterised pathogen-specific RNA-binding protein. The characterisation of the uncharacterised pathogen-specific RBP has been represented in this figure. a Schematic representation of the domain architecture of the protein. The RNA-binding PELOTA_1 domain and its model has been shown here. b Structural superposition of the L7Ae K-turn binding domain (PDB code: 4BW0: B) (in red) and the model of the uncharacterised protein PELOTA_1 domain (in blue). c. Comparison of the kink-turn RNA-bound forms of the L7Ae K-turn binding domain (PDB code: 4BW0: B) (up) and that of the model of the uncharacterised protein PELOTA_1 domain (down). The RNA-binding residues have been highlighted in yellow

Sequence analysis of pathogen-specific Cas6-like proteins

The sequences of all the proteins in Cluster 308 were aligned to the Cas6 protein sequence in E. coli strain K12 (UniProt ID: Q46897), using MUSCLE [61] and subjected to molecular phylogeny analysis using the Maximum Likelihood (ML) method and a bootstrap value of 1000 in MEGA7 (CC) [62, 63]. All reviewed CRISPR-associated Cas6 protein sequences were also retrieved from Swiss-Prot (March 2017) [44], followed by manual curation to retain 18 Cas6 proteins. Sequences of two uncharacterised proteins (UniProt IDs: C8U9I8 and C8TG04) from Cluster 308, known to be homologous to known CRISPR-associated Cas6 proteins (on the basis of sequence homology searches against the NR database, as described earlier) were aligned to those of the 18 reviewed Cas6 proteins using MUSCLE. The sequences were then subjected to molecular phylogeny analysis using the above-mentioned parameters. Secondary structure predictions for all the proteins were performed using PSIPRED [64]. The structures of Cas6 proteins from E. coli strain K12 (PDB codes: 4QYZ: K, 5H9E: K and 5H9F: K) were retrieved from the PDB. The RNA-binding and protein-interacting residues in the Cas6 protein structures were calculated on the basis of 5 Å and 8 Å distance cut-off criteria, from the associated crRNAs (PDB codes: 4QYZ: L, 5H9E: L and 5H9F: L, respectively) and the protein chains (PDB codes: 4QYZ: A-J, 5H9E: A-J and 5H9F: A-J, respectively), respectively.

Results

Genome-wide survey (GWS) of RNA-binding proteins in pathogenic and non-pathogenic E. coli strains

The GWS of RBPs was performed in 19 different E. coli strains (16 pathogenic and three non-pathogenic strains) and a total of 7902 proteins were identified (Additional file 1: Table S1). Figure 2a shows the number of RBPs found in each of the strains studied here. The pathogenic strains have a larger RBPome, as compared to the non-pathogenic ones - with strain O26:H11 encoding the greatest (441). The pathogenic strains also have bigger proteome sizes (in terms of the number of proteins in the proteome), as compared to their non-pathogenic counterparts, by virtue of maintaining plasmids in them. Hence, to normalise for proteome size, the number of RBPs in each of these strains were expressed as a function of their respective number of proteins in the proteome (Fig. 2b). We observed that the difference in the percentage of RBPs in the proteome among the pathogenic and the non-pathogenic strains are insignificant (Welch Two Sample t-test: t = 3.2384, df = 2.474, p-value = 0.06272).
Fig. 2

Statistics for the genome-wide survey of 19 E. coli strains. The different statistics obtained from the GWS have been represented in this figure. In panels a and b, the pathogenic strains have been represented in red and the non-pathogenic ones in green. The non-pathogenic strains have also been highlighted with green boxes. a. The number of RBPs in each strain. The pathogenic O26:H11 strain encodes the highest number of RBPs in its proteome. b. The percentage of RBPs in the proteome of each strain. These percentages have been calculated with respect to the proteome size of the strain under consideration. The difference in this number among the pathogenic and the non-pathogenic strains are insignificant (Welch Two Sample t-test: t = 3.2384, df = 2.474, p-value = 0.06272). c. The type of Pfam domains encoded by each strain. The difference in the types of Pfam domains, as well as Pfam RBDs, encoded by the pathogenic and the non-pathogenic strains are insignificant (Welch Two Sample t-test for types of Pfam domains: t = −1.3876, df = 2.263, p-value = 0.2861; Welch Two Sample t-test for types of Pfam RBDs: t = −0.9625, df = 2.138, p-value = 0.4317). d. The abundance of Pfam RBDs. 185 types of Pfam RBDs were found to be encoded in the RBPs, of which DEAD domains have the highest representation (approximately 4% of all Pfam RBDs)

Search scheme for the genome-wide survey. A schematic representation of the search method for the GWS has been represented in this figure. Starting from 437 structure-centric and 746 sequence-centric RBP families, a library of 1183 RBP family HMMs were built. These mathematical profiles were then used to search proteomes of 19 different E. coli strains (16 pathogenic and three non-pathogenic strains). It is to be noted here that the same search scheme has been used later to extend the study to all 166 available E. coli proteomes in the RefSeq database as of May 2016 (see text for further details) Statistics for the genome-wide survey of 19 E. coli strains. The different statistics obtained from the GWS have been represented in this figure. In panels a and b, the pathogenic strains have been represented in red and the non-pathogenic ones in green. The non-pathogenic strains have also been highlighted with green boxes. a. The number of RBPs in each strain. The pathogenic O26:H11 strain encodes the highest number of RBPs in its proteome. b. The percentage of RBPs in the proteome of each strain. These percentages have been calculated with respect to the proteome size of the strain under consideration. The difference in this number among the pathogenic and the non-pathogenic strains are insignificant (Welch Two Sample t-test: t = 3.2384, df = 2.474, p-value = 0.06272). c. The type of Pfam domains encoded by each strain. The difference in the types of Pfam domains, as well as Pfam RBDs, encoded by the pathogenic and the non-pathogenic strains are insignificant (Welch Two Sample t-test for types of Pfam domains: t = −1.3876, df = 2.263, p-value = 0.2861; Welch Two Sample t-test for types of Pfam RBDs: t = −0.9625, df = 2.138, p-value = 0.4317). d. The abundance of Pfam RBDs. 185 types of Pfam RBDs were found to be encoded in the RBPs, of which DEAD domains have the highest representation (approximately 4% of all Pfam RBDs) To compare the differential abundance of domains, if any, among the pathogens and the non-pathogens, the Pfam DAs of all the RBPs were resolved (to strengthen the results in this section, this study has been extended to all known E. coli proteomes and will be discussed in a later section). The number of different types of Pfam domains and that of Pfam RNA-binding domains (RBDs) found in each strain have been represented in Fig. 2c. We observed that the difference in the types of Pfam domains, as well as Pfam RBDs, encoded by the pathogenic and the non-pathogenic strains are insignificant (Welch Two Sample t-test for types of Pfam domains: t = − 1.3876, df = 2.263, p-value = 0.2861; Welch Two Sample t-test for types of Pfam RBDs: t = − 0.9625, df = 2.138, p-value = 0.4317). The number of different Pfam RBDs, found across all the 19 E. coli strains studied here, has been shown in Fig. 2d and also been listed in Table 3.
Table 3

Pfam RNA-binding domains. The Pfam RBDs and their corresponding occurrences in the GWS of 19 E. coli strains have been listed in this table. The Pfam domains listed are on the basis of Pfam database (v.28)

Pfam domainNumber of occurrencesPfam domainNumber of occurrences
DEAD250MnmE_helical19
S1247RNA_pol_Rpb2_719
GTP_EFTU168tRNA-synt_2e19
GTP_EFTU_D2168Ribosomal_L11_N19
HOK_GEF160KilA-N19
S4152Ribosomal_S419
PseudoU_synth_2150RTC19
CSD133Ribosomal_L219
SpoU_methylase114DnaB_bind19
tRNA_anti-codon113IPPT19
RNase_T99IF3_C19
tRNA-synt_195UPF002019
tRNA-synt_294RNA_pol_A_bac19
Ribonuc_L-PSP91RNA_pol_Rpb1_319
PRD90Se-cys_synth_N19
Anticodon_176RimM19
Formyl_trans_N75Val_tRNA-synt_C19
Aconitase75TruB_C_219
RF-170RNA_pol_Rpb1_419
tRNA-synt_1c58dsrm19
RNase_PH57RNA_pol_Rpb2_119
RNase_PH_C57Rho_N19
Dus57CheR19
HGTP_anticodon57SHS2_FTSA19
tRNA-synt_2b57Helicase_RecD19
tRNA_bind56RNA_pol_Rpb619
Sua5_yciO_yrdC56RNA_pol_Rpb2_619
tRNA_U5-meth_tr56SpoU_methylas_C19
zf-FPG_IleRS55Trm112p19
Ldr_toxin52RNA_pol_Rpb1_119
RtcB50CsrA19
CorA46Ribosomal_S20p19
MazE_antitoxin44TruB-C_219
ProQ41DALR_219
DbpA39IF-219
HA239tRNA_Me_trans19
PolyA_pol_RNAbd38KH_119
TruD38ABC119
IF2_N38tRNA-synt_1_219
PseudoU_synth_138PUA19
Methyltr_RsmF_N38CAT_RBD19
SpoU_sub_bind38PNPase19
SgrR_N38tRNA_m1G_MT19
tRNA_edit38SelB-wing_219
PolyA_pol38RNase_H19
FtsJ38PRC19
HRDC38Ribosomal_L18p19
tRNA-synt_1b38GIDA_assoc19
THUMP38RrnaAD19
DALR_138YjeF_N19
NusB38LigT_PEase19
RNase_E_G38Ub-RnfH19
ASCH37Nol1_Nop2_Fmu_219
DNA_pol_A_exo137RRF19
tRNA_SAD36B518
DHHA136FDX-ACB18
GTP_EFTU_D335GidB18
MqsA_antitoxin27Sigma70_r1_118
TGT21IF3_N18
RNA_pol_Rpb1_220B3_418
Queuosine_synth20tRNA-synt_2c17
RVT_120Colicin-DNase16
tRNA_synt_2f19PTS_2-RNA16
SelB-wing_319CRISPR_Cse114
Methyltrans_RNA19CRISPR_Cse214
Ribosomal_L1119FinO_N13
CRS1_YhbY19PIN13
Tyr_Deacylase19GIIM11
Ribosomal_L419IlvGEDA_leader11
MutS_II19SymE_toxin11
RNA_pol_Rpb1_519PRTase_110
TilS19PELOTA_110
Endonuclease_119DNA_primase_S10
Ribosomal_L25p19IlvB_leader9
Sigma54_CBD19YafO_toxin8
RapA_C19MT-A708
TilS_C19RPAP2_Rtr16
OB_NTP_bind19TisB_toxin5
GlutR_dimer19MqsR_toxin5
RNA_pol_Rpb2_319Ibs_toxin5
RtcR19RNA_ligase4
TruB_N19N363
RsmJ19Cloacin3
tRNA_bind_219Colicin_D2
tRNA-synt_1c_C19Viral_helicase12
RNA_pol_Rpb2_4519IF-2B1
HisG19Colicin_immun1
Rho_RNA_bind19RPOL_N1
PNPase_C19RNA_pol1
RNase_HII19RnlA_toxin1
RNA_pol_A_CTD19NYN1
RNA_pol_L19DUF38501
Rsd_AlgQ19
Pfam RNA-binding domains. The Pfam RBDs and their corresponding occurrences in the GWS of 19 E. coli strains have been listed in this table. The Pfam domains listed are on the basis of Pfam database (v.28) We found that E. coli encodes 185 different types of Pfam RBDs in their proteomes and the DEAD domain was found to be the most abundant, constituting approximately 4% of the total number of Pfam RBD domains in E. coli. The DEAD box family of proteins are RNA helicases that are required for RNA metabolism and thus are important players in gene expression [65]. These proteins use ATP to unwind short RNA duplexes in an unusual fashion and also help in the remodelling of RNA-protein complexes.

Comparison of RNA-binding proteins across strains reveals novel pathogen-specific factors

The proteins were clustered on the basis of sequence homology searches in order to compare and contrast the RBPs across the E. coli strains studied here. The 7902 proteins identified from all the strains were grouped into 384 clusters, on the basis of sequence homology with other members of the cluster (Additional file 2: Table S2). Greater than 99% of the proteins could cluster with one or more RBPs and formed 336 multi-member clusters (MMCs), whereas the rest of the proteins failed to cluster with other RBPs and formed 48 single-member clusters (SMCs). The distribution of members among all the 384 clusters has been depicted in Fig. 3.
Fig. 3

Clusters of RNA-binding proteins. The percentage of RBPs in the different clusters has been represented in this figure. The RBPs obtained from each of the 19 E. coli strains (16 pathogenic and three non-pathogenic strains) have been clustered on the basis of homology searches (see text for further details). Five of the biggest clusters and their identities are as follows: Cluster 5 (ATP-binding subunit of transporters), Cluster 41 (Small toxic polypeptides), Cluster 15 (RNA helicases), Cluster 43 (Cold shock proteins) and Cluster 16 (Pseudouridine synthases)

Clusters of RNA-binding proteins. The percentage of RBPs in the different clusters has been represented in this figure. The RBPs obtained from each of the 19 E. coli strains (16 pathogenic and three non-pathogenic strains) have been clustered on the basis of homology searches (see text for further details). Five of the biggest clusters and their identities are as follows: Cluster 5 (ATP-binding subunit of transporters), Cluster 41 (Small toxic polypeptides), Cluster 15 (RNA helicases), Cluster 43 (Cold shock proteins) and Cluster 16 (Pseudouridine synthases) The largest of the MMCs, consists of 1459 RBPs which are ATP-binding subunit of transporters. The E. coli genome sequence had revealed that the largest family of paralogous proteins were composed of ATP-binding cassette (ABC) transporters [66]. The ATP-binding subunit of ABC transporters share common features with other nucleotide-binding proteins [67] like, the E. coli RecA [68] and the F1-ATPase from bovine heart [69]. GCN20, YEF3 and RLI1 are examples of soluble ABC proteins that interact with ribosomes and regulate translation and ribosome biogenesis [70-72]. The other large MMCs were those of small toxic polypeptides that are components of the bacterial toxin-antitoxin (TA) systems [73-77], RNA helicases that are involved in various aspects of RNA metabolism [78, 79] and Pseudouridine synthases that are enzymes responsible for pseudouridylation, which is the most abundant post-transcriptional modification in RNAs [80]. Cold shock proteins bind mRNAs and regulate translation, rate of mRNA degradation etc. [81, 82]. These proteins are induced during the response of the bacterial cell towards temperature rise. The majority of the SMCs (38 out of 48 SMCs) are RBPs from pathogenic strains and lack homologues in any of the other strains considered here. These include proteins like putative helicases, serine proteases, and various endonucleases. Likewise, members of the small toxic Ibs protein family (IbsA, IbsB, IbsC, IbsD and IbsE that form Clusters 362, 363, 364, 365 and 366 respectively) from strain K12 are noteworthy examples of SMCs that are in non-pathogenic strains only. These Ibs proteins cause the cessation of growth when overexpressed [83].

Pathogen-specific proteins

In this study, the 226 pathogen-specific proteins that formed 43 pathogen-specific clusters are of special interest. Sixty-three of these proteins were previously uncharacterised and associations for all of these proteins were obtained on the basis of sequence homology searches against the NCBI-NR database. The function annotation of each of these clusters were transferred on the basis of homology. The biological functions and the number of RBPs constituting these pathogen-specific clusters have been listed in Table 4.
Table 4

Pathogen-specific RNA-binding protein clusters. The size of RBP clusters with members from only the pathogenic E. coli strains in our GWS of 19 E. coli strains have been listed in this table

Cluster numberNumber of membersCluster name
Cluster 28313KilA-N domain phage proteins
Cluster 29913tRNA(fMet)-specific VapC endonucleases
Cluster 17610DEAD/DEAH box helicases
Cluster 30710CRISPR type I-E/−associated proteins CasA/Cse1
Cluster 30810CRISPR-associated proteins Cas6/Cse3/CasE, subtype I-E
Cluster 30910CRISPR type I-E/−associated proteins CasB/Cse2
Cluster 31010CRISPR-associated proteins Cas5/CasD, subtype I-E
Cluster 6010Putative ATP/GTP-binding proteins
Cluster 3189ASCH domain-containing proteins
Cluster 1228Adenine DNA methyltransferases, phage-associated
Cluster 3058Post-segregational killing toxins
Cluster 1167RNA-binding proteins
Cluster 2986VagC, VagC-homologs
Cluster 3176Type I restriction-modification enzymes, M subunits
Cluster 3196Type I restriction-modification enzymes, R subunits
Cluster 1255Helicases
Cluster 2465DNA-binding proteins
Cluster 2875ATP-dependent helicases
Cluster 2885DEAD/DEAH box helicases
Cluster 2905DEAD/DEAH box helicases
Cluster 635UvrD/REP helicase-like proteins
Cluster 98a 5Protein kinases
Cluster 16142′-5′ RNA ligases
Cluster 3004proQ/FINO family proteins
Cluster 3134ATP-dependent helicases, res subunit of Type III restriction enzyme, SNF2 family helicases
Cluster 3294Serine/Threonine kinases
Cluster 3013YajA proteins
Cluster 3233Putative pyridoxal phosphate-dependent enzymes, Putative transferases
Cluster 3243Sigma-54 dependent transcription regulators, Putative transcriptional regulators of NtrC family
Cluster 3263Leucin-rich repeat proteins
Cluster 3303Ankyrin repeat-containing domain proteins
Cluster 1722Zn-dependent hydrolases, including glyoxylases, Beta-lactamases
Cluster 2382Hypothetical proteins
Cluster 3022Peptidyl-arginine deiminases
Cluster 3032Exonucleases
Cluster 3152KilA-N domain phage proteins
Cluster 3202Pyocin, putative colicin activity proteins
Cluster 3252Hyothetical proteins
Cluster 3272KilA-N domain proteins
Cluster 3322Hypothetical phage associated proteins
Cluster 3342Hypothetical proteins
Cluster 3352Chromosome partitioning ParA proteins
Cluster 3362Serine/Threonine kinases

aAll the proteins in this cluster have sequence homologues in humans

Pathogen-specific RNA-binding protein clusters. The size of RBP clusters with members from only the pathogenic E. coli strains in our GWS of 19 E. coli strains have been listed in this table aAll the proteins in this cluster have sequence homologues in humans If these pathogen-specific proteins are exclusive to the pathogenic strains, then they may be exploited for drug design purposes. To test this hypothesis, we surveyed the human (host) proteome for the presence of sequence homologues of these proteins. It was found that, barring the protein kinases that were members of Cluster 98 (marked in asterisk in Table 4), none of the pathogen-specific proteins were homologous to any human protein within the thresholds employed in the search strategy (please see section for details). Few of the pathogen-specific protein clusters are described in the following section. The DEAD/DEAH box helicases that use ATP to unwind short duplex RNA [65], formed three different clusters. In two of the clusters, the DEAD domains (Pfam ID: PF00270) were associated with C-terminal Helicase_C (Pfam ID: PF00271) and DUF1998 (Pfam ID: PF09369) domains. On the other hand, in a bigger cluster, the DEAD/DEAH box helicases were composed of DNA_primase_S (Pfam ID: PF01896), ResIII (Pfam ID: PF04851) and Helicase_C domains. Four of the pathogen-specific clusters were those of Clustered Regularly Interspaced Short Palindromic Repeat (CRISPR) sequence-associated proteins, consisting of RBPs from 10 pathogenic strains each. Recent literature reports also support the role of CRISPR-associated proteins as virulence factors in pathogenic bacteria [84]. The KilA-N domains are found in a wide range of proteins and may share a common fold with the nucleic acid-binding modules of certain nucleases and the N-terminal domain of the tRNA endonuclease [85]. Fertility inhibition (FinO) protein and the anti-sense FinP RNA are members of the FinOP fertility inhibition complex which regulates the expression of the genes in the transfer operon [86-89]. tRNA (fMet)-specific endonucleases are the toxic components of a TA system. This site-specific tRNA-(fMet) endonuclease acts as a virulence factor by cleaving both charged and uncharged tRNA-(fMet) and inhibiting translation. The Activating Signal Cointergrator-1 homology (ASCH) domain is also a putative RBD due to the presence of an RNA-binding cleft associated with a conserved sequence motif characteristic of the ASC-1 superfamily [90].

Identification of the distinct RNA-binding protein repertoire in E. coli

We identified identical RBPs across E. coli strains, on the basis of sequence homology searches and other filtering criteria (as mentioned in the Methods section). Out of the 7902 RBPs identified in our GWS, 6236 had one or more identical partners from one or more strains and formed 1227 clusters, whereas 1666 proteins had no identical counterparts. Hence, our study identified 2893 RBPs from 19 E. coli strains that were distinct from each other. Identification of such a distinct pool of RBPs will help to provide an insight to the possible range of functions performed by this class of proteins in E. coli, and hence compare and contrast with the possible functions performed by RBPs in other organisms.

GWS of RNA-binding proteins in all known E. coli strains

We extended the above-mentioned study, by performing GWS of RBPs in 166 complete E. coli proteomes available in the RefSeq database (May 2016) and a total of 8464 proteins were identified (Additional file 3). It should be noted that, unlike the nomenclature system of UniProt, where the same protein occurring in different strains are denoted with different UniProt accession IDs, RefSeq assigns same or at times different accession IDs to the same protein occurring in different strains. Thus, on the basis of unique accession IDs, 8464 RBPs were identified. The 8464 RBPs were grouped into 401 clusters on the basis of sequence homology with other members of the cluster. We found that greater than 99% of the proteins could cluster with one or more RBPs and formed 339 MMCs, whereas the rest of the proteins failed to cluster with other RBPs and formed 62 SMCs. The above-mentioned GWS statistics for RBP numbers have been plotted in Fig. 4a. The number of different Pfam RBDs found across all complete E. coli proteomes has been shown in Fig. 4b. Similar to the afore-mentioned results, seen from the dataset of 19 E. coli proteomes, it was found that E. coli encodes 188 different types of Pfam RBDs in their proteomes and the DEAD domain was still observed to be the most abundant, constituting approximately 6% of the total number of Pfam RBD domains in E. coli. The length distribution of RBPs from E. coli have been plotted in Fig. 4c and RBPs of the length 201–300 amino acids were found to be the most prevalent.
Fig. 4

Statistics for the genome-wide survey of 166 E. coli strains. The different statistics obtained from the GWS have been represented in this figure. a The number of RBPs as determined by different methods (see text for further details). b The abundance of Pfam RBDs. 188 types of Pfam RBDs were found to be encoded in the RBPs, of which DEAD domains have the highest representation (approximately 6% of all Pfam RBDs). c The length distribution of RBPs

Statistics for the genome-wide survey of 166 E. coli strains. The different statistics obtained from the GWS have been represented in this figure. a The number of RBPs as determined by different methods (see text for further details). b The abundance of Pfam RBDs. 188 types of Pfam RBDs were found to be encoded in the RBPs, of which DEAD domains have the highest representation (approximately 6% of all Pfam RBDs). c The length distribution of RBPs

Identification of the complete distinct RBPome in 166 proteomes of E. coli

These 8464 RBPs (please see previous section) formed 1285 clusters of two or more identical proteins, accounting for 3532 RBPs, whereas the remaining 4932 RBPs were distinct from the others. Hence, 6217 RBPs, distinct from each other, were identified from all known E. coli strains, which is much greater than the number (2893) found from 19 E. coli proteomes. It should be noted that the pathogenicity annotations are not very clear for few of the 166 E. coli strains for which complete proteome information are available. Hence, we have performed the analysis for the pathogen-specific proteins using the smaller dataset of 19 proteomes, whereas all the 166 complete proteomes have been considered for the analysis for the complete E. coli RBPome.

Case studies

Three case studies on interesting RBPs were performed to answer some outstanding questions and have been described in the following sections. The first of the three examples, deals with a RNase PH protein that does not cluster with those from any of the other 165 E. coli proteomes considered in this study. This protein, which forms a SMC, is interesting in the biological context due to its difference with the other RNase PH proteins, both at the level of sequence as well as biological activity. The second case study deals with a protein that is a part of a pathogen-specific cluster, in which none of the proteins are well-annotated. This protein was found to encode a bacterial homologue of a well-known archaeo-eukaryotic RBD, whose RNA-binding properties are not as well studied as its homologues. The final study involves a sequence-based approach to analyse the pathogen-specific CRISPR-associated Cas6 proteins, and compare the same with similar proteins from the non-pathogenic strains.

Case study 1: RNase PH from strain K12 is inactive due to a possible loss of stability of the protein

RNase PH is a phosphorolytic exoribonuclease involved in the maturation of the 3′-end of transfer RNAs (tRNAs) containing the CCA motif [91-93]. The RNase PH protein from strain K12 was found to be distinct from all other known RNase PH proteins from E. coli and has a truncated C-terminus. In 1993, DNA sequencing studies had revealed that a GC base pair (bp) was missing in this strain from a block of five GC bps found 43–47 upstream of the rph stop codon [94]. This one base pair deletion leads to a translation frame shift over the last 15 codons, resulting in a premature stop codon (five codons after the deletion). This premature stop codon, in turn, leads to the observed reduction in size of the RNase PH protein by 10 residues. It was also shown by Jensen [94] that this protein lacks RNase PH activity. Figure 5a shows a schematic representation of the DAs of the active (up) and inactive (down) RNase PH proteins, with the five residues that have undergone mutations and the ten residues that are missing from the inactive RNase PH protein depicted in orange and yellow, respectively. These are the residues of interest in our study. The same colour coding has been used both in Fig. 5a and b.
Fig. 5

Modelling of the RNase PH proteins from two different E. coli strains. The structural modelling of the RNase PH protein has been represented in this figure. a Schematic diagram of the active (above) and the inactive (below) RNase PH proteins. The RNase PH and the RNase_PH_C domains, as defined by Pfam (v.28), have been represented in magenta and pink, respectively. The five residues that have undergone mutations due to a point deletion and the ten residues that are missing from the inactive RNase PH protein from strain K12 have been depicted in orange and yellow, respectively. These two sets of residues are the ones of interest in this study. b Model of the RNase PH monomer from strain O26:H11. The residues with the same colour codes as mentioned in panel (a), have been represented on the structure of the model. The residues that are within an 8 Å cut-off distance from the residues of interest have been highlighted in cyan (left). c Structure of the RNase PH hexamer from strain O26:H11 (left) and the probable structure of the inactive RNase PH hexamer from strain K12 (right). The dimers marked in black boxes are the ones that were randomly selected for MD simulations. d Electrostatic potential on the solvent accessible surface of the RNase PH hexamer from strain O26:H11 (left) and that of the inactive RNase PH hexamer from strain K12 (right)

Modelling of the RNase PH proteins from two different E. coli strains. The structural modelling of the RNase PH protein has been represented in this figure. a Schematic diagram of the active (above) and the inactive (below) RNase PH proteins. The RNase PH and the RNase_PH_C domains, as defined by Pfam (v.28), have been represented in magenta and pink, respectively. The five residues that have undergone mutations due to a point deletion and the ten residues that are missing from the inactive RNase PH protein from strain K12 have been depicted in orange and yellow, respectively. These two sets of residues are the ones of interest in this study. b Model of the RNase PH monomer from strain O26:H11. The residues with the same colour codes as mentioned in panel (a), have been represented on the structure of the model. The residues that are within an 8 Å cut-off distance from the residues of interest have been highlighted in cyan (left). c Structure of the RNase PH hexamer from strain O26:H11 (left) and the probable structure of the inactive RNase PH hexamer from strain K12 (right). The dimers marked in black boxes are the ones that were randomly selected for MD simulations. d Electrostatic potential on the solvent accessible surface of the RNase PH hexamer from strain O26:H11 (left) and that of the inactive RNase PH hexamer from strain K12 (right) To provide a structural basis for this possible loss of activity of the RNase PH protein from strain K12, we modelled the structures of the RNase PH protein monomer as well as the hexamer from strains O26:H11 and K12 (Fig. 5b and c). It is known in the literature that the hexamer (trimer of dimers) is the biological unit of the RNase PH protein and that the hexameric assembly is mandatory for the activity of the protein [95, 96]. The stability of both the monomer and the hexamer were found to be affected in strain K12, as compared to that in strain O26:H11. The energy values have been plotted in Fig. 6a. In both monomer and hexamer, there is a reduction in stability, suggesting that the absence of C-terminal residues affects the stability of the protein, perhaps more than a cumulative contribution to the stability of the protein. It should be noted that since the monomeric form of the inactive protein is less stable than that of its active counterpart, the hexameric assembly of the inactive RNase PH protein is only a putative one. Hence, the putative and/or unstable hexameric assembly of the RNase PH protein, leads to the loss of activity of the protein.
Fig. 6

Energy values for the active and inactive RNase PH monomers, dimers and hexamers. The energy values (in kJ/mol) for the active (blue) and the inactive (red) RNase PH proteins, as calculated by SYBYL (in panel a) and PPCheck (in panel b) have been plotted in this figure. a The energy values for the active and the inactive RNase PH monomers and hexamers. The results show that both the monomeric, as well as the hexameric forms of the inactive RNase PH protein, is unstable as compared to the those of the active RNase PH protein. b The interface energy values for the active and the inactive RNase PH dimers (as marked in black boxes in Fig. 5c). The results show that the dimer interface of the inactive RNase PH protein is less stabilised as compared to that of the active RNase PH protein

Energy values for the active and inactive RNase PH monomers, dimers and hexamers. The energy values (in kJ/mol) for the active (blue) and the inactive (red) RNase PH proteins, as calculated by SYBYL (in panel a) and PPCheck (in panel b) have been plotted in this figure. a The energy values for the active and the inactive RNase PH monomers and hexamers. The results show that both the monomeric, as well as the hexameric forms of the inactive RNase PH protein, is unstable as compared to the those of the active RNase PH protein. b The interface energy values for the active and the inactive RNase PH dimers (as marked in black boxes in Fig. 5c). The results show that the dimer interface of the inactive RNase PH protein is less stabilised as compared to that of the active RNase PH protein Figure 5b shows that the residues marked in cyan (left) are at an interacting distance of 8 Å from the residues of interest (left). These residues marked in cyan are a subset of the RNase PH domain, which is marked in magenta (right). Hence, the loss of possible interactions (between the residues marked in cyan and the residues of interest) and subsequently stability of the three-dimensional structure of the RNase PH domain might explain the inactive nature of the protein from strain K12. Figure 5d shows differences in the electrostatic potential on the solvent accessible surfaces of the active (left) and inactive (right) RNase PH proteins. To test this hypothesis for the possible loss of function of the RNase PH protein due to loss of stability of the monomer and/or the hexamer, we performed MD simulations to understand distortions, if any, of the monomer and a randomly selected head-to-head dimer (from the hexameric assembly) of both the active and the inactive proteins. The dimers have been marked in black boxes in Fig. 5c. Various energy components of the dimer interface, as calculated by PPCheck, have been plotted in Fig. 6b. The results show that the inactive RNase PH dimer interface is less stabilised as compared to that of the active protein. The trajectories of the MD runs have been shown in additional movie files (Additional file 4, Additional file 5, Additional file 6 and Additional file 7, for the active monomer, inactive monomer, active dimer and inactive dimer, respectively). Analyses of Additional file 4, and Additional file 5 shows a slight distortion in the short helix (pink) in the absence of residues of interest (orange and yellow), which might lead to overall loss of stability of the monomer. Further analyses (Additional file 6 and Additional file 7) show the floppy nature of the terminal part of the helices that are interacting in the dimer. This is probably due to the loss of the residues of interest, which have been seen to be structured and less floppy in the active RNase PH dimer (Additional file 6). For each of the systems, the H-bond traces for three replicates (represented in different colours) have been depicted. From these figures, we can observe that the replicates are showing similar H-bonding patterns. Analyses of the number of hydrogen bonds (H-bonds) formed in the system over each picosecond of the MD simulations of the active monomer, inactive monomer, active dimer and inactive dimer have been represented in Fig. 8a, b, c and d, respectively. Comparison of panels a and b of this figure shows a greater number of H-bonds being formed in the active monomer, as compared to that of the inactive monomer, over the entire time period of the simulation. Similarly, comparison of panels c and d of this figure shows a greater number of H-bonds being formed in the active dimer as compared to that of the inactive dimer, over the entire time period of the simulation. These losses of H-bonding interactions might lead to overall loss of stability of the dimer and subsequently that of the hexamer.
Fig. 8

Hydrogen bonding patterns in molecular dynamics simulations. The number of H-bonds formed over each picosecond of the MD simulations (described in this Chapter) have been shown in this figure. Each of the six panels (systems) shows the H-bond traces from three replicates (represented in different colours). a Active RNase PH monomer. b Inactive RNase PH monomer. c Active RNase PH dimer. d Inactive RNase PH dimer. e PELOTA_1 domain from the ‘uncharacterised’ protein in complex with kink-turn RNA. f L7Ae K-turn binding domain from A. fulgidus in complex with kink-turn RNA from H. marismortui

Case study 2: Uncharacterised pathogen-specific protein and its homologues show subtly different RNA-binding properties

In our study, we observed that Cluster 60 was composed of 10 proteins, each from a different pathogenic strain studied here. All the proteins in this cluster were either annotated as ‘putative’, ‘uncharacterised’, ‘hypothetical’ or ‘predicted’. To understand the RNA-binding properties of these orthologous pathogen-specific proteins, we resolved the Pfam DA of this protein. In particular, such an association to Pfam domains provide function annotation to a hitherto uncharacterised protein, from strain O103:H2, to RBD PELOTA_1. Hence, the structure of the RNA-binding PELOTA_1 domain of this protein was modelled on the basis of the L7Ae protein from M. jannaschii (Fig. 7a). Uncharacterised pathogen-specific RNA-binding protein. The characterisation of the uncharacterised pathogen-specific RBP has been represented in this figure. a Schematic representation of the domain architecture of the protein. The RNA-binding PELOTA_1 domain and its model has been shown here. b Structural superposition of the L7Ae K-turn binding domain (PDB code: 4BW0: B) (in red) and the model of the uncharacterised protein PELOTA_1 domain (in blue). c. Comparison of the kink-turn RNA-bound forms of the L7Ae K-turn binding domain (PDB code: 4BW0: B) (up) and that of the model of the uncharacterised protein PELOTA_1 domain (down). The RNA-binding residues have been highlighted in yellow Domains that are involved in core processes, such as RNA maturation, e.g. the tRNA endonucleases, and translation and with an archaeo-eukaryotic phyletic pattern includes the PIWI, PELOTA and SUI1 domains [97]. In 2014, Anantharaman and co-workers had shown associations of the conserved C-terminus of a phosphoribosyltransferase (PRTase) in the Tellurium resistance (Ter) operon to a PELOTA or Ribosomal_L7Ae domain (Pfam ID: PF01248) [98]. These domains are homologues of the eukaryotic release factor 1 (eRF1), which is involved in translation termination. Unlike the well-studied PELOTA domain, the species distribution of the PELOTA_1 domain is solely bacterial and not much is known in literature regarding the specific function of this domain. Structure of this modelled PELOTA_1 domain from the uncharacterised protein was aligned with that of the L7Ae kink-turn (K-turn) binding domain from an archaeon (A. fulgidus) (Fig. 7b). The model also retained the same basic structural unit as the eRF1 protein (data not shown). The L7Ae is a member of a family of proteins that binds K-turns in many functional RNA species [99]. The K-turn RNA was docked onto the model, guided by the equivalents of the known RNA-interacting residues from the archaeal L7Ae K-turning binding domain. Both the complexes have been shown in Fig. 7c with the RNA-interacting residues highlighted in yellow. MD simulations of both these complexes were performed and the trajectories have been shown in additional movie files Additional file 8 (PELOTA_1 domain model-k-turn RNA complex) and Additional file 9 (L7Ae K-turn binding domain-k-turn RNA complex). For each of the systems, the H-bond traces for three replicates (represented in different colours) have been depicted. From these figures, one can observe that the replicates are showing similar H-bonding patterns. Analyses of the number of H-bonds formed between the protein and the RNA over each picosecond of the MD simulations of the PELOTA_1 domain-RNA complex and the L7Ae K-turn binding domain-RNA complex have been represented in Fig. 8e and f, respectively. Comparison of panels e and f of this figure shows a greater number of H-bonds being formed in the L7Ae K-turn binding domain-RNA complex as compared to that of the PELOTA_1 domain-RNA complex over the entire time period of the simulation. These results show that the two proteins have differential affinity towards the same RNA molecule. This hints at the fact that these proteins might perform subtly different functions by the virtue of having differential RNA-binding properties. Hydrogen bonding patterns in molecular dynamics simulations. The number of H-bonds formed over each picosecond of the MD simulations (described in this Chapter) have been shown in this figure. Each of the six panels (systems) shows the H-bond traces from three replicates (represented in different colours). a Active RNase PH monomer. b Inactive RNase PH monomer. c Active RNase PH dimer. d Inactive RNase PH dimer. e PELOTA_1 domain from the ‘uncharacterised’ protein in complex with kink-turn RNA. f L7Ae K-turn binding domain from A. fulgidus in complex with kink-turn RNA from H. marismortui

Case study 3: Pathogen specific Cas6-like proteins might be functional variants of the well-characterised non-pathogenic protein

In many bacteria, as well archaea, CRISPR associated Cas proteins and short CRISPR-derived RNA (crRNA) assemble into large RNP complexes and provide surveillance towards invasion of genetic parasites [100-102]. The role of CRISPR-associated proteins as virulence factors in pathogenic bacteria has also been reported in recent literature [84]. We found that Cluster 308 consists of 10 pathogen-specific proteins, of which half of them were already annotated as Cas6 proteins, whereas the other half constituted of ‘uncharacterised’ or ‘hypothetical’ proteins. As mentioned in the Methods section, the latter proteins were annotated on the basis of sequence homology to known proteins in the NR database, as Cas6 proteins. Molecular phylogeny analysis of all the proteins from Cluster 308 and Cas6 from E. coli strain K12 has been depicted in Additional file 10a: Figure S1, which reinstates the fact that the pathogen-specific proteins are more similar to each other, in terms of sequence, than they are to the Cas6 protein from the non-pathogenic strain K12. Furthermore, a similar analysis of two previously uncharacterised proteins (UniProt IDs: C8U9I8 and C8TG04) (red) from this pathogen-specific Cas6 proteins cluster (Cluster 308), with other known Cas6 proteins has been shown Additional file 10b: Figure S1. From the phylogenetic tree, one can infer that the pathogen-specific Cas6 proteins are more similar in terms of sequence to the Cas6 from E. coli strain K12 (blue) than that from other organisms. Multiple sequence alignment (MSA) of all the proteins from Cluster 308 and Cas6 from strain K12 has been shown in Fig. 9. The RNA-binding residues in E. coli strain K12 Cas6 protein (union set of RNA-binding residues inferred from each of the three known PDB structures (see section)) have been highlighted in yellow on its sequence (CAS6_ECOLI) on the MSA. The corresponding residues in the other proteins on the MSA, which are same as that in CAS6_ECOLI, have also been highlighted in yellow, whereas those which differ have been highlighted in red. From Fig. 9a, we can conclude that the majority of the RNA-binding residues in CAS6_ECOLI are not conserved in the pathogen-specific Cas6 proteins, and can be defined as ‘class-specific residues’. A similar colouring scheme has been followed in Fig. 9b, to analyse the conservation of protein-interacting residues in these proteins. From these analyses, we can speculate that due to the presence of a large proportion of ‘class-specific residues’, the RNA-binding properties, as well as protein-protein interactions, might be substantially different among the Cas6 proteins from non-pathogenic and pathogenic E. coli strains, which might lead to functional divergence. Secondary structures of each of these proteins, mapped on their sequence (α-helices highlighted in cyan and β-strands in green) in Fig. 9c, also hint at slight structural variation among these proteins.
Fig. 9

Sequence analysis of pathogen-specific Cas6-like proteins. Comparison of sequence features of Cas6 proteins from pathogenic (Cluster 308) and non-pathogenic K12 strains. a Comparison of RNA-binding residues. The RNA-binding residues in E. coli strain K12 Cas6 protein have been highlighted in yellow on its sequence (CAS6_ECOLI) on the MSA. The corresponding residues in the other proteins on the MSA, which are same as that in CAS6_ECOLI, have also been highlighted in yellow, whereas those which differ have been highlighted in red. b Comparison of protein-interacting residues. The protein-interacting residues in E. coli strain K12 Cas6 protein have been highlighted in yellow on its sequence (CAS6_ECOLI). A similar colour scheme has also been followed here. c Secondary structure prediction. The α-helices have been highlighted in cyan and the β-strands in green

Sequence analysis of pathogen-specific Cas6-like proteins. Comparison of sequence features of Cas6 proteins from pathogenic (Cluster 308) and non-pathogenic K12 strains. a Comparison of RNA-binding residues. The RNA-binding residues in E. coli strain K12 Cas6 protein have been highlighted in yellow on its sequence (CAS6_ECOLI) on the MSA. The corresponding residues in the other proteins on the MSA, which are same as that in CAS6_ECOLI, have also been highlighted in yellow, whereas those which differ have been highlighted in red. b Comparison of protein-interacting residues. The protein-interacting residues in E. coli strain K12 Cas6 protein have been highlighted in yellow on its sequence (CAS6_ECOLI). A similar colour scheme has also been followed here. c Secondary structure prediction. The α-helices have been highlighted in cyan and the β-strands in green

Discussion

We have employed a sequence search-based method to compare and contrast the proteomes of 16 pathogenic and three non-pathogenic E. coli strains as well as to obtain a global picture of the RBP landscape in E. coli. The results obtained from this study showed that the pathogenic strains encode a greater number of RBPs in their proteomes, as compared to the non-pathogenic ones. The DEAD domain, involved in RNA metabolism, was found to be the most abundant of all identified RBDs. The complete and distinct RBPome of E. coli was also identified by studying all known E. coli strains till date. In this study, we identified RBPs that were exclusive to pathogenic strains, and most of them can be exploited as drug targets by virtue of being non-homologous to their human host proteins. Many of these pathogen-specific proteins were uncharacterised and their identities could be resolved on the basis of sequence homology searches with known proteins. Further, in this study, we performed three case studies on interesting RBPs. In the first of the three studies, a tRNA processing RNase PH enzyme from strain K12 was investigated that is different from that in all other E. coli strains in having a truncated C-terminus and being functionally inactive. Structural modelling and molecular dynamics studies showed that the loss of stability of the monomeric and/or the hexameric (biological unit) forms of this protein from E. coli strain K12, might be the possible reason for the lack of its functional activity. In the second study, a previously uncharacterised pathogen-specific protein was studied and was found to possess subtly different RNA-binding affinities towards the same RNA stretch as compared to its well characterised homologues in archaea and eukaryotes. This might hint at different functions of these proteins. In the third case study, pathogen-specific CRISPR-associated Cas6 proteins were analysed and found to have diverged functionally from the known prototypical Cas6 proteins.

Conclusions

The approach used in our study to cross-compare proteomes of pathogenic and non-pathogenic strains may also be extended to other bacterial or even eukaryotic proteomes to understand interesting differences in their RBPomes. The pathogen-specific RBPs reported in this study, may also be taken up further for clinical trials and/or experimental validations. The effect of the absence of a functional RNase PH in E. coli strain K12 is not clear. The role of the PELOTA_1 domain-containing protein may also be reinforced performing knockdown and rescue experiments. These might help to understand the functional overlap of this protein with its archaeal or eukaryotic homologues. Introduction of this pathogen-specific protein in non-pathogens might also provide probable answers towards its virulence properties. The less conserved RNA-binding and protein-interacting residues in the pathogen-specific Cas6 proteins, might point to functional divergence of these proteins from the known ones, but warrants further investigation. RNA-binding proteins in 19 E. coli proteomes. All the RBPs obtained in the GWS of 19 E. coli strains have been listed in this table. The pathogenic and non-pathogenic E. coli strains have been highlighted in red and green, respectively. (DOC 115 kb) Clusters of RNA-binding proteins obtained from 19 E. coli proteomes. The clusters of RBPs with more than one member in the GWS of 19 E. coli strains have been listed in this table. The RBPs were clustered based on BLASTP searches at E-value, percentage identity and percentage query coverage cut-offs of 10−5, 30 and 70, respectively. (DOC 376 kb) RNA-binding proteins in all complete E. coli proteomes. All the RBPs obtained in the GWS of 166 E. coli strains have been listed here. The RefSeq IDs of the proteins are listed along with the total number of strains in which the protein is present mentioned in brackets. (DOC 185 kb) 100 ns molecular dynamics simulations of the active RNase PH monomer in the AMBER99SB protein, nucleic AMBER94 force field. The protein has been colour coded as in Fig. 5b. Hydrogen bonds at distance and angle cut-offs of 3 Å and 20°, respectively, have been shown at the region of interest with black dotted lines. (MP4 50,376 kb) 100 ns molecular dynamics simulations of the inactive RNase PH monomer in the AMBER99SB protein, nucleic AMBER94 force field. The protein has been colour coded as in Fig. 5b. Hydrogen bonds at distance and angle cut-offs of 3 Å and 20°, respectively, have been shown at the region of interest with black dotted lines. (MP4 67,789 kb) 100 ns molecular dynamics simulations of the active RNase PH dimer in the AMBER99SB protein, nucleic AMBER94 force field. The protein has been colour coded as in Fig. 5b and c. Hydrogen bonds at distance and angle cut-offs of 3 Å and 20°, respectively, have been shown at the region of interest with black dotted lines. (MP4 67,624 kb) 100 ns molecular dynamics simulations of the inactive RNase PH dimer in the AMBER99SB protein, nucleic AMBER94 force field. The protein has been colour coded as in Fig. 5b and c. Hydrogen bonds at distance and angle cut-offs of 3 Å and 20°, respectively, have been shown at the region of interest with black dotted lines. (MP4 67,820 kb) 100 ns molecular dynamics simulations of the PELOTA_1 domain from the ‘uncharacterised’ protein in complex with kink-turn RNA, in the AMBER99SB protein, nucleic AMBER94 force field. The protein has been represented in blue and the RNA in red. Hydrogen bonds at distance and angle cut-offs of 3 Å and 20°, respectively, have been shown between the protein and the RNA has been shown with black dotted lines. (MP4 54,830 kb) 100 ns molecular dynamics simulations of the L7Ae K-turn binding domain from Archaeoglobus fulgidus in complex with kink-turn RNA from H. marismortui (PDB code: 4BW0: B), in the AMBER99SB protein, nucleic AMBER94 force field. The protein has been represented in blue and the RNA in red. Hydrogen bonds at distance and angle cut-offs of 3 Å and 20°, respectively, have been shown between the protein and the RNA has been shown with black dotted lines. (MP4 66,564 kb) Molecular phylogeny analysis of Cas6 proteins. a. All the proteins from Cluster 308 and Cas6 from E. coli strain K12. b. Two previously uncharacterised proteins (UniProt IDs: C8U9I8 and C8TG04) from Cluster 308, with other known Cas6 proteins, including that from E. coli strain K12. In both the panels, the above-mentioned two previously uncharacterised proteins from the pathogen-specific Cas6 proteins cluster (Cluster 308) have been highlighted in red and the Cas6 protein from E. coli strain K12 in blue. (JPEG 4531 kb)
  99 in total

Review 1.  Themes in RNA-protein recognition.

Authors:  D E Draper
Journal:  J Mol Biol       Date:  1999-10-22       Impact factor: 5.469

2.  A small RNA serving both the Hfq and CsrA regulons.

Authors:  Erik Holmqvist; Jörg Vogel
Journal:  Genes Dev       Date:  2013-05-15       Impact factor: 11.361

Review 3.  RNA helicases at work: binding and rearranging.

Authors:  Eckhard Jankowsky
Journal:  Trends Biochem Sci       Date:  2011-01       Impact factor: 13.807

Review 4.  Cold shock response in Escherichia coli.

Authors:  K Yamanaka
Journal:  J Mol Microbiol Biotechnol       Date:  1999-11

Review 5.  Pathogenicity islands and the evolution of microbes.

Authors:  J Hacker; J B Kaper
Journal:  Annu Rev Microbiol       Date:  2000       Impact factor: 15.500

6.  Hfq is essential for Vibrio cholerae virulence and downregulates sigma expression.

Authors:  Yanpeng Ding; Brigid M Davis; Matthew K Waldor
Journal:  Mol Microbiol       Date:  2004-07       Impact factor: 3.501

7.  Repression of small toxic protein synthesis by the Sib and OhsC small RNAs.

Authors:  Elizabeth M Fozo; Mitsuoki Kawano; Fanette Fontaine; Yusuf Kaya; Kathy S Mendieta; Kristi L Jones; Alejandro Ocampo; Kenneth E Rudd; Gisela Storz
Journal:  Mol Microbiol       Date:  2008-08-14       Impact factor: 3.501

Review 8.  Pathogenic Escherichia coli.

Authors:  James B Kaper; James P Nataro; Harry L Mobley
Journal:  Nat Rev Microbiol       Date:  2004-02       Impact factor: 60.633

9.  The Escherichia coli K-12 "wild types" W3110 and MG1655 have an rph frameshift mutation that leads to pyrimidine starvation due to low pyrE expression levels.

Authors:  K F Jensen
Journal:  J Bacteriol       Date:  1993-06       Impact factor: 3.490

10.  PPCheck: A Webserver for the Quantitative Analysis of Protein-Protein Interfaces and Prediction of Residue Hotspots.

Authors:  Anshul Sukhwal; Ramanathan Sowdhamini
Journal:  Bioinform Biol Insights       Date:  2015-09-21
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.