| Literature DB >> 35046497 |
A S M Zisanur Rahman1, Lukas Timmerman2, Flyn Gallardo1, Silvia T Cardona3,4.
Abstract
A first clue to gene function can be obtained by examining whether a gene is required for life in certain standard conditions, that is, whether a gene is essential. In bacteria, essential genes are usually identified by high-density transposon mutagenesis followed by sequencing of insertion sites (Tn-seq). These studies assign the term "essential" to whole genes rather than the protein domain sequences that encode the essential functions. However, genes can code for multiple protein domains that evolve their functions independently. Therefore, when essential genes code for more than one protein domain, only one of them could be essential. In this study, we defined this subset of genes as "essential domain-containing" (EDC) genes. Using a Tn-seq data set built-in Burkholderia cenocepacia K56-2, we developed an in silico pipeline to identify EDC genes and the essential protein domains they encode. We found forty candidate EDC genes and demonstrated growth defect phenotypes using CRISPR interference (CRISPRi). This analysis included two knockdowns of genes encoding the protein domains of unknown function DUF2213 and DUF4148. These putative essential domains are conserved in more than two hundred bacterial species, including human and plant pathogens. Together, our study suggests that essentiality should be assigned to individual protein domains rather than genes, contributing to a first functional characterization of protein domains of unknown function.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35046497 PMCID: PMC8770471 DOI: 10.1038/s41598-022-05028-x
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Schematics of Tn-Seq reads mapped to the insertion sites in non-essential (a), essential (b), and essential domain-containing (EDC) genes (c–d). The number of transposon insertions related to the length of the gene (minus the non-informative 10% towards the 5′ and 3′ ends) is quantified and used to classify genes as non-essential (a), or essential (b) according to the relative number of reads mapped to that gene. Tn-seq analysis may miss EDC genes which are essential genes that contain an essential domain not spanning throughout the whole length of the gene (c–d). Genes are represented by arrows. Tn-seq reads that map to regions of those genes are represented by black boxes. Essential and non-essential regions are colored in red and green, respectively.
Figure 2Identification of putative essential domain-containing (EDC) genes from a Tn-seq dataset. Tn-Seq reads are first mapped to the reference genome. A custom-built script identifies genes with biased location of transposon insertions towards one half of the gene. The script parameters “min ratio’ and “min reads” were set such that genes were selected when (i) a half-region of that gene (at least 40% of the total gene length) showed no insertions (min ratio = 0), and (ii) the other half contained mapped reads in at least 14% of that gene-half length (min reads = 0.14) (see Supplementary Fig. 1and Material and Methods for details). Reads mapping to each 5′ or 3′ 10% end of the gene were discarded from the analysis.
Figure 3Biased transposon insertion identifies putative essential domains of uncharacterized hypothetical proteins. Tn-seq reads from[17] were mapped to the B. cenocepacia K56-2 genome and predicted to contain essential domains. (a) The script identified the well characterized essential N-terminal domains of DnaK (BCAL3270) and NusA (BCAL1506). Their respective CRISPRi mutants demonstrated a conditional growth defect. (b) Two uncharacterized genes BCAM1066 (WQ49_RS16145) and BCAS0158 (WQ49_RS10495) contain the Pfam domains DUF2213 (PF09979) and DUF4148 (PF13663), respectively, at the N-terminal end. The Tn-seq reads map to the C-terminal end of these genes, demonstrating the essentiality of DUF2213 and DUF4148. Putative essential domains are highlighted in blue. Black triangles represent the transposon insertion sites. Numbers on top of the domains denote amino acid sequence positions. Blue and red lines in the growth curves (a and b) represent growth in the absence and presence of rhamnose, respectively. Growth curves are shown for the most efficient sgRNAs. Growth curves values are the average of three independent biological replicates. Error bars indicate mean ± SD.
Putative essential genes and domains identified based on biased transposon insertions.
| K56-2 locus tag | Homolog J2315 locus tag | Product name | Function | Reads at 5′ half | Reads at 3′ half | Identified putative essential domain |
|---|---|---|---|---|---|---|
| WQ49_RS00050 | BCAL3469 | Cell division protein FtsL | Essential cell division protein | 0 | 23 | Domain (FtsL) |
| WQ49_RS00770 | BCAL3328 | NUDIX hydrolase | Nucleoside-diphosphatase | 0 | 49 | Domain (Nudix hydrolase) |
| WQ49_RS00885 | BCAL3305 | Preprotein translocase subunit YajC | Secretase/insertase | 21 | 0 | New |
| WQ49_RS01035 | BCAL3270 | DnaK | Chaperone | 0 | 227 | N-terminal Domain |
| WQ49_RS02920 | BCAM1451 | Hypothetical protein | Unknown | 43 | 0 | New |
| WQ49_RS03160 | BCAM1502 | Hypothetical protein | Unknown | 59 | 0 | New |
| WQ49_RS03550 | QU43_RS62245 | Hypothetical protein | Unknown | 33 | 0 | New |
| WQ49_RS03805 | BCAM1624 | MaoC family dehydratase | MaoC-like dehydratase | 46 | 0 | New |
| WQ49_RS04450 | BCAM1749 | Hypothetical protein | Unknown | 17 | 0 | New |
| WQ49_RS07360 | BCAM2338 | Glycosyl transferase family 1 | UDP-glycosyltransferase | 0 | 152 | Domain (Glyco_transf_28) |
| WQ49_RS07395 | QU43_RS66100 | Hypothetical protein | Unknown | 0 | 58 | New |
| WQ49_RS09185 | BCAS0417 | Cytochrome biogenesis protein CcdA | Electron transfer | 0 | 38 | New |
| WQ49_RS10495 | BCAS0158 | hypothetical protein | Unknown | 0 | 34 | Domain (DUF4148) |
| WQ49_RS11915 | BCAL0324 | TatB | Protein Transmembrane transporter | 0 | 57 | Domain (TatA_B_E) |
| WQ49_RS12045 | BCAL0298 | Thiamine biosynthesis protein ThiS | Thiamine biosynthesis protein ThiS | 0 | 50 | Domain (ThiS) |
| WQ49_RS12280 | BCAL0250 | 50S ribosomal protein L18 | Structural constituent of ribosome | 0 | 65 | Domain (Ribosomal_L18p) |
| WQ49_RS12305 | BCAL0245 | RplX | Structural constituent of ribosome | 20 | 0 | Domain (L24-Pfam) |
| WQ49_RS12315 | BCAL0243 | 30S ribosomal protein S17 | Structural constituent of ribosome | 0 | 64 | New |
| WQ49_RS12365 | BCAL0233 | RpsJ | Structural constituent of ribosome | 0 | 25 | New |
| WQ49_RS16145 | BCAM1066 | Hypothetical protein | Unknown | 0 | 425 | Domain (DUF2213) |
| WQ49_RS18705 | BCAM0549 | Molecular chaperone GroES | Chaperone | 0 | 21 | Domain (Cpn10) |
| WQ49_RS22170 | BCAM2699 | alpha/beta hydrolase | Putative hydrolase | 120 | 0 | Domain (Abhydrolase_3) |
| WQ49_RS23945 | BCAL0558 | Cca | 3′-Cytidine-cytidine-tRNA adenylyltransferase | 0 | 79 | Domain (PolyA Polymerase)/Domain (Binding) |
| WQ49_RS24070 | BCAL0585 | Hypothetical protein | Unknown | 0 | 23 | new |
| WQ49_RS25525 | BCAL0878 | FmdB family transcriptional regulator | Regulatory activity | 0 | 30 | Domain (CxxC_CXXC_SSSS) |
| WQ49_RS25680 | BCAL0909 | 16S rRNA maturation RNase YbeY | Endoribonuclease activity | 68 | 0 | Domain (UPF0054) |
| WQ49_RS26625 | BCAL2715 | RpmG | Structural constituent of ribosome | 0 | 31 | Domain (Ribosomal_L33) |
| WQ49_RS27920 | BCAL2334 | NADH-quinone oxidoreductase subunit K | NADH dehydrogenase | 0 | 21 | Domain (Oxidored_q2) |
| WQ49_RS28635 | BCAL2199 | Fe–S cluster assembly transcriptional regulator IscR | DNA-binding transcription factor | 39 | 0 | Domain (Rrf2) |
| WQ49_RS29230 | BCAL2091 | 30S ribosomal protein S2 | Structural constituent of ribosome | 0 | 86 | Domain (Ribosomal_S2) |
| WQ49_RS30770 | BCAL1788 | Biopolymer transporter ExbD | Transmembrane transporter | 0 | 47 | Domain (ExbD) |
| WQ49_RS31735 | NA | Hypothetical protein | Unknown | 0 | 42 | New |
| WQ49_RS31805 | BCAL1585 | Transcriptional regulator | DNA binding | 44 | 0 | New |
| WQ49_RS32210 | BCAL1506 | NusA | DNA-binding transcription factor | 0 | 93 | Domain (NusA_N) |
| WQ49_RS32225 | BCAL1503 | SMC-Scp complex | Cell Division/chromosome separation | 0 | 94 | Domain (SMC) |
| WQ49_RS32625 | BCAL1424 | ABC transporter | ATPase | 63 | 0 | New |
| WQ49_RS34660 | BCAL0990 | 50S ribosomal protein L32 | Structural constituent of ribosome | 27 | 0 | New |
| WQ49_RS34895 | BCAL2925 | 50S ribosomal protein L19 | Structural constituent of ribosome | 0 | 26 | Domain (Ribosomal_L19) |
| WQ49_RS35060 | BCAL2958 | Membrane protein | Porin activity | 43 | 0 | Domain (OmpA) |
| WQ49_RS03390 | BCAM1545 | LuxR family transcriptional regulator | DNA binding | 251 | 0 | Domain (HTH luxR-type) |
Figure 4Phylogenetic trees with taxonomic information of DUF2213 (PF09979) and DUF4148 (PF13663) and domain architectures of proteins containing these domains. (a–b) Phylogenetic trees of DUF2213 (a) and DUF4148 (b) across the species with taxonomic annotations. DUF2213 is widely distributed within bacterial, archaeal, phage and eukaryotic species, whereas DUF4148 is mostly distributed in bacteria (primarily in Proteobacterial species). Trees shown here are the majority rule consensus trees. Taxonomic annotations were labelled based on NCBI taxonomy database. Representative bacterial, archaeal, phage and eukaryotic species are highlighted in lilac, yellow, grey and green, respectively. The orange circles on the branches represent the bootstraps values. (c)–(d) Domain architectures of proteins containing DUF2213 (c) and DUF4148 (d) across species. Numbers on top of the domains in (c) and (d) represent amino acid sequence positions.