| Literature DB >> 30710061 |
Kira S Makarova1, Yuri I Wolf2, Eugene V Koonin2.
Abstract
A substantial fraction of archaeal genes, from ∼30% to as much as 80%, encode 'hypothetical' proteins or genomic 'dark matter'. Archaeal genomes typically contain a higher fraction of dark matter compared with bacterial genomes, primarily, because isolation and cultivation of most archaea in the laboratory, and accordingly, experimental characterization of archaeal genes, are difficult. In the present study, we present quantitative characteristics of the archaeal genomic dark matter and discuss comparative genomic approaches for functional prediction for 'hypothetical' proteins. We propose a list of top priority candidates for experimental characterization with a broad distribution among archaea and those that are characteristic of poorly studied major archaeal groups such as Thaumarchaea, DPANN (Diapherotrites, Parvarchaeota, Aenigmarchaeota, Nanoarchaeota and Nanohaloarchaeota) and Asgard.Entities:
Keywords: archaeal proteins; computational biology; functional genomics
Mesh:
Substances:
Year: 2019 PMID: 30710061 PMCID: PMC6393860 DOI: 10.1042/BST20180560
Source DB: PubMed Journal: Biochem Soc Trans ISSN: 0300-5127 Impact factor: 5.407
Figure 1.Dark matter in archaeal genomes.
Amino acid sequences of proteins, encoded in 524 (nearly) completely sequenced archaeal genomes were, when possible, assigned to 13 443 arCOGs [16] and the rest were clustered together. The combination of arCOGs and clusters is referred to as ‘gene families' here and elsewhere in the text. (A) The relative frequencies of ‘dark' (no functional annotation), ‘gray' (general functional prediction only), and ‘bright' (functionally annotated) matter among archaeal gene families (arCOGs and clusters) and individual genes. (B) Distribution of the number of genomes represented in the ‘dark', ‘gray', and ‘bright matter’ gene families. The plot shows the Gaussian kernel smoothed probability density functions in log scale; the number of genomes ranges from 1 (ORFan gene) to 524 (strictly ubiquitous family). (C) The fraction of ‘dark matter' genes in 524 archaeal genomes. (D) Distribution of the sequence lengths among the ‘dark’, ‘gray', and ‘bright matter' protein families. The plot shows the Gaussian kernel smoothed probability density functions, calculated for the family consensus sequences. (E) Distribution of the island lengths (lengths of contiguous blocks of genes) for the ‘dark matter' genes and for a randomly selected gene set of the same size (285 155 genes).
Uncharacterized proteins, top priority candidates for experimental study
| arCOG or cluster | Representative locus tag | Number of genomes | Comments |
|---|---|---|---|
| All archaea (524 genomes) | |||
| arCOG01159 | TK2157 | 492 | Coiled-coil protein; linked to arCOG01158, phosphoserine phosphatase SerB |
| | TK1195 | 463 | DUF357 family; tightly linked to arCOG02119 (DUF555 family) and Cytidylyltransferase TagD; PDB:2OO2 |
| | TK1697 | 454 | DUF359 family; predicted to be involved in CoA biosynthesis |
| | TK1296 | 441 | DUF424 family; linked to translational genes; PDB:2QYA |
| arCOG01336 | TK0174 | 429 | AMMECR1 family; linked to arCOG04290, PIN- and Zn ribbon domains; PDB:1VAJ [ |
| | TK2293 | 368 | General house-keeping gene context |
| | TK2131 | 336 | Linked to arCOG00578, Uncharacterized Zn finger containing protein; PDB: 2QZG |
| arCOG01917 | TK0022 | 392 | Zn ribbon domain-containing protein |
| | TK0743 | 343 | Membrane protein implicated in membrane remodeling or vesicle formation [ |
| arCOG04373 | TK0173 | 313 | YqgV/DUF77 family; possible thiamine binding protein; PDB:1LXN [ |
| arCOG02884 | HVO_2173 | 293 | Membrane protein with extracellular Ig-like domain, predicted component of a putative secretion system [ |
| arCOG04140 | TK1882 | 252 | PDB: 2X3D |
| arCOG01907 | TK0182 | 245 | AIM24 family; PDB: 1PG6 |
| Asgard (8 genomes) | |||
| cls.008013 | Lokiarch_14920 | 7[0] | Related to Villin-1/gelsolin, predicted actin-binding protein; PDB:3FG7 |
| cls.011087 | Lokiarch_54080 | 7[0] | Membrane protein |
| DPANN (67 genomes) | |||
| cls.004306 | NEQ255 | 43[0] | Secreted protein, often encoded next to arCOG02487, a predicted component of secretion system and S-layer-like proteins |
| cls.004259 | NEQ050 | 36[2] | MNT fused to HEPN, usually components of toxin–antitoxin systems, but typically in house-keeping context in DPANN |
| cls.004340 | NEQ484 | 35[0] | Alpha helical protein, typically in house-keeping context |
| cls.004634 | CMH64_01370 | 34[0] | Distantly related to YPEB or double PepSY-like domain-containing protein, an inhibitor of protease activity; PDB: 3NQZ [ |
| Thaumarchaeota (30 genomes) | |||
| arCOG08720 | Nmar_1229 | 29 [0] | Metal-binding protein, DUF2024 family |
| | Nmar_1451 | 29 [0] | RHH C-terminal domain, possibly DNA-binding protein |
| | Nmar_1679 | 29 [0] | Membrane protein |
| | Nmar_1445 | 29 [0] | Zn-binding protein |
| | Nmar_1788 | 29 [0] | Membrane protein; likely co-transcribed with DNA replication initiation complex subunit, GINS15 family (arCOG00551) |
| | Nmar_0539 | 29 [0] | Membrane protein; likely co-transcribed with galactose-1-phosphate uridylyltransferase (arCOG00422) |
| | Nmar_1506 | 29 [0] | Membrane protein |
| | Nmar_0502 | 29 [0] | |
| | Nmar_0508 | 29 [0] | |
| | Nmar_1190 | 29 [0] | |
| | Nmar_0643 | 29 [0] | |
| | Nmar_0717 | 29 [0] | Likely co-transcribed with membrane associated Zn finger protein (arCOG08750) |
| | Nmar_0528 | 29 [0] | |
| | Nmar_1042 | 29 [0] | |
| | Nmar_0410 | 29 [0] | |
| | Nmar_0373 | 29 [0] | |
Notes: archaea-specific arCOGs are underlined, clusters (cls) are new groups of orthologs that could not be assigned to previous version of arCOGs.
Number of archaeal genomes where this gene is present outside of this lineage. Detailed information about these families is available at ftp://ftp.ncbi.nih.gov/pub/wolf/_suppl/archDark2018/.
Figure 2.Genomic islands enriched in ‘dark matter’ genes.
Genes are shown by block arrows with the length roughly proportional to the size of the corresponding gene. For each gene, the arCOG number (bold) or new protein cluster number (gray) is indicated underneath the respective arrows. These numbers correspond to the assignments available on the ftp site (ftp://ftp.ncbi.nih.gov/pub/wolf/_suppl/archDark2018/). Signal peptides are indicated by blue triangles. For each genomic island, the organism name, genome partition accession number and co-ordinates of the locus are indicated on the right. Brief annotations of the proteins are shown above the arrows. Abbreviation and additional information for some genes: TerS and TerL: terminase small and large subunit, respectively; TadC and TadB: Tad secretion system, secretion accessory proteins C and B; antitoxins: HTH (helix turn helix) protein, RHH (ribbon–helix–helix) proteins; AbrB; MNT, minimal nucleotidyltransferase; toxins: ribonucleases HEPN, PIN, RelE, Txe, PemK, HicA, ribosome interacting toxin Doc; PD-DExK, restriction family endonuclease.