| Literature DB >> 24778629 |
Stefan W Grötzinger1, Intikhab Alam2, Wail Ba Alawi2, Vladimir B Bajic2, Ulrich Stingl3, Jörg Eppinger1.
Abstract
Reliable functional annotation of genomic data is the key-step in the discovery of novel enzymes. Intrinsic sequencing data quality problems of single amplified genomes (SAGs) and poor homology of novel extremophile's genomes pose significant challenges for the attribution of functions to the coding sequences identified. The anoxic deep-sea brine pools of the Red Sea are a promising source of novel enzymes with unique evolutionary adaptation. Sequencing data from Red Sea brine pool cultures and SAGs are annotated and stored in the Integrated Data Warehouse of Microbial Genomes (INDIGO) data warehouse. Low sequence homology of annotated genes (no similarity for 35% of these genes) may translate into false positives when searching for specific functions. The Profile and Pattern Matching (PPM) strategy described here was developed to eliminate false positive annotations of enzyme function before progressing to labor-intensive hyper-saline gene expression and characterization. It utilizes InterPro-derived Gene Ontology (GO)-terms (which represent enzyme function profiles) and annotated relevant PROSITE IDs (which are linked to an amino acid consensus pattern). The PPM algorithm was tested on 15 protein families, which were selected based on scientific and commercial potential. An initial list of 2577 enzyme commission (E.C.) numbers was translated into 171 GO-terms and 49 consensus patterns. A subset of INDIGO-sequences consisting of 58 SAGs from six different taxons of bacteria and archaea were selected from six different brine pool environments. Those SAGs code for 74,516 genes, which were independently scanned for the GO-terms (profile filter) and PROSITE IDs (pattern filter). Following stringent reliability filtering, the non-redundant hits (106 profile hits and 147 pattern hits) are classified as reliable, if at least two relevant descriptors (GO-terms and/or consensus patterns) are present. Scripts for annotation, as well as for the PPM algorithm, are available through the INDIGO website.Entities:
Keywords: GO-terms; PROSITE IDs; bioinformatics; extermophile; functional genomics; halophiles; protein sequence consensus patterns; single amplified genomes
Year: 2014 PMID: 24778629 PMCID: PMC3985023 DOI: 10.3389/fmicb.2014.00134
Source DB: PubMed Journal: Front Microbiol ISSN: 1664-302X Impact factor: 5.640
Two example and summary (.
| SAR86 clade | 25.0; 4.0 | 86 | 312 | 0.74 | 83 | 6.9 | 853 |
| MSBL1 | 54.0; 15.2 | 849 | 200 | 2.20 | 15 | 0.7 | 3293 |
Figure 1Workflow of data integration into the INDIGO warehouse starting from assembled contig sequences.
List of proteins of interest (POIs), which were selected for this study.
| 1 | Alcohol DH | Interconversion of aldehydes/ketones and alcohols | Biocatalytic synthesis of chiral intermediates |
| 2 | Formate DH | Conversion of CO2 into format | Biological carbon capture |
| 3 | Formaldehyde DH | Interconversion of formaldehyde and formate | Biological carbon capture, methanol conversion |
| 4 | Carbon monoxide DH | Interconversion of CO and CO2 | Biological carbon capture, metalloenzyme structures |
| 5 | Ene reductase | Stereoselective reduction of alkenes | Biocatalytic synthesis of chiral intermediates |
| 6 | Protease | Hydrolysis of peptide bonds | Detergents, food, and leather processing |
| 7 | Terpene synthase | Synthesis of basic, (mulit-)cyclic terpene structures | Biocatalytic synthesis of complex intermediates |
| 8 | Nitrogenase | Fixation of nitrogen from air | metalloenzyme structure and function |
| 9 | Lipase | Hydrolysis of triglyceride esters | Detergents, biodiesel synthesis |
| 10 | Carbonic anhydrase | Interconversion of CO2 and Bicarbonate | Biological carbon capture, metalloenzyme structures |
| 11 | Acetylene hydratase | Synthesis of aldehydes from acetylene | Biocatalytic synthesis intermediates |
| 12 | Acetyl-CoA synthetase | Activation of acetate for further conversion | Biological carbon capture metabolism |
| 13 | pylRS | Aminoacyl tRNA synthetase, acting on pyrrolysine | Synthetic biology, expanding the genetic code |
| 14 | pyltRNA | tRNA coding for pyrrolysine | Synthetic biology, expanding the genetic code |
| 15 | Aquaporin | Integral membrane proteins controlling osmotic pressure | Water desalination membranes |
Figure 2Flowchart illustrating the PPM (profile pattern matching) algorithm, starting from an E.C. Number based proteins of interest (POI) list and a selected database subset, which may also be uploaded externally. Numbers refer to the example published here. Square brackets indicate number of genes at each step during the example analysis, specific restriction filters are described in normal brackets. The complete PPM algorithm is available at the INDIGO webpage including the scripts AutoTECNo and PPM Processor.
Bacterial (.
| 1 | 1.2 | 401 | 1658 | 9.4 | 47 | Atlantis II | |
| 13 | MSBL1 | 8.8 | 4262 | 14,159 | 16.8 | 63 | |
| 10 | MBGE | 9.5 | 4809 | 14,809 | |||
| 3 | MSBL1 | 2.8 | 1019 | 4114 | 15.2 | 54 | |
| 3 | SA2 cluster | 0.9 | 386 | 1367 | |||
| 1 | 0.4 | 176 | 598 | ||||
| 2 | MSBL1 | 1.2 | 321 | 1716 | 14 | 32 | Discovery |
| 17 | MSBL1 | 17.9 | 7462 | 27,110 | 26.2 | 44.8 | |
| 5 | MSBL1 | 3.2 | 1647 | 4989 | 26 | 23.4 | Kebrit |
| 3 | 2.3 | 1036 | 3168 | ||||
| 58 | 6 taxon. groups | 48.2 | 21,519 | 73,688 | 6 habitats | 3 pools |
Stepwise overview of the conversion the 15 selected enzyme groups into non-redundant GO terms and PROSITE ID.
| 1 | Alcohol DH | 101 | 32 | 25 | 20 | 20 | 0 | 12 | 8 | 2–6 |
| 2 | Formate DH | 29 | 6 | 6 | 4 | 4 | 1 | 7 | 6 | 6 |
| 3 | Formaldehyde DH | 23 | 9 | 4 | 3 | 2 | 0 | 0 | 0 | 0 |
| 4 | Carbon monoxide DH | 19 | 4 | 4 | 4 | 4 | 1 | 1 | 0 | 0 |
| 5 | Ene reductase | 1162 | 107 | 65 | 61 | 61 | 4 | 9 | 1 | 1–5 |
| 6 | Protease | 741 | 217 | 111 | 39 | 39 | 0 | 45 | 20 | 8 |
| 7 | Terpene synthase | 35 | 023 | 17 | 9 | 9 | 0 | 0 | 0 | 0 |
| 8 | Nitrogenase | 18 | 4 | 2 | 2 | 2 | 0 | 4 | 4 | 4 |
| 9 | Lipase | 380 | 26 | 25 | 24 | 24 | 0 | 9 | 6 | 0 |
| 10 | Carbonic anhydrase | 58 | 1 | 1 | 1 | 1 | 0 | 3 | 3 | 0 |
| 11 | Acetylene hydratase | 2 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
| 12 | Acetyl-CoA synthetase | 8 | 3 | 3 | 3 | 2 | 0 | 1 | 0 | 0 |
| 13 | Aquaporin | 0 | 0 | 0 | 2 | 2 | 0 | 1 | 1 | 0 |
| Total | 2576 | 433 | 264 | 173 | 171 | 6 | 92 | 49 | 25 | |
T, Total in class; NR, non-redundant ones; S, selected for this study.
Enzyme hits identified using the PPM algorithm and the respective PPM descriptors.
| Profile | GO:0008839 | 3 | – | |||
| GO:0009326 | 25 | – | ||||
| GO:0018492 | 17 | – | ||||
| GO:0043115 | 31 | – | ||||
| GO:0004665 | GO:0008977 | 16 | Pro 1 | Prephenate DH [1.3.1.13] | ||
| Pattern | PS00059 | 15 | – | |||
| PS00061 | 3 | – | ||||
| PS00136 | 3 | – | ||||
| PS00137 | 1 | – | ||||
| PS00138 | 2 | – | ||||
| PS00141 | 11 | – | ||||
| PS00501 | 6 | – | ||||
| PS00060 | PS00913 | 4 | Pat 1 | Fe—ADH [1.1.1.1] | ||
| PS00062 | PS00063 | PS00798 | 19 | Pat 2 | dkgA [1.1.1.274] | |
| PS00065 | PS00670 | PS00671 | 27 | Pat 3 | Glyoxylate red. [1.1.1.26] | |
| PS00381 | PS00382 | 2 | Pat 4 | Clp protease [3.4.21.92] | ||
| PS00490 | PS00551 | PS00932 | 11 | Pat 5 | Molybdopt. OR [e.g., 1.2.2.1] | |
| PS00090 | PS00699 | 10 | Pat 6 | Nitrogenase [1.18.6.1] | ||
| PS00692 | PS00746 | 7 | Pat 7 | |||
| PS00136 | PS00137 | 2 | Pat 8 | Subtilisin [3.4.21.*] | ||
| PS00136 | PS00138 | 4 | Pat 9 | |||
| PS00137 | PS00138 | 1 | Pat 10 | |||
| Profile and Pattern | GO:0008839 | PS01298 | 14 | PP 1 | DHPR [1.3.1.26] | |
Figure 3Phylogenetic tree of the DHPR hits. To identify isoenzyme classes, every PPM set of hits was clustered into phylogenetic groups. For the 14 Dihydrodipicolinate reductases (DHPR) four closely related phylogenetic clusters were found. Scale bare: 0.1 amino acid substitutions per site.
Alignment of functional important residues the γ-CA chain A from .
Hits identified as reliable using the PPM algorithm.
| AF_D | ADH [MSBL1] | Pat 1 | ADH, iron-containing (60%) [ |
| HD_K | 2-Hydroxyacid DH [USMBL6] | Pat 3 | NAD-binding 2-hydroxyacid DH (63%) [ |
| HP_D | Halolysin [MSBL1] | Pat 8-10 | Peptidase S8 (53%) [ |
| SP_A | Subtilisin [MBGE] | Peptidase S8 (50%) [ | |
| PD_A | Prephenate DH [MBGE] | Pro 1 | Prephenate dehydrogenase (48) [ |
| DR_A1 | DHPR [MSBL1] | PP 1 | DHPR (56%) [ |
| DR_A2 | DHPR [MBGE] | DHPR (57%) [ | |
| DR_D | DHPR [MSBL1] | DHPR (54%) [ | |
| DR_K | DHPR [MSBL1] | DHPR (53%) [ | |
| CA_A | γ-CA [MSBL1] | Manual PPM | Ferripyochelin binding protein (fbp, 43%) [ |
| CA_D | γ-CA [MSBL1] | Hypothetical protein (53%) [ |
Last letter of gene name indicates habitat: D, Discovery; A, Atlantis II; K, Kebrit.
Reliability of consensus patterns found in chosen hits as well as for carbonic anhydrases.
| PS00059 | Zinc-containing AD signature | 491 | 97.4 | 2.6 | 40 | 8.1 |
| PS00061 | Short-chain DHR family signature | 720 | 82.5 | 17.5 | 192 | 26.7 |
| PS00065 | NAD-binding 2-hydroxyacid DH signature | 235 | 77.4 | 22.6 | 210 | 89.4 |
| PS00136 | Subtilase family, aspartic acid active site | 328 | 45.1 | 54.9 | 90 | 27.4 |
| PS00137 | Subtilase family, histidine active site | 200 | 92.5 | 7.5 | 56 | 28.0 |
| PS00138 | Subtilase family, serine active site | 261 | 88.1 | 11.9 | 29 | 11.1 |
| PS00671 | NAD-binding 2-hydroxyacid DH signature 3 | 319 | 99.7 | 0.3 | 66 | 20.7 |
| PS00913 | Iron-containing ADH signature 1 | 42 | 81.0 | 19.0 | 26 | 61.9 |
| PS01298 | DHPR signature | 541 | 100.0 | 0.0 | 19 | 3.5 |
| PS00162 | A-CA signature | 64 | 100.0 | 0.0 | 32 | 50.0 |
| PS00704 | Prokaryotic-type CA signature 1 | 22 | 95.5 | 4.5 | 10 | 45.5 |
| PS00705 | Prokaryotic-type CA signature 2 | 25 | 100.0 | 0.0 | 5 | 20.0 |
TP, True positive; FP, False positive; FN, False negative (Sigrist et al., 2002).
Figure 4Analysis of the phylogenetic relationships of the some of the genes, which were identified by PPM analysis as highly reliably, annotated. PPM profile and pattern hits Dihydrodipicolinate reductases DR_A1, DR_A2, DR_D, and DR_K (A), PPM profile hit prephenate dehydrogenase PD_A (B), PPM pattern hit subtilisin SD_A (C) and manual PPM hits γ carbonic anhydrase CA_A and CA_D (D). Scale bars 0.1 (A–C) or 0.2 (D) amino acid substitutions per site.