| Literature DB >> 28403817 |
P K Busk1, B Pilgaard2, M J Lezyk2, A S Meyer2, L Lange2.
Abstract
BACKGROUND: Carbohydrate-active enzymes are found in all organisms and participate in key biological processes. These enzymes are classified in 274 families in the CAZy database but the sequence diversity within each family makes it a major task to identify new family members and to provide basis for prediction of enzyme function. A fast and reliable method for de novo annotation of genes encoding carbohydrate-active enzymes is to identify conserved peptides in the curated enzyme families followed by matching of the conserved peptides to the sequence of interest as demonstrated for the glycosyl hydrolase and the lytic polysaccharide monooxygenase families. This approach not only assigns the enzymes to families but also provides functional prediction of the enzymes with high accuracy.Entities:
Keywords: Annotation; Carbohydrate-active enzymes; Genomics; Software
Mesh:
Substances:
Year: 2017 PMID: 28403817 PMCID: PMC5389127 DOI: 10.1186/s12859-017-1625-9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Steps in development and use of Hotpep for Carbohydrate-active enzymes
Bacterial strains and accession numbers
| Name | Phylum | Isolated from | Accession numbers |
|---|---|---|---|
|
| Bacteroidetes | Gut and stomach | GCA_000463315.1 |
|
| Firmicutes | Wood | GCA_000016545.1 |
|
| Deinococcus-Thermus | Coastal desert | GCA_000317835.1 |
|
| Firmicutes | Freshwater ditch | GCA_000233715.3 |
|
| Proteobacteria | Tropical forest soil | GCA_000164865.1 |
|
| Ignavibacteriae | Wooden surface of a chute | GCA_000279145.1 |
|
| Bacteroidetes | Gut | GCA_000025925.1 |
|
| Actinobacteria | Hexachlorocyclohexane-contaminated soil | GCA_000014565.1 |
|
| Firmicutes | Soil/manure | GCA_000015865.1 |
|
| Proteobacteria | Intracellular in shipworm | GCA_000023025.1 |
|
| Firmicutes | thermophilic anaerobic methanogenic reactor | GCA_000305935.1 |
|
| Firmicutes | Soil | GCA_000145615.1 |
Fungal strains (basidiomycotae) and accession numbers
| Name | Order | Life style | Accession numbers |
|---|---|---|---|
|
|
| Brown rot | GCA_000006255.1 |
|
|
| Brown rot | GCA_000344655.2 |
|
|
| Brown rot | GCA_000344685.1 |
|
|
| Brown rot | GCA_000271625.1 |
|
|
| Brown rot | GCA_000292625.1 |
|
|
| Mycoparasite | GCA_000271645.1 |
|
|
| White rot | GCA_000275845.1 |
|
|
| White rot | GCA_000271585.1 |
|
|
| White rot | GCA_000271605.1 |
|
|
| White rot | GCA_000265015.1 |
|
|
| White rot | GCA_000264995.1 |
|
|
| White rot | GCA_000320585.2 |
|
|
| White rot | GCA_000264905.1 |
|
|
| White rot | GCA_000300595.1 |
|
|
| White rot | GCA_000320605.2 |
|
|
| White rot | GCA_000832265.1 |
Fig. 2Hotpep user interface. Double-clicking on the Hotpep icon opens a DOS promt where the name of the sequence directory (e.g., “Fungus fungus”) is entered
Fig. 3Organization of the Hotpep output. a. The output is delivered in the sequence directory with one directory for each enzyme class in the CAZy database, a file containing a summary of the results and a file with all the families found for each accession number. b. Each of the class directories contains files with the hits for each family, a summary and a directory with functional predictions. c. The folder with functional predictions contains files for each EC number found and a summary
Fig. 4Hotpep output. An output files with hits for the GH3 family opened in MS Excel. The columns (from left to right) contain the group where the sequence is annotated, the name of the sequence, the sum of the frequency of the conserved peptides, the number of conserved peptides, the protein sequence, length of the sequence and the sequences of the conserved peptides
Fig. 5Hotpep output for functional prediction. Same as Fig. 4 with the addition of a column labelled “Functions” with information on the putative functions of the annotated sequence
Annotation of 12 bacterial genomes
| Method | CAZya | Hotpep | dbCAN web | dbCAN download |
|---|---|---|---|---|
| Annotated proteins | 1768 | 1839 | 2300 | 1749 |
| True positives | - | 1546 | 1701 | 1571 |
| False positives | - | 296 | 599 | 178 |
| False negatives | - | 220 | 67 | 197 |
| Sensitivity | - | 0.88 | 0.87 | 0.89 |
| Precision | - | 0.84 | 0.71 | 0.90 |
| F1 score | - | 0.86 | 0.84 | 0.89 |
a www.cazy.org
Annotation of 16 fungal genomes
| Method | JGI/CAZya | Hotpep | dbCAN web | dbCAN download |
|---|---|---|---|---|
| Annotated proteins | 3985 | 3534 | 6238 | 4490 |
| True positives | - | 3084 | 3463 | 3057 |
| False positives | - | 450 | 2775 | 1433 |
| False negatives | - | 901 | 522 | 928 |
| Sensitivity | - | 0.77 | 0.87 | 0.77 |
| Precision | - | 0.88 | 0.56 | 0.68 |
| F1 score | - | 0.82 | 0.68 | 0.72 |
aHori et al. [4]