| Literature DB >> 29771380 |
Han Zhang1, Tanner Yohe2, Le Huang1, Sarah Entwistle2, Peizhi Wu1, Zhenglu Yang1, Peter K Busk3, Ying Xu4, Yanbin Yin2.
Abstract
Complex carbohydrates of plants are the main food sources of animals and microbes, and serve as promising renewable feedstock for biofuel and biomaterial production. Carbohydrate active enzymes (CAZymes) are the most important enzymes for complex carbohydrate metabolism. With an increasing number of plant and plant-associated microbial genomes and metagenomes being sequenced, there is an urgent need of automatic tools for genomic data mining of CAZymes. We developed the dbCAN web server in 2012 to provide a public service for automated CAZyme annotation for newly sequenced genomes. Here, dbCAN2 (http://cys.bios.niu.edu/dbCAN2) is presented as an updated meta server, which integrates three state-of-the-art tools for CAZome (all CAZymes of a genome) annotation: (i) HMMER search against the dbCAN HMM (hidden Markov model) database; (ii) DIAMOND search against the CAZy pre-annotated CAZyme sequence database and (iii) Hotpep search against the conserved CAZyme short peptide database. Combining the three outputs and removing CAZymes found by only one tool can significantly improve the CAZome annotation accuracy. In addition, dbCAN2 now also accepts nucleotide sequence submission, and offers the service to predict physically linked CAZyme gene clusters (CGCs), which will be a very useful online tool for identifying putative polysaccharide utilization loci (PULs) in microbial genomes or metagenomes.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29771380 PMCID: PMC6031026 DOI: 10.1093/nar/gky418
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.dbCAN is updated every year and now has 575 HMMs. X-axis: year; Y-axis: number of HMMs of families (blue) and subfamilies (red).
Figure 2.Overall design of dbCAN2 meta server. GCPU (gene cluster plot utility) and CGC-Finder (CAZyme gene cluster finder) are two tools developed for dbCAN2.
Comparison of tools for automated CAZyme annotation
| Accuracy ( | |||||||
|---|---|---|---|---|---|---|---|
| Tools + databases | Bacteria | Eukaryotes | Subfamily | Multi-family proteins | Domain repeats | Domain positions | Speedc |
| HMMER+dbCAN | 0.88 | 0.86 | Yesa | Yes | Yes | Yes | 69 |
| DIAMOND+CAZy | 0.89 | 0.84 | Yesa | No | No | No | 4 |
| Hotpep+PPR | 0.80 | 0.94 | Yesb | Yes | No | No | 7 |
| Predicted by > = 2 tools | 0.93 | 0.92 | |||||
aTwenty four CAZyme families are classified into 207 subfamilies by phylogenetic clustering and CAZy expert curation (10).
bThree hundred and forty two CAZyme families are classified into 7036 groups by PPR (15,16).
cThe time is in seconds and calculated on Escherichia coli K-12 MG1655 proteome (4140 proteins). The detailed calculations on accuracy and speed are available in Supplementary Table S1. No correspondence has been established between PPR groups and CAZy subfamilies, and in dbCAN web server we only report CAZy subfamily annotation, whenever it is available.
Figure 3.Comparison of annotation results for multi-domain CAZymes using three different tools. (A) Two example proteins (AT1G11720.1 and YP_002573728.1) are illustrated with their CAZyme domain architecture based on dbCAN search. (B) DIAMOND search result for the two proteins showing the best CAZy protein hit; (C) HMMER search result against dbCAN HMM database, from which (A) is derived; (D) Hotpep search result against PPR library; Frequency means the sum of conserved peptide frequencies and Hits means the number of conserved peptide hits (15).
Figure 4.Screenshots of dbCAN2 result pages. (A) Venn diagram to show overlaps among the results of the three tools; (B) CGC-Finder result tab; (C) Overview tab combining results from the three tools and SignalP; (D) genomic location plot of an example CGC (signature genes are in red, green and blue colors, while non-signature genes are in gray); (E) detailed information of an example CGC.