| Literature DB >> 33680357 |
Yan Wang1,2, Hang Zhang2, Haolin Zhong2, Zhidong Xue3.
Abstract
Protein domains are the basic units of proteins that can fold, function, and evolve independently. Knowledge of protein domains is critical for protein classification, understanding their biological functions, annotating their evolutionary mechanisms and protein design. Thus, over the past two decades, a number of protein domain identification approaches have been developed, and a variety of protein domain databases have also been constructed. This review divides protein domain prediction methods into two categories, namely sequence-based and structure-based. These methods are introduced in detail, and their advantages and limitations are compared. Furthermore, this review also provides a comprehensive overview of popular online protein domain sequence and structure databases. Finally, we discuss potential improvements of these prediction methods.Entities:
Keywords: Artificial reef; MixSIAR; Natural reef; Stable isotopes; Trophic pathway; Trophic structure
Year: 2021 PMID: 33680357 PMCID: PMC7895673 DOI: 10.1016/j.csbj.2021.01.041
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Fig. 1Schematic diagram of structure and domain of an archaeal intein-encoded homing endonuclease PI-PFUI.
Sequence-based protein domain identification methods.
| Category | Method | Description | Year | URL | Reference |
|---|---|---|---|---|---|
| Homology-based | CHOP | Search target sequences against PDB, Pfam-A, and SWISS-PROT to find templates. | 2004 | ||
| DomPred | Combine homology and secondary structure element alignment to find templates. | 2005 | |||
| SSEP-Domain | Based on the secondary structure elements alignment and profile-profile alignment. | 2006 | |||
| ThreaDom | Deduce domain boundary locations based on multiple threading alignments. | 2013 | |||
| CLADE | Identifie domains by a multi-source strategy which combining multiple HMMs profile. | 2016 | |||
| MetaCLADE | A multi-source domain annotation tool for metagenomic dataset. | 2018 | |||
| Ab initio methods | Domain Guess by Size | Detect domain boundaries based on the distributions of chain and domain lengths. | 2000 | ||
| CHOPnet | Feed-forward neural network that uses amino acid composition and secondary structure and solvent accessibility as features. | 2004 | |||
| PPRODO | Feed-forward neural network that uses position-specific scoring matrix (PSSM) generated by PSI-BLAST as features. | 2005 | |||
| DOMpro | RNN uses secondary structure and solvent accessibility as features. | 2005 | |||
| KemaDom | Combine three SVM classifiers that use different features as inputs to predict domain boundaries. | 2006 | |||
| DomainDiscovery | SVM uses inter-domain linker index, PSSM, secondary structural, and solvent accessibility as features. | 2006 | |||
| IGRN | An improved general regression network model that is trained by the information of PSSM, interdomain linker index, secondary structure, and solvent accessibility. | 2008 | |||
| DomSVR | Sequence is encoded by physicochemical and biological properties. SVR uses encoded sequence to predict domain boundary. | 2010 | |||
| DoBo | SVM uses evolutionary domain boundary signals embedded in homologous proteins as input features. | 2011 | |||
| DROP | An SVM to predict domain linkers using 25 optimal features selected from a set of 3000 features. | 2011 | |||
| DomHR | Identify domain boundaries in proteins by defining the edge of domain and boundary regions as a hinge region. | 2013 | |||
| PDP-CON | Combine predicted results from six single domain boundary prediction methods. | 2016 | |||
| ConDo | Use long-range, coevolutionary features to train neural networks. | 2018 | |||
| DNN-Dom | Combine CNN and BGRU to predict domain boundary by combining amino acid composition information, PSSM, solvent accessibility, and secondary structure. A balanced Random Forest is used to solve the imbalance samples problem. | 2019 | |||
| DeepDom | Use sequences information encoded by physical–chemical properties to train a bidirectional LSTM model to predict domain boundaries. | 2019 | |||
| FuPred | Predict protein domain boundaries using predicted contact maps generated by ANN. | 2020 |
Structure-based protein domain identification methods.
| Category | Method | Description | Year | URL | Reference |
|---|---|---|---|---|---|
| Structure-based | DomainParser | Use flow network represent protein structure, and identify domain based on maximum-flow/minimum-cut theorem. | 2000 | ||
| PDP | Identify the dividing site that makes the contact density of the two parts lower than a threshold as the domain boundary. | 2003 | |||
| DIAL | Identify the domain by clustering substructures on the basis of their spatial distances. | 2005 | |||
| CATHEDRAL | Identifiy the domain by comparing target structure with structure templates in CATH. | 2007 | |||
| DDOMAIN | Identify the dividing site that makes the distance between the two parts exceed the threshold as the domain boundary. | 2007 | |||
| DHcL | Identify the domain by calculating the van der Waals model of protein. | 2008 | |||
| Sword | Assign structural domains through the hierarchical merging of protein units. SWORD provides different domain assignments using different merge schemes. | 2017 | www.dsimb.inserm.fr/sword/ | ||
| Predcitedstructure-based | SnapDRAGON | DRAGON generates 100 models, and then structure-based domain assignment is used to parse the models into domains. Finally, a result is derived from the consistency of the predicted boundaries. | 2002 | ||
| RosettaDOM | RosettaDOM is a hybrid method that uses homology-based methods to predict domain boundaries when homologous templates can be found. When lacking templates, Rosetta is used to generate models, and final domain boundary predictions are derived from the models. | 2005 | |||
| OPUS-Dom | Generate a large ensemble of folded structure decoys by VECFOLD, and predicted domain boundaries are derived from the consistency of the domain boundary in the set of 3D models. | 2009 |
Fig. 2Diagram of homology alignment-based methods to construct a domain database.
Statistics on the number of domain families annotated for the three kingdoms of life in different domain databases.
| Database | Source | Shared by the three kingdoms | ||
|---|---|---|---|---|
| Eukaryota | Bacteria | Archaea | ||
| Pfam | 8437 | 5857 | 1735 | 890 (6.2%) |
| SMART | 1120 | 428 | 241 | 166 (11.4%) |
| PROSITE | 2185 | 1262 | 641 | 500 (16.2%) |
| CATH | 2972 | 3665 | 918 | 399 (5.9%) |
| SCOP | 3166 | 2626 | 749 | 264 (4.4%) |
The domain average length of the three kingdoms in CATH and SCOP.
| Database | Mean length | Std | Eukaryota | Bacteria | Archaea | |||
|---|---|---|---|---|---|---|---|---|
| Mean | Std | Mean | Std | Mean | Std | |||
| CATH | 150.4 | 90.9 | 147.9 | 90.8 | 154.8 | 91.8 | 137.2 | 81.0 |
| SCOP | 196.8 | 129.7 | 179.5 | 128.9 | 215.4 | 129.5 | 189.5 | 115.1 |