| Literature DB >> 32500917 |
Andrew F Neuwald1,2, Christopher J Lanczycki3, Theresa K Hodges1, Aron Marchler-Bauer3.
Abstract
For optimal performance, machine learning methods for protein sequence/structural analysis typically require as input a large multiple sequence alignment (MSA), which is often created using query-based iterative programs, such as PSI-BLAST or JackHMMER. However, because these programs align database sequences using a query sequence as a template, they may fail to detect or may tend to misalign sequences distantly related to the query. More generally, automated MSA programs often fail to align sequences correctly due to the unpredictable nature of protein evolution. Addressing this problem typically requires manual curation in the light of structural data. However, curated MSAs tend to contain too few sequences to serve as input for statistically based methods. We address these shortcomings by making publicly available a set of 252 curated hierarchical MSAs (hiMSAs), containing a total of 26 212 066 sequences, along with programs for generating from these extremely large MSAs. Each hiMSA consists of a set of hierarchically arranged MSAs representing individual subgroups within a superfamily along with template MSAs specifying how to align each subgroup MSA against MSAs higher up the hierarchy. Central to this approach is the MAPGAPS search program, which uses a hiMSA as a query to align (potentially vast numbers of) matching database sequences with accuracy comparable to that of the curated hiMSA. We illustrate this process for the exonuclease-endonuclease-phosphatase superfamily and for pleckstrin homology domains. A set of extremely large MSAs generated from the hiMSAs in this way is available as input for deep learning, big data analyses. MAPGAPS, auxiliary programs CDD2MGS, AddPhylum, PurgeMSA and ConvertMSA and links to National Center for Biotechnology Information data files are available at https://www.igs.umaryland.edu/labs/neuwald/software/mapgaps/.Entities:
Mesh:
Substances:
Year: 2020 PMID: 32500917 PMCID: PMC7297217 DOI: 10.1093/database/baaa042
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1Steps required to create a large, high-quality MSA using the CDD hiMSAs and programs described here.
Figure 3Alignment of 42 representative, distantly related PH domains from distinct phyla and sharing ≤25% identity. Despite the weak similarity and an abundance of indels, conserved residues characteristic of this superfamily are generally well aligned. Residues generally conserved in `all’ PH domains are colored as follows: acidic residues, red (without highlighting); basic residues, cyan; hydrophilic residue, pink; histidine, glycine and proline, blue, green and black, respectively; hydrophobic and aromatic residues, red (highly conserved) or gray with yellow highlighting. Identifiers for phyla (left column) and sequences (right column in lower aligned region) are color coded by taxa as follows: metazoan, red; fungal, dark yellow; plant, green; protozoan, cyan.
Figure 2CDD hierarchies used here to create very large MSAs. A. Hierarchy for EEP domains (cd08372). B. Hierarchy for PH domains (cd00900). The subtree is shown for the RanBD family; other (+) nodes may be expanded in a similar manner.
Figure 4CDD versus PFAM alignment quality. S-scores estimate the statistical significance of the correspondence between pairwise correlations in an MSA and 3D residue contacts in available structures. (For pdb identifiers see supplementary data S1 file.) Higher-quality MSAs should yield higher S-scores. A. Comparison of CD08372-MAPGAPS versus PF03372_full EEP domain MSAs based on 20 EEP protein structures. B. Comparison of CDD00900-MAPGAPS versus PF00169_full PH domain MSAs based on 76 protein structures. Two CDD MSAs were analyzed: one very diverse sub-alignment of randomly sampled sequences (average column relative entropy = 0.26 nats) and another less diverse sub-alignment (denoted as hiRE) consisting of sequences very similar to those in the PFAM MSA (avg. relative entropy = 0.83 nats). The PFAM MSA was of intermediate diversity (avg. relative entropy = 0.69).