Literature DB >> 30247628

PySFD: comprehensive molecular insights from significant feature differences detected among many simulated ensembles.

Sebastian Stolzenberg1.   

Abstract

MOTIVATION: Many modeling analyses of molecular dynamics (MD) simulations are based on a definition of states that can be (groups of) clusters of simulation frames in a feature space composed of molecular coordinates. With increasing dimension of this feature space (due to the increasing size or complexity of a simulated molecule), it becomes very difficult to cluster the underlying MD data and estimate a statistically robust model. To mitigate this "curse of dimensionality", one can reduce the feature space, e.g., with principal component or time-lagged independent component analysis transformations, focusing the analysis on the most important modes of transitions. In practice, however, all these reduction strategies may neglect important molecular details that are susceptible to experimental verification.
RESULTS: To recover such molecular details, I have developed PySFD (Significant Feature Differences analyzer for Python), a multi-processing software package that efficiently selects significantly different features of any user-defined feature type among potentially many different simulated state ensembles, such as meta-stable states of a Markov State Model (MSM). Applying PySFD on MSMs of an aggregate of 300 microseconds MD simulations recently performed on the major histocompatibility complex class II (MHCII) protein, I demonstrate how this toolkit can extract and visualize valuable mechanistic information from big MD simulation data, e.g., in form of networks of dynamic interaction changes connecting functionally relevant sites of a protein complex.
AVAILABILITY AND IMPLEMENTATION: PySFD is freely available under the L-GPL license at https://github.com/markovmodel/PySFD. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author(s) 2018. Published by Oxford University Press.

Entities:  

Mesh:

Substances:

Year:  2019        PMID: 30247628      PMCID: PMC6499238          DOI: 10.1093/bioinformatics/bty818

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

Fueled by perpetual advances in supercomputing capabilities (Glaser ; Shaw ; Stone ), high-throughput molecular dynamics (MD) simulations of increasing size and timescales are becoming amenable. Most often, these advances are paralleled with an increasing heterogeneity in supercomputing resources (e.g., different compute nodes containing different numbers and types of CPUs/GPUs). This heterogeneity is reflected in high-throughput MD datasets in terms of numbers and lengths of individual MD simulations, and thus demands means to analyze these data appropriately (Faradjian and Elber, 2004; Preto and Clementi, 2014; Schaudinnus ; Wriggers ): For example, Markov State Models (MSMs; Bowman ; Noé ) are capable to compute thermodynamic and kinetic properties from many shorter off-equilibrium MD simulations, which for practical supercomputing reasons (inter-core/-node communication times, job queuing policies) are much more feasible to generate than a single, or a few long MD simulations. Also, MSMs can be used “on-the-fly” in adaptive MD simulations (Bowman ; Doerr and De Fabritiis, 2014; Plattner and Noé, 2015) that continuously select new restarting points to enhance the sampling of under-explored molecular conformations or transitions. In particular, MSMs estimate transition probabilities between micro-states, which are usually defined as clusters in a conformational feature space (inter-atomic distances, backbone dihedrals, chain rotamers, …). The more complex a simulated system, the higher the dimension of this feature space, which requires more and better clusters, and more observed inter-cluster transitions to estimate a statistically robust MSM. This “curse of dimensionality” for such feature spaces can be alleviated by including only coarse-grained or fewer localized features, and/or only the most important, uncorrelated eigen modes in such feature space (principal component analysis, or time-lagged independent component analysis) (Pérez-Hernández ; Pérez-Hernández and Noé, 2016). By definition, all these reduction strategies (and thus an MSM) do not encode all molecular features at the same time, many of which may be equally important to understand a protein’s mechanism. In principle, however, such important features may be recovered a posteriori because each MSM micro-state represents a set of simulation frames, and thus an average feature value. For example, by correlating average feature values with MSM eigenvectors along micro-state (Pérez-Hernández ), one can thus extract features that represent best the slowest eigenvectors of an MSM. Alternatively, one can identify significant feature differences (SFDs) among pairs of simulated ensembles—e.g., meta-stable states (sets) of micro-states, even across different mutants—as implemented for non-covalent contact frequencies in (Farabella ), or in the PIA (Stolzenberg, 2014; Stolzenberg , 2016), and pyHVis3D (Knapp ) tools. In this paper, I have developed the object-oriented Python package PySFD (Significant Feature Differences analyzer for Python), a generalized and more powerful framework that efficiently detects and visualizes significant differences in any user-defined feature between many pairs or many groups of molecular simulation state ensembles. As a result, these significantly different features can be used to distinguish or even classify these ensembles from one another for verification of stationary distributions (estimated, e.g., from MSMs) and their underlying simulations, and further inspire novel molecular predictions that are directly testable in experiments, such as mutagenesis or substituted cysteine accessibility measurements (Liapakis ). In this paper, I describe the basic concepts of PySFD, and its main functionalities. In the Supplementary Information, I illustrate its capabilities by applying it on 300μs MD simulations I had performed on an MHCII (HLA-B1DR*01:01) protein complex (Wieczorek ), an important peptide exchanger in the adaptive immune system.

2 Materials and methods

Given a number of molecular input trajectories for each simulated state ensemble, PySFD detects and visualizes SFDs in three different stages (I-III, Fig. 1): In the Feature Extraction stage (I), PySFD considers various groups of feature types (SRF, PRF, sPBSF, PPRF, PsPBSF, see Supplementary Information) in form of classes inherited from the FeatureAgent class. In each simulation frame, features are tabulated as Python pandas data frames (McKinney, 2010) and can be further coarse-grained into user-defined regions by residual identity (and optionally by backbone/side-chain identity) via a user-defined function (e.g., mean or sum). In stage II (see Supplementary Information), these feature tables are aggregated into means (and optionally higher statistical moments) with uncertainties, providing a way to characterize different state ensembles and/or simulated systems in form of SFDs. These differences can then be used to study molecular mechanisms directly, or to generate state- and/or system-independent insights, e.g., in form of feature type redundancies or feature selection input for various machine/deep learning algorithms. In stage III (see Supplementary Information), these SFDs can be overlaid with molecular representations of the simulated system using the PyMOL (Schrödinger, 2010) or VMD (Humphrey ) programs. In any of these cases, it remains the user’s responsibility to choose/define meaningful feature types and other PySFD parameters, and interpret the results in accordance with the particular scientific question being addressed.
Fig. 1.

Workflow of the PySFD software: The PySFD main class receives input trajectories realizing different molecular ensembles, and FeatureAgent-derived classes (see Supplementary Methods section), i.e. SRF (single residual feature), PRF (pairwise residual feature), sPBSF (sparse pairwise backbone/side-chain feature), PPRF (pairwise, pairwise residual feature), and PsPBSF (pairwise sparse pairwise backbone/side-chain feature) as arguments (I) to compute (coarse-grained) feature tables and histograms, feature type redundancies, and (common) significant feature differences (SFDs) among the simulated ensembles (II). The feature difference tables can be visualized via PyMOL and/or VMD (III), as illustrated in the lower right corner by white and black ribbons representing snapshots of the simulated ensembles 1 and 2, respectively. Residues with SFDs (here, χ1 rotamers) are rendered as sticks and colored by their corresponding ensemble

Workflow of the PySFD software: The PySFD main class receives input trajectories realizing different molecular ensembles, and FeatureAgent-derived classes (see Supplementary Methods section), i.e. SRF (single residual feature), PRF (pairwise residual feature), sPBSF (sparse pairwise backbone/side-chain feature), PPRF (pairwise, pairwise residual feature), and PsPBSF (pairwise sparse pairwise backbone/side-chain feature) as arguments (I) to compute (coarse-grained) feature tables and histograms, feature type redundancies, and (common) significant feature differences (SFDs) among the simulated ensembles (II). The feature difference tables can be visualized via PyMOL and/or VMD (III), as illustrated in the lower right corner by white and black ribbons representing snapshots of the simulated ensembles 1 and 2, respectively. Residues with SFDs (here, χ1 rotamers) are rendered as sticks and colored by their corresponding ensemble

3 Conclusion

PySFD is an object-oriented Python package I have developed to detect and visualize significant feature differences among molecular simulations, such as MD ensembles. In the Supplementary Information, I have applied PySFD on meta-stable MSM sets of 300 microseconds of MD simulation performed on the MHCII protein complex. From a machine/deep learning perspective, PySFD selects (i.e. “filters”) features that are significantly different between simulated ensembles (i.e. “classes”). This selection strategy is similar to the feature selection with one-way analysis of variance (ANOVA) (Saeys ), which performs F-tests between inter-ensemble and intra-ensemble variances, and which is directly accessible to PySFD feature tables, e.g., via the scikit-learn Python package (Pedregosa ). However, PySFD differs from “ANOVA” as it retains information about the sign and magnitude of each individual SFD, which makes PySFD’s “pre-learning” analysis by itself very useful. Click here for additional data file.
  19 in total

1.  Computing time scales from reaction coordinates by milestoning.

Authors:  Anton K Faradjian; Ron Elber
Journal:  J Chem Phys       Date:  2004-06-15       Impact factor: 3.488

Review 2.  A review of feature selection techniques in bioinformatics.

Authors:  Yvan Saeys; Iñaki Inza; Pedro Larrañaga
Journal:  Bioinformatics       Date:  2007-08-24       Impact factor: 6.937

3.  Identification of slow molecular order parameters for Markov model construction.

Authors:  Guillermo Pérez-Hernández; Fabian Paul; Toni Giorgino; Gianni De Fabritiis; Frank Noé
Journal:  J Chem Phys       Date:  2013-07-07       Impact factor: 3.488

4.  Constructing the equilibrium ensemble of folding pathways from short off-equilibrium simulations.

Authors:  Frank Noé; Christof Schütte; Eric Vanden-Eijnden; Lothar Reich; Thomas R Weikl
Journal:  Proc Natl Acad Sci U S A       Date:  2009-11-03       Impact factor: 11.205

5.  GPU-accelerated molecular modeling coming of age.

Authors:  John E Stone; David J Hardy; Ivan S Ufimtsev; Klaus Schulten
Journal:  J Mol Graph Model       Date:  2010-07-08       Impact factor: 2.518

6.  Enhanced modeling via network theory: Adaptive sampling of Markov state models.

Authors:  Gregory R Bowman; Daniel L Ensign; Vijay S Pande
Journal:  J Chem Theory Comput       Date:  2010       Impact factor: 6.006

7.  Fast recovery of free energy landscapes via diffusion-map-directed molecular dynamics.

Authors:  Jordane Preto; Cecilia Clementi
Journal:  Phys Chem Chem Phys       Date:  2014-09-28       Impact factor: 3.676

8.  The substituted-cysteine accessibility method (SCAM) to elucidate membrane protein structure.

Authors:  G Liapakis; M M Simpson; J A Javitch
Journal:  Curr Protoc Neurosci       Date:  2001-05

9.  Mechanism of the Association between Na+ Binding and Conformations at the Intracellular Gate in Neurotransmitter:Sodium Symporters.

Authors:  Sebastian Stolzenberg; Matthias Quick; Chunfeng Zhao; Kamil Gotfryd; George Khelashvili; Ulrik Gether; Claus J Loland; Jonathan A Javitch; Sergei Noskov; Harel Weinstein; Lei Shi
Journal:  J Biol Chem       Date:  2015-04-13       Impact factor: 5.157

10.  Allosteric signalling in the outer membrane translocation domain of PapC usher.

Authors:  Irene Farabella; Thieng Pham; Nadine S Henderson; Sebastian Geibel; Gilles Phan; David G Thanassi; Anne H Delcour; Gabriel Waksman; Maya Topf
Journal:  Elife       Date:  2014-10-28       Impact factor: 8.140

View more
  1 in total

1.  Identification of kinetic order parameters for non-equilibrium dynamics.

Authors:  Fabian Paul; Hao Wu; Maximilian Vossel; Bert L de Groot; Frank Noé
Journal:  J Chem Phys       Date:  2019-04-28       Impact factor: 3.488

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.