Literature DB >> 32347666

Parallel Generative Topographic Mapping: An Efficient Approach for Big Data Handling.

Arkadii Lin1, Igor I Baskin2, Gilles Marcou1, Dragos Horvath1, Bernd Beck3, Alexandre Varnek1.   

Abstract

Generative Topographic Mapping (GTM) can be efficiently used to visualize, analyze and model large chemical data. The GTM manifold needs to span the chemical space deemed relevant for a given problem. Therefore, the Frame set (FS) of compounds used for the manifold construction must well cover a given chemical space. Intuitively, the FS size must raise with the size and diversity of the target library. At the same time, the GTM training can be very slow or even becomes technically impossible at FS sizes of the order of 105 compounds - which is a very small number compared to today's commercially accessible compounds, and, especially, to the theoretically feasible molecules. In order to solve this problem, we propose a Parallel GTM algorithm based on the merging of "intermediate" manifolds constructed in parallel for different subsets of molecules. An ensemble of these subsets forms a FS for the "final" manifold. In order to assess the efficiency of the new algorithm, 80 GTMs were built on the FSs of different sizes ranging from 10 to 1.8 M compounds selected from the ChEMBL database. Each GTM was challenged to build classification models for up to 712 biological activities (depending on the FS size). With the novel parallel GTM procedure, we could thus cover the entire spectrum of possible FS sizes, whereas previous studies were forced to rely on the working hypothesis that FS sizes of few thousands of compounds are sufficient to describe the ChEMBL chemical space. In fact, this study formally proves this to be true: a FS containing only 5000 randomly picked compounds is sufficient to represent the entire ChEMBL collection (1.8 M molecules), in the sense that a further increase of FS compound numbers has no benefice impact on the predictive propensity of the above-mentioned 712 activity classification models. Parallel GTM may, however, be required to generate maps based on very large FS, that might improve chemical space cartography of big commercial and virtual libraries, approaching billions of compounds.
© 2020 The Authors. Published by Wiley-VCH Verlag GmbH & Co. KGaA.

Entities:  

Keywords:  Big Data; ChEMBL; Frame set; Parallel Generative Topographic Mapping

Year:  2020        PMID: 32347666      PMCID: PMC7757192          DOI: 10.1002/minf.202000009

Source DB:  PubMed          Journal:  Mol Inform        ISSN: 1868-1743            Impact factor:   3.353


  20 in total

1.  The scaffold tree--visualization of the scaffold universe by hierarchical scaffold classification.

Authors:  Ansgar Schuffenhauer; Peter Ertl; Silvio Roggo; Stefan Wetzel; Marcus A Koch; Herbert Waldmann
Journal:  J Chem Inf Model       Date:  2007 Jan-Feb       Impact factor: 4.956

2.  De Novo Molecular Design by Combining Deep Autoencoder Recurrent Neural Networks with Generative Topographic Mapping.

Authors:  Boris Sattarov; Igor I Baskin; Dragos Horvath; Gilles Marcou; Esben Jannik Bjerrum; Alexandre Varnek
Journal:  J Chem Inf Model       Date:  2019-03-05       Impact factor: 4.956

3.  Multi-task generative topographic mapping in virtual screening.

Authors:  Arkadii Lin; Dragos Horvath; Gilles Marcou; Bernd Beck; Alexandre Varnek
Journal:  J Comput Aided Mol Des       Date:  2019-02-09       Impact factor: 3.686

4.  Structure and energetics of ligand binding to proteins: Escherichia coli dihydrofolate reductase-trimethoprim, a drug-receptor system.

Authors:  P Dauber-Osguthorpe; V A Roberts; D J Osguthorpe; J Wolff; M Genest; A T Hagler
Journal:  Proteins       Date:  1988

5.  QSAR modeling and chemical space analysis of antimalarial compounds.

Authors:  Pavel Sidorov; Birgit Viira; Elisabeth Davioud-Charvet; Uko Maran; Gilles Marcou; Dragos Horvath; Alexandre Varnek
Journal:  J Comput Aided Mol Des       Date:  2017-04-03       Impact factor: 3.686

6.  Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17.

Authors:  Lars Ruddigkeit; Ruud van Deursen; Lorenz C Blum; Jean-Louis Reymond
Journal:  J Chem Inf Model       Date:  2012-11-01       Impact factor: 4.956

7.  Generative Topographic Mapping (GTM): Universal Tool for Data Visualization, Structure-Activity Modeling and Dataset Comparison.

Authors:  N Kireeva; I I Baskin; H A Gaspar; D Horvath; G Marcou; A Varnek
Journal:  Mol Inform       Date:  2012-04-04       Impact factor: 3.353

8.  ISIDA Property-Labelled Fragment Descriptors.

Authors:  Fiorella Ruggiu; Gilles Marcou; Alexandre Varnek; Dragos Horvath
Journal:  Mol Inform       Date:  2010-12-09       Impact factor: 3.353

9.  Generative topographic mapping-based classification models and their applicability domain: application to the biopharmaceutics Drug Disposition Classification System (BDDCS).

Authors:  Héléna A Gaspar; Gilles Marcou; Dragos Horvath; Alban Arault; Sylvain Lozano; Philippe Vayer; Alexandre Varnek
Journal:  J Chem Inf Model       Date:  2013-12-09       Impact factor: 4.956

10.  PubChem Substance and Compound databases.

Authors:  Sunghwan Kim; Paul A Thiessen; Evan E Bolton; Jie Chen; Gang Fu; Asta Gindulyte; Lianyi Han; Jane He; Siqian He; Benjamin A Shoemaker; Jiyao Wang; Bo Yu; Jian Zhang; Stephen H Bryant
Journal:  Nucleic Acids Res       Date:  2015-09-22       Impact factor: 16.971

View more
  1 in total

1.  Parallel Generative Topographic Mapping: An Efficient Approach for Big Data Handling.

Authors:  Arkadii Lin; Igor I Baskin; Gilles Marcou; Dragos Horvath; Bernd Beck; Alexandre Varnek
Journal:  Mol Inform       Date:  2020-04-29       Impact factor: 3.353

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.