| Literature DB >> 30460426 |
Mariam Pirashvili1, Lee Steinberg2, Francisco Belchi Guillamon3,4, Mahesan Niranjan5, Jeremy G Frey2, Jacek Brodzki3.
Abstract
Topological data analysis is a family of recent mathematical techniques seeking to understand the 'shape' of data, and has been used to understand the structure of the descriptor space produced from a standard chemical informatics software from the point of view of solubility. We have used the mapper algorithm, a TDA method that creates low-dimensional representations of data, to create a network visualization of the solubility space. While descriptors with clear chemical implications are prominent features in this space, reflecting their importance to the chemical properties, an unexpected and interesting correlation between chlorine content and rings and their implication for solubility prediction is revealed. A parallel representation of the chemical space was generated using persistent homology applied to molecular graphs. Links between this chemical space and the descriptor space were shown to be in agreement with chemical heuristics. The use of persistent homology on molecular graphs, extended by the use of norms on the associated persistence landscapes allow the conversion of discrete shape descriptors to continuous ones, and a perspective of the application of these descriptors to quantitative structure property relations is presented.Entities:
Keywords: Chemical space; Mapper; Persistent homology; Solubility
Year: 2018 PMID: 30460426 PMCID: PMC6755597 DOI: 10.1186/s13321-018-0308-5
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fig. 1The pipeline. A flowchart illustrating the two main threads of study performed. Both the Mapper (top) and persistent homology (bottom) routes use simple molecular SMILES strings as input. We use dashed lines to emphasise that we use the average persistence landscape metric as a surrogate for the persistence distortion distance
Fig. 2A continuous example of the implementation of the mapper algorithm, with the function f being the height and using the Euclidean metric
Fig. 3A visualisation of the superlevel set filtration of a molecular graph, for the molecule dibromomethane. The 3D model of the molecule b can be viewed as a metric graph (c). We first consider one of the hydrogen atoms as the base point (d), and get a corresponding persistence diagram (e). If, instead, we choose the carbon atom as the base point (f), we associate a different persistence diagram (g) to the graph. Note that in this second diagram the points have multiplicity 2
Fig. 4A visualisation of the Hausdorff distance. Let the red triangles be the subset X, and the orange circles be the subset Y. In b, the illustrates the part. It is the smallest number for which the disc of that radius around the orange circle farthest away from any of the triangles includes a triangle. To make the definition symmetric, this step is repeated for the triangles, and the maximum of the two radii is chosen
Fig. 5A visual explanation of persistence landscapes. The persistence diagram (left) is tilted, so that the diagonal becomes the new horizontal axis (top right). The are the piecewise linear functions (bottom right)
The table shows the changes in correlation values with solubility for the feature , depending on the number of rings
| nCIC | X% | AMW | MW |
|---|---|---|---|
| All | |||
| 0 | |||
| 1 | |||
| 2 | |||
|
| |||
|
|
Responsible for this change are the number of chlorine atoms in the molecule. Also shown are the correlation values of average molecular weight, which itself correlates well with , and molecular weight. The highest (bolditalic) and lowest (italic) correlation values are emphasised
Fig. 6The first row shows three different analyses coloured by rows per node. The red patches indicate groupings of a large number of molecules. The first analysis uses the MDS lenses and norm correlation metric (resolution: 30, gain: 2.5, not equalized), the second is MDS lenses and Variance Normalized Euclidean metric (resolution: 35, gain:, 2.5, equalized) and the last one uses PCA lenses and the Variance Normalized Euclidean metric (resolution: 30, gain: 2.5, equalized). The second row shows the same analyses coloured by nCIC. Here blue corresponds to no cycles, green to 1 cycle, etc. The presented graphs have been created using Ayasdi Workbench
Fig. 7Coloured by rows per node (a), LogS (b), nCIC (c), MW (d), AMW (e) and nCL (f). We can see the red region in (d), corresponding to molecules with a high number of chlorines, matches the blue patch in (b). These are molecules with two rings, as we can see from (c), with a particularly low solubility. It is precisely these molecules which distort the colour gradient in (b). This visualisation was created using Ayasdi Workbench
Fig. 8The first row shows tSNE embeddings of the (a) and (b) distance matrices, coloured by number of atoms and number of rings, respectively. The second row shows the MDS embeddings of the same
Fig. 9The tSNE planar embedding of the combined matrix constructed using SNF. Coloured by number of atoms (a), number of cycles (b), number of chlorines (c) and solubility (d)