| Literature DB >> 24551112 |
Diogo A R S Latino1, João Aires-de-Sousa2.
Abstract
The combination of chemoinformatics approaches with NMR techniques and the increasing availability of data allow the resolution of problems far beyond the original application of NMR in structure elucidation/verification. The diversity of applications can range from process monitoring, metabolic profiling, authentication of products, to quality control. An application related to the automatic analysis of complex mixtures concerns mixtures of chemical reactions. We encoded mixtures of chemical reactions with the difference between the (1)H NMR spectra of the products and the reactants. All the signals arising from all the reactants of the co-occurring reactions were taken together (a simulated spectrum of the mixture of reactants) and the same was done for products. The difference spectrum is taken as the representation of the mixture of chemical reactions. A data set of 181 chemical reactions was used, each reaction manually assigned to one of 6 types. From this dataset, we simulated mixtures where two reactions of different types would occur simultaneously. Automatic learning methods were trained to classify the reactions occurring in a mixture from the (1)H NMR-based descriptor of the mixture. Unsupervised learning methods (self-organizing maps) produced a reasonable clustering of the mixtures by reaction type, and allowed the correct classification of 80% and 63% of the mixtures in two independent test sets of different similarity to the training set. With random forests (RF), the percentage of correct classifications was increased to 99% and 80% for the same test sets. The RF probability associated to the predictions yielded a robust indication of their reliability. This study demonstrates the possibility of applying machine learning methods to automatically identify types of co-occurring chemical reactions from NMR data. Using no explicit structural information about the reactions participants, reaction elucidation is performed without structure elucidation of the molecules in the mixtures.Entities:
Mesh:
Substances:
Year: 2014 PMID: 24551112 PMCID: PMC3923800 DOI: 10.1371/journal.pone.0088499
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Types of photochemical reactions (from top): [3+2] photocycloaddition of azirines to C = C, [2+2] photocycloaddition of C = C to C = O, [4+2] and [4+4] photocycloaddition of olefins to carbon-only aromatic rings, [2+2] photocycloaddition of C = C to C = C, [3+2] photocycloaddition of s-triazolo[4,3-b]pyridazine to C = C, and [2+2] photocycloaddition of C = C to C = S.
Number of reactions by reaction type and partition to be used to generate training and test sets of Partition 2 of mixtures of reactions.
| Types of Reactions | Number of Reactions | For Partition 2 |
| [3+2] photocycloaddition of azirines to C = C | 20 | 16/4 |
| [2+2] photocycloaddition of C = C to C = O | 31 | 23/8 |
| [4+2] and [4+4] photocycloaddition of olefins to carbon-only aromatic rings | 20 | 16/4 |
| [2+2] photocycloaddition of C = C to C = C | 73 | 56/17 |
| [3+2] photocycloaddition of s-triazolo[4,3-b]pyridazine to C = C | 10 | 8/2 |
| [2+2] photocycloaddition of C = C to C = S | 27 | 21/6 |
| Total | 181 | 140/41 |
*Number of reactions in the training/test set to be used to generate Partition 2 of mixtures of reactions.
Number of reaction mixtures in each mixture class (mixture of two reactions of different types) for the two partitions of the data set.
| Class of mixture | Reaction 1 | Reaction 2 | Partition 1 | Partition 2 |
| A | [3+2] photocycloaddition of azirines to C = C | [2+2] photocycloaddition of C = C to C = O | 413/207 | 368/32 |
| B | [3+2] photocycloaddition of azirines to C = C | [4+2] and [4+4] photocycloaddition of olefins tocarbon-only aromatic rings | 267/133 | 256/16 |
| C | [3+2] photocycloaddition of azirines to C = C | [2+2] photocycloaddition of C = C to C = C | 975/487 | 896/68 |
| D | [3+2] photocycloaddition of azirines to C = C | [3+2] photocycloaddition of s-triazolo[4,3-b]pyridazineto C = C | 132/67 | 128/8 |
| E | [3+2] photocycloaddition of azirines to C = C | [2+2] photocycloaddition of C = C to C = S | 360/180 | 352/20 |
| F | [2+2] photocycloaddition of C = C to C = O | [4+2] and [4+4] photocycloaddition of olefins tocarbon-only aromatic rings | 413/206 | 368/32 |
| G | [2+2] photocycloaddition of C = C to C = O | [2+2] photocycloaddition of C = C to C = C | 1510/754 | 1288/136 |
| H | [2+2] photocycloaddition of C = C to C = O | [3+2] photocycloaddition of s-triazolo[4,3-b]pyridazineto C = C | 206/104 | 184/16 |
| I | [2+2] photocycloaddition of C = C to C = O | [2+2] photocycloaddition of C = C to C = S | 558/279 | 506/40 |
| J | [4+2] and [4+4] photocycloaddition of olefinsto carbon-only aromatic rings | [2+2] photocycloaddition of C = Cto C = C | 974/486 | 896/68 |
| K | [4+2] and [4+4] photocycloaddition of olefinsto carbon-only aromatic rings | [3+2] photocycloaddition of s-triazolo[4,3-b]pyridazineto C = C | 133/67 | 127/8 |
| L | [4+2] and [4+4] photocycloaddition of olefinsto carbon-only aromatic rings | [2+2] photocycloaddition of C = C to C = S | 360/180 | 353/20 |
| M | [2+2] photocycloaddition of C = C to C = C | [3+2] photocycloaddition of s-triazolo[4,3-b]pyridazineto C = C | 498/250 | 448/34 |
| N | [2+2] photocycloaddition of C = C to C = C | [2+2] photocycloaddition of C = C to C = S | 1302/651 | 1232/85 |
| O | [3+2] photocycloaddition of s-triazolo[4,3-b]pyridazine to C = C | [2+2] photocycloaddition of C = C to C = S | 180/90 | 176/10 |
*Number of reactions in the training/test sets.
Figure 2Toroidal surface of a 49×49 Kohonen SOM trained with 8280 mixtures of two photochemical reactions encoded by the 1H NMR descriptor.
After the training, each neuron was colored according to the reaction mixtures of the training set that are mapped onto it. The colors correspond to the classes in Table 1. Black neurons correspond to conflicts.
Figure 3The same map of Figure 2, with two different filters applied: top – only colored neurons belonging to mixtures of classes A, B, C, D, and E; bottom – only colored neurons belonging to mixtures of classes C, G, J, M, and N.
The colors correspond to the classes in Table 1. Black neurons correspond to conflicts between these classes and white neurons correspond to empty neurons or neurons belonging to other classes.
Classification of mixtures of reactions (mixtures of two reactions) by Kohonen SOMs and Counter-Propagation Neural Networks of dimension 49×49.
| Data sets | % Correct predictions | ||||||
| Best ind. | Ensemble of five | Ensemble of ten | |||||
| SOM | CPNN | SOM | CPNN | SOM | CPNN | ||
| Partition | Training | 80.6 | 61.3 | 86.7 | 73.0 | 89.0 | 75.6 |
| 1 | Test | 71.1 | 57.7 | 77.4 | 69.1 | 79.6 | 71.8 |
| Partition | Training | 82.9 | 68.4 | 89.4 | 77.4 | 91.4 | 78.6 |
| 2 | Test | 52.6 | 47.2 | 59.4 | 57.2 | 62.6 | 57.5 |
*Partition 1–8280 and 4141 mixtures of reactions in training and test set, respectively; Partition 2–7578 and 593 mixtures of reactions in training and test set, respectively.
Figure 4Representation of the six output layers of a 49×49 CPNN trained with 7578 mixtures of two reactions.
High values of the weights in each output layer are represented by blue, and low values by red. Output layers corresponding to the following reaction types, from left to right: First row – [3+2] photocycloaddition of azirines to C = C and [2+2] photocycloaddition of C = C to C = O reaction types. Second row – [4+2] and [4+4] photocycloaddition of olefins to carbon-only aromatic rings, and [2+2] photocycloaddition of C = C to C = C reaction types. Third row – [3+2] photocycloaddition of s-triazolo[4,3-b]pyridazine to C = C and [2+2] photocycloaddition of C = C to C = S reaction types.
Classification of mixtures of reactions (mixtures of two reactions) by Random Forests.
| Data sets | % Correct predictions | |
| Partition | Training | 99.2 |
| 1 | Test | 99.1 |
| Partition | Training | 99.6 |
| 2 | Test | 80.3 |
*Partition 1–8280 and 4141 mixtures of reactions in training and test set, respectively; Partition 2–7578 and 593 mixtures of reactions in training and test set, respectively.
Confusion matrix for the classification of mixtures obtained by RF for the test set of partition 2.
| A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | % | |
| A | 25 | – | 5 | – | – | – | 2 | – | – | – | – | – | – | – | – | 78.1 |
| B | – | 15 | – | – | – | – | – | – | – | 1 | – | – | – | – | – | 93.8 |
| C | 5 | – | 60 | – | – | – | 3 | – | – | – | – | – | – | – | – | 88.2 |
| D | – | – | – | 8 | – | – | – | – | – | – | – | – | – | – | – | 100.0 |
| E | 1 | – | 7 | – | 10 | – | – | – | – | – | – | – | – | 2 | – | 50.0 |
| F | – | – | – | – | – | 28 | 1 | – | – | 3 | – | – | – | – | – | 87.5 |
| G | – | – | 1 | – | – | – | 135 | – | – | – | – | – | – | – | – | 99.3 |
| H | – | – | – | – | – | – | – | 16 | – | – | – | – | – | – | – | 100.0 |
| I | – | – | – | – | – | – | 16 | – | 21 | – | – | – | – | 3 | – | 52.5 |
| J | – | – | – | – | – | 8 | 1 | – | – | 59 | – | – | – | – | – | 86.8 |
| K | – | – | – | – | – | – | – | – | – | – | 8 | – | – | – | – | 100.0 |
| L | – | – | – | – | – | 2 | – | – | – | 7 | – | 11 | – | – | – | 55.0 |
| M | – | – | – | – | – | – | – | – | – | – | – | – | 34 | – | – | 100 |
| N | – | – | 3 | – | – | – | 34 | – | 7 | – | – | – | – | 41 | – | 48.2 |
| O | – | – | – | – | – | – | – | – | – | – | – | – | 5 | – | 5 | 50.0 |
Relationship between the prediction accuracy and the probability associated to each prediction by RFs for test set of partition 2.
| Classes | Probability | |||||||
| No Selection | ≥0.5 | ≥0.6 | ≥0.8 | |||||
| N.of Mixtures | N. of Correct | N.of Mixtures | N. of Correct | N.of Mixtures | N. of Correct | N.of Mixtures | N. of Correct | |
| A (32) | 31 | 25 (80.7) | 12 | 11 (91.7) | 4 | 4 (100.0) | 2 | 2 (100.0) |
| B (16) | 15 | 15 (100.0) | 7 | 7 (100.0) | 4 | 4 (100.0) | 1 | 1 (100.0) |
| C (68) | 76 | 60 (79.0) | 52 | 50 (96.1) | 41 | 41 (100.0) | 14 | 14 (100.0) |
| D (8) | 8 | 8 (100) | 6 | 6 (100.0) | 4 | 4 (100.0) | 2 | 2 (100.0) |
| E (20) | 10 | 10 (100) | 3 | 3 (100.0) | 3 | 3 (100.0) | – | – |
| F (32) | 38 | 28 (73.7) | 24 | 21 (87.5) | 17 | 16 (94.1) | 6 | 6 (100.0) |
| G (136) | 192 | 135 (87.5) | 130 | 115 (94.1) | 91 | 88 (96.7) | 37 | 37 (100.0) |
| H (16) | 16 | 16 (100.0) | 14 | 14 (100.0) | 14 | 14 (100.0) | 5 | 5 (100.0) |
| I (40) | 28 | 21 (75.0) | 17 | 15 (88.2) | 10 | 8 (80.0) | – | – |
| J (68) | 70 | 59 (84.3) | 43 | 41 (95.4) | 27 | 26 (96.3) | 11 | 10 (90.9) |
| K (8) | 8 | 8 (100.0) | 8 | 8 (100.0) | 7 | 7 (100.0) | 6 | 6 (100.0) |
| L (20) | 11 | 11 (100.0) | 7 | 7 (100.0) | 5 | 5 (100.0) | 1 | 1 (100.0) |
| M (34) | 39 | 34 (87.2) | 30 | 30 (100.0) | 28 | 28 (100.0) | 14 | 14 (100.0) |
| N (85) | 46 | 41 (89.1) | 25 | 25 (100.0) | 13 | 13 (100.0) | 6 | 6 (100.0) |
| O (10) | 5 | 5 (100.0) | 5 | 5 (100.0) | 3 | 3 (100.0) | 2 | 2 (100.0) |
| Total | 593 | 476 (80.3) | 383 | 358 (93.5) | 271 | 264 (97.4) | 107 | 106 (99.1) |
Class labels and number of reactions in each class.
Number of mixtures predicted to belong to each class.
Number of true positives for each class and (in parenthesis) its percentage among the number of mixtures predicted to belong to that class.
Confusion matrix for the classification of mixtures with probability higher than 0.5 obtained by RF for the test set of partition 2.
| A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | % | |
| A | 11 | – | – | – | – | – | – | – | – | – | – | – | – | – | – | 100.0 |
| B | – | 7 | – | – | – | – | – | – | – | – | – | – | – | – | – | 100.0 |
| C | 1 | – | 50 | – | – | – | – | – | – | – | – | – | – | – | – | 98.0 |
| D | – | – | – | 6 | – | – | – | – | – | – | – | – | – | – | – | 100.0 |
| E | – | – | 2 | – | 3 | – | – | – | – | – | – | – | – | – | – | 60.0 |
| F | – | – | – | – | – | 21 | – | – | – | 1 | – | – | – | – | – | 95.5 |
| G | – | – | – | – | – | – | 115 | – | – | – | – | – | – | – | – | 100.0 |
| H | – | – | – | – | – | – | – | 14 | – | – | – | – | – | – | – | 100.0 |
| I | – | – | – | – | – | – | – | – | 15 | – | – | – | – | – | – | 60.0 |
| J | – | – | – | – | – | 3 | – | – | – | 41 | – | – | – | – | – | 93.2 |
| K | – | – | – | – | – | – | – | – | – | – | 8 | – | – | – | – | 100.0 |
| L | – | – | – | – | – | – | – | – | – | 1 | – | 7 | – | – | – | 87.5 |
| M | – | – | – | – | – | – | – | – | – | – | – | – | 30 | – | – | 100.0 |
| N | – | – | – | – | – | – | 5 | – | 2 | – | – | – | – | 25 | – | 78.1 |
| O | – | – | – | – | – | – | – | – | – | – | – | – | – | – | 5 | 100.0 |
Impact of the ratio of the two reactions in the mixture and the integration normalization on the % of correct predictions.a
| RATIO Ai/Bi | % Correct Predictions | |
| NORM = 1 | 0.2≤NORM≤1 | |
| 1 ( | 99.1 | – |
| 2 | 96.2 | 75.3 |
| 5 | 82.7 | 70.6 |
The same mixtures of the test set of partition 1 were used, but with different ratios between the two reactions, and different normalization factors in the spectra integration.
Relationship between the prediction accuracy and the probability associated to each prediction by RFs for the test set of partition 1 simulated with simultaneous random variation of the three parameters – yields (range 50–100%), NORM (range 0.2–1.0) and RATIO (range 1–4).
| Classes | Probability | |||||||
| No Selection | ≥0.5 | ≥0.6 | ≥0.8 | |||||
| N.of Mixtures | N. of Correct | N.of Mixtures | N. of Correct | N.of Mixtures | N. of Correct | N.of Mixtures | N. of Correct | |
| A (414) | 179 | 178 (99.4) | 141 | 141 (100) | 95 | 95 (100) | 17 | 17 (100) |
| B (266) | 79 | 75 (94.9) | 25 | 25 (100) | 8 | 8 (100) | 2 | 2 (100) |
| C (974) | 1108 | 751 (67.8) | 863 | 642 (74.4) | 693 | 556 (80.2) | 322 | 291 (90.4) |
| D (134) | 62 | 61 (98.4) | 44 | 44 (100) | 36 | 36 (100) | 9 | 9 (100) |
| E (360) | 163 | 146 (89.6) | 96 | 94 (97.9) | 57 | 57 (100) | 6 | 6 (100) |
| F (412) | 116 | 106 (91.4) | 68 | 66 (97.1) | 48 | 48 (100) | 11 | 11 (100) |
| G (1510) | 3327 | 1476(44.4) | 2600 | 1367(52.6) | 2208 | 1269(57.5) | 1250 | 895 (71.6) |
| H (206) | 89 | 89 (100) | 74 | 74 (100) | 61 | 61 (100) | 36 | 36 (100) |
| I (558) | 224 | 221 (98.7) | 153 | 153 (100) | 96 | 96 (100) | 21 | 21 (100) |
| J (972) | 670 | 477 (71.2) | 417 | 325 (77.9) | 313 | 256 (81.8) | 121 | 108 (89.3) |
| K ()134 | 36 | 26 (72.2) | 29 | 22 (75.9) | 23 | 21 (91.3) | 9 | 9 (100) |
| L (360) | 122 | 115 (94.3) | 64 | 64 (100) | 37 | 37 (100) | 7 | 7 (100) |
| M (488) | 449 | 288 (64.1) | 352 | 243 (69) | 283 | 200 (70.7) | 196 | 155 (79.1) |
| N (1314) | 1586 | 1046 (66) | 1230 | 880 (71.5) | 964 | 730 (75.7) | 434 | 353 (81.3) |
| O (180) | 72 | 67 (93.1) | 52 | 51 (98.1) | 33 | 33 (100) | 11 | 11 (100) |
| Total | 8282 | 5122(61.8) | 6208 | 4191(67.5) | 4955 | 3503(70.7) | 2452 | 1931(78.8) |
Class labels and number of reactions in each class.
Number of mixtures predicted to belong to each class.
Number of true positives for each class and (in parenthesis) its percentage among the number of mixtures predicted to belong to that class.