| Literature DB >> 33430960 |
Abstract
A priori calculation of thermophysical properties and predictive thermodynamic models can be very helpful for developing new industrial processes. Group contribution methods link the target property to contributions based on chemical groups or other molecular subunits of a given molecule. However, the fragmentation of the molecule into its subunits is usually done manually impeding the fast testing and development of new group contribution methods based on large databases of molecules. The aim of this work is to develop strategies to overcome the challenges that arise when attempting to fragment molecules automatically while keeping the definition of the groups as simple as possible. Furthermore, these strategies are implemented in two fragmentation algorithms. The first algorithm finds only one solution while the second algorithm finds all possible fragmentations. Both algorithms are tested to fragment a database of 20,000+ molecules for use with the group contribution model Universal Quasichemical Functional Group Activity Coefficients (UNIFAC). Comparison of the results with a reference database shows that both algorithms are capable of successfully fragmenting all the molecules automatically. Furthermore, when applying them on a larger database it is shown, that the newly developed algorithms are capable of fragmenting structures previously thought not possible to fragment.Entities:
Keywords: Cheminformatics; Group contribution method; Incrementation; Molecule fragmentation; Property prediction; RDKit; UNIFAC
Year: 2019 PMID: 33430960 PMCID: PMC6701077 DOI: 10.1186/s13321-019-0382-3
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fragmentation scheme developed in this work for the published UNIFAC groups and the respective pattern described used for sorting
| Group information | Descriptors | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Number | Name | SMILES | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| 1 | CH3 | [CH3;X4] | False | False | 1 | True | 0 | False | 0 | 0 |
| 2 | CH2 | [CH2;X4] | False | False | 1 | False | 0 | False | 0 | 0 |
| 3 | CH | [CH1;X4] | False | False | 1 | False | 0 | False | 0 | 0 |
| 4 | C | [CH0;X4], | False | False | 1 | False | 0 | False | 0 | 0 |
| 5 | CH2=CH | [CH2]=[CH] | False | False | 2 | True | 0 | False | 0 | 1 |
| 6 | CH=CH | [CH]=[CH] | False | False | 2 | False | 0 | False | 0 | 1 |
| 7 | CH2=C | [CH2]=[C], | False | False | 2 | False | 0 | False | 0 | 1 |
| 8 | CH=C | [CH]=[CH0], | False | False | 2 | False | 0 | False | 0 | 1 |
| 9 | ACH | [cH] | False | False | 1 | False | 0 | True | 0 | 0 |
| 10 | AC | [cH0] | False | False | 1 | False | 0 | True | 0 | 0 |
| 11 | ACCH3 | [c][CH3;X4] | False | False | 2 | False | 0 | True | 0 | 0 |
| 12 | ACCH2 | [c][CH2;X4] | False | False | 2 | False | 0 | True | 0 | 0 |
| 13 | ACCH | [c][CH;X4] | False | False | 2 | False | 0 | True | 0 | 0 |
| 14 | OH | [OH] | False | False | 1 | True | 1 | False | 0 | 0 |
| 15 | CH3OH | [CH3][OH] | True | False | 2 | False | 1 | False | 0 | 0 |
| 16 | H2O | [OH2] | True | False | 1 | False | 1 | False | 0 | 0 |
| 17 | ACOH | [c][OH] | False | False | 2 | False | 1 | True | 0 | 0 |
| 18 | CH3CO | [CH3][CH0]=O | False | False | 3 | True | 1 | False | 0 | 1 |
| 19 | CH2CO | [CH2][CH0]=O | False | False | 3 | False | 1 | False | 0 | 1 |
| 20 | CH=O | [CH]=O | False | False | 2 | True | 1 | False | 0 | 1 |
| 21 | CH3COO | [CH3]C(=O)[OH0] | False | False | 4 | True | 2 | False | 0 | 1 |
| 22 | CH2COO | [CH2]C(=O)[OH0] | False | False | 4 | False | 2 | False | 0 | 1 |
| 23 | HCOO | [CH](=O)[OH0] | False | False | 3 | True | 2 | False | 0 | 1 |
| 24 | CH3O | [CH3][OH0] | False | False | 2 | True | 1 | False | 0 | 0 |
| 25 | CH2O | [CH2][OH0] | False | False | 2 | False | 1 | False | 0 | 0 |
| 26 | CHO | [CH][OH0] | False | False | 2 | False | 1 | False | 0 | 0 |
| 27 | THF | [CH2;R][OH0] | False | False | 2 | False | 1 | True | 0 | 0 |
| 28 | CH3NH2 | [CH3][NH2] | True | False | 2 | False | 1 | False | 0 | 0 |
| 29 | CH2NH2 | [CH2][NH2] | False | False | 2 | True | 1 | False | 0 | 0 |
| 30 | CHNH2 | [CH][NH2] | False | False | 2 | False | 1 | False | 0 | 0 |
| 31 | CH3NH | [CH3][NH] | False | False | 2 | True | 1 | False | 0 | 0 |
| 32 | CH2NH | [CH2][NH] | False | False | 2 | False | 1 | False | 0 | 0 |
| 33 | CHNH | [CH][NH] | False | False | 2 | False | 1 | False | 0 | 0 |
| 34 | CH3N | [CH3][N], | False | False | 2 | False | 1 | False | 0 | 0 |
| 35 | CH2N | [CH2][N] | False | False | 2 | False | 1 | False | 0 | 0 |
| 36 | ACNH2 | [c][NH2] | False | False | 2 | False | 1 | True | 0 | 0 |
| 37 | C5H5N | n1[cH][cH][cH][cH][cH]1 | True | False | 6 | False | 1 | True | 0 | 0 |
| 38 | C5H4N | n1[c][cH][cH][cH][cH]1, n1[cH][c][cH][cH][cH]1, n1[cH][cH][c][cH][cH]1 | False | False | 6 | True | 1 | True | 0 | 0 |
| 39 | C5H3N | n1[c][c][cH][cH][cH]1, n1[c][cH][c][cH][cH]1, n1[c][cH][cH][c][cH]1, n1[c][cH][cH][cH][c]1, n1[cH][c][c][cH][cH]1, n1[cH][c][cH][c][cH]1 | False | False | 6 | False | 1 | True | 0 | 0 |
| 40 | CH3CN | [CH3]C#N | True | False | 3 | False | 1 | False | 1 | 0 |
| 41 | CH2CN | [CH2]C#N | False | False | 3 | True | 1 | False | 1 | 0 |
| 42 | COOH | C(=O)[OH] | False | False | 3 | True | 2 | False | 0 | 1 |
| 43 | HCOOH | [CH](=O)[OH] | True | False | 3 | False | 2 | False | 0 | 1 |
| 44 | CH2Cl | [CH2]Cl | False | True | 2 | True | 1 | False | 0 | 0 |
| 45 | CHCl | [CH]Cl | False | True | 2 | False | 1 | False | 0 | 0 |
| 46 | CCl | [CH0]Cl | False | True | 2 | False | 1 | False | 0 | 0 |
| 47 | CH2Cl2 | [CH2](Cl)Cl | True | False | 3 | False | 2 | False | 0 | 0 |
| 48 | CHCl2 | [CH](Cl)Cl | False | True | 3 | True | 2 | False | 0 | 0 |
| 49 | CCl2 | C(Cl)Cl | False | True | 3 | False | 2 | False | 0 | 0 |
| 50 | CHCl3 | [CH](Cl)(Cl)Cl | True | False | 4 | False | 3 | False | 0 | 0 |
| 51 | CCl3 | C(Cl)(Cl)(Cl) | False | True | 4 | True | 3 | False | 0 | 0 |
| 52 | CCl4 | C(Cl)(Cl)(Cl)(Cl) | True | False | 5 | False | 4 | False | 0 | 0 |
| 53 | ACCl | [c]Cl | False | True | 2 | False | 1 | True | 0 | 0 |
| 54 | CH3NO2 | [CH3][N+](=O)[O−] | False | False | 4 | True | 3 | False | 0 | 1 |
| 55 | CH2NO2 | [CH2][N+](=O)[O−] | False | False | 4 | False | 3 | False | 0 | 1 |
| 56 | CHNO2 | [CH][N+](=O)[O−] | False | False | 4 | False | 3 | False | 0 | 1 |
| 57 | ACNO2 | [c][N+](=O)[O−] | False | False | 4 | False | 3 | True | 0 | 1 |
| 58 | CS2 | C(=S)=S | True | False | 3 | False | 2 | False | 0 | 2 |
| 59 | CH3SH | [CH3][SH] | True | False | 2 | False | 1 | False | 0 | 0 |
| 60 | CH2SH | [CH2][SH] | False | False | 2 | True | 1 | False | 0 | 0 |
| 61 | Furfural | O=[CH]c1[cH][cH][cH]o1 | True | False | 7 | False | 2 | True | 0 | 1 |
| 62 | DOH | [OH][CH2][CH2][OH] | True | False | 4 | False | 2 | False | 0 | 0 |
| 63 | I | [IH0] | False | True | 1 | True | 1 | False | 0 | 0 |
| 64 | Br | [BrH0] | False | True | 1 | True | 1 | False | 0 | 0 |
| 65 | CH#C | [CH]#C | False | False | 2 | True | 0 | False | 1 | 0 |
| 66 | C#C | C#C | False | False | 2 | False | 0 | False | 1 | 0 |
| 67 | DMSO | [CH3]S(=O)[CH3] | True | False | 4 | False | 2 | False | 0 | 1 |
| 68 | ACRY | [CH2]=[CH1][C]#N | True | False | 4 | False | 1 | False | 1 | 1 |
| 69 | Cl(C=C) | [$(Cl[C]=[C])] | False | True | 3 | True | 1 | False | 0 | 0 |
| 70 | C=C | [CH0]=[CH0] | False | False | 2 | False | 0 | False | 0 | 1 |
| 71 | ACF | [c]F | False | True | 2 | False | 1 | True | 0 | 0 |
| 72 | DMF | [CH](=O)N([CH3])[CH3] | True | False | 5 | False | 2 | False | 0 | 1 |
| 73 | HCON(CH2)2 | [CH](=O)N([CH2])[CH2], | False | False | 5 | False | 2 | False | 0 | 1 |
| 74 | CF3 | C(F)(F)F | False | True | 4 | True | 3 | False | 0 | 0 |
| 75 | CF2 | C(F)F | False | True | 3 | False | 2 | False | 0 | 0 |
| 76 | CF | [C]F | False | True | 2 | False | 1 | False | 0 | 0 |
| 77 | COO | [CH0](=O)[OH0], | False | False | 3 | False | 2 | False | 0 | 1 |
| 78 | SiH3 | [SiH3] | False | False | 1 | True | 1 | False | 0 | 0 |
| 79 | SiH2 | [SiH2] | False | False | 1 | False | 1 | False | 0 | 0 |
| 80 | SiH | [SiH] | False | False | 1 | False | 1 | False | 0 | 0 |
| 81 | Si | [Si] | False | False | 1 | False | 1 | False | 0 | 0 |
| 82 | SiH2O | [SiH2][OH0] | False | False | 2 | False | 2 | False | 0 | 0 |
| 83 | SiHO | [SiH][OH0] | False | False | 2 | False | 2 | False | 0 | 0 |
| 84 | SiO | [Si][OH0] | False | False | 2 | False | 2 | False | 0 | 0 |
| 85 | NMP | [CH3]N1[CH2][CH2][CH2]C(=O)1 | True | False | 7 | False | 2 | False | 0 | 1 |
| 86 | CCl3F | C(Cl)(Cl)(Cl)F | True | False | 5 | False | 4 | False | 0 | 0 |
| 87 | CCl2F | C(Cl)(Cl)F | False | True | 4 | True | 3 | False | 0 | 0 |
| 88 | HCCl2F | [CH](Cl)(Cl)F | True | False | 4 | False | 3 | False | 0 | 0 |
| 89 | HCClF | [CH](Cl)F | False | True | 3 | True | 2 | False | 0 | 0 |
| 90 | CClF2 | C(Cl)(F)F | False | True | 4 | True | 3 | False | 0 | 0 |
| 91 | HCClF2 | [CH](Cl)(F)F | True | False | 4 | False | 3 | False | 0 | 0 |
| 92 | CClF3 | C(Cl)(F)(F)F | True | False | 5 | False | 4 | False | 0 | 0 |
| 93 | CCl2F2 | C(Cl)(Cl)(F)F | True | False | 5 | False | 4 | False | 0 | 0 |
| 94 | CONH2 | C(=O)[NH2] | False | False | 3 | True | 2 | False | 0 | 1 |
| 95 | CONHCH3 | C(=O)[NH][CH3] | False | False | 4 | True | 2 | False | 0 | 1 |
| 96 | CONHCH2 | C(=O)[NH][CH2] | False | False | 4 | False | 2 | False | 0 | 1 |
| 97 | CON(CH3)2 | C(=O)N([CH3])[CH3] | False | False | 5 | True | 2 | False | 0 | 1 |
| 98 | CONCH3CH2 | C(=O)N([CH3])[CH2] | False | False | 5 | False | 2 | False | 0 | 1 |
| 99 | CON(CH2)2 | C(=O)N([CH2])[CH2] | False | False | 5 | False | 2 | False | 0 | 1 |
| 100 | C2H5O2 | [OH0; | False | False | 4 | True | 2 | False | 0 | 0 |
| 101 | C2H4O2 | [OH0; | False | False | 4 | False | 2 | False | 0 | 0 |
| 102 | CH3S | [CH3]S | False | False | 2 | True | 1 | False | 0 | 0 |
| 103 | CH2S | [CH2]S | False | False | 2 | False | 1 | False | 0 | 0 |
| 104 | CHS | [CH]S | False | False | 2 | False | 1 | False | 0 | 0 |
| 105 | MORPH | [CH2]1[CH2][NH][CH2][CH2]O1 | True | False | 6 | False | 2 | False | 0 | 0 |
| 106 | C4H4S | [cH]1[cH][s;X2][cH][cH]1 | True | False | 5 | False | 1 | True | 0 | 0 |
| 107 | C4H3S | [c]1[cH][s;X2][cH][cH]1, [cH]1[c][s;X2][cH][cH]1 | False | False | 5 | True | 1 | True | 0 | 0 |
| 108 | C4H2S | [c]1[c][s;X2][cH][cH]1, [c]1[cH][s;X2][cH][c]1, [cH]1[c][s;X2][c][cH]1, [cH]1[c][s;X2][cH][c]1 | False | False | 5 | False | 1 | True | 0 | 0 |
| 109 | NCO | N=C=O | False | False | 3 | True | 2 | False | 0 | 2 |
| 118 | (CH2)2SU | [CH2]S(=O)(=O)[CH2] | False | False | 5 | False | 3 | False | 0 | 2 |
| 119 | CH2CHSU | [CH2]S(=O)(=O)[CH] | False | False | 5 | False | 3 | False | 0 | 2 |
In the name of the group, AC stands for aromatic carbon atom. The names of the groups are based on the original UNIFAC names as described on their webpage [44]. If several patterns were employed to find one group, these are shown separated by a comma. The underlined patterns were added to improve the matching of the algorithm in comparison to the results of the reference database. The values of the descriptors for each group, as described in “Simple fragmentation” section, are also shown in this table. For sorting, the boolean descriptor values can be replace by integer values (True: 1, False: 0). Descriptors: 1: Whether the pattern has zero bonds 2: Whether the pattern is simple 3: Number of atoms defining the group. 4: Whether the number of available bonds is one: first the patterns with one bond, then patterns with more bonds. 5: Number of atoms in the pattern that are neither hydrogen nor carbon. 6: Whether the pattern includes atoms in a ring. 7: Number of triple bonds. 8: Number of double bonds
Fig. 1Example of a molecule with different functional groups where non-unique group assignment is possible. The groups identified are marked by the dotted line. Depending on where the algorithm starts to assign the groups, the result of the fragmentation is different. If the molecule is fragmented starting from left to right, the result might be the one shown in a, while if it is fragmented from right to left, the result might be as shown in b. SMILES: C[NH]C(=O)OC
Fig. 2Example of a molecule with different functional groups where incomplete group assignment is possible. The groups identified are marked by the dotted line. The chlorine atom cannot be assigned to a group from the fragmentation scheme. SMILES: c1c(Cl)c([OH])ccc1
Fig. 3Schematic representation of the simple fragmentation algorithm
Fig. 4Schematic representation of the complete fragmentation algorithm
Results of the fragmentation with both algorithms on the reference database
| Algorithm | Sorted patterns? | NSMILES | Nfragmented (%) | NlikeRefDB (%) |
|---|---|---|---|---|
| Simple | Yes | 28,678 | 28,677 (> 99.9%) | 28,305 (98.7%) |
| Simple | No | 28,678 | 18,969 (66.1%) | 14,493 (50.5%) |
| Complete | Yes | 24,336 | 24,335 (> 99.9%) | 22,084 (90.7%) |
| Complete | No | 24,336 | 24,335 (> 99.9%) | 18,532 (76.1%) |
For the complete algorithm, only the molecules with 20 or less heavy atoms were fragmented
Fig. 5Only molecule that was not possible to fragment. SMILES: C1=CN=CC#C1