Literature DB >> 33430960

Flexible heuristic algorithm for automatic molecule fragmentation: application to the UNIFAC group contribution model.

Simon Müller1.   

Abstract

A priori calculation of thermophysical properties and predictive thermodynamic models can be very helpful for developing new industrial processes. Group contribution methods link the target property to contributions based on chemical groups or other molecular subunits of a given molecule. However, the fragmentation of the molecule into its subunits is usually done manually impeding the fast testing and development of new group contribution methods based on large databases of molecules. The aim of this work is to develop strategies to overcome the challenges that arise when attempting to fragment molecules automatically while keeping the definition of the groups as simple as possible. Furthermore, these strategies are implemented in two fragmentation algorithms. The first algorithm finds only one solution while the second algorithm finds all possible fragmentations. Both algorithms are tested to fragment a database of 20,000+ molecules for use with the group contribution model Universal Quasichemical Functional Group Activity Coefficients (UNIFAC). Comparison of the results with a reference database shows that both algorithms are capable of successfully fragmenting all the molecules automatically. Furthermore, when applying them on a larger database it is shown, that the newly developed algorithms are capable of fragmenting structures previously thought not possible to fragment.

Entities:  

Keywords:  Cheminformatics; Group contribution method; Incrementation; Molecule fragmentation; Property prediction; RDKit; UNIFAC

Year:  2019        PMID: 33430960      PMCID: PMC6701077          DOI: 10.1186/s13321-019-0382-3

Source DB:  PubMed          Journal:  J Cheminform        ISSN: 1758-2946            Impact factor:   5.514


Introduction

Cheminformatics is a growing field due to the increasing computational capabilities and improvements in the accuracy achieved by its predictions. The chemical space is vast and the number of molecules available to produce with new and, in some cases even automated synthetizing routes increases. However, before investing resources into synthetizing and characterizing molecules, a predictive approach for its properties would help narrow down the possible candidates. In addition, for the application of thermodynamic models or a priori calculation of thermophysical properties, predictive methods can be helpful and in some cases even necessary. These methods, which relate properties to the molecule structures are usually named QSPR methods (Quantitative Structure Property Relationship). One subgroup of these models is the group contribution method. The idea behind this method is to divide the value of a property of the complete molecule into its contributions based on the chemical groups or other molecular subunit. Group contribution models have been successfully applied to a wide variety of properties including density [1, 2], critical properties [3-5], enthalpy of vaporization [6], normal boiling points [7, 8], wateroctanol partition coefficients [9-11], infinite dilution activity coefficients [12] and many more. Also, from Gibbs excess energy models [13-15] and equations of states [16-19] they provide an approach that allows widening their application range to molecules composed of the same chemical groups relatively easily. However, in the development and application of these models a manual mapping of the groups has to be performed in most cases. This can hinder the fast development and testing of possible different group combinations, especially for larger number of molecules. Jochelson [20], in 1968, already described a simple automatic routine for substructure counting. Most of research since [21-28] is focused more on describing algorithms for substructure search, ring perception and aromaticity perception. In a recent paper Ertl [29] proposed a new algorithm for automatic chemical group definition based on a large database. Fortunately, most of the current cheminformatic toolkits already include search and perception features, allowing to create new advanced fragmentation algorithms focusing on other problems. One of the free tools offered online for structure analysis is Checkmol [28, 30]. It is an open-source program for finding a defined set of functional groups within a molecular structure. However, it checks its existence without counting the occurrence. Przemieniecki [31] developed an implementation of UNIFAC with automatic group fragmentation by means of a non-standardized way of specifying the fragmentation scheme. Some other free webpage services that allow a complete automatic fragmentation of molecules also exist, including the ones from the companies DDBST GmbH [32] and Xemistry GmbH [33]. In the first case, fragmentation is limited to the schemes supported by the webpage. In the second case, it is possible to provide own fragmentation rules allowing for fragmentation using different schemes. However, the terms of use only allow for a manual use of the website and without the ability to use the results in commercial applications. Furthermore, knowing how the algorithm works would allow to debug, find errors and improve it. Tools that implement group contribution models like Octopus [34], thermo [35] or UManSysProp [36] would largely benefit from an improved flexible automated fragmentation algorithm based on standardized ways to define the fragmentation scheme that can handle complex molecules. The goal of this work is to provide flexible algorithms that only need a simple fragmentation scheme based on the SMARTS language [37] which is easy to use for the rapid development and testing of group contribution methods on larger datasets.

Challenges of automatic fragmentation

Several challenges like non-unique group assignment, incomplete group assignment and the composition of the fragmentation scheme itself can arise when developing an automatic fragmentation algorithm. These will be discussed in more detail in this section. The examples described are based on the fragmentation scheme from Table 1.
Table 1

Fragmentation scheme developed in this work for the published UNIFAC groups and the respective pattern described used for sorting

Group informationDescriptors
NumberNameSMILES12345678
1CH3[CH3;X4]FalseFalse1True0False00
2CH2[CH2;X4]FalseFalse1False0False00
3CH[CH1;X4]FalseFalse1False0False00
4C[CH0;X4], [CH0;X3]FalseFalse1False0False00
5CH2=CH[CH2]=[CH]FalseFalse2True0False01
6CH=CH[CH]=[CH]FalseFalse2False0False01
7CH2=C[CH2]=[C], [CH2]=[c]FalseFalse2False0False01
8CH=C[CH]=[CH0], [CH]=[cH0]FalseFalse2False0False01
9ACH[cH]FalseFalse1False0True00
10AC[cH0]FalseFalse1False0True00
11ACCH3[c][CH3;X4]FalseFalse2False0True00
12ACCH2[c][CH2;X4]FalseFalse2False0True00
13ACCH[c][CH;X4]FalseFalse2False0True00
14OH[OH]FalseFalse1True1False00
15CH3OH[CH3][OH]TrueFalse2False1False00
16H2O[OH2]TrueFalse1False1False00
17ACOH[c][OH]FalseFalse2False1True00
18CH3CO[CH3][CH0]=OFalseFalse3True1False01
19CH2CO[CH2][CH0]=OFalseFalse3False1False01
20CH=O[CH]=OFalseFalse2True1False01
21CH3COO[CH3]C(=O)[OH0]FalseFalse4True2False01
22CH2COO[CH2]C(=O)[OH0]FalseFalse4False2False01
23HCOO[CH](=O)[OH0]FalseFalse3True2False01
24CH3O[CH3][OH0]FalseFalse2True1False00
25CH2O[CH2][OH0]FalseFalse2False1False00
26CHO[CH][OH0]FalseFalse2False1False00
27THF[CH2;R][OH0]FalseFalse2False1True00
28CH3NH2[CH3][NH2]TrueFalse2False1False00
29CH2NH2[CH2][NH2]FalseFalse2True1False00
30CHNH2[CH][NH2]FalseFalse2False1False00
31CH3NH[CH3][NH]FalseFalse2True1False00
32CH2NH[CH2][NH]FalseFalse2False1False00
33CHNH[CH][NH]FalseFalse2False1False00
34CH3N[CH3][N], [CH3][n]FalseFalse2False1False00
35CH2N[CH2][N]FalseFalse2False1False00
36ACNH2[c][NH2]FalseFalse2False1True00
37C5H5Nn1[cH][cH][cH][cH][cH]1TrueFalse6False1True00
38C5H4Nn1[c][cH][cH][cH][cH]1, n1[cH][c][cH][cH][cH]1, n1[cH][cH][c][cH][cH]1FalseFalse6True1True00
39C5H3Nn1[c][c][cH][cH][cH]1, n1[c][cH][c][cH][cH]1, n1[c][cH][cH][c][cH]1, n1[c][cH][cH][cH][c]1, n1[cH][c][c][cH][cH]1, n1[cH][c][cH][c][cH]1FalseFalse6False1True00
40CH3CN[CH3]C#NTrueFalse3False1False10
41CH2CN[CH2]C#NFalseFalse3True1False10
42COOHC(=O)[OH]FalseFalse3True2False01
43HCOOH[CH](=O)[OH]TrueFalse3False2False01
44CH2Cl[CH2]ClFalseTrue2True1False00
45CHCl[CH]ClFalseTrue2False1False00
46CCl[CH0]ClFalseTrue2False1False00
47CH2Cl2[CH2](Cl)ClTrueFalse3False2False00
48CHCl2[CH](Cl)ClFalseTrue3True2False00
49CCl2C(Cl)ClFalseTrue3False2False00
50CHCl3[CH](Cl)(Cl)ClTrueFalse4False3False00
51CCl3C(Cl)(Cl)(Cl)FalseTrue4True3False00
52CCl4C(Cl)(Cl)(Cl)(Cl)TrueFalse5False4False00
53ACCl[c]ClFalseTrue2False1True00
54CH3NO2[CH3][N+](=O)[O−]FalseFalse4True3False01
55CH2NO2[CH2][N+](=O)[O−]FalseFalse4False3False01
56CHNO2[CH][N+](=O)[O−]FalseFalse4False3False01
57ACNO2[c][N+](=O)[O−]FalseFalse4False3True01
58CS2C(=S)=STrueFalse3False2False02
59CH3SH[CH3][SH]TrueFalse2False1False00
60CH2SH[CH2][SH]FalseFalse2True1False00
61FurfuralO=[CH]c1[cH][cH][cH]o1TrueFalse7False2True01
62DOH[OH][CH2][CH2][OH]TrueFalse4False2False00
63I[IH0]FalseTrue1True1False00
64Br[BrH0]FalseTrue1True1False00
65CH#C[CH]#CFalseFalse2True0False10
66C#CC#CFalseFalse2False0False10
67DMSO[CH3]S(=O)[CH3]TrueFalse4False2False01
68ACRY[CH2]=[CH1][C]#NTrueFalse4False1False11
69Cl(C=C)[$(Cl[C]=[C])]FalseTrue3True1False00
70C=C[CH0]=[CH0]FalseFalse2False0False01
71ACF[c]FFalseTrue2False1True00
72DMF[CH](=O)N([CH3])[CH3]TrueFalse5False2False01
73HCON(CH2)2[CH](=O)N([CH2])[CH2], [CH](=O)N([CH2])[CH3]FalseFalse5False2False01
74CF3C(F)(F)FFalseTrue4True3False00
75CF2C(F)FFalseTrue3False2False00
76CF[C]FFalseTrue2False1False00
77COO[CH0](=O)[OH0], [cH0](=O)[oH0]FalseFalse3False2False01
78SiH3[SiH3]FalseFalse1True1False00
79SiH2[SiH2]FalseFalse1False1False00
80SiH[SiH]FalseFalse1False1False00
81Si[Si]FalseFalse1False1False00
82SiH2O[SiH2][OH0]FalseFalse2False2False00
83SiHO[SiH][OH0]FalseFalse2False2False00
84SiO[Si][OH0]FalseFalse2False2False00
85NMP[CH3]N1[CH2][CH2][CH2]C(=O)1TrueFalse7False2False01
86CCl3FC(Cl)(Cl)(Cl)FTrueFalse5False4False00
87CCl2FC(Cl)(Cl)FFalseTrue4True3False00
88HCCl2F[CH](Cl)(Cl)FTrueFalse4False3False00
89HCClF[CH](Cl)FFalseTrue3True2False00
90CClF2C(Cl)(F)FFalseTrue4True3False00
91HCClF2[CH](Cl)(F)FTrueFalse4False3False00
92CClF3C(Cl)(F)(F)FTrueFalse5False4False00
93CCl2F2C(Cl)(Cl)(F)FTrueFalse5False4False00
94CONH2C(=O)[NH2]FalseFalse3True2False01
95CONHCH3C(=O)[NH][CH3]FalseFalse4True2False01
96CONHCH2C(=O)[NH][CH2]FalseFalse4False2False01
97CON(CH3)2C(=O)N([CH3])[CH3]FalseFalse5True2False01
98CONCH3CH2C(=O)N([CH3])[CH2]FalseFalse5False2False01
99CON(CH2)2C(=O)N([CH2])[CH2]FalseFalse5False2False01
100C2H5O2[OH0;!$(OC=O);!R][CH2;!R][CH2;!R][OH]FalseFalse4True2False00
101C2H4O2[OH0;!$(OC=O);!R][CH;!R][CH2;!R][OH], [OH0;!$(OC=O);!R][CH2;!R][CH;!R][OH]FalseFalse4False2False00
102CH3S[CH3]SFalseFalse2True1False00
103CH2S[CH2]SFalseFalse2False1False00
104CHS[CH]SFalseFalse2False1False00
105MORPH[CH2]1[CH2][NH][CH2][CH2]O1TrueFalse6False2False00
106C4H4S[cH]1[cH][s;X2][cH][cH]1TrueFalse5False1True00
107C4H3S[c]1[cH][s;X2][cH][cH]1, [cH]1[c][s;X2][cH][cH]1FalseFalse5True1True00
108C4H2S[c]1[c][s;X2][cH][cH]1, [c]1[cH][s;X2][cH][c]1, [cH]1[c][s;X2][c][cH]1, [cH]1[c][s;X2][cH][c]1FalseFalse5False1True00
109NCON=C=OFalseFalse3True2False02
118(CH2)2SU[CH2]S(=O)(=O)[CH2]FalseFalse5False3False02
119CH2CHSU[CH2]S(=O)(=O)[CH]FalseFalse5False3False02

In the name of the group, AC stands for aromatic carbon atom. The names of the groups are based on the original UNIFAC names as described on their webpage [44]. If several patterns were employed to find one group, these are shown separated by a comma. The underlined patterns were added to improve the matching of the algorithm in comparison to the results of the reference database. The values of the descriptors for each group, as described in “Simple fragmentation” section, are also shown in this table. For sorting, the boolean descriptor values can be replace by integer values (True: 1, False: 0). Descriptors: 1: Whether the pattern has zero bonds 2: Whether the pattern is simple 3: Number of atoms defining the group. 4: Whether the number of available bonds is one: first the patterns with one bond, then patterns with more bonds. 5: Number of atoms in the pattern that are neither hydrogen nor carbon. 6: Whether the pattern includes atoms in a ring. 7: Number of triple bonds. 8: Number of double bonds

Fragmentation scheme developed in this work for the published UNIFAC groups and the respective pattern described used for sorting In the name of the group, AC stands for aromatic carbon atom. The names of the groups are based on the original UNIFAC names as described on their webpage [44]. If several patterns were employed to find one group, these are shown separated by a comma. The underlined patterns were added to improve the matching of the algorithm in comparison to the results of the reference database. The values of the descriptors for each group, as described in “Simple fragmentation” section, are also shown in this table. For sorting, the boolean descriptor values can be replace by integer values (True: 1, False: 0). Descriptors: 1: Whether the pattern has zero bonds 2: Whether the pattern is simple 3: Number of atoms defining the group. 4: Whether the number of available bonds is one: first the patterns with one bond, then patterns with more bonds. 5: Number of atoms in the pattern that are neither hydrogen nor carbon. 6: Whether the pattern includes atoms in a ring. 7: Number of triple bonds. 8: Number of double bonds

Non-unique group assignment

For the assignment of the groups several solutions might be possible. The order in which the different groups are searched has an influence. For example, an ACOH group (hydroxyl bound to an aromatic carbon atom) can be recognized as such or fragmented into an aromatic carbon (AC) and a hydroxyl (OH) group. Furthermore, depending on the order in which the non-overlapping fragmentation is performed on the molecule structure, different results might be attained. For example, if a molecule is fragmented starting from left to right (Fig. 1a), the result obtained can be different from the one obtained if the molecule is fragmented from right to left (Fig. 1b).
Fig. 1

Example of a molecule with different functional groups where non-unique group assignment is possible. The groups identified are marked by the dotted line. Depending on where the algorithm starts to assign the groups, the result of the fragmentation is different. If the molecule is fragmented starting from left to right, the result might be the one shown in a, while if it is fragmented from right to left, the result might be as shown in b. SMILES: C[NH]C(=O)OC

Example of a molecule with different functional groups where non-unique group assignment is possible. The groups identified are marked by the dotted line. Depending on where the algorithm starts to assign the groups, the result of the fragmentation is different. If the molecule is fragmented starting from left to right, the result might be the one shown in a, while if it is fragmented from right to left, the result might be as shown in b. SMILES: C[NH]C(=O)OC In these cases, the algorithm must either deliver the correct fragmentation as a first solution or find all solutions and then specify how to choose the correct one.

Incomplete group assignment

This case occurs when it is not possible to assign one or more atoms to a specific group. In some cases, the order of the groups searched can also lead to this situation. For example, in Fig. 2 if the AC groups (aromatic carbon) are searched first, the remaining chlorine atom cannot be assigned to any other functional group from the fragmentation scheme. In other cases, there will be molecules with atoms or functional groups that are just not defined in the fragmentation scheme. However, in most cases where the fragmentation is possible, this issue can be avoided if the algorithm specifies the order in which the functional groups are searched.
Fig. 2

Example of a molecule with different functional groups where incomplete group assignment is possible. The groups identified are marked by the dotted line. The chlorine atom cannot be assigned to a group from the fragmentation scheme. SMILES: c1c(Cl)c([OH])ccc1

Example of a molecule with different functional groups where incomplete group assignment is possible. The groups identified are marked by the dotted line. The chlorine atom cannot be assigned to a group from the fragmentation scheme. SMILES: c1c(Cl)c([OH])ccc1

The fragmentation scheme

Defining the fragmentation scheme is decidedly important for the accuracy of the algorithm. If the groups defined were targeting very specific functional groups or avoiding overlapping with other groups, this would minimize the non-unique or incomplete group assignments. A lot of time and testing can be invested in developing highly specific patterns for any given group contribution method such as those already done for UNIFAC by Salmina et al. [38]. However, if the algorithm includes a way to prioritize the groups from the fragmentation scheme, in most cases the groups do not have to be highly specific thus allowing to focus more time on developing different fragmentation schemes instead of refining one specific scheme.

Strategies to overcome the challenges

To overcome the challenges described in the section “Challenges of automatic fragmentation”, three features were implemented in this work:

Heuristic group prioritization

The patterns of the fragmentation scheme are sorted based on a set of heuristically determined descriptors. These descriptors can be, for example, the number of atoms describing the pattern, the number of bonds available or the number of double bonds.

Parent–child group prioritization

The complete fragmentation scheme is analyzed to find patterns that are contained within others. E.g. CH2 is contained in CONHCH2. Whenever searching for a specific pattern, if the group has such a parent pattern, the parent pattern is searched first. After that, the child pattern is searched.

Adjacent group search

To avoid incomplete group assignments, whenever a part of the structure is already fragmented, the subsequent matches have to be adjacent to the groups already found.

The algorithms

There are two types of algorithms that are possible to fragment molecules. The first type of algorithm (simple fragmentation) searches for one possible solution and accepts the first one found. The second type of algorithm (complete fragmentation) tries to find all possible solutions to fragment the molecule. To achieve this, a full tree search on the complete structure over the entire fragmentation scheme has to be performed. Since more than one solution is inherently possible, a way should to be provided to prioritize the determined solutions and select one.

Simple fragmentation

In the simple fragmentation algorithm, only one solution is searched. The patterns are sorted based on automatically calculated descriptors. In this work, the following set of 8 heuristically chosen descriptors were used to sort the patterns in descending order: When the pattern has zero bonds: First, the patterns without bonds, then patterns with bonds are sorted. When the pattern is simple: consisting of one atom with valence one or one atom with valence one connected to a carbon atom. First, the simple patterns, then the others are sorted. Number of atoms defining the group: this number includes the atoms actually matched by the pattern as well as the ones defining the vicinity in case of recursive SMARTS. When the number of available bonds is one: first, the patterns with one bond, then patterns with more bonds are sorted. Number of atoms in the pattern that are neither hydrogen nor carbon. When the pattern includes atoms in a ring: first the patterns that describe a partial ring (aliphatic or aromatic), then the other patterns are sorted. Number of triple bonds. Number of double bonds. As a first step, the algorithm performs a quick search for the different groups in the fragmentation scheme applying the heuristic group prioritization and the parent–child group prioritization as described above. The search goes sequentially through the sorted fragmentation scheme, adding groups that are found and do not overlap with groups that were already found. In case it successfully finds a valid fragmentation, this is taken as the solution. In case no solution is found after trying all fragmentation patterns, the area around the unassigned atoms is cleared of adjacent groups and the search is repeated applying all three features described above, i.e. searching only for non-overlapping groups that are contiguous to the groups already found. The clearing and searching might be repeated several times if no solution is found after the first iteration. In each subsequent iteration, a larger portion of the molecule connected to the unassigned atoms is cleared. If a valid fragmentation is found, this is taken as the solution. Figure 3 shows a flow-diagram-like schematic representation of the algorithm.
Fig. 3

Schematic representation of the simple fragmentation algorithm

Schematic representation of the simple fragmentation algorithm

Complete fragmentation

With the complete fragmentation algorithm, all possible solutions are searched. While the simple fragmentation algorithm might take milliseconds to find the fragmentation, the complete fragmentation algorithm might take minutes or even hours due to the vast space of possible combinations. Its search time increases exponentially with increasing molecule size. However, in contrast to the simple fragmentation, it allows to find all fragmentations and therefore its success in finding a solution is not dependent on the order of the searched patterns. This algorithm was implemented as a recursive algorithm that performs a complete tree search of all possible combinations of fragmentation. To reduce the fragmentation space that needs to be searched, the algorithm keeps track of the solutions already found and of the group combinations that lead to an incomplete fragmentation. If several solutions were found in the end, the solutions were sorted by the number of different patterns and the first solution was taken as the determined fragmentation. This way, patterns with larger groups are prioritized over smaller patterns. Figure 4 shows a flow-diagram-like schematic representation of the algorithm.
Fig. 4

Schematic representation of the complete fragmentation algorithm

Schematic representation of the complete fragmentation algorithm

Computational details

In this work, the RDKit [39] python module was used to implement the algorithm. It supports the Simplified Molecular Input Line Entry System (SMILES) [40] and the SMiles ARbitrary Target Specification (SMARTS) [37] languages for specifying the molecular structures and the functional group patterns respectively. The SMARTS language is used as it provides a standardized, rich featured, easily learnable and wide spread approach to describe the molecular patterns. To implement the parent–child group prioritization as described in “Parent–child group prioritization” section, it is necessary to test whether one pattern is contained within another. RDKit already works well when testing for most of the parent–child relationships. However, in some cases where the explicit amount of hydrogen atoms is important, the results are incorrect. For example, RDKit matches ‘[CH3][OH]’ as being contained in ‘[CH3][O;H0]’. Because of this, in this work, after a positive match the explicit amount of hydrogen atoms is tested to avoid false positives. The research group of Computational Molecular Design at the University of Hamburg offers an online tool called SMARTSviewer [41, 42] that makes developing SMARTS patterns easier. This tool was used in the development process of the fragmentation scheme. The same group is also developing new algorithms to find the relationships between SMARTS patterns. In future, these developments might help improve the capabilities of cheminformatics modules such as RDKit to discern whether a pattern is contained within another. The open source thermodynamics python module thermo [35] includes a large database of structures including single molecules and mixtures. After excluding salts and radicals, this comprises of a total set of 62,380 structures in the form of SMILES. For a subset of structures of this large database, fragmentations are available for use with the UNIFAC model. These structures were automatically fragmented using the service provided on the DDBST GmbH webpage [32]. This work first compares the results of the newly developed fragmentation algorithms with this reference database and then checks whether the new algorithms can fragment more structures than previously thought. For some SMILES that include heavy versions of hydrogen, e.g. deuterium, these were replaced by normal hydrogen atoms. That makes 28,678 available SMILES with their corresponding UNIFAC fragmentation in the reference database. For the sake of making the implementation of the algorithm easier in another group contribution model, the functions and the reference databases are made available as separate files in Additional files 1, 2, 3, 4 and on GitHub [43].

Results and discussion

The fragmentation scheme for UNIFAC developed in this work can be found in Table 1. A version of the sorted fragmentation scheme according to the description in “Simple fragmentation” section can be found in Additional file 5. The focus of this work is to develop a fragmentation algorithm that is as independent as possible from the chosen fragmentation scheme to allow for a faster development of new group contribution methods. For this reason, the SMARTS for each pattern were kept as simple as possible. The few patterns that were made more specific to match the results better from the literature database have been underlined. However, the overall majority of the SMARTS are as simple as they can be. The fragmentation results are summarized in Table 2. Since the order of patterns searched can have an influence on the end result, both cases are differentiated in the table.
Table 2

Results of the fragmentation with both algorithms on the reference database

AlgorithmSorted patterns?NSMILESNfragmented (%)NlikeRefDB (%)
SimpleYes28,67828,677 (> 99.9%)28,305 (98.7%)
SimpleNo28,67818,969 (66.1%)14,493 (50.5%)
CompleteYes24,33624,335 (> 99.9%)22,084 (90.7%)
CompleteNo24,33624,335 (> 99.9%)18,532 (76.1%)

For the complete algorithm, only the molecules with 20 or less heavy atoms were fragmented

Results of the fragmentation with both algorithms on the reference database For the complete algorithm, only the molecules with 20 or less heavy atoms were fragmented It can be observed that the simple fragmentation algorithm with the sorted patterns is able to fragment all but the molecule shown in Fig. 5. This is because there is no group in the fragmentation scheme matching the structure. The algorithm was able to fragment the molecules for every structure for which it should have been possible. This is a very encouraging result. Based on a set of general descriptors, by sorting the patterns automatically as much as 98.7% of the fragmented molecules match the fragmentation found by the algorithm from the reference database. Most of the remaining 1.3% of the fragmentations from the reference database can be explained by a different aromaticity perception. In the RDKit, a chemical bond is either described as being aromatic or being a single/double bond as opposed to the assignments done in the reference database where in some cases no distinction is made.
Fig. 5

Only molecule that was not possible to fragment. SMILES: C1=CN=CC#C1

Only molecule that was not possible to fragment. SMILES: C1=CN=CC#C1 For the simple fragmentation algorithm, as expected, the sorting of the patterns plays a major role on the success of finding any solution at all and it is especially important to find the same solution as the reference database. To evaluate the complete fragmentation algorithm only the molecules with 20 or less heavy atoms were included from the reference database. This was done because for very large molecules the algorithm takes hours to find all solutions. Table 2 shows that since this algorithm searches for all possible fragmentations the amount of fragmented molecules is independent on whether the patterns are sorted or not. However, the results show that the sorting of the patterns has an influence on whether the chosen solution at the end is equal to the solution of the reference database. This is because the order in which the different patterns is searched for defines the order of the found solutions from which the first one is selected. The complete fragmentation algorithm could be refined further to sort the determined solutions at the end in a more elaborate way, for example, based on the descriptors of the patterns. However, this is out of the scope of this work. Lastly, the algorithms were applied to the large database of structures included in thermo [35] to find out if the new algorithms are capable of fragmenting molecules that were not in the reference database. In this case, first the simple fragmentation algorithm was applied with the sorted patterns. If no solution was found with the simple fragmentation algorithm, the complete fragmentation algorithm was applied if the structure was smaller than 20 heavy atoms. With this combined fragmentation algorithm, in total 33,560 structures were fragmented successfully. This number is 17% larger than the 28,677 fragmented structures in the reference database. This shows that the newly developed algorithms are capable of fragmenting more structures than the algorithm used in the reference database.

Conclusions

Several challenges exist when attempting to fragment molecules into a set of predefined functional groups or molecular subunits. The strategies developed and implemented for the two algorithms in this work, show that it is possible to automate group fragmentation based on computed descriptors for the patterns in the fragmentation scheme. Both algorithms are capable of fragmenting every molecule of a reference database of structures into their respective UNIFAC groups. Furthermore, the algorithms are capable of fragmenting molecules that could not be fragmented by the algorithm of the reference database. The advancements of this work permit to accelerate the development of new group contribution models by allowing to test different fragmentations schemes on large databases of molecules much faster than with manual fragmentation, which is the existing standard for most group contribution models. It is a step forward in the direction of completely automated QSPR methods and maybe even completely automated group contribution development. Additional file 1. Reference database of structures with fragmentations by the DDBST online fragmentation tool. Additional file 2. Large database of structures without fragmentations by another method used to test the capability of the algorithms on more molecules. Additional file 3. Code to reproduce results from the paper. Additional file 4. Class encapsulating both algorithms for use in new applications. Additional file 5: Table S1. Sorted fragmentation scheme developed in this work for the published UNIFAC groups and the respective pattern used for sorting.
  6 in total

1.  Simultaneous prediction of aqueous solubility and octanol/water partition coefficient based on descriptors derived from molecular structure.

Authors:  D J Livingstone; M G Ford; J J Huuskonen; D W Salt
Journal:  J Comput Aided Mol Des       Date:  2001-08       Impact factor: 3.686

2.  From structure diagrams to visual chemical patterns.

Authors:  Karen Schomburg; Hans-Christian Ehrlich; Katrin Stierand; Matthias Rarey
Journal:  J Chem Inf Model       Date:  2010-09-27       Impact factor: 4.956

3.  CO: A chemical ontology for identification of functional groups and semantic comparison of small molecules.

Authors:  Howard J Feldman; Michel Dumontier; Susan Ling; Norbert Haider; Christopher W V Hogue
Journal:  FEBS Lett       Date:  2005-08-29       Impact factor: 4.124

4.  An algorithm to identify functional groups in organic molecules.

Authors:  Peter Ertl
Journal:  J Cheminform       Date:  2017-06-07       Impact factor: 5.514

5.  Functionality pattern matching as an efficient complementary structure/reaction search tool: an open-source approach.

Authors:  Norbert Haider
Journal:  Molecules       Date:  2010-07-27       Impact factor: 4.411

6.  Extended Functional Groups (EFG): An Efficient Set for Chemical Characterization and Structure-Activity Relationship Studies of Chemical Compounds.

Authors:  Elena S Salmina; Norbert Haider; Igor V Tetko
Journal:  Molecules       Date:  2015-12-23       Impact factor: 4.411

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.