Gabriel Sigmund1, Mehdi Gharasoo2, Thorsten Hüffer1, Thilo Hofmann1. 1. Department of Environmental Geosciences, Centre for Microbiology and Environmental Systems Science, University of Vienna, Althanstrasse 14, 1090 Wien, Austria. 2. Ecohydrology Research Group, Department of Earth and Environmental Sciences, University of Waterloo, 200 University Av W, Waterloo, Ontario Canada N2L 3G1.
Zhang et al.[1] published a paper on machine
learning based predictions
of organic contaminant sorption onto carbonaceous materials and resins.
The authors provide a novel approach to predict concentration-dependent
sorption distribution coefficients (KD) to these materials, without the need to link it to any specific
isotherm model. This study is a valuable contribution to the field
that can stimulate the scientific discussion in the adsorption-modeling
community regarding (i) mechanistic assumptions prior to model building,
(ii) the parametrization of the model based on these assumptions,
(iii) the grouping of data to train the algorithm, and (iv) data filtering
strategies. We recently published a paper on a similar topic[2] and are confident that this discussion is valuable
to improve the future applicability of machine learning techniques
to sorption phenomena.Zhang et al. used the BET specific
surface area and total pore volume to describe the sorbent materials
and state that “these two parameters are critical for deciding
the adsorption of organic compounds through hydrophobic interactions
and pore-filling, two key mechanisms for organic compounds to be adsorbed
by various adsorbents.” These processes are of general importance.
However, it is well accepted for carbonaceous sorbents that π–π
electron donor–acceptor interactions are a key mechanism for
the sorption of organic compounds.[3,4] These interactions
can be related to the polarizability of the compound[5] as well as the aromaticity of the sorbent materials, which
can be approximated by the broadly available molar H/C ratio of the
materials elemental composition.[6]The correct parametrization
of the
dominant sorption processes is crucial. Zhang et al. built their model
on two highly correlated sorbent parameters, i.e., the BET specific
surface area and total pore volume, both determined from N2 physisorption.[7,8] Hydrophobic interactions cannot
be assigned directly to the BET specific surface area; instead, if
hydrophobicity is a key driver for sorption of organic compounds,
it would be important to include a sorbate hydrophobicity parameter
such as log Dow or the hexadecanewater
partitioning coefficient (“L”), which is widely used
in ppLFER models including models for carbonaceous materials.[5,9] Zhang et al. suggest the use of the McGowan-Volume as a hydrophobicity
proxy. While it is true that hydrophobicity tends to increase with
molecular size, other aspects, such as the polarity of a compound
are not accounted for with the McGowan-Volume.Zhang et al. subdivided their data
sets into four categories (i.e., biochar, carbon nanotubes, granular
activated carbon, and resins). Based on our recent study[2] sorption mechanism to the various carbonaceous
sorbents is not fundamentally different and, when sorbent material
properties are well parametrized, these data can be combined. In the
case of Zhang et al. this would result in only two categories, i.e.,
carbonaceous materials and resins. Thereby the machine learning algorithm
could be trained on a larger data set for carbonaceous materials,
which might improve its generalization and forecasting capabilities.With deterministic data
filtering
techniques such as cosine similarity used by Zhang et al. the data
may be systematically filtered to a level that is not necessarily
representative of the original trend. Random-based statistical techniques
such as low entropy data removal or significance test measures may
be better choices for data filtering. Since random allocation of data
into different sets for training, validation, and testing would result
in various goodness of fit, we suggest multitraining as a technique
to increase model generalization. Thereby, even a poor set can contribute
to the model predictability and performance.Consideration of the above aspects will further improve the
applicability
of machine learning algorithms for studying contaminant dynamics.