Literature DB >> 31909322

Toward a Robust, Universal Predictor of Gas Hydrate Equilibria by Means of a Deep Learning Regression.

Michael K B Landgrebe¹, Diakanua Nkazi¹.

Abstract

Due to offshore reservoirs being developed in ever deeper and colder waters, gas hydrates are increasingly becoming a significant factor when considering the profitability of a reservoir due to flow disruptions, equipment, and safety hazards arising from the hydrate plug formation. Due to low-dosage hydrate inhibitors such as kinetic inhibitors competing with traditional thermodynamic inhibitors such as methanol, accurate information regarding the hydrate equilibrium conditions is required to determine the optimal hydrate control strategy. Existing thermodynamic models can prove inflexible regarding parameter adjustment and the incorporation of new data. Developing a multivariate regression model capable of generalizing hydrate equilibria over a wide range of conditions, with results competing with thermodynamic models is worthwhile. A multilayer perceptron neural network of three hidden layers has undergone supervised training means of a backpropagation to accurately predict uninhibited hydrate equilibrium pressure for a range of gas mixtures with nine input features, excluding hydrogen sulfide and electrolytes, from a dataset of 1209 equilibrium points, 670 of which are multicomponent gases, sampled in a rigorous data sampling campaign from existing experimental studies. Statistical significance of results has been emphasized, with models validated using 10-fold cross-validation and holdout validation, facilitating hyperparameter optimization without overfitting, while stratified holdout ensures testing a wide range of conditions. The developed model has proven to outperform two popular thermodynamic models. Various scoring metrics are used, with an average cross-validated R 2 of 0.987 ± (0.003). An R 2 of 0.993 and mean absolute percentage error of 5.56% are yielded for holdout validation. Auxiliary models are included to determine the multicomponent prediction capability and dependency on individual data sources. Multicomponent data prediction has proven successful; results prove that the model accurately generalizes hydrate equilibria and is well suited to predicting unseen data. Positive results are largely insensitive to exact model parameters, thus indicating a robust, replicable methodology.

Entities: CellLine Chemical Disease Gene

Year: 2019 PMID： 31909322 PMCID： PMC6941195 DOI： 10.1021/acsomega.9b02961

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

Gas hydrates are solid, crystalline masses which form as a result of a guest molecule stabilizing a water lattice. These hydrates form when an appropriate guest molecule in the presence of water is present at temperature and pressure conditions facilitating hydrate formation.[1] The stabilizing guest molecule may be a low-carbon-number organic molecule or a small inorganic gas molecule. These structures are achieved through hydrogen bonds between the encaging water molecules and are stabilized through interactions between the guest molecule and encaging water molecules.[2,3] Recently, interest over the possible use of gas hydrates as a safer alternative to liquefied natural gas (LNG) for transport and storage of natural gas has arisen.[4] Most prevalent to the petroleum industry, however, are the flow assurance issues related to hydrates. A consequence of gas hydrate formation is the formation of hydrate plugs through the nucleation, growth, and agglomeration of hydrate particles.[2,5] Hydrate plugs can result in flow interruptions, additional operational costs, equipment damage, and possibly hazardous conditions. Significant research has been directed toward gas hydrate flow assurance issues, notably predicting the conditions where hydrates are likely stable.[6] The conditions at which hydrates are most likely to be encountered at are high pressures and low temperatures,[7] prevalent when considering offshore wells, transit lines for liquid and gas flow, and equipment.[8] Hydrates form in both oil and gas wells, requiring the presence of water, which tends to accumulate as stagnant zones in pipeline elevation changes.[8] As many shallow reservoirs with favorable temperatures have been produced, offshore production is trending toward deeper and colder wells; thus, gas hydrates will likely be a significant factor when considering profitability of potential reservoirs.[5,8] Removal of hydrate plugs can pose serious hazards;[5] thus, avoiding or slowing hydrate growth is practiced, with approaches ranging from pipe insulation to fluid-phase chemical injection, and chemical means through use of thermodynamic or kinetic inhibitors, which prevent nucleation or slow hydrate growth, respectively. Kinetic inhibitors as low-dosage hydrate inhibitors (LDHIs) require addition at significantly lower concentration than thermodynamic inhibitors, thus possibly facilitating lower costs of hydrate control.[2,9] As antiagglomerants require the presence of liquid hydrocarbons,[9] these are not recommended for use with the model developed in this investigation. Due to the wide range of hydrate control options, determining the optimal combination of preventative means proves challenging. With the growing popularity of LDHI additives, it is essential to determine whether hydrate prevention rather than control is a cost-effective strategy.[5] To establish an optimal hydrate management strategy, information regarding the temperature and pressure region wherein hydrates will be stable for a given gas is vital.

Existing Means of Predicting Gas Hydrate Equilibrium Conditions

Early hydrate equilibrium predictive models, including the Katz[10] gas gravity plot, are often used as a preliminary estimate of equilibrium conditions.[5] Among the models attempting to yield more reliable results are thermodynamic models. The van der Waals and Platteeuw[11] model has served as a basis for many subsequent models.[5,6,12] Such models entail predicting hydrate equilibrium conditions through a statistical thermodynamic approach based on intrinsic factors and can predict hydrate structure,[12] with other advantages discussed in Sloan and Koh.[5] Ballard and Sloan[13] and Giavarini and Hester[12] list several computerized thermodynamic models. Many of these models use as a base the van der Waals and Platteuw[11] model or later variations to model the hydrate phase, while an equation of state such as the Peng–Robinson equation of state models the fluid phase.[14] Chapoy et al.[15] lists model shortcomings, notably the difficulty of adjusting model parameters and determining whether results correspond with a local minimum, significant factors when considering the complex nature of natural gas hydrate equilibria. Gas hydrate equilibrium modeling has also been achieved through machine learning, using easily measured variables to generalize behavior over a wide range of data, facilitating the development of models more able to accurately predict the highly nonlinear, multimodal phenomena of gas hydrate equilibria than those based on statistical thermodynamics. A wide range of machine learning models have been published. Input features for developed models vary, typically including temperature or pressure and some means of accounting for fluid composition. Regression model output is either the equilibrium temperature or pressure. Note that models have been considered with a focus on multicomponent hydrocarbon and natural gases. The neural network models of Heydari et al.[16] and Hesami et al.[17] predict equilibrium temperature from inputs of pressure and specific gravity. Zenali et al.[18] developed neural network and ANFIS models from a dataset of over 700 points, for inputs of temperature and gas composition. Ghavipour et al.[19] developed a model from 130 data points for inputs of pressure and the gas composition up to n-butane, while the train-test data split is an interesting aspect, being governed by leave-one-out, a form of cross-validation. Models have been developed to predict hydrate equilibrium conditions where thermodynamic inhibitors and electrolytes are present. Elgibaly and Elkamel[20] developed models with up to 16 inputs, including gas composition, hydrogen sulfide, and inhibitor and electrolyte concentration, later expanded on in Elgibaly and Elkamel,[21] accounting for inhibitor costs. Employing a train-test-validate split, Chapoy et al.[15] developed a neural network with 19 inputs from a dataset of over 3000 equilibrium points, including hydrate structure, gas compositions, and inhibitor and electrolyte concentration, while excluding methane molar gas-phase compositions less than 50%. While many studies cover an extensive range of conditions, many include copious amounts of data for pure and binary components. The sparseness of reliable publicly available multicomponent data renders multicomponent model development challenging. Soroush et al.[22] developed a two-hidden-layer neural network, from just under 300 equilibrium points obtained from the literature, achieving a holdout validation R2 of 0.998, and while limited by a small dataset, restricts the methane + ethane molar gas-phase composition to a minimum of 66.38%, increasing model relevance to natural gas. While neural networks have become popular, support vector machine models continue to be developed, such as the works of Ghiasi et al.[23] While neural network models have yielded promising results over a range of experimental data, there is a lack of associated error for the results of many models, which may be assessed through cross-validation practices. Ward[24] highlights that equilibrium experimental measurements must adequately account for hydrate metastability. As data used to develop models have usually been sampled from the literature, confidence in some dated sources of data may be raised due to outdated methodologies and apparatus. The cross-validated neural network developed in this investigation predicts the hydrate equilibrium pressure for a given temperature and gas composition. As extensive experimental data from various sources are used for model development, equilibrium is defined as the point at which complete dissociation of formed hydrates occurs, as recommended by Sloan and Koh.[5] Overall, a robust, flexible alternative to the thermodynamic model is in demand.

Model Limitations and Utility

The proposed model uses a multilayer perceptron neural network trained to predict gas hydrate equilibrium pressure for inputs of temperature and gas composition, indicated in eq . Features only accommodate sI and sII hydrate formers, excluding black oils and heavy condensates. The model is considered usable for all gases with a minimum methane molar fraction of 0.5 and conforms to the features and units listed in Table . The proposed model aims to perform competitively with existing popular thermodynamic models in terms of uninhibited pressure prediction, and development has focused on proving significance of results through assessment of multicomponent performance and data source dependency. Should the model prove to be able to compete with existing thermodynamic models, it would be an indication that machine learning models have the potential to see significant industrial use. Simultaneously performing an analysis on results including both uninhibited and inhibited systems could cast doubt over the reliability of results, with larger datasets with more features leading to concerns that certain underperforming ranges of conditions may go unnoticed. As such, the first step to developing a model capable of achieving industrial confidence is predicting uninhibited systems with an accuracy competing with industrially accepted thermodynamic models. Historically, hydrate prediction models have typically accounted for the presence of thermodynamic inhibitors, however, with the increasing popularity of LDHIs, proving model performance using uninhibited equilibrium data carries significance. Proving uninhibited hydrate equilibrium prediction capability would allow the model to be used to assess whether operations are likely to fall within the hydrate stability zone, thus risking hydrate formation. Furthermore, an accurate uninhibited model would allow for pressure predictions in calculations to determine the viability of kinetic inhibitors, which typically involve determining the driving force for hydrate formation, calculated directly or through correlation with the degree of subcooling, the difference in temperature between system and equilibrium pressure.[25] This degree of subcooling is also used together with induction time as a measurement of kinetic inhibitor performance and limitations,[26] thus resulting in knowledge of hydrate equilibrium conditions being necessary to efficiently select the optimal inhibition strategy. Furthermore, the proposed model is limited through a lack of electrolytes as a model feature. While this does exclude use of the model for application in offshore drilling and well production, many offshore production facilities involve natural gas separation prior to pipelining or transportation. When considering a gas separated from other reservoir fluids, hydrate equilibrium calculations can take place without having to consider electrolytes as a feature. As such, the model could see use in subsea gas pipelines which have been preceded by phase separation, offshore gas processing facilities, and river or surface gas pipelines or processing facilities during cold seasons where hydrate formation may unexpectedly occur.

Table 1

Features Influencing Hydrate Formation Included in Model Development

temperature (°C)

C₁ mol fraction

C₂ mol fraction

C₃ mol fraction

i-C₄ mol fraction

n-C₄ mol fraction

C₅₊ mol fraction

CO₂ mol fraction

N₂ mol fraction

pressure (MPa)

The main purpose of this research is to investigate the possibility of developing a machine learning model capable of yielding superior results to existing thermodynamic models. While many machine learning models in the field of predicting gas hydrate equilibrium conditions have been developed, some of which have been listed, it has been discussed that a lack of analysis into multicomponent prediction capability, datasets including methane concentrations below molar fractions of 0.5, and uncertainty over the statistical significance of reported results and as to whether results are merely valid for favorable randomizations of train-test data necessitate definitive model development and analysis with metrics which allow performance when compared to existing thermodynamic models to be judged. Through application of both cross- and holdout validation facilitating greater model optimization, together with assessment of multicomponent performance and existing thermodynamic models, this model serves to fill a gap in existing research by providing an optimized model fit for predicting unseen data. Positive results are largely insensitive to exact model parameters, indicating a robust, repeatable methodology has been used. It is expected that research conducted for similar application could follow the design choices and validation practices or build off some of this work to develop similar models. Finally, given the abundance of data which has been published in works such as Sloan and Koh[5] for thermodynamically inhibited systems and successful machine learning models accounting for inhibitors such as Chapoy et al.,[15] it is likely that a successful uninhibited model could easily be converted to one capable of modeling these extra features

Methodology

In this research, the approach taken to model the conditions of gas hydrate equilibria is that of a multivariate regression, implemented through an artificial neural network. Due to the significant nonlinearity of the data, neural networks are well suited to model the conditions of gas hydrate equilibria. Leshno et al.[27] detail that a multilayer feed-forward network is capable of acting as a universal approximator, which is the end goal of this research. The neural network models developed in this research have been designed as feed-forward, multilayer perceptrons, trained in a supervised manner by means of a backpropagation through use of datasets of published equilibrium data. Models consist of two to three hidden layers. Model development has been performed using Python (v. 3.6.8) and the Keras (v. 2.2.4)[92] library running on a Tensorflow (v. 1.12.0) backend to develop and train the neural network. Other libraries utilized include scikit-learn (v. 0.20.2), Pandas (v.0.24.1), and Numpy (v. 1.15.4), while Matplotlib (v. 3.0.2) has been used to develop various plots and illustrations.

Model Features

Features selected to generalize gas hydrate equilibria are listed in Table , where pressure is the output of the model. Rajnauth et al.[7] detail temperature, pressure, and gas composition as factors influencing hydrate formation. GPSA[1] and Sloan[28] identify methane, ethane, propane, iso-butane, n-butane, carbon dioxide, nitrogen, and hydrogen sulfide as hydrate formers. Due to the scope of this investigation and a lack of sufficient independent experimental studies publishing data for mixtures including the component, the presence of hydrogen sulfide has not been included in model development. While hydrogen sulfide as a feature has been implemented in works such as Chapoy et al.,[15] the manner in which cross-validation is performed in this research requires a significant number of independent studies covering a broad range of conditions. Carroll[29] details heavy non-hydrate-forming hydrocarbons exhibiting an azeotropic effect when mixed with pure methane in hydrate-forming systems that is accounted for in the models developed through the inclusion of a lumped sum of hydrocarbons present in the gas phase with a carbon number of five or greater as a feature. Compositions are accounted for as molar fractions. Hydrate equilibrium conditions, temperature (°C) and pressure (MPa) have been selected as model feature and output, respectively. While GPSA[1] lists kinetics and mass transfer as factors influencing hydrate formation, it is assumed that sampled data sources have adequately accounted for metastability effects when measuring equilibria by allowing sufficient time for formed hydrates to stabilize and a sufficiently low heating rate during dissociation up to equilibria.[24,30] Although GPSA[1] lists salinity as impacting hydrate formation, the effect of inhibitors and electrolytes on hydrate equilibria is beyond the scope this study. It is assumed there is sufficient water present to facilitate hydrate formation. The features listed in Table do not influence pressure prediction in equal proportions. No redundant features have been selected. Gas temperature is regarded as the feature most significantly influencing the pressure at which hydrate equilibrium is achieved, with a direct proportional relationship well established. While the lumped fraction of organic molecules with five or more carbon atoms does not occupy structure I or II hydrate cages, the presence of these molecules induces an azeotropic effect,[29] influencing hydrate equilibrium conditions, albeit likely the lowest contribution to predicted pressure of all listed features. As methane, ethane, and nitrogen may occupy both structure I and II hydrates,[31,32] their respective contribution toward the pressure prediction falls short in the presence of larger organic gases. While carbon dioxide similarly to methane may occupy cages for both hydrate structures, dissociation occurs in a different thermodynamic manner from methane,[33] requiring consideration despite carbon dioxide generally only being a low-molar-fraction component. Propane and butane fractions heavily influence the hydrate structure, thus potentially having a large impact on equilibrium conditions and are thus weighted accordingly. Further complexity is introduced through molecules such as n-butane being unable to occupy even the large cages of hydrate structure II without the assistance of a smaller molecule such as methane.[5] When considering the possible coexistence of hydrate structures, it can be seen that propane, isobutane, and n-butane will most significantly affect hydrate equilibrium conditions, being weighted only behind temperature.

Dataset Sampling

A vast quantity of data spanning a wide range of conditions is required to develop the neural network. All data sampled have been obtained from experimental studies on gases reporting several features listed in eq and Table . Hydrate equilibrium data have been collected through a rigorous data sampling campaign. To avoid biasing the model, the sampling of generated data of any kind has been avoided, and studies using software or equations to predict the conditions of hydrate equilibrium have been excluded. As a regression model is being developed, sampling data outside of equilibrium bias results and reduce the overall accuracy of the model. Ruffine et al.[34] attribute much of the difference in hydrate formation between laboratory-scale experiments and natural gas equipment to inadequate time taken to form stable hydrates and time over which hydrates occupy a stable state. Sampled data have been checked where possible to ensure that the reported values do occur at equilibrium, and care is taken to ensure no obvious neglect of metastability is reported. A limited number of multicomponent hydrate equilibrium measurements have been published, largely due to the time taken to perform these experimental measurements. Ward[24] attributes the need to account for metastability effects as a significant time investment for measurements. Several data sources used in this study have routinely been included in other prediction models in the field, notably that by Sloan and Koh,[5] which constitutes a substantial quantity of data by listing various experimental hydrate equilibrium studies and measurements. While there is an overlap between data used in this study and several others published in the field, other equilibrium data sources published in recent years or overlooked in the past have been included. All data sources have been referenced.[5,19,24,32,43−91] While natural gas hydrates conform to three distinct hydrate structures, Tohidi et al.[35] investigated the occurrence of structure H hydrates in reservoir fluids including natural gas and determined that structure H, while possible when operating within the hydrate stability zone, is unlikely to form when considering reservoir fluids such as natural gas at equilibrium. Tohidi et al.[35] identify structure II hydrates as the dominant structure in reservoir fluid hydrate formation. For this model, it is assumed that in the case of a hydrate avoidance strategy, the hydrate stability zone is avoided. As equilibrium is examined rather than conditions within the hydrate stability zone, experimental data are overwhelmingly sampled from sources where sI or sII hydrates are present.

Dataset Composition and Grouping

Sampled data have been compiled into datasets for model development; summaries of both are provided in Tables and 3. Units of measurement for each feature have been converted to those in Table . Conversions between metric temperatures have been performed on the basis that 0° C = 273.15 K. Normalization is performed to fit data within the features of the model by setting the sum of molar fraction compositions to unity due to source decimal place inconsistencies and the presence of insignificant quantities of inert components which do not interfere with hydrate formation. The regression was performed adequately without scaling temperature or pressure, and these features are left in their original units listed in Table . Both a multicomponent exclusive and complete dataset have been compiled with 670 and 1209 equilibrium gas measurements, respectively, where the complete dataset includes all samples present in the multicomponent dataset in addition to pure and binary methane measurements. While the final model is based on the complete dataset, separate models are developed for comparative purposes with regard to predictions of complex gases such as natural gas. A constraint placed on both datasets to ensure results are relevant to natural gas is a minimum 0.5 molar fraction of methane in the gas phase. Compiled datasets do not cover an even distribution of the range of conditions investigated, largely due to the lack of experimental data for certain ranges. As such, datasets are non-Gaussian, requiring consideration when developing model training and testing sets. Grouping data by experimental source facilitates sampling techniques ensuring better representation of the entire dataset for model train-test sets. Grouping is performed by assigning a unique code in the dataset to equilibrium measurements from the same source and experimental setup. Binary measurements have been grouped per source, per binary component, while several pure methane sources have been assigned two groups with similar distributions over 81 of the 123 pure methane samples to avoid biasing results. Sources reporting less than four samples are excluded from the dataset, while groups of 50 or more samples are restricted to multicomponent dominant sources. The grouping of equilibrium data in each dataset allows assessment of model dependency on individual data sources. Due to the inclusion of historical data sources which may have employed outdated measurement practices for identifying the equilibrium point or may not have sufficiently accounted for metastability such that measurements are slightly within hydrate stability, heavy reliance upon individual sources to cover ranges of data is often unwanted. While the data sampling campaign is conducted in an attempt to render statistical variations across samples insignificant, due to the limited availability of data, assessing dependencies is worthwhile. Considering grouping during model development allows these factors to be mitigated to some extent and reduces the bias on reported results.

Table 2

Multicomponent Exclusive Dataset Summary

feature	temperature (°C)	C₁ mol fraction	C₂ mol fraction	C₃ mol fraction	i-C₄ mol fraction	n-C₄ mol fraction	C₅₊ mol fraction	CO₂ mol fraction	N₂ mol fraction	pressure (MPa)
sample count	670	670	670	670	670	670	670	670	670	670
mean	13.5732	0.8524	0.0651	0.0300	0.0038	0.0065	0.0024	0.0145	0.0253	8.1922
σ	6.9344	0.1026	0.0604	0.0304	0.0078	0.0115	0.0057	0.0350	0.0620	8.8865
min	0.0000	0.5000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.4966
25%	7.4500	0.8025	0.0169	0.0096	0.0000	0.0000	0.0000	0.0000	0.0000	2.6164
50%	14.1550	0.8654	0.0543	0.0204	0.0001	0.0004	0.0000	0.0004	0.0004	5.1435
75%	19.2502	0.9280	0.0795	0.0357	0.0040	0.0085	0.0018	0.0143	0.0090	10.4623
max	30.3735	0.9940	0.2500	0.1698	0.0461	0.0510	0.0340	0.3140	0.4000	68.2300

Table 3

Complete Dataset Summary

feature	temperature (°C)	C₁ mol fraction	C₂ mol fraction	C₃ mol fraction	i-C₄ mol fraction	n-C₄ mol fraction	C₅₊ mol fraction	CO₂ mol fraction	N₂ mol fraction	pressure (MPa)
sample count	1209	1209	1209	1209	1209	1209	1209	1209	1209	1209
mean	12.5068	0.8758	0.0413	0.0223	0.0066	0.0049	0.0014	0.0225	0.0252	9.2320
σ	7.3586	0.1168	0.0620	0.0445	0.0301	0.0109	0.0044	0.0667	0.0725	11.1019
min	0.0000	0.5000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.1800
25%	6.1500	0.8375	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	2.8300
50%	12.1000	0.8950	0.0088	0.0019	0.0000	0.0000	0.0000	0.0000	0.0000	5.3778
75%	18.3500	0.9725	0.0647	0.0308	0.0029	0.0055	0.0000	0.0051	0.0068	10.8080
max	31.8500	1.0000	0.4360	0.4980	0.5000	0.0582	0.0340	0.5000	0.4975	72.2600

Model Validation

The neural network is trained by reducing the error arising from disparity between actual and predicted training data pressure. While the model training score may be reported as an exceptionally favorable figure, this does not provide an indication of a model’s capability to predict unseen data, and may be achieved at the cost of overfitting the model, thus reducing generalization capability.[37] To test the predictive capability of the model, data unseen during training must be used. The means of performing model validation has been selected to effectively utilize a dataset limited in size, facilitating a high degree of model tuning to obtain a near-optimal result, while yielding the variance associated with results. The validation strategy involves a holdout test set being separated from training data, and training data subdivided into train-test sets through 10-fold cross-validation. Cross-validation provides an indication of how the model performs under a range of conditions and outputs the bias and standard deviation associated with results over 10-folds. This 10-fold cross-validation provides 10 results obtained from 10 randomized combinations of train-test data, indicating the ability of a model to predict conditions over the entire range of data provided, and ensures that high-scoring models are not merely accurate for a favorable randomization. The cross-validation variance serves to highlight potential outliers or ranges of conditions which are poorly fitted or lack a sufficient quantity of data needed to make accurate predictions. Following cross-validation, the model is trained using all training data, and is validated using the holdout set. Holdout validation ensures that tuning the model parameters based on cross-validation results can be performed without significant risk of information regarding the test set leaking into the neural network, which could render results unreliable. Holdout validation facilitates testing the cross-validated model using unseen data and ensures that model parameter tuning has not led to overfitting. Due to the non-Gaussian nature of the dataset, randomly sampling indices for both holdout and cross-validation sets is unlikely to reflect an adequate range of equilibrium conditions present in the dataset. While grouping facilitates sampling a wide range of conditions, a highly uneven number of equilibrium measurements per group is noted, illustrated in Figure . Thus, as traditional group sampling techniques are unviable, the holdout validation split is achieved by means of a randomized, stratified split, whereby dataset indices are randomly divided into training and holdout validation sets such that the proportion of groups relative to each other is held approximately constant. Stratified sampling ensures that a wide range of conditions is represented in both the training and holdout validation sets, while introducing randomization. Exactly 10% of the dataset is used to develop the holdout validation set for all models developed. As several models have been developed, cross-validation is performed by one of two approaches. For models investigating model dependency on individual data sources, approximately 30% of training set randomly grouped data to be used for testing, while the remainder is used for training such that data from the same group are not used for both training and testing for the cross-validation fold. For models which do not consider dependency on data sources, cross-validation indices are randomly selected such that exactly 30% of indices are used for testing and indices from the same group are likely to be present in both training and test sets for the cross-validation fold.

Figure 1

Multicomponent dataset grouping. Complete dataset grouping.

Model Development

While model G is the end product of this research, to assess model capacity to predict multicomponent data, multiple models for each dataset have been developed. Models A, C, and F have been developed from the multicomponent exclusive dataset, while models B, D, E, G, and J have been developed using the complete dataset. To assess the dependency of models on individual sources of data for certain ranges, additional models have been developed applying grouped cross-validation. Models A, B, C, D, E, and J undergo a randomized group split. Models F and G employ a standard nongrouped randomized split. Additional models demonstrating the effect of altering hyperparameters have also been developed; models D and E compare results from separate activation functions. All models developed are summarized in Table , with neural network topology, cross-validation grouping, and the results of cross-validation and holdout validation displayed where applicable.

Table 4

Summary of Models and Results

model	dataset	holdout	group	hidden layers	neuron count	activation	R² (CV)	σ (CV)	R² (holdout)
A	multicomponent exclusive	true	true	2	64	ReLU	0.90622	0.05410	0.95114
B	complete	true	true	2	256	ReLU	0.96106	0.01933	0.98686
C	multicomponent exclusive	false	true	3	256	ReLU	0.91382	0.04722
D	complete	false	true	3	256	ReLU	0.95199	0.03075
E	complete	false	true	2	128	Sigmoid	0.94695	0.04611
F	multicomponent exclusive	true	false	3	352	ReLU	0.98294	0.00736	0.97389
Ga	complete	true	false	3	352	ReLU	0.98703	0.00296	0.99315
Hb	multicomponent exclusive	true	false	0			0.60671	0.02469	0.68302
Ib	complete	false	false	0			0.87105	0.05564
J	complete	true	true	3	352	ReLU	0.95814	0.02154	0.98918

End product of this research.

Simple polynomial regression.

End product of this research. Simple polynomial regression. For each model, nine inputs are present corresponding to Table excluding pressure, which is the output. Each input node connects to each neuron present in the subsequent hidden layer, each node of which connects to all nodes of the following layer, be it another hidden layer or the output. As a regression is being performed, the model output consists of a single node, yielding equilibrium pressure. Initial random weights of the network have been initialized according to a normal distribution. Models consist of two or three hidden layers. The loss function to be minimized during model training has been selected as the mean-square error (MSE), provided in eq , selected to prioritize minimizing outliers. To ensure replicable results, the randomization of cross-validation and holdout validation are seeded such that successive iterations of the same model parameters and network topology sample the same indices during training and testing. Cross-validation and holdout sets are separately seeded. The seeding of model randomization allows the effect of different model parameters on results to be assessed. Several metrics have been employed to assess model performance, as model G is considered the end product of this research and model F is additionally compared to existing thermodynamic models to gauge multicomponent performance, and these two models undergo analysis using a wide range of metrics commonly included in regression studies, including the coefficient of determination or R2 score, mean absolute percentage error (MAPE), and root-mean-square error (RMSE). While the R2 score is typically the standard metric for use with regression problems when combined with other metrics such as RMSE, as the number of features in a model increases, there is a chance that poor regressions are not detected due to an apparently high R2 score. A metric such as the R2 score, which penalizes pressure predictions on test data both greater or less than the actual test value without reducing the error metric due to overshoot and undershoot balancing is sought after due to the relative scale allowing decisions on model accuracy to be made without comparison with other models. The adjusted coefficient of determination (Radjusted2), defined in eq , accounts for the data points and features employed. Similarly, the widely used MAPE metric has issues with large dataset models. Tofallis[36] proved that MAPE metrics are biased through tending to select underperforming models, while symmetric mean absolute percentage error (SMAPE), provided in eq , exhibits a relatively neutral bias While other absolute metrics are often employed for model comparison and accuracy assessment such as mean deviation, as this model has been developed to predict pressure in an unscaled fashion, whereby pressure may vary from 0.18 to 72.26 MPa, coupled with the non-Gaussian nature of the dataset, use of relative metrics has been prioritized when performing model selection. Due to differences in random sampling for training and testing between models, comparing absolute metrics is unlikely to indicate the better model with certainty, as it is possible that one model was tested using significantly more high-pressure samples than the other, causing a model performing relatively well to be passed over. For assessing multicomponent performance and source dependency, variance over 10-fold cross-validation and altering the random seeding are the primary means of identifying model flaws.

Hyperparameter Optimization

This investigation performs hyperparameter optimization by means of a grid-search procedure, whereby model parameters iterated through predefined values in a range to select a model achieving a close-to-optimal regression testing result over 10 cross-validation folds. A significant risk when attempting to optimize model hyperparameters is the possibility of overfitting. Cawley and Talbot[37] elaborate that overfitting is a risk whenever model selection is performed over a limited dataset, even if cross-validation is applied. Iterating hyperparameters risks providing a regression highly specific to the sample data, thus overfitting while lowering generalization ability for unseen data.[37] The validation practice employed in this investigation serves to mitigate the risk of loss in generalization capacity. By performing a grid-search on cross-validated model proceeded by testing using an unseen holdout validation set, it is possible to avoid overfitting based on holdout testing results. Due to the computational times involved in training highly nonlinear, multilayered neural network regression models, parameter iterations are predefined, thus obtaining the global minimum is unlikely. Additionally, variance in results between successive trained models may result in the highest scoring model alternating between several different parameter combinations. Nevertheless, an iterative approach allows for significant improvement to results, likely yielding results close to optimal. Parameters selected to undergo grid-search include the number of hidden layers, neurons, training epochs, batch size, hidden layer activation, and the optimizer used to minimize the training loss function. These parameters are assigned a wide range of predefined values to facilitate the grid-search. Results exhibiting a high mean R2 with acceptable standard deviation undergo holdout validation to ensure the model is not overfitted. Models are further tested by varying the random seeding to ensure that results are not merely valid for a favorable train-test split. The best model is selected based on the criteria of adequately low cross-validation variance, and a high-scoring holdout validation. The presence of hidden layers allows nonlinearity to be introduced into the model by means of activation functions,[27] and at least one hidden layer must be present. The activation function governs how neurons transform input. Leshno et al.[27] detail that any continuous function can be modeled by a nonconstant, nonpolynomial activation function. Several nonlinear activation functions are selected to be iterated as part of the grid-search procedure, including the rectified linear unit (ReLU) and sigmoid activation functions. Glorot and Bengio[38] discuss the difficulty of implementing the sigmoid activation for deep learning application. The ReLU hidden-layer activation function successfully introduces nonlinearity into the system and allows complex behavior to be learned.[39] Several activation functions in a deep learning context considered in this investigation are provided and discussed in Nwankpa et al.[40] The means by which the model training loss is reduced is governed by use of an optimizer, which minimizes a cost function during training.[41] The optimizer should be selected such that the global minima, rather than some local minima, is obtained when minimizing loss. The stochastic gradient descent (SGD) optimizer, while widely used in machine learning,[41,42] results in additional parameters requiring tuning, raising computing time. While application of the step-decay method or alternate learning rate schedulers, which reduce the number of parameters, could reduce computation time,[42] an alternative is to utilize optimizers which incorporate adaptive learning rates. Several such adaptive optimizers are investigated as hyperparameters during grid-search, such as adam, a gradient-based optimizer, which is introduced and discussed in Kingma and Ba.[41] Adam allows for fewer hyperparameters to be iterated and thus simplifies model selection. In all neural network models with the exception of Model E, which employs a softplus activation function, a linear activation function is used for the output layer so as to provide a continuous, real output.

Topology and Parameter Selection

Neural network topology and parameters are selected following hyperparameter optimization. The traditional approach of selecting parameters which lead to a minimized validation loss is not possible due to the application of cross-validation. Instead, parameters are selected based on the grid-search iteration results, whereby the best-performing models over 10-fold cross-validation undergo holdout validation, after which the best model is selected. As such, the number of hidden layers, neurons, activation functions, epochs, batch sizes, optimizers, and regularization practices are selected based on model results from a range of iterated parameters, rather than being predefined. Several constraints are present in terms of topology. Iterated parameters also include number of hidden layers and neurons. All hyperparameter optimization searches are restricted to at least one hidden layer, and a maximum of four hidden layers. Furthermore, the number of neurons within each hidden layer is held constant across all layers; for instance, if hidden layer 2 consists of 100 neurons, hidden layer 3 will also contain 100 neurons. Similarly, hidden layer activation functions are held constant across layers. These constraints have been made so as to allow a large number of different parameters to be iterated over a wide range of values and cases within an acceptable training time frame. The introduction of regularization to the neural network has been implemented through dropout practices, which serve to improve generalization capability through preventing overfitting by disabling random neurons in the network during training. Dropout allows the occurrence of dead neurons to be minimized, whereby neurons are gradually assigned weights so low that they cease to alter input in any significant manner, and pairs well with the ReLU activation function for this reason. A constraint on the dropout is that each hidden layer employs the same dropout as other layers. Dropout is not performed on input or output layers.

Comparison with Existing Industrially Accepted Models

While many different models, thermodynamic and machine learning alike, have been developed to predict gas hydrate equilibrium pressure for a variety of input features, as machine learning models require a substantial amount of data to adequately generalize behavior, unless the exact same dataset has been used across model training and testing for separate models, a comparison between the results of this study and others is not possible through simply comparing reported test accuracies. It is however possible to compare models by assessing prediction capabilities through calculating accuracies for each model through use of the datasets compiled in this research. As discussed, thermodynamic models have garnered significant application in industry, and any proposed model must be able to verifiably at least compete with some existing popular models. Through obtaining accuracies for proven successful models, benchmark metrics may be obtained, which serve as an indication as to whether or not the proposed model is successful enough to see practical application. Two popular thermodynamic models have been tested using the sample datasets with the goal of assessing performance for both the complete and multicomponent exclusive datasets. CSMHYD is one such thermodynamic model which has been used to predict equilibrium pressure for both datasets, courtesy of the Colorado School of Mines (1998), is an older thermodynamic model which still sees significant use, and can be seen in comparative studies conducted by Ballard and Sloan[13] as being competitive with many more modern thermodynamic models. Comparison with a cutting-edge thermodynamic model is also desired, and as such Multiflash (v 7.0.40) has additionally been tested using the sample datasets, courtesy of KBC Advanced Technologies Limited. This software has seen extensive industrial application, and achieving a competitive performance against such a model would prove to greatly elevate the credibility of models developed in this research. It must be emphasized however that the models developed in this investigation are merely designed to predict gas hydrate equilibrium pressure for uninhibited, sweet natural gases in the absence of significant electrolytes, while software such as Multiflash provides a host of features. To use these thermodynamic models, a number of assumptions are required, notably that the sample was saturated with water. This is justifiable as all data collected have been experimentally determined, typically with excess water present. The exact quantity of water is not required, and varying the amount from saturation has a minimal effect of results. Furthermore, the C5+ fraction is completely assigned to n-pentane. This is done as although Multiflash facilitates component lumping, the CSMHYD executable does not. The effect of this has been assessed for the data being predicted for both models, and in the overwhelming majority of cases, results in a deviation in predicted pressure by less than 2% when changing the spread of C5+ fractions across various heavier hydrocarbons, both lumped and not lumped. This minimal deviation is expected as only hydrate structures sI and sII are investigated by this model, neither of which can accommodate molecules larger than n-butane as hydrate formers. Based on the C5+ fraction not actively forming hydrates, the results of the comparative studies, and the slight differences in using lumped fractions versus simply assigning the entire C5+ fraction to n-pentane, this factor is considered insignificant to assessing the overall predictive capabilities of the models, and all equilibrium predictions involving C5+ have been assigned to n-pentane. Studies which assess thermodynamic model performance typically investigate single-component, binary, ternary, and natural gas performance separately so as to identify possible weaknesses for a model;[13] however, as the majority of data used to develop the neural network models in this investigation have been used for model training, only a small portion of the original dataset remains for neural network validation, which has not been seen by the model. As such, comparison with both multicomponent and complete datasets will help reveal possible weak points for the thermodynamic model and provide an unbiased means of comparing model performance when using a smaller sample set which, due to the non-Gaussian nature of the dataset, may not be representative of the remaining data. To compare these thermodynamic models with the developed neural network models, it is necessary to perform tests using the entirety of both complete and multicomponent datasets, in addition to the holdout indices for these datasets used to provide final validation metrics for the neural network model. As sufficient data have been collected during data compilation, it is possible to conclude that the neural network model must compete with the best result of these two sample tests. Accuracy for these thermodynamic models has been assessed according to the same metrics used to judge the end product of this research, measured from the differences between experimental and predicted equilibrium pressures.

Results

Due to the numerous criteria used to evaluate model performance, including multicomponent data prediction and dependency on individual data sources, multiple models have been developed and are summarized in Tables and 5. Hyperparameter optimization has been performed on models employing holdout validation. Note that model selection based on holdout validation results has been limited to avoid leaking holdout validation data, as opposed to hyperparameter optimization whereby model selection is performed from a wide range of conditions. Model G is considered the end product of this research, trained and tested using the complete dataset and cross-validated using a 10-fold nongrouped sampling approach, while undergoing holdout testing by means of stratified sampling to ensure overfitting has not occurred during hyperparameter optimization. Model G has achieved satisfactory results, and all relevant metrics for both training and testing are listed in Table . The model uses a topology summarized in Table , with a single neuron linear activated output and three hidden layers each consisting of 352 neurons using the ReLU activation function and a dropout of 5% so as to avoid dead neurons arising from use of the ReLU activation function. The 10-fold cross-validation testing yields a mean R2 score of 0.9870 and a standard deviation of 0.0030. MAPE metrics for testing indicate an average performance of 7.019% with a standard deviation of 0.723% across the 10 cross-validation folds. For cross-validation, the discrepancy between R2 and Radjusted2 is relatively low, as is also the case with differences between MAPE and SMAPE metrics. These results indicate that variance between cross-validation folds is sufficiently low and generalization of behavior has been adequately performed. It is worth emphasizing that cross-validation is merely a tool which allows further insight into the ability of a model to predict unseen data and to prevent overfitting. The final assessment of model performance is the accuracy of holdout validation. For model G, holdout validation indicates an R2 of 0.9931 and a mean absolute percentage error of 5.558%, indicating the model is well suited toward predicting unseen data. As the sample size for holdout validation is lower than that used for cross-validation testing, adjusted metrics prove more significant, with Radjusted2 = 0.9926 and SMAPE = 5.689%. Overall, both cross-validation and holdout validation indicate the model is unlikely to be overfitted, has successfully generalized gas hydrate equilibria, and achieves satisfactory performance. Further hyperparameters not listed in Table include 8000 training epochs and a training batch size of 256. The model is proven to outperform two popular thermodynamic models, with a comparison provided in Table . Positive results are largely insensitive to exact model parameters, thus indicating a robust, replicable methodology. It must be emphasized that the model is only applicable to sweet, uninhibited natural gases which have undergone phase separation. Hydrogen sulfide was excluded as a model feature in this investigation due to poor grouped cross-validation performance.

Table 5

Optimized Model Average 10-Fold Cross-Validation Performance

model	sample size	R²	R² adjusted	MAPE (%)	SMAPE (%)	RMSE
model F (train)	422	0.997187 ± (0.002007)	0.997125 ± (0.002051)	3.254632 ± (0.931851)	3.250130 ± (0.893337)	0.441012 ± (0.156115)
model F (test)	181	0.982944 ± (0.007355)	0.982046 ± (0.007742)	7.701709 ± (1.041027)	7.563736 ± (0.877334)	1.175228 ± (0.270948)
model F (holdout)	67	0.973893	0.969770	7.535356	7.367613	1.308022
model G (train)	762	0.997900 ± (0.001391)	0.9978405 ± (0.00143)	2.698079 ± (0.703972)	2.702965 ± (0.718815)	0.483724 ± (0.151075)
model G (test)	326	0.987034 ± (0.002958)	0.9866648 ± (0.003042)	7.019491 ± (0.722523)	7.004434 ± (0.803799)	1.220705 ± (0.128513)
model G (holdout)	121	0.993147	0.992591	5.557521	5.688998	1.010064

Table 6

Comparison of Developed Models with Existing Thermodynamic Models

model	dataset	validation samples	R²	R² adjusted	MAPE (%)	SMAPE (%)	RMSE
model F	multicomponent (holdout)	67	0.973893	0.969770	7.535356	7.367613	1.308022
model Ga	complete (holdout)	121	0.993147	0.992591	5.557521	5.688998	1.010064
CSMHYD	multicomponent	670	0.947842	0.947131	14.690349	13.846418	2.028001
CSMHYD	multicomponent (holdout)	67	0.943707	0.934819	14.899993	13.892934	1.920705
CSMHYD	complete	1209	0.960327	0.960029	11.731032	11.246104	2.210369
CSMHYD	complete (holdout)	121	0.967904	0.965302	12.060525	11.435506	2.185910
multiflash	multicomponent	670	0.953865	0.953236	12.681098	12.098921	1.907313
multiflash	multicomponent (holdout)	67	0.948401	0.940254	12.541839	11.715145	1.838883
multiflash	complete	1209	0.976174	0.975995	10.127942	9.635545	1.712942
multiflash	complete (holdout)	121	0.977557	0.975737	10.246058	9.632095	1.827887

End product of this research.

Cross-validation 10-fold variance across multicomponent exclusive neural network models. Cross-validation 10-fold variance across complete dataset neural network models. Note model G is the end result of this research. End product of this research.

Discussion

Hyperparameter optimization performed by means of a grid-search procedure indicates that models comprising at least two hidden layers, with a high neuron count, are capable of achieving sufficiently accurate and replicable results. To lower computation times and the risk of overfitting, no more than four hidden layers are investigated, and hidden layer neurons and activations are constant across layers. The factor of training time proves significant considering the large number of parameters iterated during hyperparameter optimization. Hyperparameter optimization revealed that the best models have, in most cases, been developed through the use of the adam optimizer, which has been applied to all neural networks listed in Table . While a significant number of publications in the field employ the sigmoid activation function for hidden layer activation, during grid-search procedures, it was noted that sigmoid models exhibited longer training times or required more epochs to achieve near-optimal solutions. This behavior is expected, based on the observations of Glorot and Bengio[38] and Maas et al.,[39] due to deep learning application. Models applying ReLU to hidden layer activation are observed to usually achieve favorable results in less time than sigmoid models. While ReLU activation can result in dead neurons for large networks, only three hidden layers are present in the final model with 352 neurons per hidden layer, and ReLU activation does not result in significant issues. Results between models applying these two hidden-layer activations do not differ drastically, as indicated by complete dataset models D and E, which utilize ReLU and sigmoid activation, respectively, applying ungrouped cross-validation without significant model selection in the absence of a holdout. All remaining neural networks listed in Table have thus been developed using the ReLU activation function to govern hidden layer activation so as to reduce the time taken for model development and grid-searches. Altering the output function for model E from linear to softplus presented no significant effect on parameter search times; thus, all other models have been developed using a linear output function. Traditional L1/L2 regularization was attempted; however, results from implementing dropout regularization practices instead proved significantly more effective at reducing overfitting. While dropout is typically implemented in machine learning studies above 20% neuron dropout, the effect of high percentage dropout can be seen in Figure (left panel), whereby a 30% dropout applied to a neural network with the same toplogy as model G causes a rapid divergence of validation error from training error. Excess dropout leads to the testing error rapidly diverging from the training error after a few epochs, and is unwanted. A 5% dropout has been identified as an optimal solution and has been applied to model F and model G. The application of dropout has paired well with the ReLU activation function, applied to the hidden layers of model F and model G, and serves to prevent extremely low weight neurons. By deactivating 5% of neurons randomly, the network can exit a cycle of ever-decreasing neuron weights and thus serve to prevent dead neurons.

Figure 3

Effect of excess dropout applied to hidden layers of model G. Early loss divergence unwanted. Loss reported each epoch. Model G regression plot. Validation samples from holdout validation set, not used in model training.

Comparison of Results with Existing Thermodynamic Models

To assess the ability model G relative to successful models, CSMHYD and Multiflash have been selected and tested using the entirety of both the multicomponent and complete datasets and further tested using only the holdout samples used to provide a final evaluation of the neural network model performance. Results for each of these cases are provided in Table . It is possible to place the results presented in Table in perspective through examining the works of Ballard and Sloan,[13] which examines the equilibrium pressure prediction accuracy of several thermodynamic models. The study was conducted using a wide range of data and includes an analysis for uninhibited systems. The dataset used to assess performance in this regard consisted of 1685 data points, 161 of which are gases with three or more components, which constitutes multicomponent gases in this investigation. The study assessed the MAPE for pressure predictions for each model. For the entire dataset, CSMHYD and Multiflash yielded approximately a 8.7 and 9.2% errors, respectively. For ternary component mixtures, CSMHYD and Multiflash yielded approximately 16.4 and 16.7% errors, respectively. Finally, for natural gas mixtures, CSMHYD and Multiflash predicted pressure with approximately 12.1 and 9.6% errors, respectively. As can be seen from Table , results from using the dataset developed in this investigation to assess CSMHYD and Multiflash differ from the Ballard and Sloan[13] study to some extent. While values for multicomponent prediction fall between the Ballard and Sloan[13] ternary and natural gas prediction values, results for testing the entire dataset differ most significantly. Testing the complete dataset of 1209 samples revealed 11.7 and 10.1% MAPE errors for CSMHYD and Multiflash, respectively. The reason for this occurrence can be explained through examining the contents of the test set; of the 1685 data points Ballard and Sloan[13] use to test model performance, 632 are single-component samples, yielding vastly more accurate predictions for both thermodynamic models than for other gas mixtures, and 747 further samples are binary. In this investigation, 670 samples consist of at least three components, significantly more than the 161 multicomponent gas points used in Ballard and Sloan. As such, the less favorable errors reported in Table can be explained through the models being tested using significantly more difficult-to-predict samples, with the tendency for Multiflash to provide better natural gas predictions than CSMHYD illustrated in Ballard and Sloan[13] causing the program to perform better than CSMHYD for the datasets developed in this investigation. As such, it can be seen that errors for CSMHYD and Multiflash for this investigation correlate to an extent with existing literature, thus ruling out the possibility that potential inaccurate dataset points present render model comparisons void. Having discussed how the values in Table for the two thermodynamic models fall within the expected error margins, comparison may now be drawn between the performance of these models and the optimized models of this research, model F and model G. As the neural network models have been trained using 70% of data made available to cross-validation following the 10% holdout validation split, the entirety of the datasets cannot be used to assess performance. This would unfairly favor the machine learning model above the thermodynamic model. As such, the holdout validation set is used to test all models to provide a fair comparison. Concerning the thermodynamic models, there is not much difference between the results of holdout and entire dataset testing. This is largely due to the stratified manner in which holdout indices are generated and likely serve to preserve the ratio of multicomponent, binary and pure methane samples to an extent across both test cases. For both multicomponent and complete dataset testing, it can be seen that model F and model G perform significantly more accurately than either thermodynamic model. Furthermore, Multiflash can be seen providing more accurate predictions than CSMHYD. Model G, the end product of this research, is seen achieving a SMAPE 5.69% compared to the SMAPE of 9.32% for Multiflash tested using the holdout validation samples of the complete dataset, a 3.63% improvement. Multicomponent performance proves to yield even larger improvements, with model F yielding a SMAPE of 7.37%, a 6.53% improvement over CSMHYD and 4.35% improvement over Multiflash when considering holdout testing. Additional models investigated in the Ballard and Sloan[13] study can be seen to perform similarly to CSMHYD and Multiflash. This occurrence may be explained through most thermodynamic models being based on the van der Waals and Platteuw[11] model. Based on the MAPE scores listed for pressure predictions for the other thermodynamic models listed in Ballard and Sloan,[13] it can be seen that model G could likely compete with all of the listed models. To prove this, further testing would naturally be required, and extrapolating conclusions based on relationships between separate studies using different datasets should be avoided. Overall, it may be concluded that the end product of this research can at the very least compete with existing thermodynamic models and likely could significantly serve to improve gas hydrate equilibrium pressure calculations. Having compared model performance against that of existing thermodynamic models, the research objective has been met. While the model is limited to uninhibited, sweet natural gases having undergone phase separation, these results serve as an indicator that the methodology could certainly be applied to a universal model to include these features and could compete with existing models. Having assessed the absolute performance of the machine learning models for various metrics and against existing popular models, it is now necessary to analyze the results of cross-validation and holdout validation in detail so as to assess multicomponent performance, model dependency on individual sources of data, and prove that results are not merely representative of favorable combinations of training and testing data.

Multicomponent Data Prediction

To assess multicomponent data prediction, results of several models developed using the complete dataset are compared with models developed from the multicomponent dataset. As absolute performance is not as important for comparative purposes as it is for assessing the optimized models G and F developed in this investigation to potentially see practical application, assessing the variance of R2 score coupled with MAPE should be sufficient. Note that due to the relatively low number of model features and the size of the datasets, adjusted metrics such as Radjusted2 do not differ vastly from the original metric. Overall, complete dataset models achieve significantly more accurate R2 scores in terms of average cross-validation performance and lower variance over 10-folds, in addition to holdout validation performance. This observation is consistent for models applying both grouped and ungrouped cross-validation. Models A and B have been developed from the multicomponent and complete datasets, respectively, applying grouped cross-validation. Hyperparameter optimization results in different parameters used to develop these models. Model B achieved a mean cross-validated R2 score of 0.9611, 0.061 higher than that of model A, with a significantly lower standard deviation of 0.0193, as reported in Table . A comparison between models F and G, which have been developed using the multicomponent exclusive and complete datasets, respectively, reveals similar behavior, despite applying ungrouped cross-validation. While not as extreme as the disparity in results between models A and B, model G performs significantly better than model F in all assessed criteria, notably a slightly improved average cross-validation R2, and a 0.68% lower average MAPE, as indicated in Table . A key observation, however, is that results for model F are acceptable, despite the model being developed from a multicomponent exclusive dataset. With a cross-validation R2 of 0.9739 ± (0.0074), MAPE of 7.702% ± (1.041%), holdout validation R2 score of 0.9739, and a mean absolute percentage error of 7.535% indicating overfitting is unlikely, it can be seen that even without methane binary data present in the dataset, results are significant and could be applied to real natural gas hydrate equilibrium predictions. It may be inferred from the significant disparity between the results of model F and model G that the presence of pure and binary methane data in the complete dataset likely assists the model in predicting multicomponent data for ranges where multicomponent training data are sparse. This behavior may be attributed to binary mixtures such as methane–propane being capable of exhibiting sII hydrate formation, thus providing a basis for predictions for multicomponent gases under ranges of conditions where few multicomponent data points are present during training. The prediction of sII hydrate equilibrium is of significant importance, as evidenced by the findings of Tohidi et al.,[35] which details that sII hydrates are most likely to be the dominant structure when considering natural gas hydrates at equilibrium for usually encountered conditions. As such, it is likely that not all of the increased model performance for complete dataset models is due to the presence of easier-to-predict equilibrium points. In addition to assisting equilibrium predictions in ranges with sparse multicomponent data, the presence of pure and binary methane components may serve to introduce a damping effect into model training, whereby the potential inaccuracy resulting from erroneous multicomponent equilibrium measurements could be reduced through the presence of these widespread, easier-to-predict equilibrium measurements. This effect is likely responsible for the minimum R2 score for an individual fold exceeding 0.97 for model G, in terms of the results illustrated in Figure (right panel) in addition to the seed tests present in Figure (bottom left panel). As such, results are favorable to multicomponent exclusive model F. Were the increased minimum-fold score merely due to easier-to-predict data, the disparity between weak folds in seed tests would not be so great. Finally, due to the acceptable results of multicomponent models such as model F, it can be seen from models B and G that the inclusion of pure and binary component data into the training dataset does not hamper the model or provide grounds for dismissing model results, but rather serves to improve model generalization capability where limited training data are available. Thus, model G is likely to accurately predict unseen multicomponent equilibria within the range of conditions present in the complete dataset.

Figure 2

Figure 4

Cross-validation seed tests displaying 10-fold validation results for various train-test split randomizations, with the purpose of proving that reported results are not merely valid for a favorable seeding, and further serving to identify weak folds arising from unfavorable combinations of data.

Source Dependency

Despite the weaker regressions yielded, multicomponent exclusive models do not perform impermissibly poorly. While model A achieves a mean cross-validated regression coefficient barely exceeding 0.9, this is largely due to the limited data available in the multicomponent dataset, combined with the limited range of data available to model training as a result of performing grouped cross-validation. This is confirmed by a holdout validation R2 > 0.95 and a reasonable cross-validation standard deviation. Model C has been developed without a holdout set in an attempt to improve model A by providing more data to cross-validation. While hyperparameter optimization cannot be performed a holdout, results can still yield useful information. Model C yields minimal improvements to model A compared to the disparity between results of models A and F. This indicates that an inadequate distribution of data has been made available to model training during cross-validation and thus indicates a degree of source dependency. From Figure , model A can be seen providing both highly accurate (R2 > 0.98) and poor (R2 < 0.9) predictions for most folds, while extremely poor predictions of R2 ∼ 0.825 are noted for a small number of folds. Contrarily, ungrouped model F displays a relatively low variance in results, the lowest fold score exceeding R2 = 0.94. Thus, the weaker performance of model A may be attributed to grouped cross-validation offering a less balanced distribution of data available to model training. Increasing the data available to cross-validation such as with model C is insufficient to ensure an adequate distribution of training data. Unlike the stratified manner in which holdout validation indices are sampled, grouped cross-validation completely excludes entire groups from training for a specific cross-validation fold, thus causing extra data supplied to each group to have a minimal impact on the distribution of training data. While results of the complete dataset group cross-validated model B are a significant improvement over model A, analysis of the variance in cross-validation results reveals that individual folds may yield unexpectedly poor regressions, despite acceptable mean results. Unlike model A, model B exhibits a relatively low standard deviation, thus indicating that most cross-validation folds present similar R2 scores. Figure (right panel) illustrates the relatively low variance of model B, achieving a minimum-fold R2 score of approximately 0.925. However, on altering the random seeding used to determine cross-validation train-test indices, model B displays unexpectedly poor regressions for certain folds. The seed tests in Figure (top left panel) facilitate the identification of individual weak folds arising from unfavorable train-test combinations of data. A variance in results when altering the cross-validation random seeding is expected due to the non-Gaussian nature and limited size of the dataset. Furthermore, as hyperparameter optimization is conducted using a constant random seeding, it is expected that other cross-validation seeds will display varied results, but similar interquartile ranges over 10-fold cross-validation. Figure (top left panel) indicates that box plots do correlate to a certain degree with the interquartile range for model B in Figure (right panel); however, certain train-test combinations lead to exceptionally weak folds, with R2 scores as low as 0.805 (Figure ). Unlike model A, where poor predictions are noted across the interquartile range, these poor regressions are the result of singular weak folds, likely due to the model attempting to predict equilibrium pressure in ranges represented by limited sources, where binary and pure component data are insufficient to supplement a lack of multicomponent data. This outlier only appears in a few random seedings, due to near-completely isolated groups, absent from the train set. For holdout validation, weak folds are accounted for by isolated groups present in both the training and holdout sets. While it is possible that several inaccurate data points are present in the datasets, due to dataset size and sampling methodology, it is unlikely that this would have such a significant effect on weak folds. Additionally, as is demonstrated by the consistent performance of model G across seed tests in Figure (bottom left panel), no exceedingly weak folds are present. Thus, weak folds in Figure (top left panel) for model B are most likely due to ranges of data isolated during cross-validation.

Figure 5

10-Fold cross-validation train-test indices. Note grouped cross-validation implies that data from the same group cannot be present in both training and test sets for the same cross-validation fold.

10-Fold cross-validation train-test indices. Note grouped cross-validation implies that data from the same group cannot be present in both training and test sets for the same cross-validation fold. As with comparisons between multicomponent exclusive, grouped cross-validated models A and C, complete dataset models D and E offer no significant improvement over model B, despite the greater amount of data available to cross-validation. Model G achieves greatly improved results over model B, largely through applying ungrouped cross-validation. This factor is observed across multicomponent and complete dataset models. The disparity between results of grouped and ungrouped cross-validated models for both datasets further infers that a dependency on individual data sources for certain ranges of conditions is likely. This dependency is assessed through grouped cross-validation, as ranges of conditions considered by only one or two sources may be used to form the test set for a specific cross-validation fold, thus resulting in the training set lacking data in the range of conditions considered. This explains the high variance for grouped cross-validated models illustrated in Figure (left panel) and unusually weak folds present for these models illustrated in the seed tests of Figure (top left and right panels). Significant model dependency on individual sources is an unwanted factor which could potentially result in poorer results than expected from the reported model variance when predicting the equilibrium conditions of a gas unseen by the model during training. The presence of experimental or measurement error is of great concern when considering both datasets. Due to the limited availability of multicomponent gas hydrate equilibrium data, and a wide range of conditions being required to develop a useful model, it is required to include in the data sampling campaign historical sources employing older experimental methodologies and measurement apparatus. Nevertheless, the discussed damping and supplementary factors introduced through the complete dataset in addition to the large number of data points sampled from various experimental studies serve to reduce the potential impact of inaccurate data points on model generalization capability.

Model Generalization

A notable observation from the results in Table is that for grouped cross-validated models, holdout validation results prove favorable to the mean 10-fold cross-validation results. This is expected, as cross-validation removes data from model training for testing, while holdout validation allows the model to be trained with all cross-validation data. As a result of the stratified selection of holdout indices, a wider range of conditions is represented in the holdout validation results than cross-validation. Due to the reduced disparity between the holdout validation results of complete dataset models B and G when compared to the magnitude of difference when considering cross-validation results, it may be inferred that despite significant variation of model hyperparameters, both models are capable of generalizing gas hydrate equilibria from the complete dataset. The holdout validation R2 difference between models B and G differs by only 0.0063, indicating that sufficient independent data groups are present in the dataset to facilitate grouped cross-validation. This inference is made possible through randomly seeding holdout indices such that the validation set is constant across models of the same dataset, as illustrated in Figure . To belay concerns that the hyperparameters and neural network topologies of models B and G differ too significantly to render model comparison meaningful, model J has been developed to use the same hyperparameters as model G, while employing grouped cross-validation as with model B. As expected, model J achieves highly similar results to model B, with the slight difference explained by means of the bias-variance trade-off. From both models B and G achieving accurate regressions, it can be inferred that the results of model B could serve as a lower estimate to predictions made using model G when gas hydrate equilibrium pressure is predicted across ranges where suspect groups are present. A similar inference can be drawn based on the difference in results between models A and F. Despite the reliance on individual sources of data proving to be worth considering, results indicate that neural network models are capable of generalizing the behavior of the multicomponent dataset. Results of model F indicate that the neural network models are likely to predict multicomponent data accurately, within the expected margins of error, indicated by the 10-fold cross-validation variance and the lack of extremely weak single folds present in Figure (left panel). To prove that the results for model F are not merely representative of a favorable random seeding, Figure (top right panel) has been developed, indicating no model exhibits significantly worse single folds than those for model F in Figure (left panel). It can thus be concluded that the range of data available to model training and testing covers a wide range of conditions and renders the model reliable across the range of conditions present in the dataset.

Figure 6

Holdout validation indices for both datasets. Note the stratified manner by which data are sampled, the relative proportion of samples per group is held approximately constant across training and test sets.

Model Flexibility

Results indicate that due to the size of the datasets, augmenting further data or removing certain entries poses little threat to the existing optimal neural network topologies or hyperparameters requiring drastic change, as seen by models F and G, which achieve good results while applying the same neural network topology despite major differences between the two datasets. As such, through physical data gradually becoming available while the model is practically applied, or by accessing new sources of data, data may be easily incorporated into existing datasets and the model may thus be improved by using the existing topology and hyperparameters. Results are largely insensitive to exact parameters, thus indicating a robust, replicable method. This implies that dataset changes may not require a computationally intensive hyperparameter optimization procedure. Hence, it can be seen that a major advantage of the methodology is its flexibility, and the ability to rapidly augment new data into the model. This would prove an immense advantage for operation where proprietary equilibrium data are available, as over time, a highly accurate model specifically tailored toward operation over a narrow range of conditions could be developed without significant effort. Due to the methodology proven as robust and replicable, it is expected that other researchers would prove to be able to utilize the design choices and validation practices provided in this research to build models for similar thermodynamic applications.

Model Overview

Ideally, complete dataset models would be tested only using multicomponent data; however, it is essential to test model performance on pure methane and binary methane mixtures even though these are not relevant to natural gas mixtures, as ensuring the model has not been overfitted is of paramount importance. A more prudent final validation practice, when considering practical application of a model developed, would be to test the performance of these models on completely independent equilibrium natural gas mixtures, which have not been seen during model development and are relevant to conditions likely encountered during operation. In summary, several aspects of the model have been assessed: the ability of the model to predict equilibrium data over a wide range of conditions, the ability of the model to predict multicomponent data, and the dependency of the model on individual sources of data used in training and thus the potential for experimental or measurement errors to influence the final neural network. This investigation has proven through means of an extensive 10-fold cross-validation procedure, and stratified holdout validation approach, whereby validation data are proportionally sampled from each independent experimental source of data, that the model does indeed accurately predict equilibrium pressure over a wide range of conditions. Comparison with existing models further reinforces this. Seed tests for various models further prove that model results are not merely valid for a favorable cross-validation random seeding. While the final model developed is likely to have some susceptibility to dataset errors, the wide range of data sampled and damping factors associated with the complete dataset serve to minimize the impact of these errors on generalization. The results of model B and low variance of model G indicate accurate multicomponent predictions. The model designed to be applied to real natural gas hydrate prediction is model G.

Conclusions

In this investigation, a neural network multilayer perceptron has been trained by means of a backpropagation in a supervised manner to successfully model gas hydrate equilibrium with an emphasis on multicomponent gases, yielding an average cross-validation-adjusted coefficient of determination of 0.9867, a standard deviation of 0.0030, and a holdout validated Radjusted2 of 0.9926. All data used in the development of the models have been obtained through a robust data sampling campaign conducted on existing gas hydrate equilibrium measurements present in the literature. Through the use of large datasets sampled from a wide range of conditions, a highly flexible model has been developed and positive results have been achieved, which are largely insensitive to model parameters, thus indicating a robust, repeatable methodology. Further improvement to generalization capable through dataset alterations without hyperparameter changes is possible. The end product of the model is the equilibrium pressure of the desired gas mixture at a certain temperature. Sampled features include the system temperature and gas-phase composition, including methane, ethane, propane, iso-butane, n-butane, carbon dioxide, nitrogen, and a lumped fraction of molecules of five or more carbon atoms. Hydrogen sulfide was excluded as a feature due to lack of independent data sources. The model is limited to sweet, uninhibited gases proceeding phase separation. Model validation applies both 10-fold cross-validation and holdout validation, facilitating the optimization of model parameters without a significant risk of compromising model generalization capability through overfitting, and ensures that a wide range of conditions is used to both train and test models. A comparison with two popular thermodynamic models indicates that the model likely could serve to greatly improve equilibrium predictions in the field. An emphasis has been placed on ensuring that the developed model can successfully predict multicomponent equilibria, and a detailed analysis on the susceptibility of models to experimental error through dependency on individual data sources has been performed through the grouping of independent data sources. The final model developed for this investigation has been found to adequately predict multicomponent hydrate equilibrium pressure. While a degree of source dependency has been established, the damping effect introduced through the inclusion of pure and binary methane gas equilibrium data in the complete dataset serves to mitigate potentially poor results which fall outside the expected variance in results, and models designed to apply grouped validation practices may serve as a lower estimate of model performance in ranges of conditions with sparse independent experimental data. As such, generalization can be assumed. Further emphasis has been placed on ensuring that results are statistically significant, with 10-fold validation and seed tests proving that model results are universally valid across the dataset, and not merely valid for favorable randomizations. Overall, data sampling, model development, and comparison have achieved the research objectives, and they indicate that machine learning models could supplant traditional models in terms of thermodynamic predictions.

Table A1

Randomly Selected Equilibrium Samples and Respective Model G Pressure Predictionsa

T (°C)	C₁	C₂	C₃	i-C₄	n-C₄	C₅₊	CO₂	nitrogen	P_actual (MPa)	P_predicted (MPa)
27.407	0.8724	0.0757	0.0308	0.0051	0.0079	0.0040	0.0000	0.0040	31.522	30.608
22.061	0.8536	0.0826	0.0638	0.0000	0.0000	0.0000	0.0000	0.0000	6.619	7.025
4.550	0.9725	0.0142	0.0108	0.0025	0.0000	0.0000	0.0000	0.0000	1.571	1.500
16.600	0.8799	0.0541	0.0420	0.0000	0.0240	0.0000	0.0000	0.0000	3.686	3.610
13.100	0.9597	0.0000	0.0000	0.0403	0.0000	0.0000	0.0000	0.0000	2.580	2.745
0.440	0.9821	0.0057	0.0019	0.0003	0.0004	0.0003	0.0011	0.0082	2.842	2.743
18.290	0.8654	0.0523	0.0183	0.0038	0.0050	0.0236	0.0312	0.0004	9.279	9.320
12.630	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	9.619	9.540

Samples taken from holdout set, and are not used for model training.

4 in total

1. Gas hydrate nucleation and cage formation at a water/methane interface.

Authors: Robert W Hawtin; David Quigley; P Mark Rodger
Journal: Phys Chem Chem Phys Date: 2008-07-23 Impact factor: 3.676

2. The chemistry of low dosage clathrate hydrate inhibitors.

Authors: Andrea Perrin; Osama M Musa; Jonathan W Steed
Journal: Chem Soc Rev Date: 2013-01-10 Impact factor: 54.564

3. Structure and composition analysis of natural gas hydrates: 13C NMR spectroscopic and gas uptake measurements of mixed gas hydrates.

Authors: Yutaek Seo; Seong-Pil Kang; Wonho Jang
Journal: J Phys Chem A Date: 2009-09-03 Impact factor: 2.781

4. Pectin as an Extraordinary Natural Kinetic Hydrate Inhibitor.

Authors: Shurui Xu; Shuanshi Fan; Songtian Fang; Xuemei Lang; Yanhong Wang; Jun Chen
Journal: Sci Rep Date: 2016-03-21 Impact factor: 4.379

4 in total