Literature DB >> 36174038

A novel machine learning approach to predict the export price of seafood products based on competitive information: The case of the export of Vietnamese shrimp to the US market.

Nguyen Minh Khiem^1,2, Yuki Takahashi³, Hiroki Yasuma³, Khuu Thi Phuong Dong⁴, Tran Ngoc Hai⁵, Nobuo Kimura³.

Abstract

Predicting the export price of shrimp is important for Vietnam's fisheries. It not only promotes product quality but also helps policy makers determine strategies to develop the national shrimp industry. Competition in global markets is considered to be an important factor, one that significantly influences price. In this study, we predicted trends in the export price of Vietnamese shrimp based on competitive information from six leading exporters (China, India, Indonesia, Thailand, Ecuador, and Chile) who, alongside Vietnam, also export shrimp to the US. The prediction was based on a dataset collected from the US Department of Agriculture (USDA), the Food and Agriculture Organization of the United Nations (FAO), and the World Trade Organization (WTO) (May-1995 to May-2019) that included price, required farming certificates, and disease outbreak data. A super learner technique, which combined 10 single algorithms, was used to make predictions in selected base periods (3, 6, 9, and 12 months). It was found that the super learner obtained results in all base periods that were more accurate and stable than any candidate algorithms. The impacts of variables in the predictive model were interpreted by a SHapley Additive exPlanations (SHAP) analysis to determine their influence on the price of Vietnamese exports. The price of Indian, Thai, and Chinese exports highlighted the advantages of being a World Trade Organization member and the disadvantages of the prevalence of shrimp disease in Vietnam, which has had a significant impact on the Vietnamese shrimp export price.

Entities: Chemical

Mesh：

Year: 2022 PMID： 36174038 PMCID： PMC9522284 DOI： 10.1371/journal.pone.0275290

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.752

Introduction

Shrimp production and export is an important economic activity in Vietnam. About 90% of Vietnamese shrimp is used for export, with a value of 2 billion USD in 2011 [1]. According to the Food and Agriculture Organization of the United Nations (FAO) [2], Vietnam is the world’s third largest producer of farmed white leg shrimp and giant tiger prawn, accounting for 13% of the global total, while China is the largest (32%) and Indonesia is second (15%), followed by India (12%), Ecuador (9%), and Thailand (6%). Exports of seafood are important not only for economic growth but also for rural development and the improvement of livelihoods [3]. The EU, USA, and Japan are the three main importers of Vietnamese shrimp product, together accounting for more than 50% of the total export value. These markets have strict requirements for imported seafood products based on safe food criteria, traceability, and quality assurance. Vietnam competes with other exporters to access the global market. Thus, Vietnam tries to satisfy mandatory requirements, seize other competitive advantages, and minimize shrimp disease. Food safety measures imposed by developed countries directly affect the trade flows of export countries [4]. Food safety requirements are met by global GAP certification and the Safe Quality Food standard [5]. Traceability based on the Hazard Analysis Critical Control Point is required in the agriculture chain to increase the value of the product [6]. Country of Origin Labeling and Aquaculture Steward Council certificates are also used to indicate the quality of exported products. Since 2003, the US market has imposed an anti-dumping measure on shrimp imports from China, India, Thailand, Vietnam, Ecuador, and Brazil, which is believed to have affected the export price of those countries [7]. Participating in the World Trade Organization (WTO) constitutes a competitive advantage that enables countries to access high-value markets. By promoting free trade by reducing tariffs in global markets, member countries have the ability to compete with other exporters and domestic products. This is an important factor that influences price and enhances the national shrimp industry. Shrimp disease is a challenge for producer countries because it threatens the volume and quality of the exported product, and is therefore a competitive disadvantage. Previous studies have evaluated losses caused by early mortality syndrome on commercial shrimp aquaculture [8], economic losses [9], and the global perspective of shrimp disease [10]. The losses due to EMS in Vietnam have been reported [11-13]. The application of computing techniques for farming enhancement, disease prediction, and market trend analysis is widely used. For example, an expert system [14] and tools for processing digital images [15] were used to diagnose shrimp disease. Machine learning has been used for predicting disease occurrence in cultured shrimp. Previous studies used machine learning to predict the occurrence of shrimp disease [16-18] and create applications for aquaculture [19]. Machine learning has also been used in sales forecasting [20-22]. Researchers [23] applied a regression model to predict the stock price, while other research [24] used a random forest algorithm to forecast supply chain demand. Accurate price prediction is very important for fishery exports because it helps to determine global market trends, enhancing the quality of seafood products. There are millions of tons of shrimp exported to the international market from producer countries and, therefore, estimates of price need to be as precise as possible. This information can be used by exporters to determine strategies to increase exports, leading to more financial benefits for national economies and providing motivation to shrimp farmers. Although machine learning algorithms are useful for making predictions, they still depend on accurate datasets and each algorithm has its own strength of prediction. For example, the random forest algorithm was outperformed in terms of prediction by a dataset of power generation and power system security [25], and a logistic regression performed better than a neural network in the prediction of occurrence of early mortality syndrome [17]. A study preferred a probabilistic neural network to a logistic regression model when analyzing a dataset of general shrimp diseases [16]. To increase accuracy and overcome the dependence of the algorithm on the dataset, the combination of many machine learning algorithms was proposed [26]. This enabled the generation of a more powerful predictive model, called the super learner. Researchers [27] used the super learner to predict the phenotypic antiretroviral susceptibility of HIV in humans and found that it was as good as or better than any single algorithm. A combination of ten single machine learning algorithms (a neural network, linear regression, randomized trees, XGBoost, loess, random forest, polyMARS, MARS, lasso, and support vector regression) [28], was used to optimize the accuracy of daily stream flow forecasts. One study [29] used the super learner to improve the accuracy of prediction of mortality risk in an older population. In addition to improving accuracy, it is also important to determine the importance of each variable used in predictions. For exports, this will help identify the factors that have advantageous and disadvantageous influences on the price of products. The producer will then develop strategies to boost exports by increasing the beneficial conditions and minimizing the negative effects. However, there are difficulties with the machine learning approach because it creates a black-box model that is difficult to interpret, making it difficult for humans to understand and trust. Due to the effort required to evaluate the contribution of each variable in the output of a machine learning analysis, the SHapley Additive exPlanations (SHAP) method was introduced to interpret predictive models. In 2017, a unified approach used SHAP as a way to explain the importance of each variable in the output of a machine learning analysis [30]. An interpretation by SHAP was also conducted in a previous study [31]. SHAP not only successfully explained the internal logic of prediction but also verified the credibility of a predictive model [32]. In this study, we used a super learner, which had the potential to give accurate and stable price predictions for Vietnamese shrimp products exported to the US market. It combined 10 candidate algorithms to predict the exported price of Vietnamese shrimp based on information from competitive exporters (China, Thailand, Indonesia, India, Ecuador, and Chile). To interpret the prediction, SHAP was used to determine how each predictor influenced the export price and then suggested solutions for the development of the Vietnamese shrimp industry. This study provided a new approach to make accurate Vietnamese price predictions based on information from competitors rather than the use of information solely from Vietnam, as in previous research [33]. It also provided a method to interpret the predicted result.

Materials and methods

Preliminaries

Monthly data was collected from the US Department of Agriculture, WTO, and FAO for the period from May-1995 to May-2019. The seven leading exporters of frozen shrimp products to the US market (China, Thailand, India, Indonesia, Ecuador, Chile, and Vietnam) were included in the dataset. The exporting countries were direct competitors in the US market. Therefore, an increase in the price of exports from any of the listed countries would shift the demand curve of shrimp products imported to the US market [34]. This would lead to an increase in demand for shrimp products imported from other countries among US consumers. The prices of shrimp products imported from other countries would then increase. Additionally, the variables influencing the Vietnamese export price, and the competitive advantages and drawbacks (i.e., shrimp disease) of these countries were assessed. The difference in price between Vietnam and each competitor country was considered. The correlations between the variables and price in Vietnam and other countries were evaluated and it was determined whether they were negative or positive. A negative value meant that the Vietnamese price was lower than that of the other countries and vice versa. The export price was presented in US dollars; thus, the correlated price was also reported in US dollars. In Fig 1, the correlations between Vietnamese export prices and those of other countries are shown. The prices of Vietnamese exported shrimp were mainly found to be between 10–15 USD per kg, while prices >15 and <10 USD were no common. Export prices of other countries had a wider but smaller distribution range.

Fig 1

The correlations price between Vietnamese export shrimp and that of other countries.

The US is one of the largest markets for seafood products in the world. Product quality, food safety, and traceability issues are extremely important in the US market. Shrimp producers that export to the US need to adhere strictly to product requirements. We selected some of the mandatory US requirements that applied to shrimp products from all producer countries, such as global GAP, Hazard Analysis Critical Control Point, Safe Quality Food, and Aquaculture Steward Council. Global GAP and Safe Quality Food aim to provide a food safety assurance that protects consumer heath, while Hazard Analysis Critical Control Point guarantees the product origin. The Aquaculture Steward Council practices are used to enhance the responsibility of producers in terms of minimizing their impact on the environment. This certificate requires a limitation on the use of wild fish as an ingredient in shrimp feed, as well as the regular assessment of water quality to avoid pollution and disease outbreaks. These farming certificates are not only needed in response to the requirements of the US market but will also beneficially increase the price of products and the competitive advantages among exporters in the long term due to improvements in product quality [35]. The implementation date for the certificates differs among the producer countries, i.e., Vietnam has applied global GAP since Sep-2007, while Indonesia has only applied this certification to shrimp since Oct-2011. This was considered likely to affect the export price as both countries export to the same destination, i.e., the US market. To protect the domestic shrimp industry, anti-dumping tariffs were set by the US government in 2003. These apply to China, India, Thailand, Vietnam, and Ecuador, while Indonesia and Chile are not affected by this requirement [7]. It was considered that the Vietnamese price was impacted by this practice, leading to an inability to compete with the non-impacted countries. To participate in global trade, many countries have attempted to become members of the WTO. This would provide an opportunity to access stringent markets. The different times at which producer countries have joined the WTO are likely to have affected the export price. The number of shrimp exporting countries participating in the WTO will affect the price of Vietnam’s shrimp export. Membership of the WTO was therefore selected as one of variables in the super learner process to evaluate the export price. Disease is a serious concern in export shrimp production. It can cause problems for producer countries due to US market concerns regarding disease transmission and residual chemicals in the final product due to the materials applied for treatment. This will reduce the competitiveness of exporters and export volumes. Before imports can be accepted, shrimp producers in countries where disease is confirmed could be required to show evidence of safe products. Therefore, early mortality syndrome was selected as a variable that could influence the Vietnamese export price. Other variables in the dataset, such as enrofloxacin antibiotic residues and Country of Origin Labeling were not used in predictions. Because they applied to all export countries at the same time, they were not meaningful for predicting price variations. We only focused on competitor information and, therefore, the exchange rate and US income per capita were also omitted. To validate the association between independent variables that were used to hypothesize the target value–Vietnamese export price, the Pearson correlation [36] method was applied. This method measures the strength of the relationship between two variables [37], based on their coefficient. The Pearson correlation coefficient between variable X and Y is defined as: where, cov is the covariance and var is the variance. Variables that had a high correlation with the Vietnamese export price were selected for prediction. We obtained 13 independent variables for use in the super learner process to predict the Vietnamese export price (see Table 1).

Table 1

List of variables.

Variables	Description
DifferencePriceVN_China	Price gap between Vietnam and China
DifferencePriceVN_Indonesia	Price gap between Vietnam and Indonesia
DifferencePriceVN_India	Price gap between Vietnam and India
DifferencePriceVN_Thailand	Price gap between Vietnam and Thailand
DifferencePriceVN_Chile	Price gap between Vietnam and Chile
DifferencePriceVN_Ecuador	Price gap between Vietnam and Ecuador
Certificated_SQF	Number of competitive countries with Safe Quality Food certificates
Certificated_HACCP	Number of competitive countries with Hazard Analysis and Critical Control Point certifications
Certificated_ASC	Number of competitive countries with Aquaculture Stewardship Council certificates
Infected_EMS	Number of competitive countries confirmed to be infected by Early Mortality Symptom in cultured shrimp
Member_WTO	Number of competitive countries that are members of the WTO
Applied_GAP	Number of competitive countries that apply global Good Agricultural Practice
Imposed_ANTI	Number of competitive countries subject to anti-dumping laws by the USA.

Linear regression

Linear regression builds models that assume a linear relationship between input variables (x) and the single output variable (y) via scale factors to each input, called coefficients (β). The formula of this algorithm is as follows: where β0 is the intercept (when x is 0), i indicates the ith sample in the dataset, ε is a random variable, x corresponds to variables in the dataset, p is the pth independent variable, and y is the Vietnamese export price. Parameters were set to increase the reliability of the predictions for this algorithm. In the scikit-learn Python package, parameter fit_intercept, which is β0 in (2), was set to “True” to calculate the intercept for this model. Normalize and copy_X, which are used to normalize by L2-norm and copy all variables in the dataset, were set to “True.” Other parameters such as n-jobs and positive were set to default.

Lasso regression

The lasso regression, which is an abbreviation of “least absolute shrinkage and selection operator,” is used for variable selection and estimation in linear regression models [38]. A constraint is imposed on model parameters to make regression coefficients for variables shrink to zero. Some variables have a zero-coefficient and are eliminated, while non-zero coefficient variables are used to evaluate the model. The coefficient of linear regression (β0, β1,…,βP) in lasso is as follows: where, x and y are input and output variables, respectively, and λ is a non-negative tuning parameter that is used to control the shrinkage. The higher the value of λ, the larger the shrinkage of the model. Similar to linear regression, the parameters in this algorithm including fit_intercept, normalize, and copy_X were set to “True.” The parameter alpha was used to control regularization strength with values at 0.8 being the best. Max-inter, the maximum number of iterations, was set to 500. Other parameters such as tol (used for stopping criterion), precompute, random_state, and others were given default values.

Ridge regression

This is a bias estimation method that is used for estimating the coefficients of a regression model where the independent variables are strongly correlated [39]. The term bias in machine learning can be understood as the extent to which the model fails to produce a plot that is in line with the samples. The ridge regression imposes a penalty term to coefficient β to control bias, thus improving the accuracy of prediction. The cost function for a ridge regression performs an L2 regulation as follows: where, λ is a penalty term, and x and y are the input and output variables, respectively. To fairly evaluate ridge, linear, and lasso regression, the parameters of ridge including fit_intercept, normalize, and copy_X were also set to “True.” The max_iter was set to 500 and alpha was set to 0.8. The tol, positive, and random_state were set to defaults. The parameter solver was set to “svd” which is singular value decomposition of independent data (13 independent variables in the dataset) to compute the Ridge coefficients.

Elastic net

Elastic net regression uses penalties from both the lasso and ridge methods to regularize a regression model. It improves the regularization of the predicted model by learning the shortcomings of the ridge and lasso models. The limitation of the lasso regression is that it used only a few samples to produce high dimensional data (too many variables), while the ridge method can keep many highly correlated variables in the dataset. To overcome these issues, the elastic net performs a variable selection and regularization simultaneously. Accordingly, there are two stages involving the lasso and ridge methods: first, it finds the ridge regression coefficient and, second, it uses a lasso-type shrinkage of the coefficient. To implement this algorithm, we set parameter values the same as the ridge algorithm: alpha equal to 0.8; max_inter at 500; and fit_intercept, normalize, and copy_X set to “True.” Here, the parameter l1_ration was set to 0.5 and selection, which was used to update the coefficient in every iteration was set to the “random” option.

K-nearest neighbor

This method is based on the similarity concept that assumes similar things exist in close proximity [40]. Hence, k-nearest neighbor finds the similarity among data points by calculating the distance among them. The Euclidean distance is often used to measure how close two data points are. Here, K represents the specified number of samples that need to be grouped in terms of similarity. All the samples in one group have the same label (or value). Therefore, once a new sample is placed into a specific group, it is assigned the label of that group. Depending on the data, an appropriate k-value must be established. In this study, we took K = 5 to make predictions. The parameter weights was set to the value “distance” to indicate that closer neighbors of a query point will have a greater influence than neighbors that are farther away. Leaf_size was set to 20, the distance metric was set to “minkowski,” and algorithm was set to “kd_tree.” Other parameters were set to default values.

Support vector regression

A support vector machine attempts to determine a line (called the hyperplane in multidimensional space) that will separate two or many classes of data. Support vector regression was built on the principle of a support vector machine, but is used for regression problems [41]. A support vector regression performs the mapping between inputs and outputs by developing a hyperplane. The data points on either side of the hyperplane that are closest to the hyperplane are called support vectors and are used to determine the boundary line. To increase the prediction accuracy, a support vector regression attempts to find the best hyperplane within a threshold value (distance between the hyperplane and boundary line), instead of minimizing the error between the real and predicted values. To implement this algorithm, the parameter kernel was set to “sigmoid”; degree was set to 3; gamma was set to “auto”; cache_size, used to specify the size of the kernel cache, was set to 100 megabytes; max_iter was set to 500. For the regularization parameter C, which is known to be a penalty parameter of the error term, was set to 2. Other parameters such as gamma, tol, shrinking, and epsilon were set to default values.

Decision tree

The decision tree algorithm is based on the structure of a tree to predict the outcome from independent variables in both classification and regression. There are many nodes located from the root to the branches of the tree. In this architecture, the root node and internal nodes are labeled with input values, while the leaf is the output value. Depending on the type of output value, a decision tree is used for classification or regression. A decision tree where the target variable can take continuous values is called a regression tree. Here, we used a regression tree to predict the Vietnamese export price. In this algorithm, the parameter criterion used to measure the quality of a split brand of tree, was set to “gini.” Splitter strategy was set to “random.” Max_depth was set to 10. Min_samples_slit was set to 5. Min_samples_leaf was set to 5. Other parameters such as min_weight_fraction_leaf, max_features, and random_state were set to default values.

Random forest

The random forest is based on the decision tree concept [42]. As its name suggests, multiple trees are built to make a prediction. The random forest algorithm randomly selects samples and uses the best split of a subset of variables to build, simultaneously, multiple sub-decision trees. Majority voting is used to obtain the final result, which is then applied to the sub-decision trees. The random forest is therefore more flexible than a decision tree. In the implementation of random forest, the n_estimator, used to indicate the number of trees, was set to 1,000. The parameter criterion function, used to measure the quality of a split in the forest, was set to “entropy.” The parameter min_samples_split, which indicates the minimum number of samples required to split an internal node, was set to 5. The max_depth was set to 5. Other parameters, such as min_samples_leaf, min_weight_fraction_leaf, max_features, max_features, and min_impurity_split were set to the default values.

Gradient boosting

Gradient boosting is also based on the decision tree concept. Unlike the decision tree, gradient boosting sequentially builds sub-decision trees. The next sub-tree is built with the purpose of improving the error of the previous tree to enhance the ensemble performance [43]. The process continues to build sub-trees until the specified number of iterations is reached. The prediction of the final model is the sum of the predictions of previous tree models. Similar to random forest, n_estimators was set to 1,000 in the implementation of gradient boosting algorithm. The parameter max_depth was set to 5. The loss function used was least squares regression. The parameter min_samples_split was also set to 5. The parameter subsample used to control variance and bias was set equal to 1. Other parameters, such as alpha, max_features, and min_impurity_split, were set to their default values.

Neural network

This algorithm is inspired by the human brain [44]. There are many connected nodes located inside multiple layers in the structure of this algorithm. The computation is performed inside each node and provides an output by mathematically processing inputs with connected weights. The previous nodes will output a value, which is then used as the input for the next node in the network. Generally, the complex structure of a neural network consists of multiple hidden layers between the input and output layers. The complexity of the algorithm is generated through a number of hidden layers that are used for mapping the input and output. In this study, we applied a neural network with five layers: one input layer to receive values from the independent variable, three hidden layers for the mapping process, and one output layer for the target variable. For the neural network, the multi-layer perceptron regressor of scikit-learn was used to obtain the prediction. The parameter hidden_layer_size was set to 3 layers, each with 30 neurons. The activation fuction was set to “tanh.” The solver was set to “lbfgs.” The batch_size, which identifies the size of minibatches for stochastic optimizers, was set to 10. The learning_rate, used to schedule for weight updates, was set to “constant.” Other parameters were set to default values.

Extra tree regression

Extra tree regression is an ensemble technique that works by creating a large number of unpruned sub-decision trees from the training dataset. The extra tree prediction is made by averaging the predictions of the sub-decision trees. This technique is similar to that of the random forest, but it randomly chooses a subset of features to build sub-trees, whereas the random forest makes the optimal choice. The difference makes the extra tree run faster because it does not need to calculate the optimal pathway. In this study, the extra tree algorithm is used to combine the predictions of the 10 candidate algorithms and make the final prediction in the super learner process. Extra tree was implemented to obtain the final prediction for the super learner with specific parameters. The n_estimators was set to 1,000. The parameter max_depth was set to 5. The criterion was set to “absolute error.” The parameter min_samples_split also was set to 5. Other parameters, such as min_weight_fraction_leaf, bootstrap, and min_impurity_decrease, were set to their default values. In this study, all of the algorithms used were obtained from the scikit-learn package [45], which is supported by Python scripts.

Methodology

The super learner is a prediction method that allows researchers to combine the results of a set of single machine learning algorithms into one to improve the predictive performance [46]. The advantage of this method is that the prediction accuracy of its model is as good as or better than any model from a single algorithm. This method is based on the theory of cross-validation and can generate the optimal weighted combination among base algorithms, which is both adaptive and robust for use with a small number of samples [47]. The cross-validation means that all candidate models used the same k-fold splits in the dataset. Due to the super learner being developed using a stacked generalization technique, it uses a new model to combine the predictions from multiple candidate models that are already trained. To predict the accuracy of Vietnamese export prices, random forest and gradient boosting were selected as the best single machine learning approaches [33]. Therefore, these two algorithms were first chosen for use in the super learner. Forward selection was applied to iteratively add new potential candidate algorithms to the super learner. To be selected into the model, the potential algorithm had to contribute to the super learner and reduce the error of model. The more algorithms added, the more accurate the ensemble model. However, too many candidate algorithms will increase the time and computer cost for implementation. To balance the accuracy and computer cost, we set the number of algorithms in the super learner to 10. The forward selection settings for predicting the export price is described in Table 2.

Table 2

Forward selection of candidate algorithms.

Step	Candidate algorithm	MAPE
0	Random forest, Gradient boosting	5.16%
1	Elastic net	3.21%
2	Lasso	2.84%
3	Decision tree	2.27%
4	K-nearest neighbor	1.95%
5	Linear regression	1.16%
6	SVR	1.01%
7	Bridge	0.95%
8	Neural network	0.80%

After evaluating the suitability of the combination of algorithms, we used the above 10 candidate algorithms (linear regression, elastic net, k-nearest neighbor, support vector regression, decision tree, random forest, gradient boosting, neural network, lasso, and ridge) to make the base prediction. Then the extra trees algorithm was used to combine all base predictions and obtain the final result. The concept of the super learner is shown in Fig 2. Here, we divided the dataset into two subsets: 75% and 25% for the training and testing sets, respectively. According to a time series analysis, the training set used the data for the period from May-1995 to Apr-2013, while the testing set consisted of the following time period from May-2013 to May-2019.

Fig 2

Concept of the super learner.

Candidate algorithm (Algo_1 to Algo_n) and independent variable in dataset (var_1 to var_n).

Concept of the super learner.

Candidate algorithm (Algo_1 to Algo_n) and independent variable in dataset (var_1 to var_n). To determine how each variable influenced the price, SHAP is used. Originally, SHAP was developed by Shapley in 1953 [48] to estimate the importance of an individual player in a collaborative team game. It evaluates the contribution among players to the final result of the game. It is based on an optimal Shapley value. The Shapley value is calculated by averaging the marginal contribution of the variable’s values across all possible sets (coalitions). Thus, the Shapley value indicates how to distribute the predictions fairly among independent variables. This concept was later developed to interpret the contribution of each predictor in a machine learning analysis [49]. A prediction can therefore be made by assuming that an independent variable of all points in a dataset acts as a “player” in game and the predicted result is the payout [48]. Accordingly, SHAP is used to evaluate the importance of variables for an outcome when making predictions [50]. The contribution of each variable i is indicated by the SHAP value as follows: where f(S) is the outcome of the sub-set variable S used in a machine learning model, while N is the complete set of all variables. The contribution of variable i (called Φ) could have negative and positive signs. A positive value contributes to the prediction of activity, while a negative value contributes to the prediction of inactivity. In this method, the difference between the average and component prediction (for each subset S) is fairly distributed among the variables of interest, which guarantees that a full explanation of the predicted result will be delivered. Therefore, SHAP is an ideal solution in the interpretation of prediction problems. However, the computing power required to estimate the Shapley value is extremely large because it needs to handle 2k possible sets of variable values. In the exponential number of subsets, the k number is found by sampling sets and the number of iterations (called M). There is no rule for setting the optimal M due to a decrease in M reducing the computing time, leading to an increase in the variance of the Shapley value and vice versa. Hence, the value of M needs to be large enough to obtain a good Shapley value and small enough to complete the computation in reasonable time [49]. In this study, we used the SHAP package, supported by python, to estimate the Shapley value for each variable in the dataset. To visualize the contribution of each variable, a summary plot was constructed. Each point was a Shapley value for a variable and a data point. The y-axis is a set of variables, the x-axis is a Shapley value, and the color indicates the value of the variable from low to high. In the summary plot, the importance of variables is ranked in descending order, enabling us to determine which variable is the most important and which is the worst. We evaluated the contribution of all of the independent variables related to competitive factors for the prediction of export price.

Results

Variable selection

The Pearson correlation method was applied to find the variable with the strongest association with the Vietnamese export price. The correlation assigns a value between −1 and 1, where 0 is no correlation, 1 is the total positive correlation, and −1 is the total negative correlation. For continuous variables including differencePriceVN_China, differencePriceVN_Chile, differencePriceVN_Thailand, differencePriceVN_Indonesia, differencePriceVN_India and differencePrice_VN_Ecuador, the correlation values were -0.76, 0.58, 0.47, 0.65, 0.89 and 0.72, respectively. For binary variables such as certificated_SQF, Certificated_HACCP, Certificated_ASC, Infected_EMS, Member_WTO, Applied_GAP, Imposed_ANTI, we found the high correlation values with Vietnamese price which ranges from 0.52 to 0.96 of correlation. These high correlation variables were used to hypothesize the Vietnamese export price for 3-, 6-, 9-, and 12-month base predictions.

Prediction accuracy

We predicted the export price based on historical data for the previous 3, 6, 9, and 12 months. We used the mean absolute error (MAE), and mean squared error (MSE) to measure the accuracy of prediction in the testing subset (period from May-2013 to May-2019) as follows: where m is the number of test samples, y is the actual value, and is the predicted value. Then, the percentage error was calculated (for MAE, mean absolute percentage error MAPE = (MAE × 100)/average price, and for MSE, mean square percentage error MSPE = (MSE × 100)/average price 2). The average price in the dataset was 11.9 USD. The results obtained using the candidate algorithms and super learner are given in Table 3.

Table 3

Prediction for 3, 6, 9, and 12 months base.

Period base		Candidate algorithm										Supper learning
Period base		Linear Reg.	Elastic Net	SVR	Decision Tree	K-NN	Random Forest	Gradient Boosting	Neural network	Ridge	Lasso	Supper learning
3 months	MAE	0.822	0.755	0.696	1.036	0.794	0.788	0.708	1.195	0.663	0.650	0.095
	MAPE	6.91%	6.34%	5.85%	8.71%	6.67%	6.62%	5.95%	10.04%	5.57%	5.46%	0.80%
	MSE	1.063	0.867	0.677	1.783	0.973	0.978	0.782	3.386	0.637	0.651	0.021
	MSPE	0.75%	0.61%	0.48%	1.26%	0.69%	0.69%	0.55%	2.39%	0.45%	0.46%	0.01%
6 months	MAE	1.156	0.761	0.717	0.893	0.781	0.719	0.734	1.644	1.037	0.705	0.142
	MAPE	9.71%	6.39%	6.03%	7.50%	6.56%	6.04%	6.17%	13.82%	8.71%	5.92%	1.19%
	MSE	2.236	0.872	0.714	1.313	1.016	0.761	0.761	5.655	1.777	0.701	0.063
	MSPE	1.58%	0.62%	0.50%	0.93%	0.72%	0.54%	0.54%	3.99%	1.25%	0.50%	0.04%
9 months	MAE	2.047	0.748	0.720	0.742	0.835	0.707	0.696	1.509	1.510	0.668	0.126
	MAPE	17.20%	6.29%	6.05%	6.24%	7.02%	5.94%	5.85%	12.68%	12.69%	5.61%	1.06%
	MSE	6.358	0.875	0.747	0.956	1.199	0.770	0.729	4.103	3.549	0.705	0.044
	MSPE	4.49%	0.62%	0.53%	0.68%	0.85%	0.54%	0.51%	2.90%	2.51%	0.50%	0.03%
12 months	MAE	3.556	0.750	0.783	1.033	0.955	0.867	0.755	2.714	1.992	0.704	0.133
	MAPE	29.88%	6.30%	6.58%	8.68%	8.03%	7.29%	6.34%	22.81%	16.74%	5.92%	1.12%
	MSE	19.157	0.889	0.878	1.714	1.432	1.222	0.846	9.880	6.254	0.772	0.032
	MSPE	13.53%	0.63%	0.62%	1.21%	1.01%	0.86%	0.60%	6.98%	4.42%	0.55%	0.02%

In the 3 month period, the super learner produced a prediction result with a MAPE of 0.8% and MSPE of 0.01%. This was an improvement on the accuracy of the single algorithms. The average MAPE of the candidate algorithms was 6.81%, while the MSPE was 0.83%. Accordingly, the super learner method improved the MAPE by about 6 percentage points and the MSPE by about 0.80 percentage points compared to the single algorithm approach. Among the candidate algorithms, the best performance was achieved by lasso (MAPE of 5.46%, MSPE of 0.46%) and the ridge method (MAPE of 5.57% and MSPE of 0.45%), while the worst accuracy was achieved by the neural network (MAPE of 10.04% and MSPE of 2.39%). Fig 3A shows how close the predicted values from the super learner were to actual values.

Fig 3

a. Prediction for a 3-month base by the super learner. b. Prediction for a 6-month base by the super learner. c. Prediction for a 9-month base by the super learner. d. Prediction for 12-month base by the super learner. In the 6-month period, the super learner produced a prediction result with a MAPE of 1.19% and MSPE of 0.04%. The combined approach substantially reduced the error compared to the average prediction from the candidate algorithms (MAPE of 6.50% and MSPE of 1.08%). Among the candidate algorithms, lasso achieved the highest accuracy (MAPE of 5.92% and MSPE of 0.50%), while the neural network (MAPE of 13.82% and MSPE of 3.99%) had the worst performance. Fig 3B shows the prediction for this 6-month period. In the 9-month period, the super learner produced a prediction result with a MAPE of 1.06% and MSPE of 0.03%. Among the candidate algorithms, lasso achieved the highest accuracy (MAPE of 5.61% and MSPE of 0.50%), while linear regression produced the worst prediction with the highest error (MAPE of 17.02% and MSPE of 4.49%). Compared to the best candidate algorithm (lasso), the super method improved the error by at least 4 percentage points (for MAPE) and 0.47 percentage points (for MSPE). The accuracy of the predictions is presented in Fig 3C. In the 12-month period, the super learner produced a prediction result with a MAPE of 1.12% and MSPE of 0.02%. In the stand-alone approach, the accuracy of some candidate algorithms was substantially different. The lowest MAPE and MSPE were obtained using lasso (5.92% and 0.55%, respectively). The MAPE values obtained with the ridge, neural network, and linear regression methods were large (16.74%, 22.81%, and 19.88%, respectively), while the other algorithms produced predictions with MAPE values in the range of 6.03–8.68%. Fig 3D shows a comparison of the actual and predicted values of the super learner.

The SHAP evaluation

To indicate the importance of each variable used in the dataset, a summary plot of SHAP values is presented in Fig 4. These contributions were ranked in descending order, with the highest importance at the top and the lowest at the bottom. The horizontal location in each feature indicates its impact on the prediction, i.e., red had a strong influence on prediction, while blue had a weak influence. The results indicated that differencePriceVN_India, member_WTO, differencePriceVN_Thailand, differencePriceVN_China, infected_EMS and differencePriceVN_Ecuador had large and positive impacts on the prediction of the target variable (Vietnamese export price). The strong influence was indicated by the red color and there was a positive impact indicated on the right axis of the SHAP figure. Generally, the difference in export price between Vietnam and other competitive exporters had a strong influence on the predictive model, while farming certificates had little impact on the accuracy. Among the competitors, India had the most influence on the Vietnamese export price. Participating as a member of WTO was the second most important factor, followed by Thailand, China, outbreak of disease (early mortality syndrome), and Ecuador, while Indonesia, Chile, and farming certificates (Safe Quality Food, Hazard Analysis Critical Control Point, and Aquaculture Steward Council) were less significant.

Fig 4

SHAP interpretation.

Discussion

The super learner produced predictions that not only obtained a high accuracy but were also stable for different periods of historical data. The MAPE in all predictions was very low at 0.8%, 1.19%, 1.06%, and 1.12% for the 3-, 6-, 9-, and 12-month periods, respectively. Similarly to MAPE, the optimal MSPE was obtained with the super learner, with values of 0.01% (3 months), 0.04% (6 months), 0.03% (9 months), and 0.02% (12 months). Compared to the performance of each candidate in the base algorithms, this method completely improved the error of prediction. The MAPE was improved by more than 4 percentage points using the super learner compared to the best candidate algorithm. It significantly reduced the error from the best single candidate (lasso), with reductions from 5.46% to 0.8%, 5.92% to 1.19%, 5.61% to 1.06%, and 5.92% to 1.12% for the 3-, 6-, 9-, and 12-month base predictions, respectively. There was an improvement of at least 0.4 percentage points in the MSPE for the super learner compared to the best single approach. Additionally, the combined method resulted in a stable prediction, meaning that the accuracy was less dependent on historical data. The MAPE values for the 6 and 12 month periods were 1.12% and 1.19%, respectively. Fig 3 shows the stable accuracy of the super learner, which minimized errors in all predictions. These data prove that the super learner is a suitable approach for predicting the Vietnamese price of export shrimp. The combination of candidate algorithms makes the ensemble model powerful, overcoming the dependence of each algorithm on the dataset. This work enhanced the previous predictions presented by [33]. A high accuracy and stable prediction were obtained using the super learner, which outperformed the single approaches of the random forest and gradient boosting. To gain an advantage in a global market, a producer not only has to satisfy the mandatory product requirements but also needs to reach an agreement on an appropriate price. Therefore, accurately predicting the price of an exported shrimp product is essential for enabling a producer to compete in the market. The factors influencing the prediction were evaluated. In the SHAP (shown in Fig 4), the variable with the largest impact was the correlation in price between Vietnam and India. This was because India is the country exporting the most frozen shrimp in the global market (3.89 billion USD in 2019), followed by Ecuador, Vietnam, and Indonesia at 3.6 billion USD, 1.9 billion USD, and 1.4 billion USD, respectively. About 32% of all shrimp imported to the US originates from India [51], and it therefore has a very strong competitive position, significantly affecting other exporters, including Vietnam. In our analysis, the correlation between Indian and Vietnamese prices (in Fig 1) had an associate value of 0.89, the highest in our study. The distribution of Indian prices was similar to that of Vietnamese prices but lower. Indian prices ranged mainly between 8 and 11 USD, while Vietnam prices were 10–15 USD. This means that India has lower export prices, leading to a competitive advantage with Vietnam estimated in the SHAP evaluation (Fig 4). India has promoted the development of shrimp production, which has become a key sector of the Indian economy, contributing 70% of the value of India’s seafood exports [52]. From 2011 to 2018, farmed shrimp production in India increased by 23%, far surpassing the average global growth rate of 5.6% [53]. The export price of Thai and Chinese shrimp products was the leading factor that affected the export price for Vietnam, and this factor had the top ranking in the SHAP evaluation. Thailand is famous for the quality of its shrimp products. Seafood is the industry that generates most income for Thailand, and frozen shrimp are the highest value export from the country. The FAO [54] reported that Thailand was the top global exporter of shrimp products, which were the county’s most important commodity trade in terms of value. About 82% of the shrimp produced is used for export, while the remaining 18% is consumed domestically. In the production of shrimp, Thailand has prioritized food safety, welfare, and traceability among shrimp farmers and its production is conducted in an environmentally responsible manner. The US is the most important importer of Thailand shrimp products and Thailand is a strong competitor of Vietnam in the US shrimp market. With the advantage of its large potential farming area, China could also produce a large quantity of shrimp and it has become the world’s largest producer of shrimp [2]. The Chinese shrimp export price was the lowest among the exporters investigated here (as shown in Fig 5), which had an impact on the export price in Vietnam. Chile is also a strong competitor with Vietnam in exporting shrimp to the US market due to its advantageous geographical location. Chile and the US are located in the same continent, which reduces transportation costs and there are fewer trade barriers imposed by the US, i.e., the anti-dumping law. Chile also has favorable natural conditions, i.e. a long coastline, which favors shrimp production.

Fig 5

Comparison of prices among competitors, including Vietnam.

As a member of the WTO, Vietnam has obtained an advantage in exporting shrimp to the US market, which has partly overcome the unfair imposition of the anti-dumping tariff from the US government that is used to protect inefficient domestic industries [7]. Vietnam became the 150th member of WTO in 2007, and the export volume and price of shrimp products was positively affected. It increased from 11.78 USD/kg (Dec-2006) to 13.00 USD/kg (Jan-2007). In addition, WTO members face lower trade barriers; thus, they obtain more benefits from low tariffs, regulations, and import quotas. Disease clearly influenced the export price. It not only affected the quality of shrimp but also caused a scarcity in the quantity for export. The global shrimp industry has been severely affected by early mortality syndrome, and has experienced huge losses. In addition to the decrease in production for domestic consumption, shrimp losses have directly affected national exports by causing fluctuations in the exported volume and price. According to data of the US Department of Agriculture used in this study, the average export price of Vietnamese shrimp to the US was 11.7 USD/kg, which increased to 12.2 USD/kg after early mortality syndrome was observed. Similarly to Vietnam, Thailand has also experienced large changes in the export price due to the early mortality syndrome outbreak. The average price before the disease outbreak was 8.8 USD/kg, but this increased to 10.3 USD/kg after infection. The Chinese export price increased from 5.5 to 7.8 USD/kg after EMS was confirmed, while the shrimp product from Chile increased by an average of 4.7 USD for each exported kilogram, which resulted in an overall increase from 8.3 to 13.0 USD/kg. The early mortality syndrome outbreak clearly affected the export prices of shrimp products in the global market. The disease reduced the production volume; thus, it caused the global price to increase. Another impact of the disease on global trade was the reduction of opportunities for exports. Importers could implement policies restricting imported shrimp from affected countries [8]. This will reduce the competitiveness and export volume of exporters. Although other factors, including Aquaculture Steward Council certificates, global GAP, Safe Quality Food, and Hazard Analysis Critical Control Point, had less of an impact on the predictive model, they also impacted the export price of Vietnam and other countries in terms of the assurance of food safety, traceability, and disease risk. It is likely that Vietnam will obtain a better price for its shrimp products if it fully implements the assurance certificates for exported shrimp, which may give it a competitive advantage over other producer countries in the international market. The more the requirements of the US are satisfied, the greater the export volume and price Vietnam will obtain. These certificates not only help increase productivity and reduce the risks from diseases during stocking, but they also ensure the safety of exported foods [55]. Currently, the export price of Vietnamese shrimp is higher than that of other exporters (as shown in Fig 5). This presents difficulties for Vietnam in terms of competition with other countries in the US market. The quality of shrimp products is a significant consideration during farming due to the enhancement of product quality being the main goal for the Vietnamese shrimp production industry. This explains why the US still prefers to import shrimp products from Vietnam even though the price is higher than products from other countries. Vietnam is one of the major exporting countries to the US market; however, the requirements imposed on Vietnamese producers by the US market have increased the cost of production. For example, Hazard Analysis Critical Control Point, a traceability certificate, is considered the best strategy for gaining consumer trust regarding exported seafood products. Accordingly, Vietnam needs to implement Hazard Analysis Critical Control Point in cultured shrimp for export. However, the impact on the production cost needs to be considered [56]. To increase confidence in the origin of product, Vietnam issued a national traceability regulation (Circular No.03/2011 BNN-PTNT) through the Vietnamese Directorate of Fisheries in March-2011. Applying certificates in farming will satisfy the US market, but the product cost will need to increase to guarantee a profit for farmers. Currently, the export price of Vietnam’s shrimp is 30% higher than that of Ecuador, India, and Indonesia [57]. Therefore, planning the good strategy for suitable export price and increasing product quality will make Vietnam’s shrimp products more competitive with other exporters, leading to an increase in the ranking of the country among the top exporting countries in the future. (CSV) Click here for additional data file. 20 Jun 2022

PONE-D-22-12822

A novel machine learning approach to predict the export price of seafood products based on competitive information: evidence from the export of Vietnamese shrimp to the US market

PLOS ONE Dear Dr. Nguyen, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Aug 04 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, José F. Vicent Francés, Ph.D. Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match. When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section. 3. Thank you for stating the following in the Acknowledgments Section of your manuscript: "This work was supported by the Hokkaido University DX Doctoral Fellowship [grant number JPMJSP2119]." We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form. Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript." Please include your amended statements within your cover letter; we will change the online submission form on your behalf. 4. In your Data Availability statement, you have not specified where the minimal data set underlying the results described in your manuscript can be found. PLOS defines a study's minimal data set as the underlying data used to reach the conclusions drawn in the manuscript and any additional data required to replicate the reported study findings in their entirety. All PLOS journals require that the minimal data set be made fully available. For more information about our data policy, please see http://journals.plos.org/plosone/s/data-availability. "Upon re-submitting your revised manuscript, please upload your study’s minimal underlying data set as either Supporting Information files or to a stable, public repository and include the relevant URLs, DOIs, or accession numbers within your revised cover letter. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. Any potentially identifying patient information must be fully anonymized. Important: If there are ethical or legal restrictions to sharing your data publicly, please explain these restrictions in detail. Please see our guidelines for more information on what we consider unacceptable restrictions to publicly sharing data: http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access. We will update your Data Availability statement to reflect the information you provide in your cover letter. Additional Editor Comments : Due to reviewer feedback, my decision is Major Revision [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Partly Reviewer #2: Partly ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: No Reviewer #2: No ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: No ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: In this paper, the authors use a super learner, with the aim of to give accurate and stable price predictions for Vietnamese shrimp products exported to the US market. In the paper, 10 algorithms are combined to predict the exported price of Vietnamese shrimp based on information from competitive exporters as China, Thailand, Indonesia, or India. In addition, a SHAP method is used to determine how each variable (predictor) influenced in the price. In my opinion, the paper is interesting and novel but very simple. There are some technical weaknesses that must be solved and it will help clarify the visibility of the paper. The use of acronyms is abused, which makes reading difficult. Please use them in necessary cases but not continuously. First of all, the paper would have to be reorganized: The Materials and Method section should be divided into Preliminaries, where all the methods and dataset used in the paper are included and Methodology, where they put I am not sure if the variables the authors use (in table 1) are correlations or simple differences since the name is CorrelationPrice but the description talks about differences. It is mandatory to clarify that. In line 209 the authors say that “After evaluating the suitability of the combination of algorithms…” but: How do the authors assess the adequacy of these algorithms? Why choose 10 algorithms if it is known in advance that some of them will not give precise results? The inclusion of an explanatory flow chart is mandatory. I don't understand the title of the Section “Results Machine Learning”. It must be changed. Figures 2, 3, 4 and 5 are practically the same. That means that the predictions at 3, 6, 9 and 12 months based on super learner are practically identical. Is this temporary specification necessary? Regarding the data, nothing is said about the amount of data used, nothing is said about data cleaning, nothing is said about whether this data is balanced or not. An important weakness of the paper is the technical section. Then, nothing is said about the different types of algorithms used (they only describe a simple definition as Wikipedia) . They must specify much more the technical part of the data, the technical part of the algorithms used as well as the technical part of the super-learne. They are treated very weakly. For example, authors do not provide any table or description with the training parameters and the hyper-parameters of each model. This must be explicitly indicated. Last, since the problem is a timeseries forecasting task, why authors have not used well-known algorithms in this field such as ARIMA, VAR or VARMAX models? Reviewer #2: The authors use a super learner (regression model) as a predictor of Vietnamese shrimp prices as well as the SHAP method to estimate the most important features/factors that influence the predicted variable. The authors report strong results in the accuracy of the predictions and the feature importance analysis allows the highlight of useful information from the trained model. Their approach is interesting since the ability to propose (i) an accurate prediction of shrimp prices and (ii) the reasons behind the model prediction (or feature importance) can both be useful for producers or policymakers to develop more adapted strategies in the future. Comment on data availability: The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction. The authors did not make all data used in their study fully available which thus prevents reviewers or readers from independently verifying the validity of their results. The authors write that the "Data cannot be shared publicly because it is sensitive and relates to Vietnam and other countries economics sector." but then that they have "we have some papers which relate this data, but have not been published yet". For me, these two statements may appear contradictory. Moreover, the data used in their study (which includes price, required farming certificates, and disease outbreaks from May-1995 to May-2019) seems to be published by international organizations such as the US Department of Agriculture (USDA), the Food and Agriculture Organization of the United Nations (FAO), and the World Trade Organization (WTO), and publicly available to consult or download. For example, the FAO statistics that are part of the data used by the authors seem to be publicly available (e.g. Global aquaculture production can be downloaded at https://www.fao.org/fishery/en/collection/aquaculture?lang=en). In my opinion, in order to show the reproducibility of their results, the authors should provide clear references to download the data as well as the source code of the scripts they used for data cleaning, features extraction, and classification. If, for reasons that should be approved by the journal, the authors are not authorized to reproduce and/or make the raw data used in their study available, they should at least provide an archive containing the extracted features used to train their models. Comment on the reproducibility of results: The authors did not detail enough the methodology used in their work in order to make it easily replicable. For each of their base models, the authors must also provide all information and/or parameters in order for the readers to be able to reproduce their results. A few examples include (but are not limited to) the depth of the decision tree, number of trees in the random forest (and other models), number of nodes per layer, error function, learning rate, optimization function.. for the neural network, regularization parameter C in the SVR, etc... Moreover, the authors should also specify if they were default parameters or, if not, how those parameters were selected. The source code of the scripts they used for data cleaning, features extraction and classification would greatly help in this regard. In this state, the paper does not offer the required information (data and methodology) to replicate the proposed results and that essential information constitutes one of the most important prerequisites that should be met before this work can be accepted for publication. Other comments about the paper: 1. The title says "evidence", but it is not clear what kind of evidence it's referring to. Moreover, the super learner approach is hardly novel in machine learning and should not be referred to as such in a general sense. If the authors referred to the first time a super learner is used to specifically predict shrimp prices, they should be more precise so that the reader is not misled to think that the super learner proposed by the authors is a novel approach in machine learning. 2. Page 8, "Finally, we obtained 13 independent variables for use in the super learner process to predict the Vietnamese export price": the analysis resulting in the choice of variables is interesting and is clearly described, however, how did the authors determine that these 13 variables were independent? The authors should specify the statistical tests that were performed, describe the methods used to select these tests, and publish the precise results of their statistical analysis that lead to the conclusion that these 13 variables were indeed independent. 3. Page 8: moreover, in addition to the analysis proposed to determine the choice of selected variables, the authors should also support their choice by performing a statistical analysis or a forward feature selection (or any other traditional feature selection technique, e.g. see https://doi.org/10.1016/B978-0-444-81892-8.50040-7). These kinds of analyses are helpful to quantify the importance of all variables and validate the choice proposed by the authors or discuss why the choice proposed by the authors varies from these more traditional methods. For example, in the discussion, the authors state that the price of Indian shrimps significantly affects the price of Vietnamese shrimps. It would interesting to perform a correlation analysis between the price of India and Vietnam and report the result. Additionally, a forward feature selection analysis (and the evolution of prediction accuracy when incrementally adding more features) will also show if the model would already perform well with a smaller number of features (or even with just the price of Indian shrimps) or if adding more features was essential to reach the excellent performances reported by the authors. This type of analysis becomes increasingly important in machine learning and, in my opinion, should be systematically performed. 4. Pages 10-14: if the target audience is machine learning experts, the description of the base models written by the authors is too general, often simplistic, and offers little useful information (no detailed explanation on the choice of parameters, model architecture, why the model was specifically chosen, etc..). If the target audience is not supposed to have prior knowledge of machine learning, then the descriptions should be easy and clear to understand. But at the risk of repeating myself, in addition to the model's description, authors are required to provide all the necessary information about their models to ensure the reproducibility of their results. 5. Page 11: Formula (1): please specify what p and n stand for. Formula (2): I would suggest writing beta under the argmin function to make it explicit that the argmin function finds the beta that minimizes the sum. Please also specify the N parameter so it doesn't get confused with the n of the first formula (you may also consider changing the n of Formula (1)). 6. As the authors correctly stated, a super learner model is at least as good, and (hopefully) better, than any individual models that compose it. Thus, comparing the super learner results to the candidate models does not give any information to the reader concerning the performance of the authors' approach compared to state-of-the-art models for price predictions. I think that it would have been much more interesting and informative, in order to evaluate the performance of their approach, if the authors would also have compared the super-learning results with state-of-the-art models for price predictions such as LSTM/GRU, ARIMA, etc.. (as done in https://doi.org/10.1007/s00521-020-05172-3). 7. Page 21: the sentence "These performances proved that the super learner is a novel approach that produced the predictions in this study." should be rephrased. Did the authors want to emphasize that the super learner produced the best predictions? Plus, although the authors demonstrated the superior results of the super learner compared to any of its composing predictors, as stated earlier, the super learner approach is hardly novel in machine learning and should not be referred to as such in a general sense by the authors. 8. Line 460: "contributing 70% of the value of India’s exports": Did the authors want to say "India's seafood exports"? Because the sentence tends to state that it's 70% of India's total exports. 9. Figure 2-5: price should have an uppercase 'p' ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No ********** [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 15 Jul 2022 Reviewer #1: In this paper, the authors use a super learner, with the aim of to give accurate and stable price predictions for Vietnamese shrimp products exported to the US market. In the paper, 10 algorithms are combined to predict the exported price of Vietnamese shrimp based on information from competitive exporters as China, Thailand, Indonesia, or India. In addition, a SHAP method is used to determine how each variable (predictor) influenced in the price. In my opinion, the paper is interesting and novel but very simple. There are some technical weaknesses that must be solved and it will help clarify the visibility of the paper. The use of acronyms is abused, which makes reading difficult. Please use them in necessary cases but not continuously. Answer: Thank you very much for your comment. We have reduced the number of acronyms in the manuscript. First of all, the paper would have to be reorganized: The Materials and Method section should be divided into Preliminaries, where all the methods and dataset used in the paper are included and Methodology, where they put Answer: Thank you very much for your comment. We reorganized the Materials and Method according to your suggestion in the manuscript. I am not sure if the variables the authors use (in Table 1) are correlations or simple differences since the name is CorrelationPrice but the description talks about differences. It is mandatory to clarify that. Answer: Thank you very much for your comment. There is a difference between the price of Vietnamese shrimp and the price of other countries. But the difference is calculated in one direction, the price of Vietnam minus the price of others in each monthly dataset. Therefore, the data contain both negative and positive values. If a value is positive, it means that the Vietnam price is higher. If the value is negative, it means that the Vietnam price is lower. Both the value and direction of these differences will contribute to the prediction. Also, the correlation of distribution between Vietnamese export price and that of other countries is shown in Fig 1. We changed the name “CorrelationPrice” to “DifferencePrice” in Table 1 to match with the description. In line 209 the authors say that “After evaluating the suitability of the combination of algorithms…” but: How do the authors assess the adequacy of these algorithms? Why choose 10 algorithms if it is known in advance that some of them will not give precise results? The inclusion of an explanatory flow chart is mandatory. Answer: Thank you very much for your comment. In the previous paper (https://doi.org/10.1007/s12562-021-01498-6), random forest and gradient boosting were the best single machine learning approaches to predict the price of Vietnamese export shrimp. In the super learner, the combination of algorithms is required to generate an ensemble model. Therefore, the selected set of algorithms, called S, contains random forest and gradient boosting. We iteratively add new algorithms into the current set, S. To evaluate the new potential algorithm, we tried adding it in S to test the accuracy of the ensemble model. If it improved the accuracy, we added this new algorithm into S. Otherwise, we eliminated it and chose another algorithm. The more algorithms we added, the more accurate the ensemble model became. However, too many candidate algorithms will increase the time and computer cost for implementation. Here, we set the maximum number of algorithms to 10 for the current dataset. The forward selection of the algorithms is specified in Table 2. We added a description of this process to the manuscript at lines 360 - 371. I don't understand the title of the Section “Results Machine Learning”. It must be changed. Answer: Thank you very much for your comment. The results section has many subsections. We modified the heading you mention to “prediction accuracy.” Figures 2, 3, 4 and 5 are practically the same. That means that the predictions at 3, 6, 9 and 12 months based on super learner are practically identical. Is this temporary specification necessary? Answer: Thank you very much for your comment. The prediction by the super learner has stable accuracy for different periods of data (3, 6, 9, or 12 months). Therefore, there are small differences (small errors) among periods. Figures 2, 3, 4, and 5 intend to emphasize that the use of the super learner can overcome the dependence of the algorithm on long- or short-term periods of data. To more easily compare these figures, we combined them into one figure, now Fig 3. We explain this point in the manuscript at lines 520–524. Regarding the data, nothing is said about the amount of data used, nothing is said about data cleaning, nothing is said about whether this data is balanced or not. Answer: Thank you very much for your comment. The monthly datasets were collected from the US Department of Agriculture, WTO, and FAO for the period from May 1995 to May 2019. Therefore, there are 289 rows of data. We only focus on the competition among 7 countries that export to the US market, including China, Thailand, India, Indonesia, Ecuador, Chile, and Vietnam. Therefore, we chose 13 variables related to competition among countries as explained at lines 137 to 149. The dataset used in this research has 289 rows and 13 columns. For machine learning, we separated 2 periods: May 1995 to Apr 2013 for the training model, and May 2013 to May 2019 for testing accuracy as described at lines 376 to 380. We also added the distribution of export prices of Vietnamese shrimp and the correlation between the Vietnamese export price and that of other countries to Fig 1. Also, an explanation of Fig 1 was added to the manuscript at lines 139 – 149. An important weakness of the paper is the technical section. Then, nothing is said about the different types of algorithms used (they only describe a simple definition as Wikipedia). They must specify much more the technical part of the data, the technical part of the algorithms used as well as the technical part of the super-learner. They are treated very weakly. For example, authors do not provide any table or description with the training parameters and the hyper-parameters of each model. This must be explicitly indicated. Answer: Thank you very much for your comment. We added the description of parameters used in each algorithm in the manuscript. Linear regression: line 206 – 210 Lasso regression: line 224 – 228 Ridge regression: line 238 – 242 Elastic net: line 252 – 255 K-nearest neighbor: line 263 – 267 SVR: line 278 – 282 Decision Tree: line 291 – 294 Random forest: line 301 – 306 Gradient Boosting: 313 – 317 Neural network: line 329 – 334 Extra tree: line 344 – 348 Last, since the problem is a timeseries forecasting task, why authors have not used well-known algorithms in this field such as ARIMA, VAR or VARMAX models? Answer: Thank you very much for your comment. We did not use these algorithms (ARIMA, VAR, and VARMAX) in this paper as we already evaluated them in a previous paper (https://doi.org/10.1007/s12562-021-01498-6). Although these algorithms will give state-of-the-art models, these time series analyses were not suitable for our dataset because of large error prediction. Also, we tested VARMAX for our dataset and compared its accuracy to the current candidate algorithms. Here, we applied VARMAX under the Vector Autoregressive Moving Average with eXogenous regressors model (statsmodels.tsa.statespace.varmax.VARMAX(train_feature,order = (2,0))) using Python version 3.7. The mean absolute error for the 3-month base of data was 8.82% for the testing subset, 15.15% (for the 6-month base), 17.30% (for the 9-month base), and 21.00% (for the 12-month base), which had a larger average error, compared to linear regression. In fact, linear regression was one of worst candidate algorithms for the super learner. When we combined VARMAX with the current 10 algorithms, the error (MAPE) for the 3-month based prediction was 1.17% larger than the current error (0.80%). Therefore, these time series analyses were not added to the super learner. Reviewer #2: The authors use a super learner (regression model) as a predictor of Vietnamese shrimp prices as well as the SHAP method to estimate the most important features/factors that influence the predicted variable. The authors report strong results in the accuracy of the predictions and the feature importance analysis allows the highlight of useful information from the trained model. Their approach is interesting since the ability to propose (i) an accurate prediction of shrimp prices and (ii) the reasons behind the model prediction (or feature importance) can both be useful for producers or policymakers to develop more adapted strategies in the future. Comment on data availability: The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction. The authors did not make all data used in their study fully available which thus prevents reviewers or readers from independently verifying the validity of their results. The authors write that the "Data cannot be shared publicly because it is sensitive and relates to Vietnam and other countries economics sector." but then that they have "we have some papers which relate this data, but have not been published yet". For me, these two statements may appear contradictory. Moreover, the data used in their study (which includes price, required farming certificates, and disease outbreaks from May-1995 to May-2019) seems to be published by international organizations such as the US Department of Agriculture (USDA), the Food and Agriculture Organization of the United Nations (FAO), and the World Trade Organization (WTO), and publicly available to consult or download. For example, the FAO statistics that are part of the data used by the authors seem to be publicly available (e.g. Global aquaculture production can be downloaded at https://www.fao.org/fishery/en/collection/aquaculture?lang=en). In my opinion, in order to show the reproducibility of their results, the authors should provide clear references to download the data as well as the source code of the scripts they used for data cleaning, features extraction, and classification. If, for reasons that should be approved by the journal, the authors are not authorized to reproduce and/or make the raw data used in their study available, they should at least provide an archive containing the extracted features used to train their models. Answer: Thank you for your comment. Other papers now partly use this dataset and are in the process of publication. We willing to share the original dataset to the Plos one journal that was analyzed to train the model of the super learner in this paper. Comment on the reproducibility of results: The authors did not detail enough the methodology used in their work in order to make it easily replicable. For each of their base models, the authors must also provide all information and/or parameters in order for the readers to be able to reproduce their results. A few examples include (but are not limited to) the depth of the decision tree, number of trees in the random forest (and other models), number of nodes per layer, error function, learning rate, optimization function.. for the neural network, regularization parameter C in the SVR, etc... Moreover, the authors should also specify if they were default parameters or, if not, how those parameters were selected. The source code of the scripts they used for data cleaning, features extraction and classification would greatly help in this regard. In this state, the paper does not offer the required information (data and methodology) to replicate the proposed results and that essential information constitutes one of the most important prerequisites that should be met before this work can be accepted for publication. Answer: Thank you very much for your comment. We added the description of parameters for each algorithm. Linear regression: line 206 – 210 Lasso regression: line 224 – 228 Ridge regression: line 238 – 242 Elastic net: line 252 – 255 K-nearest neighbor: line 263 – 267 SVR: line 278 – 282 Decision Tree: line 291 – 294 Random forest: line 301 – 306 Gradient Boosting: 313 – 317 Neural network: line 329 – 334 Extra tree: line 344 – 348 Other comments about the paper: 1. The title says "evidence", but it is not clear what kind of evidence it's referring to. Moreover, the super learner approach is hardly novel in machine learning and should not be referred to as such in a general sense. If the authors referred to the first time a super learner is used to specifically predict shrimp prices, they should be more precise so that the reader is not misled to think that the super learner proposed by the authors is a novel approach in machine learning. Answer: Thank you very much for your comment. The word “evidence” may cause confusion for the reader. Here, the “evidence” in the title implies the case of Vietnamese shrimp export. It does not mean the evidence for the super learner. The super learner was highly accurate when it was applied to predict the dataset of Vietnamese exports. We changed “evidence from” to “the case of” in the title. 2. Page 8, "Finally, we obtained 13 independent variables for use in the super learner process to predict the Vietnamese export price": the analysis resulting in the choice of variables is interesting and is clearly described, however, how did the authors determine that these 13 variables were independent? The authors should specify the statistical tests that were performed, describe the methods used to select these tests, and publish the precise results of their statistical analysis that lead to the conclusion that these 13 variables were indeed independent. Answer: Thank you very much for your comment. To determine the independent variable used for hypothesizing the export price of Vietnamese shrimp, we applied the Pearson correlation method. In machine learning, the Pearson correlation method is popularly used to select variables/features for prediction (https://doi.org/10.1260/1748-3018.6.3.385). We calculated the association between influence variables and the Vietnamese export price. The Pearson correlation assigns a value between −1 and 1, where 0 is no correlation, 1 is total positive correlation, and −1 is total negative correlation. Therefore, an absolute value of correlation from 0.5 to 1 is mostly acceptable and considered a high association with the target value. Although correlation between price Thailand and Vietnam is 0.47 (<0.5), we also used it to fully evaluate the competition. These 13 variables satisfy the criteria of the Pearson correlation. Therefore, they were used in the prediction. Also, we explained the method of selecting the variables in the manuscript at lines 188 – 197 and the results of the selection at lines 420–431. 3. Page 8: moreover, in addition to the analysis proposed to determine the choice of selected variables, the authors should also support their choice by performing a statistical analysis or a forward feature selection (or any other traditional feature selection technique, e.g. see https://doi.org/10.1016/B978-0-444-81892-8.50040-7). These kinds of analyses are helpful to quantify the importance of all variables and validate the choice proposed by the authors or discuss why the choice proposed by the authors varies from these more traditional methods. For example, in the discussion, the authors state that the price of Indian shrimps significantly affects the price of Vietnamese shrimps. It would interesting to perform a correlation analysis between the price of India and Vietnam and report the result. Additionally, a forward feature selection analysis (and the evolution of prediction accuracy when incrementally adding more features) will also show if the model would already perform well with a smaller number of features (or even with just the price of Indian shrimps) or if adding more features was essential to reach the excellent performances reported by the authors. This type of analysis becomes increasingly important in machine learning and, in my opinion, should be systematically performed. Answer: Thank you very much for your comment. We chose the Pearson correlation method to select variables with a strong influence on the target value–Vietnamese export price. Due to the absolute value of correlation ranges from 0 to 1, a value of correlation from 0.5 to 1 is acceptable and considered to represent a high impact on the target value. Most variables satisfied this condition. The price of Indian shrimp strongly influenced the Vietnam price, showing a correlation value of 0.89, the highest among all countries (Fig 1). Also, the distribution of the Indian price is more similar to the Vietnam price but lower. The Indian price was mainly 8–11 USD, while the Vietnam price was 10–15 USD. This means that India will have a competitive advantage over the Vietnam price, which was estimated in the SHAP evaluation (Fig 4). The explanation regarding the competition of Indian prices in the Discussion is at lines 540 – 549. 4. Pages 10-14: if the target audience is machine learning experts, the description of the base models written by the authors is too general, often simplistic, and offers little useful information (no detailed explanation on the choice of parameters, model architecture, why the model was specifically chosen, etc..). If the target audience is not supposed to have prior knowledge of machine learning, then the descriptions should be easy and clear to understand. But at the risk of repeating myself, in addition to the model's description, authors are required to provide all the necessary information about their models to ensure the reproducibility of their results. Answer: Thank you very much for your comment. We described the parameters of each algorithm in the manuscript. We also explain how to select these algorithms at lines 360–371. 5. Page 11: Formula (1): please specify what p and n stand for. Formula (2): I would suggest writing beta under the argmin function to make it explicit that the argmin function finds the beta that minimizes the sum. Please also specify the N parameter so it doesn't get confused with the n of the first formula (you may also consider changing the n of Formula (1)). Answer: Thank you very much for your comment. We modified equations (1) and (2) in the manuscript. Now, they become equations (2) and (3), respectively. 6. As the authors correctly stated, a super learner model is at least as good, and (hopefully) better, than any individual models that compose it. Thus, comparing the super learner results to the candidate models does not give any information to the reader concerning the performance of the authors' approach compared to state-of-the-art models for price predictions. I think that it would have been much more interesting and informative, in order to evaluate the performance of their approach, if the authors would also have compared the super-learning results with state-of-the-art models for price predictions such as LSTM/GRU, ARIMA, etc… (as done in https://doi.org/10.1007/s00521-020-05172-3). Answer: Thank you very much for your comment. As shown in our previous paper (https://doi.org/10.1007/s12562-021-01498-6), we already tried time series forecasting algorithms to find the best single algorithm for prediction, but we did not select them. Although these algorithms are state-of-the-art models, they seem to be unsuitable for our dataset due the larger error compared to machine learning algorithms. In this study, we again tested the VARMAX algorithm using the Vector Autoregressive Moving Average with eXogenous regressors model (statsmodels.tsa.statespace.varmax.VARMAX(train_feature,order = (2,0))) with Python version 3.7. The mean absolute error for the 3-month base of data was 8.82% for the testing subset, 15.15% (for the 6-month base), 17.30% (for the 9-month base), and 21.00% (for the 12-month base), which had a larger average error, compared to linear regression. In the super learner, the compatibility among algorithms is important. We also combined VARMAX with the current 10 algorithms. The error (MAPE) is 1.17% for the 3-month base period (larger than the current 10 algorithms which was 0.80%). Therefore, these time series forecasting algorithms were not added in the super learner. 7. Page 21: the sentence "These performances proved that the super learner is a novel approach that produced the predictions in this study." should be rephrased. Did the authors want to emphasize that the super learner produced the best predictions? Plus, although the authors demonstrated the superior results of the super learner compared to any of its composing predictors, as stated earlier, the super learner approach is hardly novel in machine learning and should not be referred to as such in a general sense by the authors. Answer: Thank you very much for your comment. We rephrased this sentence in manuscript at lines 521–524. 8. Line 460: "contributing 70% of the value of India’s exports": Did the authors want to say "India's seafood exports"? Because the sentence tends to state that it's 70% of India's total exports. Answer: Thank you very much for your comment. We modified India’s exports to India’s seafood exports (line 547 – 548). 9. Figure 2-5: price should have an uppercase 'p' Answer: Thank you very much for your comment. We changed ‘price’ to ‘Price’ in Fig 3 (before was Fig 2–Fig 5) Submitted filename: Response to Reviewers.docx Click here for additional data file. 13 Sep 2022 A novel machine learning approach to predict the export price of seafood products based on competitive information: The case of the export of Vietnamese shrimp to the US market PONE-D-22-12822R1 Dear Dr. Nguyen, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, José F. Vicent Francés, Ph.D. Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #2: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #2: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #2: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #2: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #2: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #2: I believe the authors have adequately addressed the comments raised in the previous round of review. I particularly appreciate that the authors made their dataset fully available and provided all technical information to ensure the reproducibility of their results. For me, this manuscript is now acceptable for publication. Thank you to the authors for their interesting work. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #2: Yes: Cédric Simar ********** 19 Sep 2022 PONE-D-22-12822R1 A novel machine learning approach to predict the export price of seafood products based on competitive information: The case of the export of Vietnamese shrimp to the US market Dear Dr. Nguyen: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. José F. Vicent Francés Academic Editor PLOS ONE

7 in total

1. Super learning: an application to the prediction of HIV-1 drug resistance.

Authors: Sandra E Sinisi; Eric C Polley; Maya L Petersen; Soo-Yon Rhee; Mark J van der Laan
Journal: Stat Appl Genet Mol Biol Date: 2007-02-23

2. Interpretation of machine learning models using shapley values: application to compound potency and multi-target activity predictions.

Authors: Raquel Rodríguez-Pérez; Jürgen Bajorath
Journal: J Comput Aided Mol Des Date: 2020-05-02 Impact factor: 3.686

3. Mortality risk score prediction in an elderly population using machine learning.

Authors: Sherri Rose
Journal: Am J Epidemiol Date: 2013-01-29 Impact factor: 4.897

4. Introduction to machine learning: k-nearest neighbors.

Authors: Zhongheng Zhang
Journal: Ann Transl Med Date: 2016-06

5. Interpretation of Compound Activity Predictions from Complex Machine Learning Models Using Local Approximations and Shapley Values.

Authors: Raquel Rodríguez-Pérez; Jürgen Bajorath
Journal: J Med Chem Date: 2019-09-26 Impact factor: 7.446

Review 6. Stacked generalization: an introduction to super learning.

Authors: Ashley I Naimi; Laura B Balzer
Journal: Eur J Epidemiol Date: 2018-04-10 Impact factor: 8.082

7. Impacts of acute hepatopancreatic necrosis disease on commercial shrimp aquaculture.

Authors: K F J Tang; M G Bondad-Reantaso
Journal: Rev Sci Tech Date: 2019-09 Impact factor: 1.181

7 in total