| Literature DB >> 36081066 |
Hoa Thi Pham1,2, Joseph Awange1,3, Michael Kuhn1.
Abstract
Machine learning (ML) has been widely used worldwide to develop crop yield forecasting models. However, it is still challenging to identify the most critical features from a dataset. Although either feature selection (FS) or feature extraction (FX) techniques have been employed, no research compares their performances and, more importantly, the benefits of combining both methods. Therefore, this paper proposes a framework that uses non-feature reduction (All-F) as a baseline to investigate the performance of FS, FX, and a combination of both (FSX). The case study employs the vegetation condition index (VCI)/temperature condition index (TCI) to develop 21 rice yield forecasting models for eight sub-regions in Vietnam based on ML methods, namely linear, support vector machine (SVM), decision tree (Tree), artificial neural network (ANN), and Ensemble. The results reveal that FSX takes full advantage of the FS and FX, leading FSX-based models to perform the best in 18 out of 21 models, while 2 (1) for FS-based (FX-based) models. These FXS-, FS-, and FX-based models improve All-F-based models at an average level of 21% and up to 60% in terms of RMSE. Furthermore, 21 of the best models are developed based on Ensemble (13 models), Tree (6 models), linear (1 model), and ANN (1 model). These findings highlight the significant role of FS, FX, and specially FSX coupled with a wide range of ML algorithms (especially Ensemble) for enhancing the accuracy of predicting crop yield.Entities:
Keywords: TCI; VCI; crop yield; feature extraction; feature selection; machine learning
Mesh:
Year: 2022 PMID: 36081066 PMCID: PMC9460661 DOI: 10.3390/s22176609
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.847
The feature dimensionality reduction methods used and the main findings in reviewed papers.
| Crop; Regions | Features | Methods | Related Findings |
|---|---|---|---|
| Rice; Punjab State of India [ | Features related to agriculture and weather | RRF; CBFS, RFE | Selected ten of the most significant features |
| Sugarcane; São Paulo State, Brazil [ | NDVI: At the start, in the middle, 1–10 months after the harvest starts; amplitude; max; derivative; integral | Wrapper combining ANN | Selected seven essential features |
| Bio-oil; Unclear [ | Biomass composition and pyrolysis conditions | GA, filter, wrapper | GA outperforms filter and wrapper methods. |
| Unclear crop; Tamil Nadu, India [ | Canal length; the number of tanks, tube and open wells; planting area; amount of fertilizers, seed quantity; cumulative rainfall and radiation; max/average/min temperatures | FFS, BFE, CBFS, RFVarImp, VIF | Methods were quite the same accuracy (FFS and BFE are slightly better than others, but FFS takes less time) when combined with MLR and M5Prime but varied with ANN; The adjusted |
| Unclear crop; Tamil Nadu, India [ | Canal length; the number of tanks, tube and open wells; planting area; amount of fertilizers, seed quantity; cumulative rainfall and radiation; max/average/min temperatures | FFS, CBFS, VIF, RFVarImp | FFS gives good accuracy; RF achieves the highest quality for all feature subsets compared with ANN, SVM, and KNN. |
| Soybean; Southern France [ | Features related to climate, soil, and management | Filter, wrapper, embedded | The subsets selected by wrapper combined with SVM and LR provided the best results. |
| Winter wheat; Germany [ | Weekly weather data, soil conditions, and crop phenology variables | SHAP explanation | The accuracies of models using 50/75 percent of components did not decline significantly compared with the model using full features; some even slightly improved. |
| Corn, and Soybean; The United States [ | Weather components, soil conditions, and management | The trained CNN-RNN model | The models’ accuracies did not decline remarkably compared to the model based on full features, but some even slightly improved. |
| Alfalfa, Wisconsin, the United States [ | Vegetation indices | RFE | All models based on RF, SVM, and KNN were improved when using selected features. |
| Tee; Bangladesh [ | Satellite-derived hydro-meteorological variables | Dragonfly and SVM | Combining RF with the dragonfly algorithm and SVR-based feature selection improves prediction performance. |
| Alfalfa; Kentucky and Georgia [ | Weather, historical yield, sown date | CBFS, ReliefF, wrapper | CBFS was better than ReliefF and wrapper; ML combined with FS offered promise in forecasting performance. |
| Sugarcane; Teodoro Sampaio-São Paulo in Brazil [ | Soil and weather | RReliefF | FS eliminated nearly 40% of the features but increased the mean absolute error (MAE) by 0.19 Mg/ha. |
| Coffee; Brazil [ | Leaf area index (LAI), tree height, crown diameter, and the individual: RGB band values | Pearson, Spearman, F-test, RFE, Mutual Information | Most of the learners using the most important parameters (LAI and the crown diameter) and the most critical months improved prediction compared with employing total features. |
| Winter wheat, Corn; Kansas, USA [ | VCI and TCI | PCA | The contribution of PCA was unclear because PCA-ML was not compared with ML-only. |
| Rice, Potato; Bangladesh [ | VCI and TCI | PCA | The contribution of PCA was unclear because PCA-ML was not compared with ML-only. |
| Cotton; Unclear region [ | Max/min temperature, relative humidity, wind speed, sunshine hours | PCA | A significant improvement in PCR-based prediction models compared with models using MLR. |
| Rice; Vietnam [ | VCI and TCI | PCA | PCA coupled with EmbBoostTree was better than ML-only at an average of 18.5% and up to 45% of RMSE. |
Figure 1Flowchart illustrating the selection of the best performing ML-based crop yield prediction models based on FS, FX, FSX, and All-F.
Figure 2Eight sub-regions for developing rice yield prediction in mainland Vietnam: Northwest, Northeast, RRD, NCC, SCC, Highlands, Southeast, and MRD [33].
Figure 3The RMSE of All-F-, FS-, FX-, and FSX-based models in different ML algorithms, units are tons/hectare.
Figure 4Percentage of FS-, FX-, and FSX-based models outperforming All-F-based models (a1); FS-based models being better than FX-based models, FSX-based models being better than FS-based models, and FSX-based models being better than FX-based models (a2).
Figure 5(a1–a3) display RMSE values of the best models generated from separate All-F, FS, FX, and FSX sets, while (a4) presents the RMSE of the overall best model selected over different feature subsets (All-F, FS, FX, and FSX) for each rice season. The numbers in the columns refer to the used ML method: 1 (linear), 3 (SVMQuaratic), 4 (SVMCubic), 5 (SVMRBF), 6 (Tree), 7 (EmbBoostTree), 8 (EmbBagTree), 9 (ANNReLu), 11 (ANNSigmoid), and 12 (ANNNone). The text at the top of the column denotes the dimensionality reduction techniques used; (b) The accuracy improvement of the best overall models compared with the All-F-based models.