| Literature DB >> 30138366 |
Magdalena K Sobol1, Sarah A Finkelstein1.
Abstract
This paper investigates suitability of supervised machine learning classification methods for classification of biomes using pollen datasets. We assign modern pollen samples from Africa and Arabia to five biome classes using a previously published African pollen dataset and a global ecosystem classification scheme. To test the applicability of traditional and machine-learning based classification models for the task of biome prediction from high dimensional modern pollen data, we train a total of eight classification models, including Linear Discriminant Analysis, Logistic Regression, Naïve Bayes, K-Nearest Neighbors, Classification Decision Tree, Random Forest, Neural Network, and Support Vector Machine. The ability of each model to predict biomes from pollen data is statistically tested on an independent test set. The Random Forest classifier outperforms other models in its ability correctly classify biomes given pollen data. Out of the eight models, the Random Forest classifier scores highest on all of the metrics used for model evaluations and is able to predict four out of five biome classes to high degree of accuracy, including arid, montane, tropical and subtropical closed and open systems, e.g. forests and savanna/grassland. The model has the potential for accurate reconstructions of past biomes and awaits application to fossil pollen sequences. The Random Forest model may be used to investigate vegetation changes on both long and short time scales, e.g. during glacial and interglacial cycles, or more recent and abrupt climatic anomalies like the African Humid Period. Such applications may contribute to a better understanding of past shifts in vegetation cover and ultimately provide valuable information on drivers of climate change.Entities:
Mesh:
Year: 2018 PMID: 30138366 PMCID: PMC6122137 DOI: 10.1371/journal.pone.0202214
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
African biomes represented in the modern pollen data organized by biome, number of representative modern pollen samples, biogeographic region, and country.
| Biome | Pollen | Biogeographic region | Country |
|---|---|---|---|
| Deserts and Xeric Shrublands | 239 | Namib and Karoo deserts and shrublands | South Africa, Namibia |
| Kaokoveld Desert | Namibia, Angola | ||
| Madagascar Spiny Desert | Madagascar | ||
| Horn of Africa deserts | Somalia | ||
| Socotra Island Desert | Yemen | ||
| Flooded Grasslands and Savannas | 21 | Sahelian flooded savannas | Mali, Chad, Niger, Nigeria, Cameroon, Senegal, Mauritania |
| Zambezian flooded savannas | Botswana, Namibia, Angola, Zambia, Malawi, Mozambique | ||
| Sudd flooded grasslands | Sudan, Ethiopia | ||
| Montane Grasslands and Shrublands | 120 | East African moorlands | Kenya, Tanzania, Uganda, D.R. Congo, Rwanda |
| Ethiopian Highlands | Somalia, Eritrea, Sudan | ||
| Zambezian montane savannas and woodlands | South Africa, Lesotho, Swaziland | ||
| Tropical and Subtropical Grasslands, Savannas, and Shrublands | 415 | Angolan Escarpment woodlands | Angola |
| Zambezian woodlands and savannas | Zambia, Tanzania, Malawi, Zimbabwe, Mozambique, Angola, Namibia, Botswana, D.R. Congo, Burundi | ||
| Sudanian savannas | Central African Republic, Chad, Uganda, Ethiopia, D.R. Congo, Cameroon, Sudan, Nigeria, Eritrea | ||
| East African acacia savannas | Kenya, Tanzania, Sudan, Ethiopia, Uganda | ||
| Tropical and Subtropical Moist Broadleaf Forests | 314 | Madagascar moist forests | Madagascar |
| Guinean moist forests | Ghana, Guinea, Côte d’Ivoire, Liberia, Sierra Leone, Togo | ||
| Eastern Arc montane forests | Tanzania, Kenya | ||
| East African coastal forests | Tanzania, Kenya, Mozambique, Somalia | ||
| Albertine Rift highland forests | D.R. Congo, Rwanda, Uganda, Burundi, Tanzania | ||
| East African highland forests | Kenya, Tanzania, Uganda | ||
| Seychelles and Mascarene Islands forests | Mauritius, Seychelles, Comoros, Reunion, Rodrigues | ||
| Gulf of Guinea Islands forests | São Tomé and Príncipe, Equatorial Guinea, | ||
| Macaronesian forests | Azores, Madeira, Canary, Cape Verde Islands | ||
| Congolian coastal forests | Cameroon, Gabon, R. Congo, Nigeria, Equatorial Guinea, Benin | ||
| Western Congo Basin forests | Central African Republic, Cameroon, R. Congo, Gabon, D.R. Congo, Equatorial Guinea | ||
| Northeastern Congo Basin forests | D.R. Congo, Central African Republic, Sudan, Uganda | ||
| Southern Congo Basin forests | D.R. Congo, Congo, Angola |
Fig 1Simplified representation of the classification process for the statistical and machine learning algorithms used for predicting biome.
a) Linear Discriminant Analysis, b) Logistic Regression, c) K-Nearest Neighbors, d) Classification Decision Tree, e) Random Forest, f) Support Vector Machines, and g) Neural Networks. Naïve Bayes classifier not depicted. Red and green dots in panels a), c), and f) represents two classes of data while pink stars represent a new pollen assemblage without biome label. Pink lines in d), and e) represent decision paths.
Fig 2Distribution of modern pollen samples (Gajewski et al., 2002) across African biomes (Olson et al., 2001).
List of hyper-parameters identified for each model using random grid search (Bergstra & Bengio, 2012), their optimized values, and argument descriptions (Pedregosa et al., 2011).
| Model | Parameter | Value | Argument description |
|---|---|---|---|
| LDA | n_components | 3 | Number of components for dimensionality reduction |
| solver | svd | Solver to use | |
| LR | multi_class | multinomial | Class type; either ‘one-versus-rest’ or ‘multinomial’ |
| C | 973.755518841459 | Inverse of regularization strength | |
| solver | lbfgs | Algorithm to use in the optimization problem | |
| fit_intercept | FALSE | Specifies if a constant should be added to the decision function | |
| class_weight | None | Weights associated with classes | |
| NB | alpha | 0.97375551884146 | Smoothing parameter |
| fit_prior | TRUE | Whether to learn class prior probabilities or not | |
| class_prior | None | Prior probabilities of the classes | |
| KNN | n_neighbours | 6 | Number of neighbors to use |
| weights | distance | Weight function used in prediction | |
| algorithm | brute | Algorithm used to compute the nearest neighbors | |
| p | 1 | Power parameter for the Minkowski metric | |
| CDT | max_features | sqrt | Number of features to consider when looking for the best split |
| min_samples_split | 0.031313293 | Minimum number of samples required to split internal node | |
| splitter | random | Strategy used to choose the split at each node | |
| criterion | entropy | Function measuring the quality of a split | |
| class_weight | None | Weights associated with classes | |
| RF | max_features | sqrt | Number of features to consider when looking for the best split |
| min_samples_split | 0.007066305 | Minimum number of samples required to split an internal node | |
| class_weight | balanced_subsample | Weights associated with classes | |
| criterion | entropy | Function measuring the quality of a split | |
| n_estimator | 98 | Number of trees in the forest | |
| SVM | kernel | poly | Kernel type to be used in the algorithm |
| C | 21.234911067828 | Penalty parameter C of the error term | |
| gamma | 617.482509627716 | Kernel coefficient | |
| degree | 1 | Degree of the polynomial kernel function | |
| NN | hidden_layer_size | 200 | The n-th element representing the number of neurons in the n-th hidden layer |
| alpha | 0.017436642900 | Regularization term | |
| activation | relu | Activation function for the hidden layer | |
| solver | adam | Solver for weight optimization | |
| batch_size | 32 | Size of minibatches for stochastic optimizers | |
| learning_rate | 0.0001 | Learning rate schedule for weight updates | |
| learning_rate_init | adaptive | The initial learning rate used | |
| max_iter | 123 | Maximum number of iterations |
Models were fitted to 10 folds for each of 50 candidates, totaling 500 fits. Acronyms denote: LDA for Linear Discriminant Analysis, LR for Logistic Regression, NB for Naïve Bayes, SVM for Support Vector Machines, KNN for K-Nearest Neighbors, CDT for Classification Decision Tree, RF for Random Forest and NN for Neural Networks.
Evaluation metrics calculated on the test set and reported in percent (%) for biome predictions for each classifier.
| Evaluation | Classification model | |||||||
|---|---|---|---|---|---|---|---|---|
| Logistic Regression | Linear Discriminant Analysis | Naive Bayes | Support Vector Machines | K-Nearest Neighbors | Decision Tree | Random Forests | Neural Networks | |
| Accuracy | 0.82 | 0.77 | 0.78 | 0.77 | 0.79 | 0.76 | 0.86 | 0.77 |
| Precision | 0.82 | 0.80 | 0.81 | 0.79 | 0.80 | 0.77 | 0.85 | 0.75 |
| F1 | 0.81 | 0.79 | 0.78 | 0.75 | 0.79 | 0.75 | 0.85 | 0.76 |
| Kappa | 0.74 | 0.69 | 0.71 | 0.67 | 0.71 | 0.66 | 0.80 | 0.67 |
Recall foreach predicted vegetation type is calculated as the weighted number of correct predictions for a given known vegetation type. Precision for each predicted vegetation type is calculated as the weighted proportion of correctly classified vegetation unit to the sum of all predictions.
Evaluation summaries for the prediction on individual biomes on the test set for the Random Forests classifier.
| Overall accuracy 0.86 | Predicted biomes | Evaluation metrics | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Overall kappa 0.75 | DXS | FGS | MGS | TSMBF | TSGSS | Recall | Precision | F1 | Kappa | |
| Deserts and Xeric Shrublands | 0 | 1 | 0 | 1 | 0.73 | 0.92 | 0.81 | 0.76 | ||
| Flooded Grasslands and Savannas | 2 | 0 | 0 | 0 | 0.00 | 0.00 | - | - | ||
| Montane Grasslands and Shrublands | 1 | 0 | 1 | 0 | 0.77 | 0.83 | 0.80 | 0.77 | ||
| Tropical and Subtropical Moist Broadleaf Forests | 0 | 0 | 2 | 2 | 0.93 | 0.87 | 0.90 | 0.86 | ||
| Tropical and Subtropical Grasslands, Savannas, and Shrublands | 5 | 0 | 0 | 1 | 0.92 | 0.86 | 0.89 | 0.83 | ||
Number of correct predictions run diagonally and are highlighted in bold. Recall for for each predicted vegetation type is calculated as the weighted number of correct predictions for a given known vegetation type. Precision for each predicted vegetation type is calculated as the weighted proportion of correctly classified vegetation unit to the sum of all predictions.
Fig 3Mean decrease in accuracy (MDA) calculated for the machine learning classifiers identifying pollen taxa that contribute to high predictions.
a) Linear Discriminant Analysis, b) Logistic Regression, c) Naïve Bayes, d) K-Nearest Neighbours, e) Classification Decision Tree, f) Random Forest, g) Support Vector Machine, h) Neural Network. Error bars are standard error of the mean. For each model the most important 30 taxa are plotted. Abbreviations of pollen taxon names along with their MDA percentages may be found in S2 Table.
Evaluation metrics calculated for the PFT-based biome model (Jolly et al., 1998, Table 4).
| Overall accuracy 0.71 | Predicted biomes | Evaluation metrics | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Overall kappa 0.63 | DESE | STEP | SAVA | XERO | WAMF | TDFO | TSFO | TRFO | Recall | Precision | F1 | Kappa | |
| Desert | 7 | 0 | 1 | 0 | 0 | 0 | 0 | 0.38 | 1.00 | 0.56 | 0.55 | ||
| Steppe | 0 | 25 | 14 | 0 | 1 | 0 | 0 | 0.76 | 0.16 | 0.27 | 0.70 | ||
| Savanna | 0 | 27 | 25 | 3 | 7 | 3 | 0 | 0.76 | 0.74 | 0.75 | 0.64 | ||
| Temperate Xerophytic Woods/Scrub | 0 | 2 | 3 | 6 | 0 | 0 | 0 | 0.90 | 0.49 | 0.63 | 0.57 | ||
| Warm Mixed Forest | 0 | 2 | 2 | 54 | 1 | 0 | 0 | 0.70 | 0.88 | 0.78 | 0.73 | ||
| Tropical Dry Forest | 0 | 3 | 41 | 8 | 7 | 8 | 2 | 0.23 | 0.70 | 0.35 | 0.32 | ||
| Tropical Seasonal Forest | 0 | 0 | 2 | 0 | 4 | 0 | 0 | 0.84 | 0.63 | 0.72 | 0.70 | ||
| Tropical Rain Forest | 0 | 0 | 0 | 0 | 0 | 0 | 8 | 0.62 | 0.87 | 0.72 | 0.72 | ||
Number of correct predictions run diagonally and are highlighted in bold. Recall and precision are calculated as in Table 4.
Fig 4Detrended correspondence analysis (DCA) of the modern pollen assemblages color-coded by biome type (Olson et al., 2001).