Literature DB >> 28303176

Abundance distributions for tree species in Great Britain: A two-stage approach to modeling abundance using species distribution modeling and random forest.

Louise Hill1, Andy Hector1, Gabriel Hemery2, Simon Smart3, Matteo Tanadini1, Nick Brown1.   

Abstract

High-quality abundance data are expensive and time-consuming to collect and often highly limited in availability. Nonetheless, accurate, high-resolution abundance distributions are essential for many ecological applications ranging from species conservation to epidemiology. Producing models that can predict abundance well, with good resolution over large areas, has therefore been an important aim in ecology, but poses considerable challenges. We present a two-stage approach to modeling abundance, combining two established techniques. First, we produce ensemble species distribution models (SDMs) of trees in Great Britain at a fine resolution, using much more common presence-absence data and key environmental variables. We then use random forest regression to predict abundance by linking the results of the SDMs to a much smaller amount of abundance data. We show that this method performs well in predicting the abundance of 20 of 25 tested British tree species, a group that is generally considered challenging for modeling distributions due to the strong influence of human activities. Maps of predicted tree abundance for the whole of Great Britain are provided at 1 km2 resolution. Abundance maps have a far wider variety of applications than presence-only maps, and these maps should allow improvements to aspects of woodland management and conservation including analysis of habitats and ecosystem functioning, epidemiology, and disease management, providing a useful contribution to the protection of British trees. We also provide complete R scripts to facilitate application of the approach to other scenarios.

Entities:  

Keywords:  abundance distributions; abundance–occupancy relationships; biotic effects; mapping

Year:  2017        PMID: 28303176      PMCID: PMC5306018          DOI: 10.1002/ece3.2661

Source DB:  PubMed          Journal:  Ecol Evol        ISSN: 2045-7758            Impact factor:   2.912


Introduction

Robust information on the distribution and abundance of species is essential for many applications in ecology and conservation. Advances in species distribution modeling have driven an explosion in the use of these and similar techniques, which are now widespread (Araújo & Guisan, 2006). However, the most commonly desired output from these techniques, an estimate of the probability of species occurrence, is restricted in its uses. Furthermore, due to limitations in the quality of available data, the actual output of species distribution models (SDMs) is often even less useful, producing only a relative, not absolute, likelihood of presence, and sometimes worse (Guillera‐Arroita et al., 2015; Pearce & Boyce, 2006). For many ecological questions, estimates of abundance would be far more valuable as they provide much more information about the state of populations and properties of ecosystems (Hui et al., 2009; Pearce & Ferrier, 2001; Sagarin, Gaines, & Gaylord, 2006). Predicting abundance distributions with accuracy is challenging. Even where presence–absence or presence‐only data are easy to find, large amounts of abundance data are rarely available (Nielsen et al. 2005; Van Couwenberghe, Collet, Pierrat, Verheyen, & Gégout, 2013). Therefore, finding effective ways to model abundance is an important area of research in ecology. A variety of different approaches, including looking for a fundamental relationship between the area of occupancy and abundance (Gaston et al., 2000) and looking at how occupancy patterns change with different grain size (Hui et al., 2009) among others (Hwang & He, 2011; Wenger & Freeman, 2008), have been attempted. However, none of these has produced consistently satisfactory results and each has significant theoretical or practical limitations. Another approach has been to investigate relationships between likelihood of occurrence and abundance. This approach assumes that species abundance and occurrence are controlled by the same or related environmental factors (Brown, 1984; Van Couwenberghe et al., 2013). Various studies have attempted to correlate the results of SDMs or related models with abundance data to produce models predicting abundance in unknown areas (Johnson & Seip, 2008; Nielsen et al. 2005; Van Couwenberghe et al., 2013). However, wide variation in the relationships between species occurrence, and abundance has been reported, with various studies showing weak relationships (Gaston et al., 2000; Nielsen et al. 2005; Van Couwenberghe et al., 2013). Another study by Pagel et al. (2014) used a hierarchical model to predict abundances in time and space using a combination of plentiful occurrence data and restricted abundance data. Their method produced unbiased results, but with very low precision in predictions, perhaps due to inflexibility in their models or not using environmental covariates. These studies have all suggested promise for a technique combining the use of large amounts of occurrence data with small amounts of abundance data, but none have yet performed well enough to be of use for many real‐world applications. We present a two‐stage modeling approach for predicting abundance, where the results of SDMs produced using the R package biomod2 (Thuiller, Georges, Engler, & Breiner, 2016) are regressed against abundance data and additional predictors using random forest regression with the R packages caret and randomForest (Breiman, 2001; Kuhn et al., 2016; Liaw & Wiener, 2002). This approach performs well in almost all cases tested here and is flexible and simple to use. We argue that poor correlations between SDM results and abundance previously reported may be partly due to the use of less powerful or inappropriate modeling techniques in other studies. SDMs are first produced using, in our case, presence–absence data to produce a map of estimated probability of occupancy for the species of interest. Separate abundance data are then used to fit a random forest regression that predicts abundance from probability of occupancy. Additional predictors, which may be expected to influence abundance but not occupancy, can be included at this stage. We also include the SDM results of co‐occurring species as covariates in the random forest regression, allowing biotic effects to be accounted for in the prediction of abundance and producing more realistic species responses. We have used this approach to produce distribution maps showing the abundance of 20 common tree species in Great Britain. Available tree distribution data for Great Britain were surprisingly poor, presenting a knowledge gap for ecologists working on British woodlands, particularly in light of major threats such as emerging tree diseases (Boyd, Freer‐Smith, Gilligan, & Godfray, 2013). Our distributions show total combined area covered by each species within each square kilometer (hectares per square kilometer) across Great Britain and are a significant improvement on previously widely available distribution data. We envisage that such distribution maps could make an important contribution in a number of fields related to British forestry, from conservation planning to epidemiology.

Methods

We predicted abundance of tree species using a combination of two established techniques. First, we used the R package biomod2 (Thuiller et al., 2016) to produce ensemble species distribution models (SDMs) of trees in Great Britain at 1 km2 resolution. Then, we used random forest regression, with caret (Kuhn et al., 2016) and randomForest (Liaw & Wiener, 2002) packages in R, to link the results of these SDMs to a much smaller amount of abundance data, to predict abundance across Great Britain at the same resolution.

Stage 1: Fitting species distribution models

From the Distribution Database of the Botanical Society of the British Isles (BSBI) (see Data Accessibility), we downloaded all records from Great Britain between 1950 and 2014 for 25 commonly found tree species: Acer campestre L., Acer platanoides L., Acer pseudoplatanus L., Alnus glutinosa (L.) Gaertner, Betula pendula Roth, Betula pubescens Ehrhart, Carpinus betulus L., Castanea sativa Miller, Corylus avellana L., Crataegus monogyna von Jacquin, Fagus sylvatica L., Fraxinus excelsior L., Populus tremula L., Prunus avium L., Prunus padus L., Pseudotsuga menziesii Franco, Quercus petraea Lieblein, Quercus robur L., Salix caprea L., Salix cinerea L., Sorbus aria Crantz, Taxus baccata L., Tilia cordata Miller, Ulmus glabra Hudson, and Ulmus procera Salisbury. We discarded records with location data less precise than tetrad level (2 × 2 km) and simplified data with more precise locations to tetrad level. We chose tetrad resolution as a suitable compromise between having a high number of records to use and a small spatial scale, as using coarse scales can be problematic when modeling species distributions (Dengler, Löbel, & Dolnik, 2009; Guisan, Graham et al., 2007). We then converted this presence‐only data to presence–absence. We considered tetrads for which botanical surveys had been undertaken at least twice since 1950, and where at least 50 species of plants were recorded in each survey, to be “well‐surveyed” (Groom, 2013) (a map is provided in Appendix 1). Any well‐surveyed tetrads that did not have records for the species of interest were reclassified as “absence” points for that species (i.e., locations where the species was likely to be either truly absent or at very low abundance and therefore playing little role in defining the dominant ecological characteristics of that tetrad). Accounting for the likelihood that common trees will have a higher detection probability than most species of plants, we kept this threshold low enough to prevent the exclusion of tetrads in species‐poor areas, while being high enough to prevent the inclusion of too many poorly surveyed tetrads (Groom, 2013). This produced a total of 18,993 tetrads from across Great Britain that were considered well surveyed and subsequently used as presence or absence points. Data manipulation was carried out using custom‐written scripts in Python (Python Software Foundation: Version 3.3.2). We downloaded data on a variety of ecological variables across Great Britain from a variety of free sources (Table 1). See Data Accessibility for details. Preprocessing of layers was carried out in ArcGIS (ESRI 2014) to ensure identical extent, cell size, and coordinate system for use in species distribution modeling. All environmental covariates were used at 1 km resolution: vector datasets were rasterized to 1 km resolution.
Table 1

Ecological variables downloaded and produced for species distribution modeling. Details of data sources can be found in Data Accessibility

VariableDescriptionUnitSource
bio1Annual mean temperature°C × 10Worldclim
bio2Mean diurnal temperature range: mean of monthly (max temp − min temp)°C × 10Worldclim
bio3Isothermality (bio2/bio7 × 100)°C × 10Worldclim
bio4Temperature seasonality: standard deviation × 100°C × 1000Worldclim
bio5Max temperature of warmest month°C × 10Worldclim
bio6Min temperature of warmest month°C × 10Worldclim
bio7Temperature annual range°C × 10Worldclim
bio8Mean temperature of wettest quarter°C × 10Worldclim
bio9Mean temperature of driest quarter°C × 10Worldclim
bio10Mean temperature of warmest quarter°C × 10Worldclim
bio11Mean temperature of coldest quarter°C × 10Worldclim
bio12Annual precipitationmmWorldclim
bio13Precipitation of wettest monthmmWorldclim
bio14Precipitation of Driest MonthmmWorldclim
bio15Precipitation seasonality: coefficient of variationcmWorldclim
bio16Precipitation of wettest quartermmWorldclim
bio17Precipitation of driest quartermmWorldclim
bio18Precipitation of warmest quartermmWorldclim
bio19Precipitation of coldest quartermmWorldclim
altitudeAltitudem × 10Worldclim
slopeSlope%Derived from Altitude using ArcGIS (Slope)
aspectAspectDegreesDerived from Altitude using ArcGIS (Slope)
directradiatDirect radiation: incoming direct solar radiationWatt hr m−2 Derived from Altitude using ArcGIS (Solar Radiation Analysis)
directduratDirect duration: duration of direct solar radiationHoursDerived from Altitude using ArcGIS (Solar Radiation Analysis)
diffuseradiatDiffuse radiation: incoming scattered solar radiationWatt hr m−2 Derived from Altitude using ArcGIS (Solar Radiation Analysis)
nfiNational Forest Inventory Great Britain 2014, forested areasNominalForestry Commission
soilSoil typeNominalEuropean Soil Database
soiltextDominant soil surface textural classNominalEuropean Soil Database
octopTopsoil organic carbon contentNominalEuropean Soil Database
awctopTopsoil available water capacityNominalEuropean Soil Database
mintopTopsoil minerologyNominalEuropean Soil Database
ancient_esAncient woodlands in England, Scotland and WalesNominalNatural England, Forestry Commission Scotland and National Resources Wales
land cover 07UK Land cover map 2007 (1 km2)NominalCountryside Survey/CEH
Ecological variables downloaded and produced for species distribution modeling. Details of data sources can be found in Data Accessibility We then fitted species distribution models (SDMs) to these data. For reviews of these methods, see Elith and Leathwick (2009) and Pearson and Dawson (2003). SDMs use species records and environmental variables to fit models that describe the relationship of the species’ distribution to the environmental variables, which can then be used to predict the occupancy probability or related measures across a wider landscape (Elith & Leathwick, 2009; Thuiller, 2003). SDMs for all species were produced using the package biomod2 in R (R Core Team 2015; Thuiller et al., 2016). We selected 15 environmental variables as covariates from the original set of 33. We removed one of each pair of variables with a pairwise Pearson's correlation coefficient higher than 0.7, while retaining variables that are known to be important determinants of plant growth (Guisan, Zimmermann et al., 2007; Prentice et al., 1992). The final selection was altitude, aspect, slope, direct incoming solar radiation, mean diurnal temperature range, temperature seasonality, annual precipitation, ancient woodland locations, topsoil available water capacity, topsoil minerology, topsoil organic carbon content, topsoil texture class, soil category, National Forest Inventory (NFI) forest type, and land cover type(see Appendix 2 for pairwise Pearson's correlations between selected variables). We ran six algorithms (GLM, GAM, classification tree analysis (CTA), generalized boosting models (GBM), random forest (RF), and maximum entropy (MaxEnt)) 15 times for each species using the 15 environmental covariates, producing a total of (25 species × 6 algorithms × 15 repeats) = 2,250 models. Each model run was carried out using a randomly chosen 70% of the presence–absence data (Heikkinen, Marmion, & Luoto, 2012; Thuiller, 2003); the remaining 30% were used for cross‐validation to assess the performance of each model using two model assessment criteria; area under the receiver operator curve (ROC) and the true skill statistic (TSS; Allouche, Tsoar, & Kadmon, 2006). For each species, we selected the best‐performing models (see Table 2) to build an ensemble distribution model (a mean of the raw model results, weighted by the model ROC scores), producing a single distribution map for each species that represents a robust estimate of a species’ British distribution at 1 km2 resolution (Thuiller et al., 2016). The model selection process was as follows. Firstly, we assessed ROC and TSS scores—for both metrics, a higher value indicates better model fit—and if there was a leading group of models whose ROC and TSS scores were a step higher than the remainder, this leading group was chosen. Often this leading group contained just the 15 random forest models. Otherwise, the top 20 models with the highest scores were selected. Secondly, we visually assessed the predicted responses of the species to each environmental covariate for each of these models. Any models that contained biologically implausible responses were rejected, as were models where the responses or predicted occurrence maps disagreed greatly from the overall consensus, as these can lead to development of inappropriate ensemble models (H. Hannemann, personal communication). See the walkthrough of R code in Supporting Information for an example of how models were chosen and example response curves. After rejection of implausible models, the final number of models used to produce each ensemble ranged between 11 and 20. Ensemble models were therefore robust, biologically plausible, and had high predictive power for the majority of species (see Table 2).
Table 2

The number, type, and prediction accuracy of the individual models used to build ensemble distribution models for each tree species. Algorithms included were GAM (generalized additive model), GBM (generalized boosted regression), GLM (General Linear Model), RF (Random Forest), and MaxEnt (Maximum Entropy)

SpeciesNumber of models used to build ensembleAlgorithms includedMean ROC scoreMean TSS score
Acer campestre 20GAM, RF, GBM0.920.71
Acer platanoides 20GLM, GAM, RF, GBM0.760.44
Acer pseudoplatanus 20GAM, RF, GBM0.850.55
Alnus glutinosa 15RF0.800.46
Betula pendula 15RF0.790.46
Betula pubescens 15RF0.800.46
Carpinus betulus 20RF, GBM, MaxEnt0.780.40
Castanea sativa 15RF0.810.47
Corylus avellana 16RF, GBM0.860.46
Crataegus monogyna 20GLM, GBM, RF, GBM0.960.82
Fagus sylvatica 20GAM, RF, GBM0.810.48
Fraxinus excelsior 20GLM, GAM, RF, GBM0.920.83
Populus tremula 17RF, GBM0.710.31
Prunus avium 11RF0.750.36
Prunus padus 20RF, GBM0.800.48
Pseudotsuga menziesii 19GAM, RF, GBM, MaxEnt0.760.39
Quercus petraea 15RF0.820.49
Quercus robur 16RF, GBM0.900.64
Salix caprea 16RF, GBM0.790.42
Salix cinerea 16RF, GBM0.780.42
Sorbus aria 20RF, GBM0.840.53
Taxus baccata 20GAM, RF, GBM0.800.44
Tilia cordata 15RF, GBM0.760.36
Ulmus glabra 15RF0.790.43
Ulmus procera 15RF0.890.61
The number, type, and prediction accuracy of the individual models used to build ensemble distribution models for each tree species. Algorithms included were GAM (generalized additive model), GBM (generalized boosted regression), GLM (General Linear Model), RF (Random Forest), and MaxEnt (Maximum Entropy) Nonsignificant variables were not removed from the models because of the very large size of our datasets, and because the models were used to make predictions rather than to test hypotheses. Therefore, final models may include terms that were not important to the outcome, but this should not have had a detrimental impact on the model fit. The numbers and types of models selected for each species are displayed in Table 2.

Stage 2: Modeling abundance using random forest regression

Abundance data for trees, in the form of hectares covered by a species per square kilometer (or percent cover), were obtained from the Countryside Survey and myForest (see Data Accessibility). The Countryside Survey is a large‐scale survey in Great Britain measuring many aspects of landscapes and the countryside, including diversity and abundance of plant species. It uses a random stratified sampling procedure to capture a representative sample of all land cover types. By contrast, myForest is a service set up to help woodland owners map and manage their forests, which currently holds data on over 45,000 ha of woodlands across Great Britain, but does not contain any records outside of woodlands. For all tree species combined, 9,800 randomly selected abundance data points from the Countryside Survey and 9,453 abundance data points from myForest were used, making an average of 770 abundance data points per species (see Appendix 5 for numbers of data points per species). The two abundance datasets (Countryside Survey and myForest) were rescaled to express them as hectares covered per kilometer squared (percent cover), in order to make them comparable. For the myForest data, which was originally provided in the format percentage cover of each species within a woodland patch, this involved multiplying each percentage cover record by the proportion of woodland cover in the relevant kilometer square. For this, we used a shapefile downloaded from the National Forest Inventory (NFI), containing outlines of all woodlands over 0.5 ha in Great Britain. For the Countryside Survey data, which was collected using a more complex methodology (details available in Barr et al., 1993) where linear features such as hedgerows were sampled separately from the rest of the landscape, more manipulation was required. The data were weighted by the length of linear features in the kilometer squared, to account for the fact that linear features are more likely to contain trees and the lengths of them are not equal across the country. The weighting was done using (linear plot percentage cover × percent of kilometer square covered by linear features) + (nonlinear plot percentage cover × remaining area), with all required information taken from the Countryside Survey. Tree cover data for England and Wales from Bluesky's National Canopy Map were made available by the Woodland Trust, to be used as a modeling covariate. Three layers from this were used: the total tree cover, tree cover derived only from woodlands included in the NFI, and tree cover derived from trees outside woodlands. The National Canopy Map layers were used in England and Wales, while the more basic NFI layers were used in Scotland where complete tree cover data were not available. We also used the NFI dataset to calculate the proportion of each square taken up by broadleaved woodland edge, which was defined as any woodland within 50 m of nonwoodland (Aune, Gunnar, & Moen, 2005). All these layers were used as covariates in the random forest regression (below). We used R version 3.2.3 for all modeling and data processing (R Core Team 2015). We used random forest regression to model the relationships between abundance, the probability of occupancy predicted by the SDMs, and our tree cover covariates which we expected would be important for modeling tree abundance (Breiman, 2001). A separate random forest regression was implemented for each species. The SDM outputs for all species were included as variables for each species, so that the models would also capture interactions between species (such as competition). Potentially, this could also capture variation in other variables that are not included in that species’ SDM but which correlate with the distribution of other species. Models had the form: where P̂ is the predicted probability of occupancy from the relevant SDM, CA is cover from all trees, CW is cover from woodland trees only, CO is cover from trees outside woodland only, and CE is cover from woodland edge. Models were run using the combined myForest and Countryside Survey data. We chose to use random forest regression because it is insensitive to data distribution and therefore copes well with our data which has a high percentage of zeros. It can also take a large number of potentially collinear variables, and is robust to overfitting, making it extremely useful for prediction (Prasad, Iverson, & Liaw, 2006; Segal, 2004). We used these models to predict abundance of each species across the whole of Great Britain at 1 km2 resolution. We used root‐mean‐square error (RMSE) and mean absolute error (MAE), produced by k‐fold cross‐validation with 10‐fold, to evaluate our models. These two commonly used evaluation metrics give interpretations of a model's average error when testing it against independent data, in this case, the 10% that was left out of each run (Chai & Draxler, 2014). A schematic overview of the whole two‐stage method is shown in Figure 1.
Figure 1

Schematic showing the outline of the two‐stage method for predicting abundance distributions. The first stage uses SDMs to produce maps of predicted probability of occupancy, while the second stage takes these maps as inputs and uses Random Forest regression to produce maps of predicted abundance. Distribution data inputs are shown in square boxes and model covariates in round boxes, and model outputs are shaded in solid gray and modeling processes in hashed gray

Schematic showing the outline of the two‐stage method for predicting abundance distributions. The first stage uses SDMs to produce maps of predicted probability of occupancy, while the second stage takes these maps as inputs and uses Random Forest regression to produce maps of predicted abundance. Distribution data inputs are shown in square boxes and model covariates in round boxes, and model outputs are shaded in solid gray and modeling processes in hashed gray

Results

Species distribution modeling

All selected models had useful prediction capability (AUC > 0.7) (Boyce, Vernier, Nielsen, & Schmiegelow, 2002). In general, prediction accuracy of the selected models was good and they successfully predicted a large proportion of known presence or absence points. The selected models had ROC scores between 0.71 and 0.96 and TSS scores between 0.31 and 0.83 (Table 2). Ensemble models were built using 100% of the available data, so evaluations are not given for ensemble models as this would test the models on the same data they were generated with, resulting in unfair evaluation statistics. For the four species with the lowest predictive power (lowest TSS and ROC scores), (Populus tremula, Prunus avium, Pseudotsuga menziesii, and Tilia cordata) (Table 2), we investigated further to ensure that all ecological factors known to be important to them were included in the model runs. However, no further improvements to the model fit were found. These were species that tend to be either widespread but uncommon throughout their range (P. tremula, P. avium, T. cordata) or non‐native trees whose distribution is largely controlled by human planting (P. menziesii), and as a result, it is unlikely to be possible to generate high‐scoring distribution models for these species (Guisan, Zimmermann et al., 2007). For 21 of 25 species, however, SDMs produced high‐quality ensemble models.

Abundance modeling

In general, the random forest models were very successful in predicting the abundance of tree species. Figure 2 shows the predicted against observed abundance for four representative species; graphs for all other species are included in Appendix 3. For the majority of species, the predictions of the models are similar to the observed values.
Figure 2

Observed abundance against abundance predicted by Random Forest regression, as used to assess model performance, shown for four tree species. The line on each graph is the 1:1 line showing perfect model fit

Observed abundance against abundance predicted by Random Forest regression, as used to assess model performance, shown for four tree species. The line on each graph is the 1:1 line showing perfect model fit We produced root‐mean‐square error (RMSE) and mean absolute error (MAE) scores using 10‐fold cross‐validation to evaluate our models’ performance. These two commonly used model evaluation metrics give interpretations of a model's average error when testing it against independent data (Chai & Draxler, 2014). Table 3 shows RMSE and MAE scores for each species; the error scores are given in the same scale as the response variable, that is, hectares covered per square kilometer (percent cover). All the models have RMSE scores under 10, and most are under 5. All MAE scores are under 5. The average prediction error for most of the models produced is therefore <5%.
Table 3

Root‐mean‐square error (RMSE) and mean absolute error (MAE) scores for the Random Forest regression model for each species. The number of nonzero data points available for each species is also shown

SpeciesRMSEMAENumber of nonzero data points
Acer campestre 1.440.35315
Acer platanoides 1.270.1942
Acer pseudoplatanus 4.011.40634
Alnus glutinosa 2.400.66195
Betula pendula 6.882.29802
Betula pubescens 4.091.09127
Carpinus betulus 3.791.05320
Castanea sativa 9.563.58501
Corylus avellana 4.471.47935
Crataegus monogyna 1.100.23339
Fagus sylvatica 8.452.91918
Fraxinus excelsior 4.951.881629
Populus tremula NA NA 16
Prunus avium 1.980.56401
Prunus padus NA NA 9
Pseudotsuga menziesii 7.661.96193
Quercus petraea 5.991.84209
Quercus robur 6.502.541867
Salix caprea 1.380.2874
Salix cinerea 0.160.0355
Sorbus aria NA NA 3
Taxus baccata 2.210.4986
Tilia cordata 1.040.1456
Ulmus glabra NA NA 22
Ulmus procera NA NA 27
Root‐mean‐square error (RMSE) and mean absolute error (MAE) scores for the Random Forest regression model for each species. The number of nonzero data points available for each species is also shown For six species, Acer platanoides, Populus tremula, Prunus padus, Sorbus aria, Ulmus glabra, and Ulmus procera, there were too few nonzero abundance data points to use 10‐fold cross‐validation. We chose 50 positive data points as the cutoff for using 10‐fold cross‐validation, as this gives an average of five nonzero data points per fold. Acer platanoides had 42 positive abundance data points, so for this species, we used eightfold cross‐validation to maintain an average of five nonzero data points per fold. However, for the remaining five species, we felt that there was not enough data available to produce reliable abundance models (see Table 3). These species were omitted, and maps of predicted abundance of the remaining 20 species across Great Britain were produced (Figure 3 and downloadable from the Sylva Foundation website and Oxford University Research Archive (see Data Accessibility). Where adequate abundance data were available, however, random forest regression was able to improve the prediction of the species for which the SDMs had a poorer fit. We were able to successfully model the abundance of Prunus avium, Pseudotsuga menziesii, and Tilia cordata despite the SDM prediction accuracy for these species being poorer than the other species.
Figure 3

Maps of predicted abundance for four species, in hectares per km2, or percent cover. Note the scale varies between species. Maps for all other successfully modeled species are available to download from Sylva Foundation website and Oxford University Research Archive (see Data Accessibility)

Maps of predicted abundance for four species, in hectares per km2, or percent cover. Note the scale varies between species. Maps for all other successfully modeled species are available to download from Sylva Foundation website and Oxford University Research Archive (see Data Accessibility) We also calculated R 2 scores for the models, and these are available in Appendix 5. However, we recommend caution when interpreting these scores, as R 2 is not the most appropriate metric to use in this situation. R 2 is affected by the extent of the dependent variable (Gelman & Hill, 2007), and as the maximum abundance varied greatly between species, this confounds comparison between our models. The high percentage of zeros in our datasets also produces difficulty in the interpretation of R 2. For instance, for Acer campestre, over 97% of the available abundance data points were zero. The model tended to predict very slightly higher than zero for these points (generally between zero and one percent cover), resulting in a low R 2 (.523). However, the observed vs predicted graph (Figure 2) and the low RMSE and MAE scores (Table 3) for Acer campestre show that the model generally predicts very close to the true abundance, despite scatter in the data, and this is mirrored for most other species. For applications where the difference between zero and one or two percent cover is unimportant, these models can be used directly for predicting abundance; where it is more important, the predicted against observed graph can be used to select a cutoff below which predicted abundance will be coerced to zero.

Discussion

The two‐stage modeling approach produced good or excellent predictions of abundance for the majority of species across the whole of Great Britain, despite only being trained on a relatively small amount of abundance data. This is in contrast to several previous studies looking for relationships between SDM results and abundance, which have shown little or no relationship (Gaston et al., 2000; Johnson & Seip, 2008; Nielsen, Johnson, Heard, & Boyce, 2005). However, to our knowledge, no previous studies have used random forest regression to model this relationship, and doing so has a number of advantages. Most importantly perhaps is that it does not make any assumptions about the shape of the relationship. Previous studies have attempted to use the negative binomial and other theoretical distributions, but we argue that this is likely to be an oversimplification that may mask true relationships. The shape of such a relationship, which is likely to have several different drivers, may not follow a simple mathematical function, and is known to vary between species (Gaston et al., 2000; Harris, 2015; Nielsen et al. 2005). The use of random forest regression allows for such variation, making it a much more powerful technique for this application (De'ath & Fabricius, 2000; Evans & Cushman, 2009; Prasad et al., 2006). Our two‐stage modeling approach has a number of other advantages. It can incorporate biotic effects, and include covariates that are expected to influence abundance separately from those expected to influence occupancy. It makes use of the large amount of presence or presence–absence data that are often available, rather than discarding it. It will work with any measure of abundance (number of individuals, percentage cover, biomass, etc.) and has been shown to be effective over large spatial extents. It may be a particularly powerful approach where occurrence and abundance are not influenced by exactly the same factors (see Nielsen et al., 2005). Although not tested here, this method also has the potential to be effective when used with the results of more problematic SDMs, such as those made using presence‐only data, which can only predict a relative likelihood of occupancy (Araújo & Peterson, 2012). We can also make use of the covariate allocation of random forest to gain insights into underlying ecological processes within the community. For each species for which we have modeled abundance, we can inspect which variables are having the strongest effects in the model (see Appendix 4) (Breiman, 2001). This means we can see which other species’ SDM results are most strongly associated with the abundance of our species of interest, allowing us to identify possible biotic interactions such as competition. This does not allow us to distinguish causal relationships because of the possibility that hidden covariates could be at play; two species’ SDM results could be correlated with each other not because of a biotic interaction, but because they are both influenced by an underlying factor. However, it does provide a qualitative estimate of biotic effects that could be an interesting starting point for further study. The inclusion of biotic effects may have the additional benefit of improving model performance for predicting abundance under new conditions, such as future climate scenarios (Anderson, 2013; Araújo & Guisan, 2006; Elith & Leathwick, 2009; Harris, 2015). Species distributions and abundances are predicted to be strongly influenced in future by both climatic changes and biotic effects, and to our knowledge, this is the first technique for predicting abundance which is able to make some account of these biotic effects. However, the approach will not be able to incorporate changes to biotic effects with novel species assemblages, or other factors such as dispersal limitation, without further modification. Not all species were successfully modeled using this technique. Prunus padus, Populus tremula, Sorbus aria, Ulmus glabra, and Ulmus procera were all unsuccessful, in each case because very little abundance data were available for these species in our datasets. For example, our combined abundance dataset contained only four nonzero data points for Sorbus aria, demonstrating the difficulty in acquiring abundance data even for such a well‐studied system. However, various tree species which are generally considered to be difficult to model—such as Pseudotsuga menziesii, a non‐native species whose distribution is still largely controlled by planting, and Tilia cordata, which is thought to be both rare and widespread in Britain due to an unusual ecological history (Pigott, 1991)—were successfully modeled by random forest regression, despite showing relatively poor SDM performance. Overall, the method performed well for the majority of species and seems to be generally effective across a range of species, provided that sufficient abundance data are available. British trees exist in highly human‐modified landscapes where their distributions have without exception been altered by human land use and preferences (Hopkins & Kirby, 2007; Rackham, 2008). This is a challenging scenario for modeling abundance; other studies which have tried to model abundance of vascular plant species have avoided trees for this reason (Van Couwenberghe et al., 2013). However, despite this, the models performed well for the majority of species. This suggests that the models may be flexible enough to work in a variety of contexts and are likely to perform even better in less human‐dominated landscapes. This flexibility is one of the major advantages of using random forest regression, and we expect it to offer broad application in modeling abundance of a wide range of species (Prasad et al., 2006). The next step for evaluating the method will be to compare its performance to other published methods for predicting abundance, which could be done by evaluating the relative performance of this and other methods with a variety of published datasets. The abundance maps that we have produced are the best quality abundance distributions currently available for these species in Great Britain; previously, the best widely available distribution maps for trees in Great Britain were presence‐only maps on 10‐ or 2‐km square scales (see Figure 4). Our maps are modeled, not directly observed, and as is the case for modeling any highly noisy system, will not accurately predict abundance in every 1 km square; however, they are expected to capture overall patterns of distribution well. As more data, particularly abundance data and better quality environmental covariates, become available, our maps can continue to be improved. Abundance maps have a far wider variety of applications than presence‐only maps, and these maps will allow significant improvements to these applications. British woods face a range of threats, including invasive diseases such as ash dieback, undermanagement or overmanagement leading to poor woodland quality, pollution, and damage by deer (Rackham, 2008). These improved maps should allow better planning and management of woodlands, analysis of habitats and ecosystem functioning, and epidemiology and disease management, and will be a useful contribution to the protection of British trees.
Figure 4

(a) Presence records of Acer campestre, downloaded from the BSBI database (some of the best available distribution data at the country‐wide level). The data are presence only on 2 × 2 km (tetrad) scale. Note that where presence is not recorded, it is impossible to say whether the species is truly absent. Compare with our modeled abundance distribution (b) showing modeled hectares covered by A. campestre per square kilometer for every 1 km square

(a) Presence records of Acer campestre, downloaded from the BSBI database (some of the best available distribution data at the country‐wide level). The data are presence only on 2 × 2 km (tetrad) scale. Note that where presence is not recorded, it is impossible to say whether the species is truly absent. Compare with our modeled abundance distribution (b) showing modeled hectares covered by A. campestre per square kilometer for every 1 km square

Conclusion

The two‐stage method to predict abundance, using random forest regression to model the relationship between SDM outputs and abundance, is robust and easy to use producing good results for the majority of British tree species. Images and raster files of our abundance maps for the 20 successfully modeled tree species are available to download from the Sylva Foundation website and Oxford University Research Archive (see Data Accessibility). Both SDMs and random forest regression are well‐established techniques, and using them together in this combination is a user‐friendly way to produce good‐quality maps of predicted abundance. This opens the way for more abundance maps to be produced for a wider range of scenarios, which itself could drive improvements in a number of ecological research areas, from responses to climate change to epidemiology. To facilitate this, we provide annotated R code in the Supporting Information for the entire process, to act as a guide for those wishing to use this method themselves.

Data Accessibility

Ancient Woodland shapefile data: England: available from Natural England at http://www.gis.naturalengland.org.uk/pubs/gis/GIS_register.asp (accessed 17/06/2016). Scotland: available from Forestry Commission Scotland at https://gateway.snh.gov.uk/natural-spaces/dataset.jsp?dsid=AWI (accessed 17/06/2016). Wales: available from National Resources Wales at http://lle.wales.gov.uk/Catalogue/Item/AncientWoodlandInventory2011/?lang=en (accessed 17/06/2016). BSBI Distribution Database: available at http://bsbidb.org.uk/ (accessed 17/06/2016). Countryside Survey data: the data used is available to download from the Oxford University Research Archive, https://ora.ox.ac.uk/ “Merged abundance dataset from myForest and the Countryside Survey”. Landcover Map 2007: available from Centre for Ecology and Hydrology at http://www.ceh.ac.uk/services/land-cover-map-2007 (accessed 17/06/2016). myForest: the data used is available to download from the Oxford University Research Archive, https://ora.ox.ac.uk/ “Merged abundance dataset from myForest and the Countryside Survey”. National Canopy Map was made available for this study by kind permission of the Woodland Trust. National Forest Inventory Great Britain 2014: available from Forestry Commission at http://www.forestry.gov.uk/datadownload (accessed 17/06/2016). R scripts: see Supporting Information. Raster and image files for the abundance maps produced for all species are available to download from the Sylva Foundation website, https://sylva.org.uk/ and Oxford University Research Archive, https://ora.ox.ac.uk/ “Predicted abundance maps for British Trees”. Soil data from European Soil Database: available at http://esdac.jrc.ec.europa.eu/content/european-soil-database-v20-vector-and-attribute-data (accessed 17/06/2016). Worldclim: free global climate data, available at http://www.worldclim.org/ (accessed 17/06/2016).

Conflict of Interest

None declared. Click here for additional data file.
AltitudeAspectSlopeDirect incoming solar radiationMean diurnal temperature rangeTemperature seasonalityAnnual precipitationTopsoil available water capacityTopsoil minerologyTopsoil organic carbon contentTopsoil texture class
Altitude10.0153960.5186520.591347−0.20693−0.031620.6213560.2318880.0008050.3031980.428471
Aspect0.0153961−0.020160.062394−0.05345−0.02690.0623060.0147670.0194120.0351330.041259
Slope0.518652−0.0201610.224245−0.081780.007090.3312020.104236−0.012460.1545480.180757
Direct incoming solar radiation0.5913470.0623940.2242451−0.30348−0.176340.6273650.2255410.0552760.2317020.407135
Mean diurnal temperature range−0.20693−0.05345−0.08178−0.3034810.620974−0.57396−0.08620.068989−0.26488−0.27498
Temperature seasonality−0.03162−0.02690.00709−0.176340.6209741−0.37328−0.137550.054918−0.21352−0.27218
Annual precipitation0.6213560.0623060.3312020.627365−0.57396−0.3732810.279320.0392240.369680.506227
Topsoil available water capacity0.2318880.0147670.1042360.225541−0.0862−0.137550.2793210.1526530.3051730.590749
Topsoil minerology0.0008050.019412−0.012460.0552760.0689890.0549180.0392240.15265310.1422620.298536
Topsoil organic carbon content0.3031980.0351330.1545480.231702−0.26488−0.213520.369680.3051730.14226210.386809
Topsoil texture class0.4284710.0412590.1807570.407135−0.27498−0.272180.5062270.5907490.2985360.3868091
SpeciesMost important variable in abundance model
Acer campestre Cover of trees in NFI
Acer platanoides Probability of occupancy of Crataegus monogyna
Acer pseudoplatanus Cover of trees in NFI
Alnus glutinosa Probability of occupancy of Crataegus monogyna
Betula pendula Cover of all trees
Betula pubescens Cover of trees outside of NFI
Carpinus betulus Cover of all trees
Castanea sativa Cover of trees in NFI
Corylus avellana Cover of trees in NFI
Crataegus monogyna Probability of occupancy of Betula pendula
Fagus sylvatica Cover of all trees
Fraxinus excelsior Cover of trees in NFI
Prunus avium Probability of occupancy of Prunus avium
Pseudotsuga menziesii Cover of trees in NFI
Quercus petraea Cover of trees outside of NFI
Quercus robur Cover of trees in NFI
Salix caprea Probability of occupancy of Fagus sylvatica
Salix cinerea Probability of occupancy of Corylus avellana
Taxus baccata Cover of trees in NFI
Tilia cordata Probability of occupancy of Fagus sylvatica
Species R 2 RMSEMAENumber of data points per speciesNumber of nonzero data points
Acer campestre .5231.440.35679315
Acer platanoides .2071.270.1944442
Acer pseudoplatanus .4264.011.40906634
Alnus glutinosa .2712.400.66484195
Betula pendula .4506.882.291,261802
Betula pubescens .5964.091.09501127
Carpinus betulus .6903.791.05755320
Castanea sativa .7649.563.58982501
Corylus avellana .3444.471.471,282935
Crataegus monogyna .0491.100.23413339
Fagus sylvatica .4968.452.911,388918
Fraxinus excelsior .3974.951.881,9861,629
Populus tremula .126NANA40016
Prunus avium .5891.980.56886401
Prunus padus .004NANA3949
Pseudotsuga menziesii .5967.661.96600193
Quercus petraea .8415.991.84584209
Quercus robur .4626.502.542,3031,867
Salix caprea .6441.380.2844574
Salix cinerea .0810.160.0339255
Sorbus aria .055NANA3923
Taxus baccata .3722.210.4951886
Tilia cordata .4421.040.1446156
Ulmus glabra .037NANA39422
Ulmus procera .013NANA39327
  8 in total

Review 1.  Moving beyond assumptions to understand abundance distributions across the ranges of species.

Authors:  Raphael D Sagarin; Steven D Gaines; Brian Gaylord
Journal:  Trends Ecol Evol       Date:  2006-07-11       Impact factor: 17.712

2.  Estimating species occurrence, abundance, and detection probability using zero-inflated distributions.

Authors:  Seth J Wenger; Mary C Freeman
Journal:  Ecology       Date:  2008-10       Impact factor: 5.499

3.  Extrapolating population size from the occupancy-abundance relationship and the scaling pattern of occupancy.

Authors:  Cang Hui; Melodie A McGeoch; Belinda Reyers; Peter C le Roux; Michelle Greve; Steven L Chown
Journal:  Ecol Appl       Date:  2009-12       Impact factor: 4.657

4.  Uses and misuses of bioclimatic envelope modeling.

Authors:  Miguel B Araújo; A Townsend Peterson
Journal:  Ecology       Date:  2012-07       Impact factor: 5.499

5.  A framework for using niche models to estimate impacts of climate change on species distributions.

Authors:  Robert P Anderson
Journal:  Ann N Y Acad Sci       Date:  2013-09       Impact factor: 5.691

Review 6.  Ancient woodlands: modern threats.

Authors:  Oliver Rackham
Journal:  New Phytol       Date:  2008-09-02       Impact factor: 10.151

Review 7.  The consequence of tree pests and diseases for ecosystem services.

Authors:  I L Boyd; P H Freer-Smith; C A Gilligan; H C J Godfray
Journal:  Science       Date:  2013-11-15       Impact factor: 47.728

8.  Editorial commentary on 'BIOMOD - optimizing predictions of species distributions and projecting potential future shifts under global change'.

Authors:  Wilfried Thuiller
Journal:  Glob Chang Biol       Date:  2014-10-13       Impact factor: 10.863

  8 in total
  3 in total

1.  Evaluating the productivity of ancient Pu'er tea trees (Camellia sinensis var. assamica): a multivariate modeling approach.

Authors:  Shuqiao Zhang; Wendou Liu; Xinmeng Cheng; Zizhi Wang; Fengjun Yuan; Wengui Wu; Shengxi Liao
Journal:  Plant Methods       Date:  2022-07-27       Impact factor: 5.827

2.  A multi-data ensemble approach for predicting woodland type distribution: Oak woodland in Britain.

Authors:  Duncan Ray; Maurizio Marchi; Andrew Rattey; Alice Broome
Journal:  Ecol Evol       Date:  2021-06-21       Impact factor: 2.912

3.  A machine learning approach to integrating genetic and ecological data in tsetse flies (Glossina pallidipes) for spatially explicit vector control planning.

Authors:  Anusha P Bishop; Giuseppe Amatulli; Chaz Hyseni; Evlyn Pless; Rosemary Bateta; Winnie A Okeyo; Paul O Mireji; Sylvance Okoth; Imna Malele; Grace Murilla; Serap Aksoy; Adalgisa Caccone; Norah P Saarman
Journal:  Evol Appl       Date:  2021-05-05       Impact factor: 5.183

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.