Literature DB >> 35061658

Automatic variable selection in ecological niche modeling: A case study using Cassin's Sparrow (Peucaea cassinii).

John L Schnase1, Mark L Carroll1.   

Abstract

MERRA/Max provides a feature selection approach to dimensionality reduction that enables direct use of global climate model outputs in ecological niche modeling. The system accomplishes this reduction through a Monte Carlo optimization in which many independent MaxEnt runs, operating on a species occurrence file and a small set of randomly selected variables in a large collection of variables, converge on an estimate of the top contributing predictors in the larger collection. These top predictors can be viewed as potential candidates in the variable selection step of the ecological niche modeling process. MERRA/Max's Monte Carlo algorithm operates on files stored in the underlying filesystem, making it scalable to large data sets. Its software components can run as parallel processes in a high-performance cloud computing environment to yield near real-time performance. In tests using Cassin's Sparrow (Peucaea cassinii) as the target species, MERRA/Max selected a set of predictors from Worldclim's Bioclim collection of 19 environmental variables that have been shown to be important determinants of the species' bioclimatic niche. It also selected biologically and ecologically plausible predictors from a more diverse set of 86 environmental variables derived from NASA's Modern-Era Retrospective Analysis for Research and Applications Version 2 (MERRA-2) reanalysis, an output product of the Goddard Earth Observing System Version 5 (GEOS-5) modeling system. We believe these results point to a technological approach that could expand the use global climate model outputs in ecological niche modeling, foster exploratory experimentation with otherwise difficult-to-use climate data sets, streamline the modeling process, and, eventually, enable automated bioclimatic modeling as a practical, readily accessible, low-cost, commercial cloud service.

Entities:  

Mesh:

Year:  2022        PMID: 35061658      PMCID: PMC8782318          DOI: 10.1371/journal.pone.0257502

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

Ecological niche modeling (ENM) consists of a set of techniques and tools that use species occurrence records and environmental data to predict the relative suitability of habitats [1]. It is used across a wide range of disciplines, including fields as diverse as biogeography and phylogeny [2], conservation biology and epidemiology [3, 4], invasion biology [5], and archaeology [6]. In recent years, ecological niche models have become particularly important in understanding the influence of climate change on the geographic distribution of species [7]. This, in turn, has led to greater use of global climate model (GCM) outputs as environmental predictors [8]. GCMs combine observations from an array of satellite, airborne, and in-situ sensors to create global representations of the climate system, including historical simulations and future projections for hundreds of climate variables [9]. The largest and most sophisticated of these, however, produce complex, petabyte-scale data sets, which complicates variable selection and limits their direct use in ecological modeling [10-12]. Part of the problem lies in the fact that most ENM software tools require predictors and observations to be memory-resident in order for the programs to work [13, 14]. This results in run-times and space requirements that have linear or higher-order scaling properties with respect to the size of a model’s inputs. This generally poses few difficulties. But when the number of predictors becomes large, compute times can become impractically long, models can become overly complex, and efforts to understand any particular variable’s contribution to model formation, either as an aspect of model analysis or as a way of selecting subsets of variables for further model refinement, can become challenging [10, 15–19]. An effective way of dealing with large, externally-stored environmental data sets that preserves the advantages of conventional tools while overcoming this limitation would benefit the ENM community. In previous work, we demonstrated the potential of a MaxEnt-based Monte Carlo method that addresses this issue by screening large data collections for viable predictors [14]. Based on a machine learning approach to maximum entropy modeling, MaxEnt is one of the most popular software packages in use today by the ENM community [20-22]. Among its many advantages, MaxEnt ranks the contribution of predictor variables in the formation of its models. Our Monte Carlo method exploits this feature in an ensemble strategy whereby many independent MaxEnt runs, each drawing on a small, random subset of variables stored in the filesystem, converge on a global estimate of the top contributing subset of variables in the larger collection. These top-contributing predictors can then be studied in more detailed ways, augmented with other variables, and further refined prior to final model construction. We believe a screening step, such as this, could help the ENM process, particularly when working with large, multidimensional data sets where selection through ecological reasoning or other means is not apparent. In our earlier, proof-of-concept work, we implemented the Monte Carlo selection algorithm as a single-threaded program running on a MacBook Pro laptop computer [14]. In the current study, we have implemented a parallel version of the Monte Carlo method in a high-performance cloud computing environment. Our goal this time has been to characterize the run-time performance and scaling properties of a parallel implementation of the Monte Carlo algorithm and demonstrate its variable selection behavior with two example use cases. We call the prototype system MERRA/Max to reflect its reliance on MaxEnt and our interest in using the technology to screen for bioclimatic predictors in NASA’s Modern-Era Retrospective Analysis for Research and Applications Version 2 (MERRA-2) dataset, what we view as an underutilized and potentially important GCM resource for the ecological modeling community [23, 24]. A second goal for this paper is to open a discussion about the potential merits of this technology and lay the groundwork for experiments to evaluate its scientific value more fully. Reanalyses, such as MERRA-2, simulate hundreds of low-level physical drivers of the Earth system at extraordinarily fine temporal scale, and they do so over the entire four-decade span of the satellite era. A technology like MERRA/Max makes this remarkable resource practically available to the ENM community. In this paper, we begin to make the case for that. Some of the most important conservation questions scientists hope to answer are hobbled by current predictor selection methods. Rare species, for example, are generally represented by sparce occurrence records. Effective ENM, in these cases, requires a small number of high-quality predictors to avoid overfitting. We show how MERRA/Max can help with that. As large climate data sets continue to grow in size, they become less accessible to the science community and less usable in today’s suite of machine learning and statistical analysis tools. MERRA/Max shows how parallelizable, external-memory algorithms can address that problem. And, while the topic of variable selection in ENM is represented by a vast literature, most approaches in use today are difficult or impossible to automate, do not scale well to large data sets, or provide limited insight into the underlying biology or ecology of the organisms being studied. In the pages that follow, we show that a technology like MERRA/Max can potentially help overcome these limitations. This project builds on a twenty-year history of technology research and development at NASA focusing on applications of high-performance computing to ecological modeling [25-29] and the big data challenges of Earth science [30-40]. It complements this body of work by looking at ways that machine learning and high-performance cloud computing can extend existing capabilities and open new opportunities for research. In a fully realized, operational implementation of the technologies described here, we see MERRA/Max as one element of a bioclimatic modeling service enabled by a suite of high-performance data subsetting and data analytic tools of the sort becoming increasingly available to the research community through commercial cloud services [41-46].

Materials and methods

System architecture and implementation

We implemented MERRA/Max in a 100-core testbed within the NASA Center for Climate Simulation’s (NCCS’s) Advanced Data Analytics Platform (ADAPT). ADAPT is a managed virtual machine (VM) environment most closely resembling a platform-as-a-service (PaaS) cloud [47]. It features over 300 physical hypervisors that host one or more VMs, each having access to multiple shared, centralized data repositories. The hypervisor hardware consists of 2.2 GHz 24-core Intel Xeon Broadwell E5-2650 v4 processors with 256 GB of memory. The MERRA/Max testbed consists of a dedicated set of ten 10-core Debian Linux 9 Stretch VMs. We used shell scripts, R Version 4.0.1 [48], ENMeval Version 0.3.1 [49, 50], and MaxEnt Version 3.4.1 [51] to develop MERRA/Max’s software components, which collectively realize the Monte Carlo algorithm through the interactions shown in Fig 1.
Fig 1

MERRA/Max architecture.

Conceptual diagram showing the major hardware and software components of the MERRA/Max prototype. The study’s testbed consisted of 10 virtual machines (VMs) within NASA’s ADAPT science cloud, with each VM contributing 10 processing cores to the testbed. Numbered arrows indicate the system’s processing workflow.

MERRA/Max architecture.

Conceptual diagram showing the major hardware and software components of the MERRA/Max prototype. The study’s testbed consisted of 10 virtual machines (VMs) within NASA’s ADAPT science cloud, with each VM contributing 10 processing cores to the testbed. Numbered arrows indicate the system’s processing workflow. Conceptually, MERRA/Max sits atop a collection of variables stored in the underlying filesystem; when provided a species occurrence file, the system screens the collection to find the most important predictors for the input provided (Fig 1, Steps 1–5). The Monte Carlo screening process is initiated by the mcensemble.sh script, which launches an mcsprint.sh script on each of the 10 MERRA/Max VMs (Fig 1, Step 2). The mcsprint.sh script, in turn, creates parallel sprint runs by launching an R run-time environment and an MCSprint.R program on each of the 10 VM’s 10 processor cores (Fig 1, Steps 2.1). The MCSprint.R programs perform repeated MaxEnt runs on random pairs of variables read from the shared filesystem until a desired level of sampling is achieved (Fig 1, Steps 3.1). MCSprint.R maintains a tally table that tracks the number of times each variable is used along with its accumulating permutation importance, then writes the table to a shared directory (Fig 1, Steps 4.1). We used the operating system’s MCSprint.R process identifiers to create unique pid.tt file names for the output tally tables. When all the sprints have completed their work, MCSelect.R concatenates the pid.tt files into a global tally table (Step 4), computes the average permutation importance for each variable, then sorts them to reveal the top contributing variables identified by the ensemble’s runs (Fig 1, Step 5). While MERRA/Max relies on MaxEnt to perform selection, it is important to note that the resulting set of selected variables can be used in any ENM application or species distribution modeling approach.

Run-time performance and scaling properties

MERRA/Max’s run-time performance and scaling properties are affected by several factors. The total amount of time needed for MERRA/Max to complete a screening run (T) is primarily determined by the number of variables in the collection being screened (N), the average number of random samples taken of each variable in the collection (S), the number of random variables used in each independent MaxEnt sampling run (V), and the number of processor cores available in the compute environment (C). To understand the interplay of these factors, we gathered timing metrics on a series of ensembles with varying values of N, S, and C. In a first ensemble with N = 2 and C = 10, we used 10 parallel sprints (one sprint per core) in which each sprint performed five sequential two-variable sampling runs to achieve an average sample size for each collection variable of S = 50. We then completed the S = 50 series with ensembles in which N and C were proportionally increased to N = 18 and C = 90. This process was repeated using 10- and 15-run sprints respectively to create a timing series for S = 100 and S = 150. These measurements allowed us to quantify MERRA/Max’s run-time performance, estimated optimal performance, and its scaling properties within the constraints of a 100-core testbed. We did not quantify non-algorithmic influences on run time, such as predictor resolution, occurrence file size, competing system processes, filesystem performance, processor failures, process failures (abends), or MaxEnt parameter settings, as these tend to be intrinsic properties of the science question being studied, the compute environment, or the MaxEnt software itself. These non-algorithmic factors either have an idiosyncratic impact on overall run time that is constant for any particular application of MERRA/Max, or they are beyond user control. For development testing, we used Cassin’s Sparrow (Peucaea cassinii Woodhouse, 1852) as the target species [52] and Worldclim Version 2.1’s 19 Bioclim variables at a resolution of 5.0 arc-minutes (Table 1) as environmental predictors [53, 54]. We obtained Cassin’s Sparrow observational records for the year 2016 from the Global Biodiversity Information Facility (GBIF) [55]. Because of their secretive nature, Cassin’s Sparrows are generally detected in the field by the presence of singing males that define and defend breeding territories that range in size from 0.6 to 12.9 acres [52, 56–58]. After removing replicates, we thinned the records to non-overlapping observations within a 16 km buffer around each point to avoid double counting the same individuals. This resulted in a total of 609 observations, which were used for testing throughout the study. The predictor layers were clipped to the coverage area of our observational data, reprojected, and formatted for use by MaxEnt using rgdal Version 1.5–18 [59] following the guidelines of Hijmans et al. [60].
Table 1

Bioclim variables.

bio01Annual mean temperature
bio02Mean diurnal range (mean of monthly (max temp—min temp))
bio03Isothermality (bio2/bio7) (×100)
bio04Temperature seasonality (standard deviation ×100)
bio05Maximum temperature of warmest month
bio06Minimum temperature of coldest month
bio07Temperature annual range (bio5-bio6)
bio08Mean temperature of wettest quarter
bio09Mean temperature of driest quarter
bio10Mean temperature of warmest quarter
bio11Mean temperature of coldest quarter
bio12Annual precipitation
bio13Precipitation of wettest month
bio14Precipitation of driest month
bio15Precipitation seasonality (coefficient of variation)
bio16Precipitation of wettest quarter
bio17Precipitation of driest quarter
bio18Precipitation of warmest quarter
bio19Precipitation of coldest quarter
We adopted a standard MERRA/Max screening configuration that we used as the default in all our timing trials and use cases. This included a MaxEnt feature class (FC) setting of LQHP (linear, quadratic, hinge, and product), a regularization multiplier (RM) setting of 1.0, 10 replicate cross-validation, and ten thousand background points from across the study area [14]. We used V = 2 random variables in all the independent MaxEnt sampling runs. Additional detail about MERRA/Max’s default screening parameters and the rationale for their choice are provided in S1 Appendix.

Use case scenarios and selection behavior

To demonstrate MERRA/Max’s selection behavior and show how the system might be used in actual practice, we developed two use case scenarios in which we modeled the bioclimatic niche of Cassin’s Sparrow, a species known to be sensitive to many of the variables used in the study [52, 56, 58, 61]. Each use case involved three steps. The first was a Variable Screening step, in which MERRA/Max selected the top six contributing predictors from a collection of variables using an average sampling rate of S = 50. In previous work, we demonstrated that sampling at this rate converges quickly on a stable set of top predictors [14]. Here, we confirmed this behavior by first performing three ensemble runs. We then used the averaged results from these three screenings to settle on the top contributors. This was followed by a Predictor Refinement step, where we used variance inflation factor (VIF) analysis to reduce collinearities in the selected predictors [62]. VIF shows the degree to which standard errors are inflated due to the levels of multicollinearities. Using ENMtools Version 1.4.4 [63], we first calculated Pearson correlation coefficient (r), coefficient of determination (r2), and VIF [1÷ (1– r2)] values for the selected predictors, then eliminated the least contributing variable in any pair of variables having r > 0.8, r2 > 0.8, and VIF > 10.0 [62]. In a final Model Calibration / Final Model Run step, we used the ENMeval R package [49, 50] to identify optimal settings for the remaining, non-collinear predictors by performing a series of MaxEnt runs across all possible combinations of five feature classes (L, LQ, H, LQH, and LQHP) and regularization multiplier values ranging from 0.5 to 4.0 in increments of 0.5. The combination of settings resulting in the lowest value for Akaike’s information criterion corrected for small sample size (AICc) [64] was taken to be an optimal tuning configuration. We used the same 2016 Cassin’s Sparrow occurrence data in each scenario that we used for development testing. However, the two use cases operated on different sets of environmental predictors. In the first, we again used WorldClim’s 19 Bioclim variables. In the second use case, to gain experience with an even larger collection and demonstrate the system’s application to a novel set of Intergovernmental Panel on Climate Change (IPCC)-class GCM outputs [10, 65], we used variables obtained directly from the Modern-Era Retrospective Analysis for Research and Applications Version 2 (MERRA-2) reanalysis. In contrast to Worldclim’s Bioclim predictors, which are derived from 30-year averages of spatially interpolated weather station temperature and precipitation data [53, 54], the MERRA-2 reanalysis is produced by NASA’s Goddard Earth Observing System Version 5 (GEOS-5) [23, 66, 67]. The system integrates observational data with numerical models to produce a global temporally and spatially consistent synthesis of over 600 climate-related variables. MERRA-2’s spatial resolution is 1/2° latitude × 5/8° longitude (i.e., 55.5 × 69.4 km at the equator) × 72 vertical levels extending through the stratosphere. Its temporal resolution is hourly and extends from 1979 to the present, nearly the entire satellite era. The complete MERRA-2 collection is about one petabyte in size. For the current study, we created a test collection of 86 MERRA-2 variables of potential ENM interest. These were drawn from four MERRA-2 collections and included modeled, two-dimensional values for atmospheric attributes and heat, wind, radiation, and land surface attributes (Table 2). The test collection contains weekly and monthly maximum, minimum, and average values (or sums as appropriate) for each variable for the 40 years spanning 1980 to 2020. Importantly, the collection contains modeled values for the temperature and precipitation variables that form the basis for Bioclim’s 19 predictors, which highlight climate conditions generally understood to relate to a species’ physiology, plus an extended array of environmental attributes of potentially more direct biological significance, such as soil moisture and evaporation, wind direction and speed, and various solar radiation fluxes (Table 2) [53, 68–72]. For our use case, we used xarray [73] to create annual averages for the 86 variables for the year 2016, the year corresponding to the observation year of our Cassin’s Sparrow occurrence data. We used modeled values at 850 hPa, where appropriate, to reflect surface conditions. The hPa (hectopascal) atmospheric pressure unit is an expression of altitude. Generally, 850 hPa lies immediately above the atmospheric boundary layer (about 1.5 km), where daily surface variations in temperature, humidity, wind speed, etc. have little if any effect on measured or modeled values [9]. These layers were then prepared for use with MaxEnt as described above.
Table 2

MERRA-2 variables.

M2T1NXSLV 2D Atmospheric single-level diagnostics
M01PSTime averaged surface pressure
M02U850Eastward wind at 850 hPa
M03V850Northward wind at 850 hPa
M04T850Temperature at 850 hPa
M05Q850Specific humidity at 850 hPa
M06H1000Height at 1000 hPa
M07TSSurface skin temperature
M08QV2M2-meter specific humidity
M09QV10M10-meter specific humidity
M10T2M2-meter air temperature
M11T10M10-meter air temperature
M12U2M2-meter eastward wind
M13U10M10-meter eastward wind
M14U50MEastward wind at 50 meters
M15V2M2-meter northward wind
M16V10M10-meter northward wind
M17V50MNorthward wind at 50 meters
M2T1NXFLX 2D Surface turbulent flux diagnostics
M18EFLUXLatent heat flux (positive upward)
M19HFLUXSensible heat flux (positive upward)
M20TAUXEastward surface wind stress
M21TAUYNorthward surface wind stress
M22RHOASurface air density
M23TSHEffective turbulence skin temperature
M24QSHEffective turbulence skin humidity
M25PGENTOTTotal generation of precipitation
M26PREVTOTTotal re-evaporation of precipitation
M2T1NXRAD 2D Surface and TOA radiation fluxes
M27EMISSurface emissivity
M28ALBEDOSurface albedo
M29LWGEMEmitted longwave at the surface
M30LWGABSurface absorbed longwave
M31LWGABCLRSurface absorbed longwave assuming clear sky
M32LWGABCLRCLNSurface absorbed longwave assuming clear clean sky
M33LWGNTSurface net downward longwave flux
M34LWGNTCLRSurface net downward longwave flux assuming clear day
M35LWGNTCLRCLNSurface net downward longwave flux assuming clear clean day
M36SWGDNSurface incident shortwave flux
M37SWGDNCLRSurface incident shortwave flux assuming clear sky
M38SWGNTSurface net downward shortwave flux
M39SWGNTCLRSurface net downward shortwave flux assuming clear sky
M40SWGNTCLNSurface net downward shortwave flux assuming clean sky
M41SWGNTCLRCLNSurface net downward shortwave flux assuming clear clean sky
M42TAUTOTOptical thickness of all clouds
M43CLDTOTTotal cloud fraction
M2T1NXLND 2D Land surface diagnostics
M44GRNVegetation greenness fraction (LAI-weighted)
M45LAILeaf area index
M46GWETPROFTotal profile soil wetness
M47GWETROOTRoot zone soil wetness
M48GWETTOPTop soil layer wetness
M49TSURFMean land surface temperature (incl. snow)
M50TPSNOWTop snow layer temperature
M51TUNSTSurface temperature of unsaturated (but non-wilting) zone
M52TSA TSurface temperature of saturated zone
M53TWLTSurface temperature of wilting zone
M54SNODPSnow depth
M55RUNOFFOverland runoff
M56BASEFLOWBaseflow
M57QINFILSoil water infiltration rate
M58FRUNSTFractional unsaturated (but non-wilting) area
M59FRSATFractional saturated area
M60FRSNOFractional snow-covered area
M61FRWLTFractional wilting area
M62PARDFLANDSurface downward photosynthetically active radiation diffuse flux
M63PARDR LANDSurface downward photosynthetically active radiation beam flux
M64SHLANDSensible heat flux from land
M65LHLANDLatent heat flux from land
M66LWLANDNet downward longwave flux over land
M67SWLANDNet downward shortwave flux over land reservoirs
M68GHLANDDownward heat flux into top soil layer
M69TWLANDTotal water stored in land reservoirs
M70TELANDEnergy stored in all land
M71WCHANGETotal land water change per unit time
M72ECHANGETotal land energy change per unit time
M73SPLANDSpurious land energy source
M74SPWATRSpurious land water source
M75SPSNOWSpurious snow energy source
M76PRMCTotal profile soil moisture content
M77RZMCRoot zone soil moisture content
M78SFMCTop soil layer soil moisture content
M79PRECTOTTotal surface precipitation
M80SNOMASSnow mass
M81EVPSOILBare soil evaporation
M82EVPTRNSTranspiration
M83EVPINTRInterception loss
M84EVPSBLNSublimation
M85SMLANDSnowmelt over land
M86EVLANDEvaporation from land
To evaluate MERRA/Max’s selection behavior, we created initial MaxEnt models using the top six predictors selected by the three screenings in the Variable Screening step. Then, using the overall top six variables found in the Variable Screening step, we created a final MaxEnt model in the Model Calibration / Final Model Run step that reflected any improvements gained in the Predictor Refinement step or by Model Calibration. The potential distribution maps produced by the final models were judged for reasonableness based on first-hand knowledge of the species, its habitat preferences, what is known about Cassin’s Sparrow’s range from the published literature [52, 56–58, 61], and observational records from Cornell Lab’s eBird citizen-scientist database [74]. We further compared these final model predictions to results obtained by replicating, in part, the work of Salas et al. [75], in which traditional MaxEnt variable-selection techniques were used to model the bioclimatic niche of Cassin’s Sparrow. Here, we used our 2016 Cassin’s Sparrow occurrence data in combination with the seven Worldclim Bioclim variables used by the Salas team: bio03, bio06, bio08, bio09, bio12, bio14, and bio18. The Salas team chose these predictors by first removing one of each pair of highly correlated variables to avoid collinearity among the variables. The team then chose between highly correlated variables by selecting those that were identified in one or more species-specific studies as influencing the species’ range or population dynamics. In cases where the literature search could not differentiate between two highly correlated variables, the team used a qualitative assessment of the distribution of values of the variable at all presence points and the relationship between the variable and species presence or pseudo-absence [75]. We used ENMeval, as described above, to identify optimal tuning parameters for the Salas-derived model. To gain a quantitative perspective on performance, we used AICc [64] as a measure of a model’s relative explanatory power (lower values indicating less information loss) and area under the receiver operating characteristic curve (AUC) [76], percent correctly classified (PCC) [77], and the True Skill Statistic (TSS) [78, 79] as measures of model accuracy (higher values in all cases indicating greater accuracy). Similarities between our first use case’s final Bioclim model and the Salas-derived Bioclim model were examined using Warren’s I statistic [80], Schoener’s D statistic [81], and Pearson’s r statistic [82]. All input data used in this study, along with a set of example scripts are provided in S1 File.

Results

The first ensemble of the 50-sample timing series (S = 50) required a total run time of T = 7.9 minutes to screen a two-variable collection (N = 2) using 10 processor cores (C = 10) (Fig 2A). At one sprint per core, and with each MaxEnt sampling run operating on two (V = 2) randomly selected variables at a time, this first ensemble needed 50 MaxEnt runs to do its work. The shortest achievable screening time (Tmin) is possible only when the number of cores needed for perfect parallelism (Cmax) are actually available, in this case, 50:
Fig 2

MERRA/Max run-time performance and scaling properties.

Figure shows the relationship between the amount of time it takes MERRA/Max to complete a screening run (T) (shown by the left Y axis and the colored lines labeled A, B, and C), the number of variables in the collection being scanned (N), the average number of random samples taken of each variable in the collection during the screening process (S), and the number of processor cores available in the compute environment (C) (shown by the colored vertical bars and right Y axis). MERRA/Max’s parallel implementation scales linearly with respect to S, and, for any given collection of size N and sample size S, the estimated minimum possible run time (Tmin) (shown in parentheses) can be achieved when enough cores are available for a completely parallel screening of the collection.

MERRA/Max run-time performance and scaling properties.

Figure shows the relationship between the amount of time it takes MERRA/Max to complete a screening run (T) (shown by the left Y axis and the colored lines labeled A, B, and C), the number of variables in the collection being scanned (N), the average number of random samples taken of each variable in the collection during the screening process (S), and the number of processor cores available in the compute environment (C) (shown by the colored vertical bars and right Y axis). MERRA/Max’s parallel implementation scales linearly with respect to S, and, for any given collection of size N and sample size S, the estimated minimum possible run time (Tmin) (shown in parentheses) can be achieved when enough cores are available for a completely parallel screening of the collection. Because only 10 cores were available, each of the parallel sprints had to perform five sequential MaxEnt sampling runs to achieve the S = 50 sampling goal, a repeat factor (R) of 5: By accounting for this performance cost, we estimate that MERRA/Max’s minimum possible run time, in a completely parallel screening of this first data set, would have been about 1.6 minutes: In each subsequent ensemble of the S = 50 series, we added two variables to the scanned collection and 10 cores to the pool of available processors. With this proportional scaling of variables and processors, average run times remained constant across the series at T = 7.9 ± 0.3 minutes (Tmin = 1.6 ± 0.1 minutes) (Fig 2A). In the S = 100 timing series, R = 10 sequential MaxEnt runs were used in each sprint to achieve the desired sampling level (Fig 2B), and in the S = 150 series, R = 15 runs were used (Fig 2C). In both cases, run times scaled linearly with sample size and remained relatively constant across the series, with T = 14.8 ± 2.5 minutes (Tmin = 1.6 ± 0.1 minutes) for the S = 100 series and T = 22.6 ± 0.5 minutes (Tmin = 1.5. ± 0.1 minutes) for the S = 150 series. The Bioclim collection consists of N = 19 variables. To achieve an average per-variable sampling goal of S = 50 with C = 100 cores, each sprint in the Bioclim use case (Fig 3A) performed R = 5 sequential MaxEnt runs in the Variable Screening step, resulting in ensembles comprising a total of 500 runs. In an average of three such ensembles, MERRA/Max took T = 6.4 ± 0.5 minutes (Tmin = 1.3 ± 0.1 minutes) to identify bio18 (precipitation of the warmest quarter), bio03 (isothermality), bio05 (maximum temperature of the warmest month), bio08 (mean temperature of the wettest quarter), bio13 (precipitation of the wettest month), and bio16 (precipitation of the wettest quarter) as the top six contributing variables of the collection. In the subsequent Variable Refinement step, predictor pairs bio13-bio16 and bio16-bio18 were shown to be correlated, which led us to discard bio16 from the selection set. In the Model Calibration / Final Model Run step, the remaining five non-correlated variables were used to create a final model in which the top four contributing variables (bio18, bio03, bio05, and bio13) accounted for approximately 98% of overall permutation importance, and the performance metrics were AICc 12,232, AUC 0.83, PCC 0.75, and TSS 0.49.
Fig 3

MERRA/Max use case scenarios.

Figure shows the results of two use cases involving Cassin’s Sparrow observational data and predictor data sets of contrasting size and complexity: the Bioclim collection with N = 19 variables (A) and a MERRA-2 reanalysis test collection comprising N = 86 variables (B). A Variable Screening step was used in each scenario to select the top six contributing variables in the underlying collection. Correlated variables (indicated with red text and yellow highlight) were identified in a Predictor Refinement step and thinned to reduce collinearities. In a third step, Model Calibration and a Final Model Run were performed with the remaining non-correlated variables (green highlight). AICc is Akaike’s information criterion corrected for small sample size, AUC is area under the receiver operating characteristic curve, PCC is percent correctly classified, TSS is True Skill Statistic, Parameters is MaxEnt’s measure of model complexity, r is Pearson’s correlation coefficient, r2 is the coefficient of determination, and VIF is variable inflation factor. The estimated minimum run time (Tmin) for a completely parallel screening is shown in parentheses. Maps created by the authors show MaxEnt logistic output, which can be interpreted as an estimate of habitat suitability between 0 and 1 with warmer colors indicating better predicted conditions for the species.

MERRA/Max use case scenarios.

Figure shows the results of two use cases involving Cassin’s Sparrow observational data and predictor data sets of contrasting size and complexity: the Bioclim collection with N = 19 variables (A) and a MERRA-2 reanalysis test collection comprising N = 86 variables (B). A Variable Screening step was used in each scenario to select the top six contributing variables in the underlying collection. Correlated variables (indicated with red text and yellow highlight) were identified in a Predictor Refinement step and thinned to reduce collinearities. In a third step, Model Calibration and a Final Model Run were performed with the remaining non-correlated variables (green highlight). AICc is Akaike’s information criterion corrected for small sample size, AUC is area under the receiver operating characteristic curve, PCC is percent correctly classified, TSS is True Skill Statistic, Parameters is MaxEnt’s measure of model complexity, r is Pearson’s correlation coefficient, r2 is the coefficient of determination, and VIF is variable inflation factor. The estimated minimum run time (Tmin) for a completely parallel screening is shown in parentheses. Maps created by the authors show MaxEnt logistic output, which can be interpreted as an estimate of habitat suitability between 0 and 1 with warmer colors indicating better predicted conditions for the species. In the MERRA-2 use case (Fig 3B), MERRA/Max screened a collection of N = 86 variables of coarser resolution (approximately 50 km for MERRA-2 vs. 8 km for Bioclim). To achieve the S = 50 sampling goal, each sprint performed R = 22 MaxEnt sampling runs, which resulted in 2200-run ensembles. The average run time across three such ensembles in the Variable Screening step increased to T = 18.0 ± 2.2 minutes; however, because of the coarser predictor resolution, times for the MaxEnt sampling runs decreased, which resulted in an estimated theoretical lower bound of only Tmin = 0.8 ± 0.1 minutes. The six top contributing variables identified in the Variable Screening step included M05 (specific humidity), M39 (surface net downward shortwave flux assuming a clear day), M38 (surface net downward shortwave flux), M81 (bare soil evaporation), M03 (northward wind), and M04 (temperature). In the Predictor Refinement step, the M38-M39 pair showed strong correlation, which led us to discard M38. In the Model Calibration / Final Model Run step, the remaining five non-correlated variables were used to create a final model in which the top four contributing variables (M39, M05, M04, and M81) accounted for approximately 97% of over overall permutation importance, and the performance metrics were AICc 7,023, AUC 0.83, PCC 0.72, and TSS 0.44. In the Salas-derived Cassin’s Sparrow model (Fig 4A), where a traditional approach to variable selection was used to identify the seven predictors used in MaxEnt, the top four contributing variables (bio18, bio06, bio14, and bio09) accounted for approximately 83% of overall permutation importance, and the model’s performance metrics were AICc 12,169, AUC 0.83, PCC 0.76, and TSS 0.50.
Fig 4

Cassin’s Sparrow baseline model and maps.

Figure shows results from a MaxEnt run that builds on the Cassin’s Sparrow bioclimatic modeling work of Salas et al. [75] and reflects a more traditional approach to ENM (A) and Cassin’s Sparrow’s range map based on observational data (B). Highlighted variables indicate those that were also selected by MERRA/Max in the Bioclim use case. Range map provided by eBird (www.ebird.org), created 28 July 2020, and reprinted from [83] under a CC BY license, with permission from the Cornell Lab of Ornithology.

Cassin’s Sparrow baseline model and maps.

Figure shows results from a MaxEnt run that builds on the Cassin’s Sparrow bioclimatic modeling work of Salas et al. [75] and reflects a more traditional approach to ENM (A) and Cassin’s Sparrow’s range map based on observational data (B). Highlighted variables indicate those that were also selected by MERRA/Max in the Bioclim use case. Range map provided by eBird (www.ebird.org), created 28 July 2020, and reprinted from [83] under a CC BY license, with permission from the Cornell Lab of Ornithology.

Discussion

Climate change research is giving rise to new technology requirements at the intersection of big data, machine learning, and high-performance computing [84]. There are few places where this is more clearly seen than with studies focusing on the climate’s impact on species distribution and abundance [85]. For nearly twenty years, the ecological modeling community’s tool-of-choice for this work has been MaxEnt [22]. Few, if any, machine learning programs have been more widely used or more carefully studied [16, 18, 84, 86–92]. Today, however, there is increasing interest in using GCM outputs as predictors in ENM [10-12], which has brought into focus one of the more challenging problems with existing machine learning systems: how to make them work with large, complex, feature-rich, high-dimensional data sets [93-99]. This study reflects our efforts to address this problem. Our approach to dimensionality reduction involves a parallel, out-of-core Monte Carlo selection method implemented in a high-performance, cloud computing setting. Monte Carlo optimizations are a way of finding approximate answers to problems that are solvable in principle but lack a practical means of solution [100]. Out-of-core (or external memory) algorithms process data sets that are too large to fit into a computer’s main memory [101, 102]. They are currently a major focus of research in the machine learning community [102-104]. With MERRA/Max, we bring these concepts together to find a useful subset of predictors in a large collection of environmental variables in a reasonable amount of time. Early results are encouraging and suggest that the approach holds promise from both a technological and scientific perspective. To begin, we have shown that MERRA/Max’s parallel implementation of the Monte Carlo selection algorithm scales linearly with respect to the average number of samples taken of each variable in the collection being screened. It can recruit additional processor cores to maintain a constant run time regardless of the size of the collection. In a best case, where sufficient processors are available for complete parallelism, MERRA/Max can screen a predictor data set of any size in the time it takes for a single MaxEnt run using only two predictors in the target collection. Collectively, these results confirm that near real-time performance and, in the vernacular of high-performance cloud computing, “infinite scalability” are achievable [105-107]. MERRA/Max’s run-time performance and scaling properties are consistent with what one would expect of an “embarrassingly parallel” workload, where subtasks are completely independent and able to run concurrently. The more important question is: Does this approach actually benefit science? We try to assess MERRA/Max’s potential value to science by addressing three interrelated questions: (1) Is MERRA/Max making useful, ecologically plausible selections?. It is difficult to know how best to evaluate selection success in this work, given that we are proposing the Monte Carlo method as a preliminary screening step, which, presumably, would be followed by further refinements to the set of selected predictors based on the biology and ecology of the species, collinearity, or the other considerations that have traditionally guided ENM variable selection. Depending on the circumstances, post-selection refinement might mean additional winnowing, substitution or augmentation of the screened predictors with other variables, or that the selected variables are discarded altogether. That being said, the Bioclim use case seems to confirm that MERRA/Max’s selections are both valid and useful. In the most general, qualitative sense, the habitat suitability map produced by the final model in the Bioclim use case is consistent with what is known about Cassin’s Sparrow’s range from observational records [83] (Figs 3A and 4B). Likewise, the set of selected variables are consistent with what is known of the species’ natural history. Cassin’s Sparrow is a desert-adapted, ground-dwelling (and, notably, ground-nesting) species, whose breeding biology is exquisitely linked to conditions of temperature and precipitation and their consequent influence on vegetation availability, insect abundance, and terrestrial microclimates [52, 56, 58, 61, 108–110]. In fact, field studies over the past century suggest that Cassin’s Sparrow is an itinerant breeder, so responsive to temperature and precipitation that they make seasonal, inter-clutch moves within their range to find optimal conditions for breeding [52, 109, 111]. The Bioclim scenario’s ordered selection of bio18 (precipitation of the warmest quarter), bio03 (isothermality, i.e., temperature evenness, or how large the daily temperature variation is compared to its annual variation), bio05 (maximum temperature of the warmest month), bio13 (precipitation of the wettest month), and bio08 (mean temperature of the wettest quarter) is entirely consistent with this picture. It is also largely consistent with the variables assembled by the Salas team using a more traditional approach to variable selection [75]. Both sets have bio18 and bio03 in common, both of these variables are highly influential, and both are known to be important determinants of range in arid-adapted birds, especially in desert and grassland species of conservation concern [75, 85, 112–118]. Where the two predictor data sets differ, for example, bio05 (maximum temperature of the warmest month) in the Bioclim use case vs. bio06 (minimum temperature of the coldest month) in the Salas-derived model, bio13 (precipitation of the wettest month) vs. bio12 (annual precipitation) and bio 14 (precipitation of the driest month), an argument could be made, in light of Cassin’s Sparrows distinctive seasonal breeding dynamics, which is likely influenced by the North American Monsoon, that MERRA/Max found the more relevant predictors [119-121]. From a quantitative perspective, the final model in MERRA/Max’s Bioclim scenario (Fig 3A) demonstrated strong evaluation metrics (AUC, PCC, TSS of 0.83, 0.75, 0.49 respectively) that were almost identical to those obtained in the Salas-derived model (0.83, 0.76, 0.50) (Fig 4A) [122]. A high degree of similarity between the Bioclim use case and Salas-derived model is further confirmed by their AICc values (12,232 and 12,169 respectively) and the results we obtained for the D, I, and r statistics (0.974, 0.999, and 0.997 respectively) [80]. The top four predictors in the Bioclim scenario accounted for 98% of overall permutation importance in the final model; the top four predictors in the Salas-derived model accounted for 83%. While these results reflect an admittedly limited trial at this point, taken together, they suggest that MERRA/Max’s use in the Bioclim scenario produced a bioclimatic niche model for Cassin’s Sparrow that, within its training range, is ecologically reasonable, statistically robust, and at least as good (if not better) than what might be obtained in a traditional application of MaxEnt. This gives us confidence that the Monte Carlo method is, in fact, finding a useful subset of predictors in a larger pool of possible predictors. (2) Does MERRA/Max create new research opportunities?. Another measure of MERRA/Max’s potential value to science is to consider whether new avenues of research are opened up with this technology. We believe they are, and the MERRA-2 use case helps explain why. Here we have a situation where the size and complexity of the target collection, as well as the obscure nature of the data, make a priori variable selection difficult. We have no immediate basis for distinguishing which variables in the MERRA-2 test collection might be the most important contributors to a final model. With 86 variables in play, even something as straightforward as correlation analysis provides little help, requiring nearly 3700 pair-wise comparisons in this case. Yet, going into this with no pre-vetting of the test collection whatsoever, we are struck by the ecological and biological relevance of the predictors selected by MERRA/Max in the MERRA-2 use case scenario. In the same way that the Earth’s climate is ultimately driven by a balance between incoming and outgoing energy, so too is the natural history of Cassin’s Sparrow linked to the energetics of the species’ diurnal and seasonal activities and the locations where those activities occur [58]. Viewed this way, it makes ecological sense to see that MERRA/Max identified M39 (surface net downward shortwave flux assuming a clear day) as the most important variable for modeling Cassin’s Sparrow potential habitat in the MERRA-2 test collection. Likewise, a unique aspect of Cassin’s Sparrow’s breeding biology is an energetically demanding skylark display in which males define and defend territories and secure mates by aerial flight songs. Wind has a pronounced impact on this behavior [58]. While its effectiveness as a proxy for surface conditions in this setting is unknown, it is notable that MERRA/Max identified a low-level zonal wind component, M03 (northward wind), as a top contributor. Finally, given Cassin’s Sparrow’s ground-dwelling habit and the importance of low-level environmental conditions to almost all aspects of the species’ life, it is not surprising to see M05 (specific humidity), M04 (temperature), and M81 (bare soil evaporation) in MERRA/Max’s selection set. These results must be interpreted with caution. After all, even models based on meaningless variables can be classified as excellent according to widely used evaluation metrics [123], and high predictive accuracy does not necessarily connote robust inferential capacity [17]. What is more, MERRA-2 variables represent the low-level physical drivers of many of the Earth system’s biological processes [10, 124]: an interpretation of ecological plausibility could be made for almost any of the MERRA-2 variables. That being said, studies have shown that MaxEnt’s ranking of variable importance can capture biologically realistic assessments of factors governing range boundaries when models are built using best-practice procedures and variables are ranked based on permutation importance [7, 17]. And, with Cassin’s Sparrow, we have a species whose behavioral and energetic ecology has been studied in significant detail [58]. Of the many potential contributors in the MERRA-2 collection, the types of variables selected by MERRA/Max are known to be particularly important environmental influences for the species and are notably consistent with our mechanistic, process-based understanding of the bird’s natural history [52, 56, 58, 61, 125]. Further, in the use case scenario’s final MaxEnt model, we see a reasonable habitat suitability map based on our understanding of Cassin’s Sparrow’s current range (Fig 4B), a relatively robust set of metrics (AUC 0.83, PCC 0.72, TSS 0.44), and the top four contributing variables accounting for a significant proportion (i.e., 97%) of overall permutation importance (Fig 3B). While an interpretation of causality between selected variables and species occurrence may not be supported by the data currently at hand, it does appear that MERRA/Max and the Monte Carlo selection method have detected a signal in the MERRA-2 data that has both biological and statistical significance for Cassin’s Sparrow [126, 127]. What about the larger question regarding new research opportunities? A particular type of question that might be better addressed with MERRA/Max’s combination of data and technology concerns the conservation status of arid-adapted birds. Accurate assessments are often difficult with these species [85, 112–118, 128]. With Cassin’s Sparrow, for example, numerous studies over the past half-century have painted a confusing picture. Some find evidence for a retraction of viable habitat and declining regional populations [61], others find mixed results and too little data to establish with confidence an overall conservation status [56], and many sources identify the species as stable and of little immediate worry [129-132]. In recent work, of nine grassland birds of conservation concern, Cassin’s Sparrow was the only species to project gains in suitable habitat over the next fifty years [75]. This ambiguous picture is not unique to Cassin’s Sparrow; however, in this case, the species’ itinerant breeding habit no doubt contributes to the confusion: it is simply impossible to know what part of the bird’s population one is seeing at any given time. Understanding the conservation status of a bird like Cassin’s Sparrow means being able to tease apart the species’ physiological capacity for seasonal response to weather from the species’ longer-term, adaptive response to a changing climate’s effect on the landscape [116]. Ultimately, one would like to distinguish short-term transformations in the bird’s bioclimatic niche within a coherent, long-running temporal framework. Historical and multi-temporal scale modeling are not new [133-136]; however, with four decades of climate attributes modeled on an hourly basis, reanalyses, such as MERRA-2, are uniquely able to provide the high-temporal resolution, longitudinal environmental data for this. A technology like MERRA/Max transforms MERRA-2 into a viable experimental sandbox. And, thanks to long-running citizen-scientist efforts, such as the U.S. Geological Survey’s North American Breeding Bird Survey (BBS) [137, 138]; Cornell University’s eBird, Great Backyard Bird Count Surveys, and other projects [74, 139, 140]; Audubon’s Christmas Bird Counts [141, 142]; and the wealth of online museum specimen records in resources like the Global Biodiversity Information Facility (GBIF) [143], the NSF-funded VertNet databases [144], and the U.S. Geological Survey’s BISON species occurrence database [145], there now exists widespread availability of avian observations that provide good coverage for the MERRA-2 time span, making multi-temporal scale investigations like this possible [146]. Another dimension of conservation research that could potentially be advanced with a technology like MERA/Max is the modeling of rare or endangered species. Rare species are among the most in need of predictive distribution modeling but are often the most difficult to model. Known as the “rare species modeling paradox” [147], these species generally have a low number of occurrence records, which can lead to model over-parameterization and overfitting if too many predictors are used [11, 148, 149]. A new strategy using ensembles of small models (ESMs) was recently developed to overcome this limitation. It involves fitting many two-variable models, filtering the results against a weighted AUC-based performance threshold, then averaging the remaining models to produce an ensemble average model. The approach is particularly useful when applied to rare species, because it simultaneously winnows the starting pool of predictors while generating a final model in which the number of predictors has been kept low in each of the ensemble’s contributing models [150, 151]. MERRA/Max also uses bivariate ensembles in its Monte Carlo approach to variable selection. In contrast to the ESM method, however, MERRA/Max discards its ensemble models after tallying the permutation importance of each model’s two variables, producing in the end a small selection of top contributing variables for further consideration. By separating variable selection from final model construction, MERRA/Max provides the modeler with greater latitude in the overall ENM process, offering, in a sense, a supervised approach that could enable more carefully crafted results. Finally, spatiotemporal projection is a critical element of conservation research that could also potentially benefit from MERRA/Max. ENMs are commonly used to predict the impact of climate change on biodiversity. The reliability of those predictions, however, depends on a model’s transferability in space and time, which, in turn, is influenced by variable selection [152-154]. While the work presented here has focused exclusively on performance evaluations within the calibration range, it is important to note that MERRA-2’s selected variables, along with similar GCM outputs, form the basis for the IPCC’s research activities. As a consequence, models using MERRA/Max-selected variables are particularly well-suited for extrapolative studies using data from IPCC’s Global Projection scenarios [155-159]. We close this section by considering briefly the most apparent technical shortcoming in the MERRA-2 use case: the inherently coarse spatial resolution of reanalysis data. Given that species’ responses to the environment are scale dependent, there is a recognized need within the ecological research community for higher resolution reanalysis products, to which various efforts are now responding [158, 160–162]. Increasing evidence, however, shows that both coarse and fine resolution variables are important across scales [163]. In the context of rapid prescreening, in particular, we feel that MERRA-2’s coarse resolution data serves an important purpose. Selection times are fast with coarse-resolution data, MERRA/Max-selected variables are relevant, and many resolution shortcomings in the selected variables can be addressed in the refinement step, either by downscaling variables of interest or going to an alternative source for a higher resolution product, such as remote sensing data [164, 165] or NatureServe’s high-resolution data sets [166]. (3) Can MERRA/Max improve the ecological niche modeling process?. Finally, in evaluating potential benefits to science, we can consider whether a tool like MERRA/Max could improve the work practices of ecological modeling. Here again, we think it can. There is heightened awareness of the significance of dimensionality in understanding environmental spaces and the importance of variable selection in modeling those spaces [15, 18, 167]. This awareness is accompanied by a recognition that logistic difficulties often preclude examining large numbers of variables, which has led to a search for alternative means of variable selection and calls for process automation [10, 17, 18, 88, 168]. A comprehensive review of these approaches is beyond the scope of this paper. In general, however, they include greater use of biological insight and expert knowledge in the selection of predictor variables; reliance on manual or statistical analysis of the published literature to identify predictors; use of statistical algorithms for variable prescreening based on cluster analysis, collinearity reduction, or calibration-/projection-range analogue analysis; and various types of classical principal component analysis (PCA) [153]. To understand where a technology fits within this conceptual framework, it is important to note that any selection process that involves human intervention is difficult or impossible to automate; the use of in-core statistical software tools are inherently unscalable to large data sets; whether implemented as in-core or out-of-core, compute-intensive, exponential time algorithms, such as PCA, are likewise not scalable; and mathematical approaches to feature reduction that operate on predictors in isolation from the feature of interest, i.e., the dependent variable, are not directly influenced by the underlying biology or ecology of the species being studied, which may limit the insights one might otherwise gain in the modeling process. MERRA/Max attempts to overcome all these limitations. Finally, Cobos et al. [92] provide a helpful framework for understanding where a technology like MERRA/Max could fit in the overall ENM workflow (Fig 5). The work of ENM can be thought of as a multi-step process ranging from initial data preparation and cleaning, to model calibration, final model construction, model evaluation, and the assessment of extrapolation risk. Among the tasks associated with data cleaning, the selection of viable predictors is crucial, time-consuming, and the place where a means for rapid, automatic, preselection could improve the overall workflow, especially if it enabled exploration of a large universe of predictors.
Fig 5

Ecological niche modeling (ENM) process.

Schematic description of the ENM process. Color bars under each step reflect an approximate amount of time that may be needed, ranging from low (blue) to high (red). The use of MERRA/Max to prescreen a large collection of predictors could support variable selection in the data cleaning step. Image provided by [92] and adapted for use here under a CC-BY license.

Ecological niche modeling (ENM) process.

Schematic description of the ENM process. Color bars under each step reflect an approximate amount of time that may be needed, ranging from low (blue) to high (red). The use of MERRA/Max to prescreen a large collection of predictors could support variable selection in the data cleaning step. Image provided by [92] and adapted for use here under a CC-BY license.

Future work

Our plans for future work are being shaped by a vision where the technical complexities described here are abstracted away from the end-users, and low cost, easy access, and simplicity make MERRA/Max a practical and useful tool for the conservation research community. On the technology front, our next steps will focus on developing a “cloud bursting” capability that will allow MERRA/Max ensembles to migrate from NASA’s private ADAPT science cloud to a public commercial cloud in response to resource demands that outstrip local capacity. This will allow us to scale MERRA/Max to larger data sets and more demanding science questions. For example, we would like to make better use of our experimental MERRA-2 test collection, which spans 40 years and includes monthly and weekly maximum, minimum, and average values for each of the collection’s 86 variables, a total of N = 660,480 files. Beyond that, expanding MERRA/Max to accommodate all of MERRA-2’s 600-plus variables is a more challenging long-term goal that could open interesting new research opportunities for the ENM community. Other technical improvements on the horizon include the use of refined selection criteria and automatic stopping rules that would enable true convergence in the selection process. Finally, given that each step in our use case scenarios is carried out by a program that could readily participate in an orchestrated workflow, we would like to know whether a practical level of automation for the entire MaxEnt-enabled ENM process might be possible. On the science front, we plan to pursue two distinct threads of development. First, we want to follow up on the example posed in the previous section and see if MERRA/Max and the MERRA-2 test collection could provide a better understanding of the true conservation status of Cassin’s Sparrow by temporally filtering occurrence data and environmental predictors to the breeding season and evaluating fine-grained changes in the species’ bioclimatic niche over the past four decades. Pursuing that question for Cassin’s Sparrow alongside other grassland birds of conservation concern, such as those studied by Salas et al. [75], would provide the additional benefit of extending our experiences to other species and would allow us to test the ecological and conservation applicability of this technology, which we view as an important next step. The second thread of science development will examine the extensibility of this approach to other types of problems. For example, we have used MERRA/Max and MaxEnt to study the hydrological cycle in NASA’s Arctic-Boreal Vulnerability Experiment (ABoVE) [169]. Changes to the hydrological cycle in the Arctic are particularly complex, because observed and projected warming directly impacts permafrost and leads to variable responses in surface water extent [170]. In preliminary work, using the locations of observed increases and decreases in surface water extent as dependent variables (in essence, treating them as “pseudo” species occurrences), the technologies and techniques described here successfully replicated observed patterns of surface water change [171]. If these findings are validated by further experimentation, the view of how MERRA/Max, the MERRA-2 reanalysis, and MaxEnt might be applied to studies of climate change and its impact on the Earth system becomes significantly broadened.

Conclusions

In this paper, we have described a prototype system called MERRA/Max that implements a feature selection approach to dimensionality reduction that is specifically intended to enable direct use of GCM outputs in ENM. The system accomplishes this reduction through a Monte Carlo optimization in which many independent MaxEnt runs operating on a species occurrence file and a small set of variables randomly selected from a large collection of variables converges on an estimate of the top contributing predictors in the larger collection. These top predictors become candidates for consideration in the variable selection step of the ENM process. MERRA/Max’s Monte Carlo algorithm operates on files stored in the underlying filesystem and is thus scalable to large data sets. We implemented its program components using open-source and commercial off-the-shelf software. These components can run independently as parallel processes in a high-performance cloud computing environment to yield near real-time performance. Within this framework, variable selection is guided by the indirect biological influences injected into MERRA/Max’s feature reduction process by the species occurrence files. We find evidence for this tailoring of results in our use case scenarios. In preliminary tests using a single bird species and observations from a single year, MERRA/Max selected reasonable and familiar climatological predictors from the classic Bioclim collection of environmental variables. MERRA/Max also selected biologically and ecologically plausible predictors from a larger and much more diverse set of environmental variables derived from NASA’s MERRA-2 reanalysis. Our experience is limited at this point, but we feel that these results point to a technological approach that could expand the use of GCM outputs in ENM, foster exploratory experimentation with otherwise difficult-to-use climate data sets, streamline the modeling process, and, eventually, enable automated bioclimatic modeling as a cloud service.

MERRA/Max parameterization.

Overview of MERRA/Max’s default screening parameters and supporting documentation. (PDF) Click here for additional data file.

Data and scripts.

Compressed file folder containing input data and example scripts used in the study. (ZIP) Click here for additional data file.
  29 in total

Review 1.  Ensemble forecasting of species distributions.

Authors:  Miguel B Araújo; Mark New
Journal:  Trends Ecol Evol       Date:  2006-09-29       Impact factor: 17.712

2.  Ecological niche modeling in Maxent: the importance of model complexity and the performance of model selection criteria.

Authors:  Dan L Warren; Stephanie N Seifert
Journal:  Ecol Appl       Date:  2011-03       Impact factor: 4.657

Review 3.  Major challenges for correlational ecological niche model projections to future climate conditions.

Authors:  A Townsend Peterson; Marlon E Cobos; Daniel Jiménez-García
Journal:  Ann N Y Acad Sci       Date:  2018-06-20       Impact factor: 5.691

4.  Assemblage time series reveal biodiversity change but not systematic loss.

Authors:  Maria Dornelas; Nicholas J Gotelli; Brian McGill; Hideyasu Shimadzu; Faye Moyes; Caya Sievers; Anne E Magurran
Journal:  Science       Date:  2014-04-18       Impact factor: 47.728

5.  Big Data Challenges in Climate Science.

Authors:  John L Schnase; Tsengdar J Lee; Chris A Mattmann; Christopher S Lynnes; Luca Cinquini; Paul M Ramirez; Andre F Hart; Dean N Williams; Duane Waliser; Pamela Rinsland; W Philip Webster; Daniel Q Duffy; Mark A McInerney; Glenn S Tamkin; Gerald L Potter; Laura Carrier
Journal:  IEEE Geosci Remote Sens Mag       Date:  2016-09-16

6.  Weather, not climate, defines distributions of vagile bird species.

Authors:  April E Reside; Jeremy J Vanderwal; Alex S Kutt; Genevieve C Perkins
Journal:  PLoS One       Date:  2010-10-22       Impact factor: 3.240

7.  The relative impacts of climate and land-use change on conterminous United States bird species from 2001 to 2075.

Authors:  Terry L Sohl
Journal:  PLoS One       Date:  2014-11-05       Impact factor: 3.240

8.  Ecological opportunity and the evolution of habitat preferences in an arid-zone bird: implications for speciation in a climate-modified landscape.

Authors:  Janette A Norman; Les Christidis
Journal:  Sci Rep       Date:  2016-01-20       Impact factor: 4.379

9.  MaxEnt's parameter configuration and small samples: are we paying attention to recommendations? A systematic review.

Authors:  Narkis S Morales; Ignacio C Fernández; Victoria Baca-González
Journal:  PeerJ       Date:  2017-03-14       Impact factor: 2.984

10.  kuenm: an R package for detailed development of ecological niche models using Maxent.

Authors:  Marlon E Cobos; A Townsend Peterson; Narayani Barve; Luis Osorio-Olvera
Journal:  PeerJ       Date:  2019-02-06       Impact factor: 2.984

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.