Literature DB >> 34871304

Fine scale prediction of ecological community composition using a two-step sequential Machine Learning ensemble.

Icíar Civantos-Gómez^1,2, Javier García-Algarra³, David García-Callejas^4,5, Javier Galeano², Oscar Godoy⁴, Ignasi Bartomeus⁵.

Abstract

Prediction is one of the last frontiers in ecology. Indeed, predicting fine-scale species composition in natural systems is a complex challenge as multiple abiotic and biotic processes operate simultaneously to determine local species abundances. On the one hand, species intrinsic performance and their tolerance limits to different abiotic pressures modulate species abundances. On the other hand, there is growing recognition that species interactions play an equally important role in limiting or promoting such abundances within ecological communities. Here, we present a joint effort between ecologists and data scientists to use data-driven models to predict species abundances using reasonably easy to obtain data. We propose a sequential data-driven modeling approach that in a first step predicts the potential species abundances based on abiotic variables, and in a second step uses these predictions to model the realized abundances once accounting for species competition. Using a curated data set over five years we predict fine-scale species abundances in a highly diverse annual plant community. Our models show a remarkable spatial predictive accuracy using only easy-to-measure variables in the field, yet such predictive power is lost when temporal dynamics are taken into account. This result suggests that predicting future abundances requires longer time series analysis to capture enough variability. In addition, we show that these data-driven models can also suggest how to improve mechanistic models by adding missing variables that affect species performance such as particular soil conditions (e.g. carbonate availability in our case). Robust models for predicting fine-scale species composition informed by the mechanistic understanding of the underlying abiotic and biotic processes can be a pivotal tool for conservation, especially given the human-induced rapid environmental changes we are experiencing. This objective can be achieved by promoting the knowledge gained with classic modelling approaches in ecology and recently developed data-driven models.

Entities: Chemical

Mesh：

Year: 2021 PMID： 34871304 PMCID： PMC8675934 DOI： 10.1371/journal.pcbi.1008906

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

Introduction

In the face of human-induced rapid environmental change, the ability to predict species responses to environmental change within a community context is more pressing than ever [1]. However, fine scale prediction is a recognized weak spot in ecology [2-6]. Within the realm of community ecology, most prediction efforts rely on a mechanistic understanding of how multiple abiotic and biotic processes regulate species population dynamics [7]. In particular, theoretical frameworks centered around the study of the determinants of species coexistence and the development of mechanistic models that take into account the effects of the environment and species interactions on the maintenance of biodiversity are an active field of research [8]. These recent developments point out ecological processes that drive the dynamics of interacting species such as those occurring in plant competitive networks [9-11]. Moreover, this body of theory has also shown direct applications to better predict species abundances under controlled experimental conditions [12, 13]. Yet, current theory and associated modelling tools fail in most cases to accurately predict basic features of ecological communities observed in nature such as species abundances, composition, and species turnover in space and time [14]. In order to solve this limitation, there is a recent call to address the complexity of multispecies processes occurring in nature [15, 16]. However, a major stumbling block to advance in this front is parameterizing and validating those models in real communities, which currently is prohibitive due to the complexity of estimating with confidence all parameters from observational data [17]. In order to tackle the problem of the trade-off between model complexity and data availability, we aim to develop an alternative approximation using a mechanistically informed data-driven approach that allows us to achieve predictive power with affordable data requirements. In a nutshell, existing phenomenological approaches that summarize well-known mechanistic processes require to feed models describing the population dynamics of interacting species with information about 1) the intrinsic ability of species to grow in the absence of interactions, 2) the strength of intra and inter-specific interactions, and 3) how these two sets of parameters change in the presence of different abiotic and biotic variables such as soil conditions or multitrophic species interactions (e.g. pollinators, herbivores) [18, 19]. This is in most cases unfeasible for two reasons: 1) we need to gather detailed information under natural conditions, which for many systems is unfeasible due to the long lifespan of species or the inability to detect and quantify the strength of species interactions, and 2) this approach considers that all species within a community can potentially interact among them [17, 20]. As the number of parameters to estimate scales exponentially with the number of species in the community, estimating all parameters for large communities quickly becomes an intractable problem. Moreover, because species abundances are not likely to vary independently (i.e. the population size of species A, B, and C covary), it is often difficult to estimate with confidence the strength and sign of many inter-specific parameters. Even if we find a suitable ecosystem to parameterize these models, gathering all required information is labor intensive and highly time consuming. Hence, to resolve this conundrum, we can not rely simply on gathering more and better data. We also need simpler models and search for indirect methods to obtain enough information to be predictive. A key challenge, for example, is that mechanistic models do not always require empirical data that is easy to measure [21]. Hence, we need models that move closer to what we could actually measure on the field. But how to capture complex systems with simpler models? Fortunately, there is a possibility worth exploring. The problem of inferring key behaviours from complex data has been solved using Machine Learning approaches. Machine learning is a field of computer science that gives computers the ability to learn without being explicitly programmed. In the past decade, Machine Learning has given us self-driving cars, practical speech recognition, effective web search, and a vastly improved understanding of the human genome [22-26]. However its potential has been unleashed mostly in applied domains, as predictions done with Machine Learning approaches often lack the interpretability needed to explain the mechanisms behind the algorithm’s decisions. As scientists, we are often uncomfortable with predictions that have no theoretical basis [27]. However, we can combine the power of data-driven models with stronger theoretical foundations [28]. Here we address this issue by partnering together ecologists and data scientists to develop an efficient and predictive data-driven model rooted in known ecological mechanisms that are thought to explain species occurrence and its abundance at local scales. First, we explain the core problem, then we propose a solution, and finally, we test the predictions against a well-resolved data set consisting of five years of observations describing the community composition of 23 species co-occurring in a Mediterranean annual grassland.

The problem

To predict species abundances within a community context, we know that different abiotic factors determine species performance and their tolerance limits [29], from which one can derive potential species abundances [30]. However, we also know that the final species fate will be modulated by the positive and negative species interactions established among and within species able to grow in a particular place [31, 32]. Of course, stochastic processes coming for instance from dispersal events or random birth and death dynamics [33, 34] are also recognized to have increasing importance in modulating species persistence, but for a first approximation and for the sake of simplicity they are not included in the modelling approach here developed. This is justified as many annual study systems (including ours, see below) complete their life-cycle within a year and “re-start” the next each year. Hence, mechanistic models to understand species population dynamics and their ability to persist in the long-run are often formalized as a set of coupled equations where each response variable (i.e. population size of a given species in a given time and location) depends on and modifies the outcome of the rest of response variables (i.e. population size of this and other species) [31, 32]. A clear example using the standard Lotka-Volterra equations is the persistence of the populations of three plant species following rock-scissors-paper dynamics [35, 36], in which each species have to win and lose simultaneously against different competitors in order to avoid the collapse of the system. This kind of circular dependence requires measuring all parameters for all species to be able to estimate their behaviour. Even when these parameters are correctly measured spending long hours in the field, the predictive power of such mechanistic models is still very low (See Section 1 in S1 Text). In our particular scenario, the mechanistic hypothesis is that the abundance of any given plant species is influenced by the environment (e.g. precipitation, soil properties) and the abundance of competitors of that particular season. In mathematical terms, given a subplot k (we drop this index to simplify the notation), the predicted abundance of species j in the subplot k (our spatial sampling unit) at a given season t (year) is: where f is a function with n abiotic variables and m − 1 abundances of competitors, excluding individuals of X(t). Alternatively, it is possible to use data-driven predictive models where the response variable is a function of abiotic and biotic features. While this distinction among features is ecologically important in terms of the ultimate mechanisms driving species abundances, from the point of view of the data scientist that distinction is not relevant, as far as the model behaves properly. The predictive model is just a special class of function: Here g is a supervised predictive model, trained with the values of n abiotic variables and m − 1 abundances of competitors, excluded X, from the initial season t0 to t. We call it so to make it clear that is does not belong to the set of mechanistic models of Eq 1. Once the model is built, one simply feeds the values of the features at subplot k during season t to predict . If the available data set includes all these values, the data engineer enjoys a wide range of chances to pick out of them a subset of features to train and tune the model. For instance, if the set includes a long enough series of data recorded during previous seasons t0 to t, you can train the model with that set and predict the abundances at subplot k during season t + 1, given that all the covariates are available, and that subplot k belongs to the sampled set. That is what we call the temporal trained predictive model: Note that we do not try to use a time-series approach. You could also use that same model to predict the abundance of X(t) at a non sampled subplot l. That is the spatial trained predictive model: where the predictive model is the same as in 3, but the covariates take the values at subplot l, season t, instead. While abiotic variables are often easy to measure, obtaining spatially explicit data on species abundances for the whole community is prohibitive, and in fact, it would be equivalent to measuring community composition to predict community composition. If you want to predict the aforementioned abundance you need the values of . In any case, and for the sake of being pedagogic, we start by testing the scenario where the full data set is available, and the field team recorded a detailed sample of species abundances and abiotic parameters for each subplot. In this case, it is simple to build a predictive model that works for a nearby piece of land, where all those variables are known: this is the very essence of Machine Learning. So, abundance at t of species j in a given subplot, whose field data are known but were not used to build the model, could be estimated by Eq 2 by feeding the model with the measured abiotic variables and competitor abundances at that spot. Prediction gets harder when trying to apply the model to a real-world scenario. For example, how do we know in advance the abundance of competitor individuals elsewhere in the community? Eq 4 is deceptive because while the abiotic variables are relatively simpler to measure for the subplot we are interested in, none of the biotic covariates are known at t. Similarly, if we want to apply the model to predict the abundances of the incoming season (Eq 3), abiotic features A(t + 1) may be gathered without an extraordinary effort, but we would have to wait until we record the number of individuals of each competing species X1(t + 1), X2(t + 1), … at subplot k. For that task, the predictive model would be less useful. To put it bluntly, imagine you have sampled 100 areas (i.e.subplots) out of 10000 to build a detailed map of the density of one species that has 20 competitors. The sample is representative of the population and there are no quality issues. Even with that optimal starting point, you would need to count the individuals of competitors species for each of the unknown 9900 plots. The only way to avoid that time-consuming task is to predict those abundances, but as each predictor includes the abundances of competitors the problem is recursive. A possible strategy to overcome the deadlock is dropping off the conflicting variables. That is, getting rid of the species abundances and relying just on abiotic data, that are easy to measure. This model is valid to predict for an unknown plot at t or for one of the sampled plots at t + 1 if we know the values of the set A(t + 1). From an ecological perspective this model ignores direct species interactions. For the data scientist, feature engineering is a common procedure to build and test different models. Data sets have redundant information and dimensionality reduction is often desirable. Therefore, we can start building a model with only abiotic predictors. Even from an extreme data-centric approach, this solution looks very weak for this predictive challenge. But weak doesn’t mean useless. A smart mix of weak models may produce an accurate predictor, that is the basis of ensemble methods [37]. This first model generates a set of competing species abundances driven only by abiotic factors. In a second step, we predict again species abundances with the same abiotic data and the predicted abundance of competitors modeled in step one. Thus, we end up with a two-step predictor that is an ad-hoc ensemble method for this scenario. The first step, from the abiotic conditions at year t, that are easy to measure for each subplot k, we predict , the abundances of competing species j ignoring the biotic interactions as in Eq 5. The second step, we combine those observed abiotic features with the predicted biotic constrained abundances : where represents the model of Eq 2. The main difference with the one-step model is that competitor species abundances X(t) are replaced by their predicted values . This procedure is valid for spatially-trained models (abundance for a subplot l at year t) or temporally-trained models (for a recorded supblot k at year t + 1) once the abiotic magnitudes have been recorded and choosing the proper training sets.

Materials and methods

Data description

We tracked during five years (2015–2019) the local abundances of individuals of 23 annual plant species distributed along 9 plots (plot size 8.5m x 8.5 m) located along a salinity gradient of 1km long by 800 m wide in a highly diverse Mediterranean grassland at Doñana National Park (SW Spain, 37º 04´ N, 6º 18´ W). The plots are placed, on average, > 100m apart from each other. Each plot is subsequently subdivided into 36 subplots of 1m2. For each of these subplots, we compiled across each of these five years the number of adult individuals of each plant species at their phenological peak (i.e. when at least half of the individuals are in bloom). This period extends on average and across species from February to June yearly. Thus, overall, we gathered abundance data from 36 subplots in each of the 9 plots, during 5 years, for a total of 1620 plant communities. These subplots represent the basic unit of our study and their scale is appropriate given the small size of the annual plants and the high micro-habitat heterogeneity. For example, the plots in the upper part are rarely flooded, whereas those in the middle and lower parts are annually flooded by vernal pools. This spatial configuration of plots allows capturing small scale variation (due to the different soil conditions created by salinity, among other variables) as well as large scale variation (induced by vernal pools) in the dynamics of annual plant communities in our system. In addition, we empirically measured at the subplot level an array of physical and chemical soil properties at the beginning of the survey (spring 2015) to characterize the abiotic properties of each community (one soil sample per each subplot, see Table B in S1 Text for a summary). These values are kept from year to year for this study as they are stable through time in this type of environments. Finally, we obtained annual precipitation values for each year and the whole study area from a nearby weather station maintained by the regional government “Junta de Andalucía” (El Rocío-Almonte, 10 km far apart). Soil data was only recorded during 2015 because those are soils with high content of clay in which their properties vary little from one year to another. Therefore, although we acknowledge that within the five years of study some soil properties might have changed, we assume this variation is of little magnitude compared to other abiotic variation such as precipitation and flood. As initial data assessment, we performed an exploratory analysis, studying abundance distributions for each species and the relationship between their averages and variances. In total, the data set contains abundance values for 37240 species-plot combinations. The distribution of abundances is extremely skewed due to a 75.6% of zero values. Each of them means that the field team has not found any individual of the particular species inside the sampled subplot during the season. Most species are scarcely represented because they were only recorded some years and in some particular plots of the soil salinity gradient. This is a well-known issue in spatial distribution models [38]. Even if zero values were ignored, the uneven distribution of abundances would remain, as generally expected from species-abundance distributions (Fig 1A). The mean value and the variance of abundances scale with each other. This phenomenon is known as Taylor’s Law and, in our case, the scaling an exponent of 2.15 and an adjusted R2 = 0.92 [39]. Taylor’s Law appears in different contexts in ecology with exponents close to 2 as in this case [40, 41], which implies that our sampling is representative of empirical community structures. In any case, no sample is discarded to build the predictors.

Fig 1

Species abundances.

A: Boxplots of the distribution of individuals for each species, highlighting the median value. B: Scatter plot of the mean vs. variance for individuals by species, and regression line to check how they fit Taylor’s Law.

Species abundances.

Methods

Regression models

We implemented in Python three regression models to tackle the problem to predict species abundances (See a full list packages at the end of this document). Linear Regression, Random Forest regression, and XGBoost use the equation presented in 6. Specifically, the Linear Regression Model (LRM) is the simpliest choice to achieve a balance between interpretability and precision. It explains the outcome as a function of the multiple input features and has inspired many mechanistic models. This simple model provides fair results when the underlying function is linear or there are linear combinations of features. We also used more flexible models to improve results. Random Forest Regression (RFR) is a tree-based ensemble method and belongs to the family of Classification and Regression Trees (CART) [42]. It combines the predictions from multiple weak trees to make accurate predictions [43]. A random subset of samples is drawn with replacement from the training sample. All of them have the same distribution. These randomly selected samples grow decision trees and the average of predictions yields the model’s outcome [44]. Alternatively, XGBoost (eXtreme Gradient Boosting) relies on the concept of gradient tree boosting [45, 46]. Boosting is a sequential algorithm that makes predictions for T rounds on the entire training sample and iteratively improves the performance of the boosting algorithm with the information from the prior round’s prediction accuracy. It is faster to train and less prone to overfitting than a Boosted Regression Tree (BRT) [47]. XGBoost produces black box models, hard to visualize and tune compared to RFR. Note that our aim is not to compare performance across a wide range of modelling techniques, but to show how different modelling approaches ranging from simple linear regression to to more complex XGBoost can be explored within our framework. One common feature of all these methods is that they are sensitive to the random splitting of training and testing sets, which we set to an 80/20 ratio. We checked for the spatial autocorrelation of each species abundance and found that for all species the Moran I was low (I < 0.2). Hence, we do not further model the spatial component directly in our models, but we do take into account the spatial distribution in our training and testing sets. For each model we perform a 4-fold spatial cross-validation [48] using the K-Folds cross-validator provided by the Python package Verde [49]. In addition, we provide the results of 100 runs of such models.

Mechanistic model

Finally, we show in S1 Text the implementation of a mechanistic model built on a population dynamics framework suited to characterize the dynamics of annual plant populations [17]. In our implementation, abiotic variables can potentially affect the intrinsic growth rates of the species modelled, as well as intra- and inter-specific interaction coefficients. In S1 Text we show how this model is improved when adding the effect of abiotic variables identified as important by the data-driven model.

Feature engineering

The original data set for this regression analysis includes 40 variables. There are 13 abiotic measurements, 12 of soil conditions (pH, total salinity, carbonates, organic matter, C/N ratio, and Cl, C, N, P, Ca, Mg, K, and Na concentrations; Table B in S1 Text) for each subplot, and the annual precipitation, common for all plots. The additional 23 numerical features are the abundances of each species in the subplot (Table C in S1 Text). There is also a factor called species that corresponds to the identity of the plant species for which we want to predict its abundance. Note that we build a unique model that works for any focal species, so this factor must be kept to inform the predictor (hereafter we refer to the ABIOTIC and ALLFEATURES datasets in tables and plots). Decision trees methods, in particular Random Forests, Boost Decision Trees, and Ridge Regression, are not much affected by multi-collinearity [50]. However, since it is a good practice to remove any redundant features from any data set used for training, we used Spearman correlation as a filter-based feature selection method. In addition, for the three models (Linear Regression, Random Forests, and XGBoost), we run a filter feature selection procedure to drop those variables that are less relevant for the outcome [51]. The permutation importance technique tests the performance of a model after removing each feature and replacing it with noise [52].

Model evaluation

To assess the performance of regression models we compute the Root Mean Square Error (RMSE) and the coefficient of determination R2 [53]. RMSE is a distance between the vectors of recorded values (y) and predicted values (). The coefficient of determination R2 is the proportion of of variation of the response variable explained by the regression compared to a null model. The second term of Eq 8 is the Relative Squared Error (RSE). It normalizes RMSE by dividing it by the total squared error of the predictor.

The two-step model

As we mentioned above, the prediction of abundances in this scenario poses a major challenge as the problem is recursive. To predict the abundance of species X we need to know in advance the abundance of each of its competitors, but those abundances are dependent on the rest of the species as well. To solve this limitation and given the fact that soil features and annual rainfall are easier to get, a predictor that could get rid of all abundances is more operative, at the price of reduced predictive power. Dropping that information is equivalent to ignore direct interactions among species. That would be unacceptable for a mechanistic model as a too naïve simplification, but Machine Learning has developed some strategies to deal with this kind of hindrances. Stacked models are a kind of ensemble models that perform sequential learning [54]. Predicted values of stage n are fed as features to stage n + 1 mixed with original features. We have built a two-step sequential model, following this idea. This stacked generalization predicts the abundances of competing species using the abiotic Random Forest (first step) model and then binds these predicted columns to the abiotic set to perform the full featured predictor (second step). During the first step, the model is trained with the abiotic data and predicts the abundance of competitor individuals. These predictions may be weak, but combining them with the abiotic variables, we can use this semi-synthetic data set to train an all-features model to perform the final prediction. This can be applied to any modeling tools, and we exemplify it here using Linear Regression, Random Forest, or XGBoost. Specifically, we build 100 models that only differ in the random split of training and testing sets, including all features and years. In the final step of the analyses, we build full predictors to evaluate spatial prediction by randomly splitting the data in training and testing sets using the spatial cross-validation explained above. When using the model to predict the abundance of a sampled subplot during the incoming season, the training set excludes the samples of the year we want to predict. Please, notice that we may be including years ahead of that predicted (for instance, training with 2018 and 2019 to predict 2017), as our goal is just the evaluation of the goodness of the procedure. We do not explore here other approaches such as the use of time-series data.

Results

Before building the models we selected the training features by looking at the correlation analysis and Feature Importance. The first method showed two subsets of strongly correlated features (Figs A and B in S1 Text). We kept C and dropped Organic matter, N, and C/N ratio. Salinity remains in the training set and Na, Cl and K are removed. After dropping these variables we run the Feature Importance method for the Random Forest with the abiotic set (Table 1). Results show that Annual precipitation is the most relevant abiotic feature, after Species, that is just the focal species whose abundance we want to predict. Carbonates, C, P, and Salinity follow in importance, while Ca, Mg and pH are less relevant than the added random noise, so they could be ignored to build the final model.

Table 1

Feature importance for the Random Forest model with the ABIOTIC set of variables.

Feature	Importance
Species	0.328
Annual precipitation	0.250
Carbonates	0.098
C	0.091
P	0.067
Salinity	0.049
Random noise	0.036
Ca	0.028
pH	0.027
Mg	0.023

We applied the Feature Importance method with the full set of features as well (Table D in S1 Text). Results show that Annual precipitation is, again, the most relevant abiotic feature. The number of individuals of abundant competitor species such as POMA, LEMA, CHFU, and SASO (see Table C in S1 Text for species acronyms) or the concentration of carbonates showed up to be relevant too for the Random Forest built with the full set. As a result of both selection procedures, the models (Linear, RF and XGBoost) trained with the abiotic set work with only 6 features: salinity, precipitation, C, Ca, P and carbonates (co3). For the all features and two-step models, we keep Mg and pH as well, because their rank in the importance of features tables was slightly higher for XGBoost. Thus, the training set for these two latter models includes 8 abiotic and 23 biotic features, one for the abundance of each species. As the model is unique, there is a circular problem with the abundance of individuals of the focal species when acting as competitors. For instance, to predict the abundance of HOMA individuals in a particular subplot, we should know in advance the abundance of HOMA individuals as competitors. Getting rid of the HOMA column is unfeasible, because those values are important to predict the abundance of any other species. So, before building the full set model, this value is set to 0 where the competitor and focal species are the same. We keep the rule for the two-step predictor. We found that models that include the full set of abiotic and biotic features perform quite well regarding their R2 to predict species abundances within a spatial context. This is an important result because it shows a direct application of using Machine Learning approaches to describe relevant characteristic of ecological communities such as the spatial distribution of species relative abundances. Specifically, we build 100 models that only differ in the random split of training and testing sets, including all features and years. The median R2 values are 0.095 for the Linear Regressor, 0.809 for XGBoost and 0.867 for Random Forest (Table 2). Prediction would be a practical tool with these two former models, but results may be deceiving to ecologists. To predict the abundance of species X we need to know beforehand the abundances of the rest of species, so the painstaking field work is not avoided.

Table 2

Prediction errors for spatial application.

Model	Median R²			Median RMSE
Model	Linear	Random Forest	XGBoost	Linear	Random Forest	XGBoost
All features	0.095	0.867	0.809	37.505	14.361	17.234
Abiotic features	0.024	0.852	0.827	38.969	15.138	16.383
Two-step model	0.222	0.868	0.809	34.789	14.290	16.171

The weak performance of the Linear Regressor is a hint on the non-linear nature of the prediction challenge. The F statistic for the abiotic data set trained LRM, is nearly null. According to the t value, the order of significance of variables is Ca, C and salinity, with the annual precipitation in fourth place (Table E in S1 Text). Even though is a rough way to compare, the Feature Importance for the RFR model is quite different, with the annual precipitation as the most important variable (Table 1). The median R2 value for the Random Forest predictor trained just with abiotic information is very close to the predictor trained with all features: 0.852 vs. 0.867. This figure provided the hint to try the two-step method. Results are quite encouraging as the median R2 of two-step models is 0.868 using Random Forest for the second stage and to 0.831 using XGBoost. The median R2 of the two-step is virtually identical to the value 0.867 we got with the model built with the full data set. The same happens when we compare the median RMSE values of both methods: 14.290 (two-step) vs. 14.361 (all features). The practical advantage of the two-step method is that it does not require to know in advance the abundance of competitor species. Fig 2 shows the improvement of the R2 distribution with the two-step method and Random Forest as the second stage model. XGboost results were slightly worse (Figs C and D in S1 Text).

Fig 2

Prediction errors with a two-step Random Forest Regressor.

Prediction errors with a two-step Random Forest Regressor.

A: Relative Squared Error distributions for 100 random choices of training/testing sets, vertical lines set at median values. B: Root Mean Square Error distributions for the same collection of predictors. Although R2 is useful to make global comparisons among predictors (i.e. among species), we still require an assessment of prediction accuracy by species because of their asymmetry in observed abundances. To evaluate the three methods considering a species-specific approach, we performed 100 runs, following the steps described in the previous section, and measured both RMSE and RSE for each species (RSE = 1 − R2, just for plotting convenience using a logarithmic scale). We overall found that relative squared error is fairly small for abundant species such as Hordeum marinum or Chamaemelum fuscatum, while it shows a wide spread for plants that are relatively rare in the study area (Fig 3, see also Figs E and F in S1 Text).

Fig 3

Prediction errors by species using a two-step Random Forest Regressor.

Prediction errors by species using a two-step Random Forest Regressor.

A: Relative Squared Error distributions for 100 random choices of training/testing sets. B: Root Mean Square Error distributions for the same collection of predictors. See Table C in S1 Text for species acronyms. Fig 4 shows the distribution of errors of a particular run. The two-step Random Forest model seems to be much more accurate predicting zeros than the abiotic RF model.

Fig 4

Prediction errors by individuals.

Prediction errors by individuals.

Each dot is the value of where y is the recorded value of abundance and the regression prediction. There are 37260 predictions for each run. A: Error values for a run of the two-step model with Random Forest. B: Error values for a run of the abiotic model with Random Forest. The Random Forest models do not predict negative values. The Linear Regressor and XGBoost, return between 9% and 25% of negative values that would not have biological sense. We have kept them as predicted to compute the R2 index, and we have kept the decimal values as well, in order to make fair comparisons among the different models. Similarly to predicting species abundances across space, we could predict species abundance over time with the same models trained with a different data set. From a modelling perspective, prediction over time is a widespread application of Machine Learning. If we have got a curated yearly series of data, it is straightforward to build a predictor for the incoming season, and in the case the quality of predictions is fine enough, then it would allow us to anticipate how plant will respond to changes in future environmental conditions. Unfortunately, this expectation is not the case for the data analyzed, and it comes as no surprise. This annual plant system is a highly variable system in which propagules can disperse over a wide range of distances after individuals complete their life cycle. Such dispersal kernels in combination with variation in flooding events make our system overall highly dynamical in terms of space and time. We therefore do not believe that our system is stable through time as it can be other systems with species with longer life cycles such as shrubs or trees Table 3 shows the evaluation results of temporal trained predictors, including all features of four years and tested with the remaining one. The median R2 values are very disappointing for all models. A potential explanation is that, despite the fair size of the data set, the temporal sample is tiny. In addition, yearly fluctuations in weather are heavily marked in this study system, ranging from 384 mm in 2019 to 625 mm in 2016. The fact that there are only five values for time-related variables, one per year, makes prediction to fail because the test data often falls outside the trained data conditions. One possible workaround is dropping the annual precipitation to reduce overfitting, but having in mind that feature analysis showed that it is the most relevant independent feature. Results show a mild improvement but even the best R2 (0.08 for 2019) tell us that the predictive value is nearly null. Results for Linear Regression and XGBoost predictors are even worse.

Table 3

Prediction errors splitting by year and using Random Forest.

Predicted Year	With Precipitation		Without Precipitation
Predicted Year	Median RMSE	Median R²	Median RMSE	Median R²
2015	33.39	-3.95	27.77	-2.42
2016	21.58	-1.34	21.77	-1.38
2017	49.54	-0.01	47.26	0.08
2018	42.74	-0.16	41.47	-0.09
2019	54.81	0.07	54.85	0.07

Regardless of the differences in the ability of the Random Forest models to predict species abundance over time or across space, these models have the potential to provide novel insights into some key processes that modulate the response variable studied (species abundances in our case). This new information can be incorporated in turn into mechanistic predictions from population dynamics models that describe the abundance trajectories of interacting species. These later type of models are much more familiar to ecologists. This possibility of feedback from the data-driven models to the mechanistic models is exemplified in our system with the particular focus on soil carbonates. The inclusion of this abiotic variable, which was deemed second in importance just after the annual precipitation by the Feature Importance method (Table 1), shows an overall improvement in the predictions derived from the mechanistic models (S1 Text).

Discussion

By combining ecological knowledge with data-driven models, we showed that it is possible to develop reliable models that predict reasonably well complex systems such as the abundance of multiple species that compose ecological communities. Plant species composition at fine-resolution scales is hard to predict, because their densities and relative abundances are partly governed both by abiotic factors, which determine where species can potentially thrive, and by the network of species interactions in which they are embedded, which modify their reproductive success. In fact, these two axes of variation defining the species persistence probabilities have been at the core of the species niche concept [55], and in the development of modern community ecology theory [56], but rarely exploited for predictive purposes. Here, we show a simple methodology to use easy to obtain abiotic information to accurately predict species abundances while taking also into account their potential biotic interactions. Our models are sensitive to the breadth of the training data, and as such they capture better the spatial anomalies (where we have more data) than the temporal anomalies. For this last practical purpose an alternative approach based on time-series may yield better results. Machine learning-based methods have been extensively applied for relating species distributions to environmental factors, through species distribution models. While the literature on species distribution modelling is vast, most of it is centered on large scale distributional patterns of species occurrences [57], often involving only abiotic variables [58], and in a vast majority of cases, prediction is limited to species presence or absence. However, most ecosystem functioning processes happens at the community scale. At this scale, species interactions are thought to determine species performance, quantified in their probability of persistence [31] and in their relative abundance [59]. We show that a data-driven sequential model that firstly predicts the potential species abundances for a given set of abiotic variables, and secondly uses this predictions to refine the realized species abundances predicted, performs fairly well when comparing them with more data-hungry models. However, note than when the species abundance is low (median value under 3 individuals), the uncertainty of the abiotic prediction increases. To avoid this issue, the model could be refined in a future development through careful resampling of low abundance species before performing the first step. We discarded this procedure because to raise the overall R2 with simple SMOTE-based resampling required too high resampling percentages for this particular application [60]. In any case, a remarkable fact is that the two-step model is much better predicting at absences than the abiotic one. The existence of competing species seems to play an important role as an inhibitor of the growth of a particular species. This information is lost when the model only works with abiotic features. The fact that this two-step process matches the predictions of a one-step model with all data available is remarkable. One possible explanation is that observed plant abundances empirically measured in the field only capture fully developed individuals, missing early stages of competition among seedlings that despite dying soon, affect final species abundances. In our case, the best performing data-driven model is the Random Forest, closely followed by XGboost. It was expected that the assumptions of linear models are too simple when there are complex interactions among features, as the exploratory analysis suggested. Which model is more appropriate may depend on the data set at hand. Interestingly, this data-driven exercise can also help us enhance mechanistic models. We already used mechanistic models to understand the species dynamics in our ecological system. Aware of the importance of the abiotic environment, we modelled species reproductive success as a function not only of competitors, but also of other environmental variables such as soil salinity content [18]. To our surprise, the feature importance selection procedure highlights CaCO3 as a key determinant of species abundances and not salinity, which was the most obvious variable initially selected in the field. Despite initially counter-intuitive, this result is congruent with the fact that we sampled in a hypersaline environment in which phosphorous (a key element for plant growth) is not available for plant absorption. Rather, it is retained in carbonate minerals such as calcite and dolomite, and plants can mostly obtain phosphorous thanks to the enzymes from mycorrhizal fungi. With this new knowledge, we re-parameterized the mechanistic annual plant model by adding CaCO3 as a covariable affecting both the intrinsic fecundity rates and the pairwise interactions among species. With this update we obtained significantly better predictive error than with the biotic-only parameterization (Table A in S1 Text). Hence, we show that ecological process can shed light on data driven models, but those can in turn refine which ecological process are important to include in the mechanistic models. In our relatively simple proof of concept, the mechanistic formulation of the parametric model was not influenced by the data-driven model, but more complex feedbacks are of course conceivable, for example more appropriate functional responses (e.g. non-linearities) of some variables, or the interaction among variables. In any case, data-driven methodologies are specially suitable when one has data on many different environmental variables, which would be unfeasible to include in a parametric model one by one. This exercise is tailored to the problem at hand. For example, an implicit assumption of this modelling framework is that plant species can reach all quadrants in the grassland, and are not limited by dispersal. This assumption is reasonable on a study system in which seeds are small, they can be dispersed by wind and small animals such as ants, and additionally the system also gets flooded in extremely wet years. Similarly, we focused our modelling on the plant-plant competitive interactions, which are the main interactions structuring this grassland communities [61], and ignore other interactions such as pollination or herbivory. However, the same approach can be used to model other interaction types in other systems, as far as you have initial data to train the models. However, when modelling species with lower detectability than plants or hyper-diverse communities, further enhancements may be needed to obtain sensitive results. In our case, we obtain a good spatial predictive ability, but we fail to predict temporally. Given the strong across-year variations in precipitation, we believe this is due to the limited number of years to train the data, and not an inherent limitation of the framework. It might also be possible that stochastic events, which create variation from unknown sources (e.g. random birth-death, perturbations in population sizes, dispersal events in no particular direction) are more prevalent in the temporal dimension than deterministic processes such as species interactions [62]. In any case, given the expected ongoing environmental change in many abiotic variables such as precipitation regimes and temperatures, we envision this kind of predictive models to be specially suitable in combination with semi-automated species monitoring schemes (e.g. NEON, [63]) to anticipate to global change effects on delicate and highly-diverse ecosystems such as Mediterranean grasslands. We want to highlight that the proposed approach complements current approaches to understand fine scale community composition, such as multivariate methods (e.g. CCA [64]) or time series analysis [65], which may be more suitable depending on the question to be answered, or the data available. Including the temporal resolution of soil properties may enhance model performance.

Conclusion

The rate of ecological data generated is increasing substantially [63]. Open and reliable data sets hold the potential to facilitate the application of near-term forecasting protocols [6]. However, for those efforts to thrive, we need simple models that can work with the sparse data typical of ecological surveys. A more predictive ecology likely serves to anticipate how several ongoing critical environmental changes such as climate change affect multiple properties of ecosystems, and at the same time it also provides information about which management actions are required to maintain healthy ecosystems. Taken together, our results show that two-step ensemble models are a promising tool to reach efficient management without the costs of prohibiting data collection.

List of packages

Python: python 3.8.8 [66], matplotlib 3.3.4 [67], numpy 1.20.1 [68], pandas 1.2.4 [69], seaborn 1.11.1 [70], scikit-learn 0.24.1 [71], verde 1.6.1 [49], xlsxwriter 1.3.8 [72], xgboost 1.4.2 [73]. R: r-base 4.1.0 [74], cowplot 1.1.1, [75], ggplot2 3.3.3 [76], gridExtra 2.3.0 [77], patchwork 1.1.1 [78], scales 1.1.1 [78], tidyverse 1.3.1 [79].

Abundance prediction with population dynamics models and supplementary figures and tables.

(PDF) Click here for additional data file. 9 Jun 2021 Dear Dr García-Algarra, Thank you very much for submitting your manuscript "Fine scale prediction of ecological community composition using a two-step sequential machine learning ensemble" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. Reviewer 1 and 2 raise important methodological concerns which need to be addressed. In particular, as also suggested by reviewer 2, I encourage the authors to better and more deeply discuss the limitations of their approach. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Jacopo Grilli Associate Editor PLOS Computational Biology Natalia Komarova Deputy Editor PLOS Computational Biology *********************** Reviewer 1 and 2 raise important methodological concerns which needs to be addressed. In particular, as also suggested by reviewer 2, I encourage the authors to better and more deeply discuss the limitations of their approach. Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: uploaded as an attachement Reviewer #2: Summary: This paper proposes a two-step ensemble model and feature engineering workflow to predict species abundances in real world populations over space and time. The aim is ambitious, and it speaks to an important division in quantitative ecology between traditional mechanistic population models, mostly using formal Bayesian models with MCMC, and more flexible machine learning models. The work is similar to semi-supervised learning with weak labels where the predictions from the first step are propagated to avoid measuring difficult interaction covariates in the field. They provide a nice worked example. I like the idea and think the conversational tone throughout is useful. Given that formal models have been largely non-effective, and that there is no hope is directly measuring species interaction coefficients over long periods of time, we definitely need all new ideas we can get. To make this paper more palatable to a large audience, the introduction needs to slow down, avoid making grand statements, and really consider the current state of quantitative ecology. My only concern for the results is that with a single example, and no simulations, it is difficult to guess which parts of the model will be most sensitive to typically challenging ecological scenarios. A short list might include • Incomplete detection • Large clines in environmental conditions • Spatial autocorrelation • Rare events/time lag (such as seed back in this case) • Increasing numbers of interacting species • Asymmetric interactions It would be impossible for the authors to tackle all of these ideas, but without alteast some exploration in the text, I don’t know that I see this as more than a nice example and perhaps not broad enough for the journal. I think this can be fairly easily remedied. The authors might feel the natural desire to defend their ideas and avoid being too critical, but I prefer that when a new modeling approach is proposed that we can assess both its strengths and weaknesses. The paper lacks clear guidance and when such a workflow is a good idea and when it would be problematic and rather opts for statements like L354 ‘The fact that this two-step process enhances predictions over a one-step model with all data available is remarkable’ Comments The authors are really taking on a large and important moment in quantitative community ecology. Decades of mechanistic models show relatively little promise in predicting population abundances dynamics in natural settings over broad spatial areas. These conditions need to made clear in the paper, there are plenty of very good community microcosm experiments (https://scholar.google.com/citations?hl=en&user=Nkgv64gAAAAJ&view_op=list_works&sortby=pubdate). See several papers by H. Lynch in predicting penguin population dynamics (particularly https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/1365-2656.12790) or very similar work by J. HillrisLambers (https://onlinelibrary.wiley.com/doi/abs/10.1111/ele.13236) showing that current models can’t produce large scale dynamics. I’ll avoid citing my own work here, but there are many similar overarching papers outlining the need for prediction in community ecology [1-3]. I don’t think the paper needs to repeat the work in those papers, but it needs to help set the stage. Rather than creating a simple old/new dichotomy between mechanistic and data-driven models (L24), I think the authors need to spend more time in setting up the tradeoffs between the two. Why were mechanistic models originally favored? Data availability, lack of computational power to do MCMC? Focus on hypothesis testing over prediction (see [4])? To make the kind of impact they want, the authors need to slow down in the intro, give some specific examples, rather than general citations, and really engage with the current state of the field. Try to avoid proposing a ‘solution’ (always dangerous, L22) rather quickly. I’m a very sympathetic reader, since I already use these kinds of models, but the target audience should be a much broader group of quantitative ecologists, so the arguments need to be well laid out. L51. I work with these kinds of ML ecological models every day. I emphatically reject they have ‘solved’ anything (famously see [5]). I could show some well-constructed and thoughtful neural networks that do an absolutely terrible job at prediction compared to regression. Similar to the comment above, if the authors want to really make a contribution in this area, more subtly and accuracy is needed. L55 is particularly worrisome and shows some CS envy. This kind of sentence makes me stop reading a paper. One thing that jumps out at me is that the virtue of the sequential ML models, especially in the computer vision space, has been their ability to scale to enormous amounts of data (going back to [6]). The argument has always been that the learning potential for these types of models greatly exceeds the formal hierarchical Bayesian models and does away with needing to set priors. But in ecology we almost never have such large datasets (though its getting easier). I worry about the applicability of this kind of method to the average ecology dataset, especially as the number of species increases. How would this scale? I don’t have an intuition for which parts of the model would be most sensitive to larger number of taxa. Similarly for spatial heterogeneity of the landscape. The authors should either dial back their language in the conclusion or look to simulations to help explore when these kind of models will be most successful. Figure 2 B needs some text to help orient me. What do the authors hope to convey with this right-hand panel? Minor Comments: Slightly awkward phrasing in the abstract ‘inform back’. In general, I try not to provide too many line edits unless they interfere with readability. Missing ‘e’ in ‘especially’, in the abstract. Try to be careful in the introduction to consider a wide range of sources. The topic is heated and will be best received if the authors don’t cite their own work in such prominent spots (e.g. L7-11). 1. Anderegg LDL, HilleRisLambers J. Local range boundaries vs. large-scale trade-offs: climatic and competitive constraints on tree growth. Ecol Lett. 2019;22: 787–796. doi:https://doi.org/10.1111/ele.13236 2. Youngflesh C, Jenouvrier S, Hinke JT, DuBois L, Leger JS, Trivelpiece WZ, et al. Rethinking “normal”: The role of stochasticity in the phenology of a synchronously breeding seabird. J Anim Ecol. 2018;87: 682–690. doi:https://doi.org/10.1111/1365-2656.12790 3. Dietze MC, Fox A, Beck-Johnson LM, Betancourt JL, Hooten MB, Jarnevich CS, et al. Iterative near-term ecological forecasting: Needs, opportunities, and challenges. Proc Natl Acad Sci. 2018;115: 1424–1432. doi:10.1073/pnas.1710231115 4. Betts MG, Hadley AS, Frey DW, Frey SJK, Gannon D, Harris SH, et al. When are hypotheses useful in ecology and evolution? Ecol Evol. n/a. doi:https://doi.org/10.1002/ece3.7365 5. Nguyen A, Yosinski J, Clune J. Deep Neural Networks Are Easily Fooled: High Confidence Predictions for Unrecognizable Images. 2015. pp. 427–436. Available: https://www.cv-foundation.org/openaccess/content_cvpr_2015/html/Nguyen_Deep_Neural_Networks_2015_CVPR_paper.html 6. Dean J, Corrado GS, Monga R, Chen K, Devin M, Le QV, et al. Large Scale Distributed Deep Networks. NIPS. 2012. Reviewer #3: Kia ora koutou, Thank you so much for offering me the opportunity to read carefully your manuscript. The manuscript is clear, and the methodology interesting and robust. There may be some minor revision needed to convey the message more smoothly and I indicate them in the attached PDF document. Best wishes, ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: Yes: Giulio Valentino Dalla Riva Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at . Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols Submitted filename: Civantos2021PLOSCompBio.docx Click here for additional data file. Submitted filename: Report.pdf Click here for additional data file. 9 Aug 2021 Submitted filename: response_to_reviewers.pdf Click here for additional data file. 27 Sep 2021 Dear Dr García-Algarra, Thank you very much for submitting your manuscript "Fine scale prediction of ecological community composition using a two-step sequential machine learning ensemble" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations. Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Jacopo Grilli Associate Editor PLOS Computational Biology Natalia Komarova Deputy Editor PLOS Computational Biology *********************** A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: [LINK] Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: First, I want to thank the authors for taking most of my prior suggestions seriously and working hard to improve their manuscript. In the comments below I’ll focus on specific (new) line numbers where issues remain, as well as the comment numbers the authors introduced in their response to my review. Introduction: At a high-level, I’m still concerned that the authors are couching their paper in terms of the (real) scaling problems inherent in theory-based modeling approaches to species interactions (e.g. Lotka-Volterra; Lines 21, 25-48, etc), but then glosses over that (A) the method they propose has the exact same issue (number of terms scales quadratically with the number of species and (B) other mechanistic modeling frameworks exist that don’t have this scaling issue. See previous Comments XX. Line 56: The authors point to the lack of interpretability of machine learning models, but then go on to present an uninterpretable analysis. For example, there are still no plots showing the shapes of the relationships inferred (previous Comment XX). The two-stage modeling approach is also particularly hard to interpret because it’s not clear what these stages represent. Counter to what is implied (but not outright stated, e.g. Line 71 and 406), the ABIOTIC model is most definitely NOT a model of the fundamental niche, but instead is a model of the realized niche, with the impacts of species interactions still implicitly present in the data. Equation 2: It’s not clear to me what point is being made here. The only difference between equation 1 and 2 is that you’re calling the function g instead of f. Equation 3: As noted in previous Comment XX, I have real concerns about the approach to temporal that’s implied here, which differs strongly from standard time-series and population modeling approaches (which is to use X_t to predict X_t+1). Indeed, in this round of review I downloaded the authors data and can confirm that, counter to their abysmal results in Table 3, a linear model that just use X_t to predict X_t+1 (without ANY other covariates) has an R2 of 0.42 (compared to the reported -3.95 to 0.08). Adding in species identity and the precip for both time t and t+1 (but not ANY soil variables, species interactions, interactions between species and precip, or nonlinear relationships), a simple linear model increases to a R2 of 0.56. Admittedly this is still worse than the spatial Random Forest, but it does strongly suggest that the Authors revise their approach and all of their Results and Discussion around temporal models. As a separate comment, accessing the raw data made it clear to me that the soil covariate data for a subplot were identical for every single year. First, this was not obvious to me when reading the paper so should be more prominent in the Methods. Second, it should be reiterated in the Results and Discussion, as it’s utterly unsurprising that their current Temporal model does so poorly when the covariate data used to predict Temporal variations is itself not changing in time. Line 118: measuring, not measure Line 134: Unless you have a time machine I don’t know about, measuring abiotic data for time t+1 when you’re currently at time t isn’t possible. If you want to predict to time t+1 you need to either use covariates at time t or (uncertain) forecasts of covariates to time t+1. Second point, throughout the paper (here, L147, etc.), the authors keep saying that it’s easier to measure abiotic covariates than it is to measure the biotic data itself (species composition) but they don’t provide any evidence to support this argument. Indeed, as previously noted their own data set samples species every year, but abiotic covariates only once, strongly implying that the latter is harder. On top of this, there are known technological approaches to measuring species composition at fine spatial resolutions and non-trivial spatial scales (e.g. NEON’s 1 m^2 resolution imaging spectroscopy), but as far as I know there is no equivalent technological approach to measuring below-ground nutrient concentrations. Line 150: I’m not sure why the authors think this approach is “too radical” since they’re just describing species distribution modeling, which is widespread in the discipline. Line 166: Unclear how the uncertainties in the predicted values are being handled in the second stage model. Please comment on this explicitly. I suspect they are being ignored, in which case this should be revisited in the Discussion as a limitation / future direction. L198: still not adequately addressing how the approached described deal with the zero inflation problem (previous Comment 2) Line 304-305: sentence very hard to parse. I had to read it 3-4 times just to be sure it was a sentence, not a fragment, and it was still unclear what the authors were trying to convey. Line 311: Not entirely following why this was done over just not including species j as a covariate when predicting species j. Line 322: latter not former Figure 3: The decision to both use a different species order in panel B, and to switch the mapping between color and species, is extremely misleading. I think it is essential to keep the colors the same between panels and recommend keeping the species order the same too. Line 356: why? Please revisit in Discussion Line 369: First, this comes as a surprise to me as I would have expected a simple repeated measures model to have done fairly well because of the tendency of plants to stay put through time. Second, as discussed above I don’t think this analysis is correct. L377: ambiguous what “this variable” is referring to. L391: Feature Line 392: Still introducing new Methods in the Results. This is not OK. Per previous Comment 8, you really need to lay out this analysis in the Methods and preferably also motivate it in the Introduction. Line 444: This likewise feels like new Methods and Results in the Discussion. Mechanistic modeling part not adequately explained. Line 476: NEON is all caps, and acronyms should be defined, and it needs a citation Comment 5: I’ll defer to the Editor, but I’m not fully convinced that a comparison to conventional methods for vegetation analysis (e.g. CCA) is “beyond the scope”. Seems like at a minimum the existence of such methods should come up in the Discussion. Comment 6: did not address the issue about differences in sensitivity to spatial and temporal anomalies. Would be good to add a quick line to the Discussion. Reviewer #2: I appreciate the authors willingness to engage with reviewers and improve the text. The intro is much better. I like the idea and think it is worth exploring. The authors continue to overstate the strength of their results, but I think this point is stylistic and rather minor. A good revision. Reviewer #3: Kia ora koutou Thank you so much for you effort in answering my questions. I'm satisfied with the answer and the edits. Ngā mihi, Giulio ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: Yes: Ben Weinstein Reviewer #3: Yes: Giulio Valentino Dalla Riva Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols References: Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. 25 Oct 2021 Submitted filename: response_to_the_reviewers_2nd_revision.pdf Click here for additional data file. 12 Nov 2021 Dear Dr García-Algarra, We are pleased to inform you that your manuscript 'Fine scale prediction of ecological community composition using a two-step sequential machine learning ensemble' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Jacopo Grilli Associate Editor PLOS Computational Biology Natalia Komarova Deputy Editor PLOS Computational Biology *********************************************************** 1 Dec 2021 PCOMPBIOL-D-21-00507R2 Fine scale prediction of ecological community composition using a two-step sequential machine learning ensemble Dear Dr García-Algarra, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Olena Szabo PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

32 in total

1. Random forest: a classification and regression tool for compound classification and QSAR modeling.

Authors: Vladimir Svetnik; Andy Liaw; Christopher Tong; J Christopher Culberson; Robert P Sheridan; Bradley P Feuston
Journal: J Chem Inf Comput Sci Date: 2003 Nov-Dec

2. A competitive network theory of species diversity.

Authors: Stefano Allesina; Jonathan M Levine
Journal: Proc Natl Acad Sci U S A Date: 2011-03-17 Impact factor: 11.205

3. Legacy effects of drought on plant-soil feedbacks and plant-plant interactions.

Authors: Aurore Kaisermann; Franciska T de Vries; Robert I Griffiths; Richard D Bardgett
Journal: New Phytol Date: 2017-06-16 Impact factor: 10.151

Review 4. Towards the Integration of Niche and Network Theories.

Authors: Oscar Godoy; Ignasi Bartomeus; Rudolf P Rohr; Serguei Saavedra
Journal: Trends Ecol Evol Date: 2018-02-19 Impact factor: 17.712

5. Rethinking "normal": The role of stochasticity in the phenology of a synchronously breeding seabird.

Authors: Casey Youngflesh; Stephanie Jenouvrier; Jefferson T Hinke; Lauren DuBois; Judy St Leger; Wayne Z Trivelpiece; Susan G Trivelpiece; Heather J Lynch
Journal: J Anim Ecol Date: 2018-01-31 Impact factor: 5.091

6. Experimental evidence of the importance of multitrophic structure for species persistence.

Authors: Ignasi Bartomeus; Serguei Saavedra; Rudolf P Rohr; Oscar Godoy
Journal: Proc Natl Acad Sci U S A Date: 2021-03-23 Impact factor: 12.779

Review 7. Machine and deep learning meet genome-scale metabolic modeling.

Authors: Guido Zampieri; Supreeta Vijayakumar; Elisabeth Yaneske; Claudio Angione
Journal: PLoS Comput Biol Date: 2019-07-11 Impact factor: 4.475

Review 8. Array programming with NumPy.

Authors: Charles R Harris; K Jarrod Millman; Stéfan J van der Walt; Ralf Gommers; Pauli Virtanen; David Cournapeau; Eric Wieser; Julian Taylor; Sebastian Berg; Nathaniel J Smith; Robert Kern; Matti Picus; Stephan Hoyer; Marten H van Kerkwijk; Matthew Brett; Allan Haldane; Jaime Fernández Del Río; Mark Wiebe; Pearu Peterson; Pierre Gérard-Marchant; Kevin Sheppard; Tyler Reddy; Warren Weckesser; Hameer Abbasi; Christoph Gohlke; Travis E Oliphant
Journal: Nature Date: 2020-09-16 Impact factor: 49.962

9. When are hypotheses useful in ecology and evolution?

Authors: Matthew G Betts; Adam S Hadley; David W Frey; Sarah J K Frey; Dusty Gannon; Scott H Harris; Hankyu Kim; Urs G Kormann; Kara Leimberger; Katie Moriarty; Joseph M Northrup; Ben Phalan; Josée S Rousseau; Thomas D Stokely; Jonathon J Valente; Chris Wolf; Diego Zárrate-Charry
Journal: Ecol Evol Date: 2021-03-25 Impact factor: 2.912