Literature DB >> 29732259

Species distribution modeling based on the automated identification of citizen observations.

Christophe Botella^1,2,3,4, Alexis Joly¹, Pierre Bonnet^3,5, Pascal Monestiez⁴, François Munoz⁶.

Abstract

PREMISE OF THE STUDY: A species distribution model computed with automatically identified plant observations was developed and evaluated to contribute to future ecological studies.
METHODS: We used deep learning techniques to automatically identify opportunistic plant observations made by citizens through a popular mobile application. We compared species distribution modeling of invasive alien plants based on these data to inventories made by experts.
RESULTS: The trained models have a reasonable predictive effectiveness for some species, but they are biased by the massive presence of cultivated specimens. DISCUSSION: The method proposed here allows for fine-grained and regular monitoring of some species of interest based on opportunistic observations. More in-depth investigation of the typology of the observations and the sampling bias should help improve the approach in the future.

Entities: Chemical

Keywords: automated species identification; citizen science; crowdsourcing; deep learning; invasive alien species; species distribution modeling

Year: 2018 PMID： 29732259 PMCID： PMC5851560 DOI： 10.1002/aps3.1029

Source DB: PubMed Journal: Appl Plant Sci ISSN： 2168-0450 Impact factor: 1.936

Identifying organisms is a key step in accessing information related to the ecology of species. Specifically, large‐scale monitoring of species distribution dynamics is essential in the context of global change. Such monitoring requires intensive occurrence data, but such data are lacking due to the level of expertise necessary to correctly identify and record living organisms. This is especially true for plants, which are one of the most difficult groups to identify, with more than 350,000 known species on earth. The Rio Conference of 1992 (the Earth Summit, United Nations Conference on Environment and Development [UNCED], Rio de Janeiro, Brazil, 3–14 June 1992 [http://www.un.org/geninfo/bp/enviro.html]) recognized this taxonomic gap as a major obstacle to the global implementation of the Convention on Biological Diversity. Gaston and O'Neill (2004) discussed the potential of using automated identification approaches, typically based on machine learning and multimedia data analysis methods, to produce more intensive occurrence data. They suggested that if the scientific community is able to (1) overcome the production of large training data sets, (2) more precisely identify and evaluate error rates, (3) scale up automated approaches, and (4) detect novel species, it will then be possible to initiate the development of a generic automated species identification system. Such a system should then open important opportunities for studies in biology, ecology, and related fields. Since Gaston and O'Neill (2004) raised the question, enormous work has been done on the development of automated approaches for plant species identification (Casanova et al., 2009; Yanikoglu et al., 2014; Lee et al., 2015; Champ et al., 2016; Goëau et al., 2016; Joly et al., 2016; Wilf et al., 2016; Wäldchen and Mäder, 2017). Deep learning techniques in particular have been recently shown to achieve impressive recognition performance (Goëau et al., 2017). Some of these results were integrated into effective web or mobile tools and have initiated close interactions between computer scientists and end‐users such as ecologists, botanists, educators, land managers, and the general public. One remarkable realization in this domain is the Pl@ntNet mobile application (Affouard et al., 2017). It is used in an eponymous citizen science initiative (SciStarter, available at https://scistarter.com/project/16909-PlntNet) by a growing number of users around the world (more than 6 million downloads since 2013), and tens of thousands of plant pictures are submitted each day. Because a large fraction of this observation stream is geolocalized, it has great potential in terms of biodiversity monitoring and species distribution modeling (SDM). As the use of opportunistic data coming from citizen science initiatives has already been proven by Giraud et al. (2016) to strengthen the estimate of relative bird species abundance, we can expect other potential uses for such data types in a botanical context with Pl@ntNet. Acquiring a large amount of opportunistic data still occurs at the expense of data quality and reliability, however. Many irrelevant pictures are submitted by the users of the Pl@ntNet application. This includes non‐plant pictures, plant pictures of poor quality, or pictures of taxa that are not in the designated checklist (e.g., potted plants, ornamental and horticultural varieties, hybrids). Because the machine learning algorithm is not able to filter all of these pictures, many of them result in false positives (i.e., they are predicted as occurrences of species belonging to the checklist). Indeed, for a species automatically identified from a picture, two problems may induce identification error: (1) there is an intrinsic taxonomic uncertainty given the picture alone (i.e., it does not contain the discriminant visual pattern[s] that would make an expert certain about the exact species identification) or (2) the species was misidentified. Figure 1 illustrates typical examples of identification errors for Acer monspessulanum L. In Fig. 1B, one can see that the small symmetrical lobes at the base of the leaf might be confused with those of a young specimen of A. campestre L., which is probably the cause of the model uncertainty. Figure 1C well illustrates the problem of taxonomic uncertainty, as several species cannot be distinguished by the feature recorded in the observer's image where there is high proximity of the confidence values of the first two species. Finally, Fig. 1D shows a leaf of Hedera helix L. with three major lobes that have strong visual similarity to those of the A. monspessulanum leaf. Manually cleaning such large and noisy data streams is not possible. These problems imply that all species are not equal in their potential for automatic identification. There are several factors that make a species automatically identifiable from a photograph: the scale of the discriminant visual pattern (for example, there are many issues with the Poaceae family because discriminant features are often too small to be easily captured with a photograph), the visual saliency of the pattern compared to other species, and the temporality of the pattern due to the phenology of its organ.

Figure 1

Four unvalidated Pl@ntNet plant pictures representing, or identified as, Acer monspessulanum and their respective predicted confidence values for the highest ranked species (the sum of scores over all species is always 100). (A) The species is A. monspessulanum and is well predicted. (B) The species is A. monspessulanum, but the model confounds it with A. campestre. (C) The species is A. monspessulanum or A. pseudoplatanus, but the species cannot be determined with the fruit only; there is an intrinsic taxonomic uncertainty. (D) The species is Hedera helix but is predicted as A. monspessulanum because this leaf is quite similar, as one can compare with (A). In this article, we explore the possibility of exploiting automatically identified observations, without human validation, for SDM. Specifically, we study the impact of the degree of uncertainty of the retained occurrences when training the popular MAXENT niche modeling approach (Merow et al., 2013). Given the type of Pl@ntNet users, candidate species have to be automatically identifiable by non‐expert observers who are often not familiar with the discriminant part of the plant that needs to be photographed. In addition, species that are visually similar in pictures must be avoided, and the chosen species must be well illustrated in the predictive model training database. In addition to these criteria that allow automatic species identification, we must take into account the requirements using SDM on presence‐only data to acquire meaningful results. More precisely, the species must have contrasted environmental preferences regarding the study domain, its realized habitat must not be overly constrained by its dispersal capacity or important historical perturbations, and there must be enough observation points regarding the environmental variables considered. Considering these constraints on species selection, the available data, and the potential use‐cases, we applied our protocol to the modeling of the distribution of five species classified in major and moderate categories of invasion by the National Mediterranean Botanical Conservatory of Porquerolles for the southeastern region of France (Conservatoire botanique national méditerranéen de Porquerolles, 2018). Invasive species represent a major economic cost to our society (estimated at nearly €12 billion a year in Europe) and are one of the main threats to biodiversity conservation (Weber and Gut, 2004). The early detection of the appearance of these species is a key element in managing them and reducing the cost of such management. The analysis of Pl@ntNet data can provide a highly valuable response to this problem because the presence of these species is often correlated with that of human activity (and thus to the density of Pl@ntNet data occurrences), and the constant flow of observations enables annual monitoring of species distributions.

METHODS

Automatic species identification and the Pl@ntNet workflow

We first present the workflow of the Pl@ntNet system that yields automatically identified observations. To compute automatic species identification, we use a convolutional neural network (CNN). CNNs have been shown to considerably improve the accuracy of automated plant species identification compared to previous methods (Grinblat et al., 2016; Ghazi et al., 2017; Goëau et al., 2017). More generally, CNNs recently received much attention in the computer vision community because of the impressive performance they can achieve on a large variety of classification tasks. Details of the CNN architecture and of the training procedure we used in this study are provided in Appendix 1. The network was trained in a supervised manner on a set of 332,000 humanly validated plant images belonging to approximately 11,000 species and an additional rejection class (containing non‐plant pictures taken by Pl@ntNet users, e.g., faces, animals, manufactured objects). These species cover a large part of the European and North African floras, according to the network of people initially involved in the production and validation of these data (this network was initiated with the Tela Botanica non‐governmental organization [http://www.tela-botanica.org] and the network of French‐speaking botanists, composed of professionals and amateurs). This data set also includes a few hundred species of common tropical plants from two tropical regions: the Indian Ocean region and tropical Amazonia. Data from these two regions were collected by scientists and engineers from research institutes and universities working on these flora, representatives of the Tela Botanica network in these regions, and Pl@ntNet users. The data validation process was conducted using the IdentiPlante web tool (http://www.tela-botanica.org/appli:identiplante), essentially dedicated to the Tela Botanica community, and was also accessible on the Pl@ntNet Android app. These applications display all botanical records shared by the project members. Logged‐in users are able to provide new identifications, post comments, and vote on previous identifications. The revised data are regularly crawled by the visual search engine, which picks up observations considered correctly identified according to a predefined set of rules on the votes and on possible conflicts. These validation tools allow coverage of a growing number of species, from 800 in 2013 up to 11,000 in 2016.

Species distribution modeling using automatically identified Pl@ntNet observations

We performed SDM based on the unvalidated Pl@ntNet observations made in France in 2016. In total, the data represent approximately 2 million observations (most observations have only one image and some have up to five images). Each image x was passed to the CNN to receive an automated species prediction in the form of a categorical distribution p(k|x) estimating the probability that the image x is from the k‐th species (according to the softmax classification layer of the CNN). For the observations composed of several images, the predictions were simply averaged (i.e., p(k|x) = 1/nx · ∑p(k|x ) for an observation x composed of nx images x ). We then kept only the observations for which the most probable species (denoted as kmax) belonged to the set of the five potential invasive species considered in our study: Acer negundo L., Carpobrotus edulis (L.) N. E. Br., Erigeron karvinskianus DC., Opuntia ficus‐indica (L.) Mill., and Reynoutria japonica Houtt. The resulting number of occurrences per species and per interval of confidence values p(kmax|x) is provided in Fig. 2. For low values of p(kmax|x), the level of noise is important (e.g., with several false positives for p(kmax|x) < 30%). For the highest values of p(kmax|x) (e.g., p(kmax|x) > 95%), the level of noise is more reasonable but the number of occurrences is also much lower. Thus, to maximize SDM performance, one could expect a positive trade‐off with an intermediate threshold.

Figure 2

The number of Pl@ntNet observations per species and per confidence values p(kmax|x).

The number of Pl@ntNet observations per species and per confidence values p(kmax|x). To validate the species distribution models trained from automatically identified data, we used a second reference data set comprising count data collected and validated by French expert naturalists. This data set, referred to as Inventaire National du Patrimoine Naturel (INPN; https://www.gbif.org/dataset/75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d; Dutrève and Robert, 2016), comes from the Global Biodiversity Information Facility (https://www.gbif.org/). The underlying occurrences were collected in various contexts, including floras and regional catalogs, specific inventories, field notebooks, and surveys carried out by botanical conservatories. We kept only a subset of these data corresponding to the five invasive species considered in our study. The resulting data set contains 20,810 occurrences (see Table 1 for the detailed numbers per species) aggregated in 3242 quadrat cells of 100 km2 distributed on a regular grid of 5175 quadrat cells covering the French territory.

Table 1

Detailed number of occurrences in the Inventaire National du Patrimoine Naturel (INPN) data set by species

Species name	No. of observations	No. of 100‐km² areas
Acer negundo L.	5217	904
Carpobrotus edulis (L.) N. E. Br.	484	114
Erigeron karvinskianus DC.	711	306
Opuntia ficus‐indica (L.) Mill.	120	44
Reynoutria japonica Houtt.	14,278	2623

Detailed number of occurrences in the Inventaire National du Patrimoine Naturel (INPN) data set by species Species distribution models were computed via MAXENT (Phillips et al., 2004, 2006), a popular environmental niche modeling method. In particular, we used the implementation of the maxnet (Phillips et al., 2017) R package that expands the input environmental variables with several functions (including linear, quadratic, threshold, hinge, and first‐order interactions). Because we used presence‐only SDM, we used pseudo‐absence localities for model parameterization (see Appendix 2 for more details). MAXENT was computed on a set of 29 input environmental variables, including bioclimatic, pedological, topological, hydrographical, and land cover variables from CHELSA Climate data 1.1 (Karger et al., 2017), Consultative Group on International Agricultural Research–Consortium for Spatial Information (CGIAR‐CSI) potential evapo‐transpiration (ETP) data (Zomer et al., 2007, 2008), ESDBv.2 (Panagos, 2006; Van Liedekerke et al., 2006; Panagos et al., 2012), U.S. Geological Survey Digital Elevation data, the Institut National de l'information Géographique et forestière–Système d'Administration Nationale des Données et Référentiels sur l'Eau (IGN‐SANDRE) BD Carthage, CORINE Land Cover 2012 data, and IGN ROUTE500 data. The detailed methodology of how these variables were collected and formatted is described in Appendix 3. The full list of the variables used is presented in Table 2. For each of the considered species, we computed seven models with varying levels of minimal confidence of species occurrences, i.e., different threshold values pmin(kmax|x) of the categorical probability p(kmax|x). We know that the global sampling effort in Pl@ntNet is highly correlated with human population density and the proximity to roads and to the coastline. In our study, the sampling intensity was so high compared to the species abundance that we strongly overestimated the species abundance in cities, on beaches, and on roads. Consequently, we fitted MAXENT models, including variables of urban areas, proximity to roads, and distance to the coastline. In the predicted abundance function, we then kept these variables constant across space to cancel the effect of the sampling effort (see Appendix 2 for more details). This approach has already been proposed and successfully used in the literature of SDMs (Warton et al., 2013; Stolar and Nielsen, 2015). The predictive effectiveness of the models was then assessed using the INPN count data as a validation set. We used two evaluation metrics: (1) the true skills statistics (TSS) equal to the sum of the sensitivity and the specificity minus one (as described in Allouche et al., 2006), and (2) the accuracy on 10% densest quadrats (A10DQ; see Appendix 2 for more details). The TSS is the sum of sensibility and specificity minus one when comparing the SDM predicted presences/absences of a species with the references (the INPN data set). It is a meaningful measure to evaluate the model's ability to detect presences while simultaneously minimizing false positives. It is computed through binarization of SDM continuous prediction based on the threshold that maximizes the TSS. We chose the A10DQ as a complementary metric because it evaluates the accuracy of the models in predicting the quadrats with the highest abundance (INPN count), which is an especially interesting property from the perspective of invasive species management.

Table 2

List and details of the environmental descriptors used in this study

Name	Description	Nature	Valuesa	Local image
CHBIO_2	Mean monthly temp (max, min)	quanti.	[7.8, 21.0]	Yes
CHBIO_7	Temp. annual range	quanti.	[16.7, 42.0]	Yes
CHBIO_8	Mean temp. of wettest quarter	quanti.	[−14.2, 23.0]	Yes
CHBIO_9	Mean temp. of driest quarter	quanti.	[−17.7, 26.5]	Yes
CHBIO_10	Mean temp. of warmest quarter	quanti.	[−2.8, 26.5]	Yes
CHBIO_11	Mean temp. of coldest quarter	quanti.	[−17.7, 11.8]	Yes
CHBIO_13	Precip. of wettest month	quanti.	[43.0, 285.5]	Yes
CHBIO_14	Precip. of driest month	quanti.	[3.0, 135.6]	Yes
CHBIO_15	Precip. seasonality (CV)	quanti.	[8.2, 26.5]	Yes
CHBIO_18	Precip. of warmest quarter	quanti.	[19.8, 851.7]	Yes
CHBIO_19	Precip. of coldest quarter	quanti.	[60.5, 520.4]	Yes
etp	Potential evapotranspiration	quanti.	[133, 1176]	Yes
alti	Elevation	quanti.	[−188, 4672]	Yes
shade	Shade level	quanti.	[0, 1]	No
slope	Ground slope	quanti.	[0, 13457]	No
dmer	Distance to coastline	quanti.	[\|0, 32767\|]	No
droute	Distance to roads	quanti.	[\|0, 32767\|]	No
proxi_eau	<50 m to fresh water	bool.	{0, 1}	Yes
awc_top	Topsoil available water capacity	ordinal	{0, 120, 165, 210}	Yes
bs_top	Base saturation of the topsoil	ordinal	{35, 62, 85}	Yes
cec_top	Topsoil cation exchange capacity	ordinal	{7, 22, 50}	Yes
crusting	Soil crusting class	ordinal	[\|0, 5\|]	Yes
dgh	Depth to a gleyed horizon	ordinal	{20, 60, 140}	Yes
dimp	Depth to an impermeable layer	ordinal	{60, 100}	Yes
erodi	Soil erodibility class	ordinal	[\|0, 5\|]	Yes
oc_top	Topsoil organic carbon content	ordinal	{1, 2, 4, 8}	Yes
pd_top	Topsoil packing density	ordinal	{1, 2}	Yes
text	Dominant surface textural class	ordinal	[\|0, 5\|]	Yes
clc	Ground occupation	categ.	[\|1, 48\|]	Yes

bool. = Boolean data; categ. = categorical data; CV = coefficient of variation of monthly precipitation; quanti. = quantitative data.

Data presented in curly brackets ({ }) contain the list of all possibles values of the variable, i.e., a discrete ensemble; square brackets ([ ]) indicate the continuous range of values that can take the variable, i.e., a continuous interval; vertical lines indicate the range of integers between the two bounds given, i.e., a discrete interval.

List and details of the environmental descriptors used in this study bool. = Boolean data; categ. = categorical data; CV = coefficient of variation of monthly precipitation; quanti. = quantitative data. Data presented in curly brackets ({ }) contain the list of all possibles values of the variable, i.e., a discrete ensemble; square brackets ([ ]) indicate the continuous range of values that can take the variable, i.e., a continuous interval; vertical lines indicate the range of integers between the two bounds given, i.e., a discrete interval.

RESULTS

Figure 3 displays the evaluation metrics as a function of the confidence threshold pmin(kmax|x) applied to filter the automatic predictions. We found that the confidence threshold had variable influence depending on the species, but there was an overall trend represented by the average curve (Fig. 3, black solid line). Too‐low thresholds did not allow for filtering identification errors sufficiently, thus the model was biased by the presence of too many irrelevant occurrences. A too‐high threshold (above 70%) also degraded the model performance (in particular, the accuracy of the quadrat cells with the higher level of counts; see Fig. 3) because the number of retained occurrences in the training set decreased significantly with increasing threshold. Models based on too few occurrences could not provide a relevant prediction of species distribution. With the current Pl@ntNet data, the chosen species, and the variables, a confidence threshold of 70% represented a good compromise for SDM. It filtered identification errors effectively for most species while retaining enough occurrences for model training. The most problematic species was Reynoutria japonica: it had very poor TSS for all thresholds (a TSS score of 0 would be a random prediction of presence and absence), indicating that the SDM did not distinguish presence and absence zones very well. This species is the most widespread, which leads to poor SDM performances. Nevertheless, for the best threshold, A10DQ showed that 20% of the densest INPN quadrats were predicted by the model fitted on Pl@ntNet, which is significantly better than a random ranking of quadrats (which would give an average of 10% and a standard deviation of 1.3%). Consequently, the model could capture information on the distribution of Reynoutria from the Pl@ntNet data. Conversely, very good results were obtained for both metrics for Opuntia ficus‐indica and Carpobrotus edulis.

Figure 3

Predictive effectiveness of the species distribution models trained on Pl@ntNet data as a function of the confidence threshold value pmin(kmax|x) showing accuracy on the 10% densest quadrats (A) and true skill statistics (TSS; conversion of prediction value into presence/absence with the threshold that maximizes TSS) (B). Figure 4 further shows the distributions predicted for each species using pmin(kmax|x) = 70%. For comparison, we also displayed the expert count data of INPN, as well as the specificity and sensitivity of our model measured with that data (at TSS max). Most regions with high INPN counts were reasonably well predicted by the models. Accordingly, sensitivity values were generally accurate for most species. Nevertheless, there were also regions for which the Pl@ntNet model and INPN data disagreed; in these regions the Pl@ntNet model predicted high abundances but there were none or very few occurrences in the INPN data. The strongest disagreement occurred for Reynoutria japonica, i.e., the taxon for which the specificity was the lowest. Other false‐positive prediction regions included the west coast for Opuntia ficus‐indica and Carpobrotus edulis and the “Golfe du Lion” (arc on the southeast coast) for O. ficus‐indica and Erigeron karvinskianus.

Figure 4

Maps of species distribution models computed from Pl@ntNet data (based on pmin(kmax|x) = 70%) and of expert count data from the Inventaire National du Patrimoine Naturel (INPN). The sensibility and specificity used for the computation of the true skill statistics (for pmin(kmax|x) = 70) is provided for each species.

DISCUSSION

Visual inspection of Pl@ntNet observations occurring in such false‐positive regions revealed that for the vast majority such observations did not correspond to erroneous identifications (pmin(kmax|x) = 70% is a high enough threshold to remove noise efficiently). Rather, they corresponded to real occurrences that can be classified in three main categories (see Fig. 5 for examples of observations belonging to the different categories). The first category can be qualified as cultivated specimens, i.e., specimens planted and/or maintained by humans such as gardening plants, house plants, ornamental plants in city parks, etc. Most occurrences of Opuntia ficus‐indica on the west coast belonged to this category. A second category of observations could be qualified as casual invasive specimens, i.e., isolated specimens that often flourish close to human construction but that do not form self‐replacing populations. Cultivated and casual invasive specimens present in the observations reveal that the species is able to grow in a great diversity of habitats. These specimens should be identified, either to (1) filter them for model learning, (2) evaluate the correlation between species gardening intensity and its abundance in wild surroundings, or (3) learn more complex models that integrate dispersal mechanisms and quantify more precisely the importance of gardening intensity on the species’ capacity to colonize a region. To identify cultivated specimens, several options are possible: for example, learning models can be used to identify the context of the picture or the user can be asked to clarify the type of environment where the observation was made, especially when observations appear ambiguous. Apart from the issue of correctly predicting species occurrences in the wild, frequent occurrences of cultivated and casual invasive specimens in a region where there is no presence in the wild can reflect the risk of future invasion in the wild.

Figure 5

Pl@ntNet observations with a species prediction score of more than 70% for plants living in natural conditions or cultivated for ornamental purpose.

Pl@ntNet observations with a species prediction score of more than 70% for plants living in natural conditions or cultivated for ornamental purpose. A last category of observations can be qualified as newly inventoried invasive specimens, i.e., non‐isolated specimens living in natural areas that have yet to be inventoried in the INPN data. Notably, the majority of occurrences of Carpobrotus edulis on the west coast belong to this category. Newly inventoried invasive specimens could provide an early warning for territory managers. For example, we found newly inventoried specimens of Reynoutria japonica in the Pl@ntNet data, and we suspect that poor performance of its SDM could reflect a negative bias in the evaluation metrics of this species. Typically, specimens occurring outside of presence areas identified by experts and not categorized as cultivated or casual invasive should be prioritized for expert validation. In this study, our sampling effort correction approach was based on prior knowledge of sampling intensity in the Pl@ntNet data. We could not evaluate the errors related to the sampling effort bias without complementary systematic survey data. Nevertheless, the INPN data have their own heterogeneity in the spatial distribution of the sampling effort. These data were collected by independent regional conservatories, and variations in sampling by different workforces may have introduced regional heterogeneity. Furthermore, some zones are not surveyed by conservatories, typically cities in most cases, which tends to bias the Pl@ntNet model error in urban areas. The study of global sampling effort bias is crucial for exploiting presence‐only data collected without protocol. The spatially heterogeneous sampling effort is especially problematic when it is correlated with environmental variables impacting the species distribution. For example, the sampling effort is correlated with the distance to the coastline, which is also a variable influencing the abundance of Opuntia ficus‐indica, Erigeron karvinskianus, and Carpobrotus edulis. Because our bias correction method removes the distance to the coastline effect, it partially removes the ability of the model to capture this effect on the species distribution. When we included these variables in the predicted distribution of the three species (results not presented in this article), we found a much greater predicted abundance gradient toward the coast. However, the maps presented in Fig. 4 show that the model captured a part of the coastal effect through other variables that are correlated with the distance to coastline. The same problem will occur with other invasive species that tend to grow near roads as a result of constant perturbation or dispersal mechanisms. More generally, we note that the presence of invasive species is strongly influenced by human activity. It is also highly correlated with observational intensity in opportunistic presence‐only data. Thus, this category of species represents a major methodological challenge for improving SDM based on presence‐only data and represents a clear path for future research.

CONCLUSIONS

This study is the first to evaluate the potential of automated identification of opportunistic plant observations for modeling species distributions. The described methodology allowed us to analyze the potential usefulness of the Pl@ntNet data. By comparing SDMs trained on Pl@ntNet unvalidated observations with validated independent count data on a large spatial scale, we found that the data are rich enough to be used for SDM with only a single year of data collection. However, we also showed that distributions reported from Pl@ntNet data do not precisely match those of expert data. The main reasons for these deviations appear to be the presence of cultivated or casual invasive specimens in the data set, the detection of real presence in new areas, and the limits of the sampling bias correction method. Noticing these limits allowed us to underline significant research challenges for SDMs and to provide possible methods to usefully integrate information provided by opportunistic citizen science observations into conservation management.

7 in total

1. Automated species identification: why not?

Authors: Kevin J Gaston; Mark A O'Neill
Journal: Philos Trans R Soc Lond B Biol Sci Date: 2004-04-29 Impact factor: 6.237

2. Capitalizing on opportunistic data for monitoring relative abundances of species.

Authors: Christophe Giraud; Clément Calenge; Camille Coron; Romain Julliard
Journal: Biometrics Date: 2015-10-23 Impact factor: 2.571

3. Computer vision cracks the leaf code.

Authors: Peter Wilf; Shengping Zhang; Sharat Chikkerur; Stefan A Little; Scott L Wing; Thomas Serre
Journal: Proc Natl Acad Sci U S A Date: 2016-03-07 Impact factor: 11.205

4. Climatologies at high resolution for the earth's land surface areas.

Authors: Dirk Nikolaus Karger; Olaf Conrad; Jürgen Böhner; Tobias Kawohl; Holger Kreft; Rodrigo Wilber Soria-Auza; Niklaus E Zimmermann; H Peter Linder; Michael Kessler
Journal: Sci Data Date: 2017-09-05 Impact factor: 6.444

5. Finite-Sample Equivalence in Statistical Models for Presence-Only Data.

Authors: William Fithian; Trevor Hastie
Journal: Ann Appl Stat Date: 2013-12-01 Impact factor: 2.083

6. Going deeper in the automated identification of Herbarium specimens.

Authors: Jose Carranza-Rojas; Herve Goeau; Pierre Bonnet; Erick Mata-Montero; Alexis Joly
Journal: BMC Evol Biol Date: 2017-08-11 Impact factor: 3.260

7. Model-based control of observer bias for the analysis of presence-only data in ecology.

Authors: David I Warton; Ian W Renner; Daniel Ramp
Journal: PLoS One Date: 2013-11-18 Impact factor: 3.240

7 in total

5 in total

Review 1. The history and impact of digitization and digital data mobilization on biodiversity research.

Authors: Gil Nelson; Shari Ellis
Journal: Philos Trans R Soc Lond B Biol Sci Date: 2018-11-19 Impact factor: 6.237

2. Exploring snake occurrence records: Spatial biases and marginal gains from accessible social media.

Authors: Benjamin M Marshall; Colin T Strine
Journal: PeerJ Date: 2019-12-17 Impact factor: 2.984

Review 3. Tackling unresolved questions in forest ecology: The past and future role of simulation models.

Authors: Isabelle Maréchaux; Fanny Langerwisch; Andreas Huth; Harald Bugmann; Xavier Morin; Christopher P O Reyer; Rupert Seidl; Alessio Collalti; Mateus Dantas de Paula; Rico Fischer; Martin Gutsch; Manfred J Lexer; Heike Lischke; Anja Rammig; Edna Rödig; Boris Sakschewski; Franziska Taubert; Kirsten Thonicke; Giorgio Vacchiano; Friedrich J Bohn
Journal: Ecol Evol Date: 2021-03-30 Impact factor: 3.167

4. Automatic detection of fish and tracking of movement for ecology.

Authors: Sebastian Lopez-Marcano; Eric L Jinks; Christina A Buelow; Christopher J Brown; Dadong Wang; Branislav Kusy; Ellen M Ditria; Rod M Connolly
Journal: Ecol Evol Date: 2021-05-18 Impact factor: 2.912

5. A data management workflow of biodiversity data from the field to data users.

Authors: Rachel A Hackett; Michael W Belitz; Edward E Gilbert; Anna K Monfils
Journal: Appl Plant Sci Date: 2019-12-20 Impact factor: 1.936

5 in total