| Literature DB >> 30497363 |
Theodor Sperlea1, Stefan Füser1, Jens Boenigk2, Dominik Heider3.
Abstract
BACKGROUND: Microbes are essentail components of all ecosystems because they drive many biochemical processes and act as primary producers. In freshwater ecosystems, the biodiversity in and the composition of microbial communities can be used as indicators for environmental quality. Recently, some environmental features have been identified that influence microbial ecosystems. However, the impact of human action on lake microbiomes is not well understood. This is, in part, due to the fact that environmental data is, albeit theoretically accessible, not easily available.Entities:
Keywords: Data enrichment; Database; Ecology; GPS; Microbial ecology
Mesh:
Year: 2018 PMID: 30497363 PMCID: PMC6266930 DOI: 10.1186/s12859-018-2419-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Sample workflow for the use of SEDE-GPS. Based on user-defined GPS positions, SEDE-GPS queries a set of modules and returns all relevant data. This data can then be used in analyses of any geo-located process. Due to the huge amount of features present in the dataset after data enrichment with SEDE-GPS, we recommend including a feature selection step before using the gathered data for model construction, e.g., based on machine learning. Data sources are represented by their respective logos which were taken from Wikimedia (https://commons.wikimedia.org/wiki/Main_Page)
Modules and their subfields currently available in SEDE-GPS
| Module | Subfields | Additional Input | Data Processing | No. of features | Runtime (ms) |
|---|---|---|---|---|---|
| OSM Land Use | - | Radius | Pixel decompression | 20 | 24823 ±2421 |
| OSM POIs | Craft | Radius | Bounding boxes | 7 | 3229 ±342 |
| Leisure | Radius | Bounding boxes | 15 | 7202 ±622 | |
| Powerplants | Radius | Bounding boxes | 11 | 5053 ±503 | |
| Special buildings | Radius | Bounding boxes | 13 | 6881 ±453 | |
| Tourism | Radius | Bounding boxes | 8 | 3096 ±382 | |
| Transport | Radius | Bounding boxes | 13 | 6951 ±496 | |
| Urban | Radius | Bounding boxes | 6 | 2402 ±401 | |
| CDC | Average of the day | Date | 4 | <1 | |
| Average of the month | Date | 4 | 2 ±0 | ||
| Average of the year | Date | 4 | 211 ±0 | ||
| Eurostat | Agriculture | 721 | 711 ±80 | ||
| Business Demography | 778 | 1467 ±83 | |||
| Crime Statistics | 4 | 16 ±4 | |||
| Demography | 15077 | 2611 ±79 | |||
| Economic Accounts | 67 | 431 ±41 | |||
| Education Stat. | 30 | 31 ±5 | |||
| Labour Market Stat. | 99 | 172 ±17 | |||
| Science & Technology | 644 | 3718 ±400 | |||
| Tourism Stat. | 44 | 163 ±11 | |||
| Transport | 59 | 13383 ±224 | |||
| - | Radius | 1 | 1014 ±316 | ||
| Total | 17629 | 83567 |
Runtime means and standard deviation were calculated from ten measurements
Performance (R2 values) of machine learning models trained to predict alpha diversity from SEDE-GPS output
| Dataset |
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|---|
| Euk Chao1 | 0.292 | 0.003 | 0.713 | 0.980 | 0.0415 | 0.214 | 0.631 | 0.518 | 0.496 | 0.999 |
| Euk Shannon | 0.228 | 0.0167 | 0.791 | 0.993 | 0.000 | 0.180 | 0.635 | 0.582 | 0.680 | 1.000 |
| Euk Simpson_e | 0.277 | 0.0146 | 0.556 | 0.976 | 0.107 | 0.238 | 0.671 | 0.559 | 0.546 | 0.980 |
| Euk Simpson | 0.150 | 0.001 | 0.742 | 0.906 | 0.014 | 0.090 | 0.545 | 0.346 | 0.432 | 0.995 |
| Prok Chao1 | 0.768 | 0.461 | 0.832 | 0.991 | 0.0695 | 0.420 | 0.635 | 0.915 | 0.955 | 0.979 |
| Prok Shannon | 0.527 | 0.011 | 0.940 | 0.991 | 0.172 | 0.538 | 0.626 | 0.930 | 0.993 | 0.999 |
| Prok Simpson_e | 0.345 | 0.128 | 0.849 | 0.991 | 0.035 | 0.304 | 0.622 | 0.937 | 0.840 | 0.999 |
| Prok Simpson | 0.459 | 0.008 | 0.915 | 0.986 | 0.168 | 0.453 | 0.627 | 0.904 | 0.880 | 0.991 |
Fig. 2Performance of machine learning models predicting microbial lake alpha diversity based on the output of SEDE-GPS. Stars represent the performance of models trained on the respecitve dataset, box plots represent confidence intervals of R2 values gathered from the respective model. Models were trained on the output of SEDE-GPS after feature selection and evaluated using LOOCV (“Methods” section). Only results for the four best-performing models are shown; for the others, see Table 2
Fig. 3Stability of feature lists over LOOCV iterations. Jaccard distances and Kendall’s τ were calculated for pairs of feature lists for the 50 most important features of each dataset. Dots and error bars represent average values and standard deviations of values, respectively. At maximum distance, the Jaccard distance and Kendall’s τ would assume a value of 1 and −1, respectively. Both feature lists are rather stable, however, the feature lists of the Prokaryote datasets are more stable than their Eukaryote counterparts
Features with the highest weights for prediction of different alpha diversity metrics for Prokaryotes and Eukaryotes in Austrian lakes
| Prokaryotes | |||
|---|---|---|---|
| Chao1 | Shannon Entropy | Simpson Diversity | Simpson Evenness |
| Industrial Area, Villages, Street (2-5 km) | Forests (5km) | Forests (5km) | Forests (5km) |
| Forests (5km) | Main street (small), married people | Forests | Main street (small), married people |
| Climate, Demography, City Structures | Forests (2km) | Buildings, Highways, Water, Parking, Parks | Forests (1km) |
| Climate, Demography, City Structures | Climate, Demography, City Structures | Forests (1km) | Buildings, Highways, Water, Parking, Parks |
| Main street (small), married people | Green space, small villages, Industrial area | Mining, main streets | Mining, main streets |
| Eukaryotes | |||
| Chao1 | Shannon Entropy | Simpson Diversity | Simpson Evenness |
| Forests | Main streets | Main streets | Economy (parking, GDP, Agrarian structures), Population |
| Family Demography | Beach & Water | Beach & Water | Economy (parking, GDP, Agrarian structures), Population |
| Climate, Demography, City Structures | Picnic Site (5km) | Economy (parking, GDP, Agrarian structures), Population | Beach & Water |
| Altitude, Climate, Demography, City Structures | Highway Pull-ins | Towns | Towns |
| Climate, Demography, City Structures | Urban regions, Av. Temperature, Parks | Urban regions, Av. Temp., Parks | Highway Pull-ins |
For features in bold, a linear regression shows a positive relationship with the respective target variable
Fig. 4Decline of average importance of features over the 25 highest ranked features. Feature weights were calculated using EFS and averaged over the LOOCV iterations. Ribbons indicate standard deviation. Average importance values were normalized so that the first feature has an average weight of 1. For all datasets except Euk Simpson, after the twelfth highest weighted features, feature weights are below 0.5