| Literature DB >> 35576001 |
Giulia Agostinetto1, Davide Bozzi1,2,3, Danilo Porro1,4, Maurizio Casiraghi1, Massimo Labra1, Antonia Bruno1.
Abstract
Large amounts of data from microbiome-related studies have been (and are currently being) deposited on international public databases. These datasets represent a valuable resource for the microbiome research community and could serve future researchers interested in integrating multiple datasets into powerful meta-analyses. However, this huge amount of data lacks harmonization and it is far from being completely exploited in its full potential to build a foundation that places microbiome research at the nexus of many subdisciplines within and beyond biology. Thus, it urges the need for data accessibility and reusability, according to findable, accessible, interoperable and reusable (FAIR) principles, as supported by National Microbiome Data Collaborative and FAIR Microbiome. To tackle the challenge of accelerating discovery and advances in skin microbiome research, we collected, integrated and organized existing microbiome data resources from human skin 16S rRNA amplicon-sequencing experiments. We generated a comprehensive collection of datasets, enriched in metadata, and organized this information into data frames ready to be integrated into microbiome research projects and advanced post-processing analyses, such as data science applications (e.g. machine learning). Furthermore, we have created a data retrieval and curation framework built on three different stages to maximize the retrieval of datasets and metadata associated with them. Lastly, we highlighted some caveats regarding metadata retrieval and suggested ways to improve future metadata submissions. Overall, our work resulted in a curated skin microbiome datasets collection accompanied by a state-of-the-art analysis of the last 10 years of the skin microbiome field. Database URL: https://github.com/giuliaago/SKIOMEMetadataRetrieval.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35576001 PMCID: PMC9216470 DOI: 10.1093/database/baac033
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 4.462
Figure 1.Schematic representation of the three-step framework adopted in the study to collect datasets and metadata and generate three differently curated data frames.
Figure 2.(A) Number of studies and samples from Data Frame 3 released every year starting from 2012. (B) Comparison of the number of samples released each year for the three Data Frames. Data Frames 1 and 2 contain samples starting from 2008, whereas Data Frame 3 only from 2012. (C) Distribution of the variable ‘sex’ in the three Data Frames. In all three cases, the majority of the samples do not have such information reported. (D) The number of Taxon ID/Scientific names used in the three Data Frames (barplot) and relative abundance (as a logarithm) of the Taxon ID/Scientific names used for the samples in Data Frame 3 (pie chart). (E–H) Comparison of the median number of spots (E), bases (F), reads average length (G) and insert size (H) in the three Data Frames. (I) Read length distribution in the three Data Frames. (J) Distribution of the insert size in the three Data Frames.
Figure 3.(A) Number of samples and studies that used specific 16S rRNA hypervariable regions in Data Frame 3. (B) The number of studies and samples for each disease/condition investigated in Data Frame 3. (C–E) Frequency of use of the different sequencing platforms (C), clustering methods (D) and taxonomic databases (E) in Data Frame 3. (F) Table showing the WoS research areas and Scopus research subjects that described the scientific journals in which the studies of Data Frame 3 have been published. The research areas/subjects are divided into three boxes depending on how often they were associated with the Scopus research subject ‘Medicine’. Going from left to right are shown the research areas/subjects that were always (left), sometimes (center) and never (right) associated with the Scopus research subject ‘Medicine’. (G) Geographical distribution of the studies included in Data Frame 3.