| Literature DB >> 34747489 |
Mélanie Courtot1, Dipayan Gupta1, Isuru Liyanage1, Fuqi Xu1, Tony Burdett1.
Abstract
The BioSamples database at EMBL-EBI is the central institutional repository for sample metadata storage and connection to EMBL-EBI archives and other resources. The technical improvements to our infrastructure described in our last update have enabled us to scale and accommodate an increasing number of communities, resulting in a higher number of submissions and more heterogeneous data. The BioSamples database now has a valuable set of features and processes to improve data quality in BioSamples, and in particular enriching metadata content and following FAIR principles. In this manuscript, we describe how BioSamples in 2021 handles requirements from our community of users through exemplar use cases: increased findability of samples and improved data management practices support the goals of the ReSOLUTE project, how the plant community benefits from being able to link genotypic to phenotypic information, and we highlight how cumulatively those improvements contribute to more complex multi-omics data integration supporting COVID-19 research. Finally, we present underlying technical features used as pillars throughout those use cases and how they are reused for expanded engagement with communities such as FAIRplus and the Global Alliance for Genomics and Health. Availability: The BioSamples database is freely available at http://www.ebi.ac.uk/biosamples. Content is distributed under the EMBL-EBI Terms of Use available at https://www.ebi.ac.uk/about/terms-of-use. The BioSamples code is available at https://github.com/EBIBioSamples/biosamples-v4 and distributed under the Apache 2.0 license.Entities:
Mesh:
Year: 2022 PMID: 34747489 PMCID: PMC8728232 DOI: 10.1093/nar/gkab1046
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.The BioSamples layer cake of FAIRification. Technical pillars underpin three levels of data management supporting different communities of users described in subsequent sections.
Manual curation applied to specific samples and exported to ZOOMA. All future samples with similar attribute values will be annotated automatically through the ZOOMA pipeline
| BioSamples ID | Attribute name | Attribute value | Ontology annotation |
|---|---|---|---|
| SAMN14168014 | strain | SARS-CoV-2 |
|
| SAMN14450688 | host disease | COVID19 |
|
| SAMN14428242 | isolation source | nasal swab |
|
Merged attributes in curation process. Attribute 1 refers to the most popular or selected attribute. In the merging process attribute 2 will be replaced by attribute 1
| Attribute 1 | Attribute 2 | Description |
|---|---|---|
| aluminium | Al | Short forms |
| disease | illness | Synonym |
| collection time | collection timestamp | |
| environmental feature | environmetal feature | Spelling mistakes |
| host disease status | host disease stautus | |
| cell surface marker | cell surface markers | inflections |
| samp size | sample size | |
| age years | age yrs | Different units and formatting |
Figure 2.Sample relationships and inter-archival relationships. A donor patient sample (top middle) is hosted in EGA under controlled access for privacy and confidentiality. A tissue sample (top left) is generated from that donor and its information hosted in EGA as well. Both samples have had BioSamples IDs assigned upon submission. Metadata attributes that can be made public are then imported by BioSamples, where that metadata can be linked to the corresponding viral sample (top left) in BioSamples, which sequencing data is hosted by ENA and linked to the sample.