| Literature DB >> 35616100 |
Nicolas Matentzoglu1, James P Balhoff2, Susan M Bello3, Chris Bizon2, Matthew Brush4, Tiffany J Callahan4, Christopher G Chute5, William D Duncan6, Chris T Evelo7, Davera Gabriel5, John Graybeal8, Alasdair Gray9, Benjamin M Gyori10, Melissa Haendel4, Henriette Harmse11, Nomi L Harris6, Ian Harrow12, Harshad B Hegde6, Amelia L Hoyt13, Charles T Hoyt10, Dazhi Jiao5, Ernesto Jiménez-Ruiz14,15, Simon Jupp16, Hyeongsik Kim17, Sebastian Koehler18, Thomas Liener12, Qinqin Long19, James Malone20, James A McLaughlin11, Julie A McMurry4, Sierra Moxon6, Monica C Munoz-Torres4, David Osumi-Sutherland11, James A Overton21, Bjoern Peters22, Tim Putman4, Núria Queralt-Rosinach19, Kent Shefchek4, Harold Solbrig5, Anne Thessen4, Tania Tudorache23, Nicole Vasilevsky4, Alex H Wagner24,25, Christopher J Mungall6.
Abstract
Despite progress in the development of standards for describing and exchanging scientific information, the lack of easy-to-use standards for mapping between different representations of the same or similar objects in different databases poses a major impediment to data integration and interoperability. Mappings often lack the metadata needed to be correctly interpreted and applied. For example, are two terms equivalent or merely related? Are they narrow or broad matches? Or are they associated in some other way? Such relationships between the mapped terms are often not documented, which leads to incorrect assumptions and makes them hard to use in scenarios that require a high degree of precision (such as diagnostics or risk prediction). Furthermore, the lack of descriptions of how mappings were done makes it hard to combine and reconcile mappings, particularly curated and automated ones. We have developed the Simple Standard for Sharing Ontological Mappings (SSSOM) which addresses these problems by: (i) Introducing a machine-readable and extensible vocabulary to describe metadata that makes imprecision, inaccuracy and incompleteness in mappings explicit. (ii) Defining an easy-to-use simple table-based format that can be integrated into existing data science pipelines without the need to parse or query ontologies, and that integrates seamlessly with Linked Data principles. (iii) Implementing open and community-driven collaborative workflows that are designed to evolve the standard continuously to address changing requirements and mapping practices. (iv) Providing reference tools and software libraries for working with the standard. In this paper, we present the SSSOM standard, describe several use cases in detail and survey some of the existing work on standardizing the exchange of mappings, with the goal of making mappings Findable, Accessible, Interoperable and Reusable (FAIR). The SSSOM specification can be found at http://w3id.org/sssom/spec. Database URL: http://w3id.org/sssom/spec.Entities:
Mesh:
Year: 2022 PMID: 35616100 PMCID: PMC9216545 DOI: 10.1093/database/baac035
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 4.462
Figure 1.Example of mappings between different identifiers representing statements about similarity or identity of concepts across resources and vocabularies. Even with this simplified example, it is possible to see a range of mapping types, and that providing information about each mapping is crucial to understanding the bigger picture. This information helps avoid errors such as mistakenly conflating two variants of a disease.
Desired features of a mapping standard, with examples of cases where the desired feature is met and examples where the desired feature is not met (negative examples)
| Feature | Why | Examples | Negative example |
|---|---|---|---|
| Explicit relationship types | Applications that demand highly accurate results require mapping relations with explicit precision and semantics | EC:2.2.1.2 exactMatch GO:0004801 (transaldolase activity) | Two-column file that maps FMA ‘limb’ to Uberon ‘limb’, hiding differences in species-specificity |
| Explicit confidence | Different use cases require different levels of confidence and accuracy | A mapping tool assigns a confidence score based on the amount of evidence that is explicitly recorded | Without the confidence score we cannot filter out automated mappings with low confidence |
| Provenance | Understanding how a mapping was created (e.g. automatically or by a human expert curator) is crucial to interpreting it | Mapping file that automated mappings with link to tool used; curated mapping file with curators’ ORCIDs provided | Two-column mapping file with no indication of how the mapping was made, and no supplementary metadata file |
| Explicit declaration of completeness | Must be able to distinguish between absence due to lack of information vs deliberate omission | Mapping file where rejected mappings are explicitly recorded | Mapping file where absence of a mapping can mean either explicitly rejected mapping OR the mapping was not considered/ reviewed |
| FAIR principles | Mappings should be Findable, Accessible, Interoperable and Reusable | Mapping file available on the web with clear licensing conditions, in standard format, with full metadata and a persistent identifier | Mapping files exchanged via email |
| Unambiguous identifiers | Mapping should make use of standard, globally unambiguous identifiers such as CURIEs or IRIs | Standard ontology CURIEs like UBERON:0002101 for entities, with prefixes registered in a registry or as part of the metadata | Identifiers are used without explicitly defined prefixes; mappings are created between strings rather than identifiers |
| Allows composability | Mappings from different sources should be combinable and should be possible to chain mappings together | Defined mapping predicates (relations) such that reasoning about chains A-> B-> C is possible (where allowed by semantics of the predicate) | Two mapping files with implicit or undefined relationships -> unclear whether these can be combined or composed |
| Follows Linked Data principles | Allows interoperation with semantic data tooling, facilitates data merging | All mapped entities have URIs, and metadata elements also have defined URIs; available in JSON-LD/RDF | No reuse of existing vocabularies for metadata or for relating mapped entities |
| Well-described data model | Allows interoperation and standard tooling | Data model provided in both human and machine-readable form | Ad hoc file format with unclear semantics |
| Tabular representation | Ease of curation and rapid analysis | A mapping available as a TSV that is directly usable in common data science frameworks; may complement a richer serialization | Ad hoc flat-file format requiring a custom parser |
Figure 2.Example of basic SSSOM mapping model with some illustrative mapping metadata elements.
Recommended values of predicate_id capturing a broad range of use cases, drawn from SKOS vocabularies and from OWL
| Predicate | Description |
|---|---|
| owl:sameAs | The subject and the object are instances (OWL individuals), and the two instances are the same. |
| owl:equivalentClass | The subject and the object are classes (OWL class), and the two classes are the same. |
| owl:equivalentProperty | The subject and the object are properties (OWL object, data, annotation properties), and the two properties are the same. |
| rdfs:subClassOf | The subject and the object are classes (OWL class), and the subject is a subclass of the object. |
| rdfs:subPropertyOf | The subject and the object are properties (OWL object, data, annotation properties), and the subject is a subproperty of the object. |
| skos:relatedMatch | The subject and the object are associated in some unspecified way. |
| skos:closeMatch | The subject and the object are sufficiently similar that they can be used interchangeably in some information retrieval applications. |
| skos:exactMatch | The subject and the object can, with a high degree of confidence, be used interchangeably across a wide range of information retrieval applications. |
| skos:narrowMatch | The object of the triple is a narrower concept than the subject of the triple. |
| skos:broadMatch | The object of the triple is a broader concept than the subject of the triple. |
Figure 3.An example SSSOM TSV table (generated by the developers of the environmental exposure ontology (19) using rdf-matcher (26)), with a table header (lines that start with #, shown in purple) that contains the mapping set metadata, followed by the mappings (27).