| Literature DB >> 28662064 |
Julie A McMurry1, Nick Juty2, Niklas Blomberg3, Tony Burdett2, Tom Conlin1, Nathalie Conte2, Mélanie Courtot2, John Deck4, Michel Dumontier5, Donal K Fellows6, Alejandra Gonzalez-Beltran7, Philipp Gormanns8, Jeffrey Grethe9, Janna Hastings10, Jean-Karim Hériché11, Henning Hermjakob2, Jon C Ison12, Rafael C Jimenez2, Simon Jupp2, John Kunze13, Camille Laibe2, Nicolas Le Novère10, James Malone2, Maria Jesus Martin2, Johanna R McEntyre2, Chris Morris14, Juha Muilu15, Wolfgang Müller16, Philippe Rocca-Serra7, Susanna-Assunta Sansone7, Murat Sariyar17, Jacky L Snoep18,19, Stian Soiland-Reyes6, Natalie J Stanford6, Neil Swainston20, Nicole Washington21, Alan R Williams6, Sarala M Wimalaratne2, Lilly M Winfree1, Katherine Wolstencroft22, Carole Goble6, Christopher J Mungall21, Melissa A Haendel1, Helen Parkinson2.
Abstract
In many disciplines, data are highly decentralized across thousands of online databases (repositories, registries, and knowledgebases). Wringing value from such databases depends on the discipline of data science and on the humble bricks and mortar that make integration possible; identifiers are a core component of this integration infrastructure. Drawing on our experience and on work by other groups, we outline 10 lessons we have learned about the identifier qualities and best practices that facilitate large-scale data integration. Specifically, we propose actions that identifier practitioners (database providers) should take in the design, provision and reuse of identifiers. We also outline the important considerations for those referencing identifiers in various circumstances, including by authors and data generators. While the importance and relevance of each lesson will vary by context, there is a need for increased awareness about how to avoid and manage common identifier problems, especially those related to persistence and web-accessibility/resolvability. We focus strongly on web-based identifiers in the life sciences; however, the principles are broadly relevant to other disciplines.Entities:
Mesh:
Year: 2017 PMID: 28662064 PMCID: PMC5490878 DOI: 10.1371/journal.pbio.2001414
Source DB: PubMed Journal: PLoS Biol ISSN: 1544-9173 Impact factor: 8.029
Fig 1Anatomy of a web-based identifier.
An example of an exemplary unique resource identifier (URI) is below; it is comprised of American Standard Code for Information Interchange (ASCII) characters and follows a pattern that starts with a fixed set of characters (URI pattern). That URI pattern is followed by a local identifier (local ID)—an identifier which, by itself, is only guaranteed to be locally unique within the database or source. A local ID is sometimes referred to as an “accession.” Note this figure illustrates the simplest representation; nuances regarding versioning are covered in Lesson 6 and Fig 5.
Fig 5Record-level versioning and release-level versioning.
Fig 2A summary of the 10 recommendations and their direct or indirect impact on different kinds of identifier roles.
Desirable characteristics for database identifiers in the life sciences.
| Characteristics | Definition | General rationale/impact on data integration | Specific example of a possible ramification due to non-adherence |
|---|---|---|---|
| One | Avoids collisions that result in integrating on the wrong entity. | A physician uses a wrong clinical guideline and makes a wrongful diagnosis because the info button within the clinical information system is linked to the wrong document. | |
| One entity should ideally be identified by no more than one URI. | (1) Eliminates the cost of maintaining public mappings between equivalent identifiers | A researcher fails to make a pathway discovery because she does not realize that | |
| The | Avoids | A researcher is unable to reproduce an experiment because the link to a record is dead. | |
| Identifier must NOT be reassigned to an altogether different entity, though the original entity may evolve provided a change history is documented. | Avoids integrating on the wrong entity. | Because a new chemical gets an old ID, a chemist uses the wrong chemical in a reaction. | |
| If the entity’s definition or essential metadata changes substantially, (Lesson 7) the | Avoids integrating on the wrong entity state (specified through version). | A given experiment is not reproducible because the specific build version of a gene sequence was not specified. | |
| The identifier must NOT be deleted (but may be deprecated). | Avoids | Information about a gene model is completely lost. | |
| The | Avoids the unnecessary proliferation of resolvable identifiers issued by third parties (for entities that are not resolvable and/or not identified in their native context) See also | A dozen different third-party providers mint identifiers for entities that are not actually under their control. Harmonization between these off-brand identifiers is painful. | |
| The | Avoids the need for special handling of edge cases when integrating data at scale. | Data integrators spend time cleaning identifiers and handling edge-cases instead of doing science. | |
| The total set of assignable identifiers for the | Facilitates validation and extraction from scientific text, thus the pattern should be as tightly specified as possible (see Lesson 3). | Identifiers cannot be validated and a provider may find it hard to assess their impact in the literature. | |
| The | Avoids potential points of failure due to malformed URL, XML, etc. | Use of the identifier produces malformed XML and/or requires special detection and encoding. | |
| The | Lowers barriers for data generators to deposit data. | Data generators become reluctant to deposit data in order to minimize costs. | |
| The | Enables integration on the basis of scientific merit, rather than on the restrictions of the license. | When there are license restrictions on the identifier and/or label (not just the content) it thwarts meaningful reuse and redistribution of whole datasets. | |
| The identifier scheme should be documented. | Encourages consistent use of existing identifiers by others and reduces the number of ways identifiers are represented. | Inconsistent informal approaches to referencing are difficult to harmonize post-hoc. By extension, impact is harder to assess. |
CURIE, compact uniform resource identifier; Local ID, local identifier; URI, unique resource identifier; URL, uniform resource locator; XML, extensible markup language.
Fig 3Contributions and roles related to content as they correspond to identifier creation versus identifier reuse.
The decision about whether to create a new identifier or reuse an existing one depends on the role you play in the creation, editing, and republishing of content; for certain roles (and when several roles apply) that decision is a judgement call. Asterisks convey cases in which the best course of action is often to correct/improve the original record in collaboration with the original source; the guidance about identifier creation versus reuse is meant to apply only when such collaboration is not practicable (and an alternate record is created). It is common that a given actor may have multiple roles along this spectrum; for instance, a given record in monarchinitiative.org may reflect a combination of (a) corrections Monarch staff made in collaboration with the original data source, (b) post-ingest curation by Monarch staff, (c) expanded content integrated from multiple sources.
Fig 4Examples of provisioning resolvable Unique Resource Identifiers (URIs).
Compact URIs (CURIEs; Panel A), URIs (Panel B), and Access URLs (Panel C) with no redirection (the Zebrafish Identification Network [ZFIN]), in house redirection (UniProt and Ensembl), and third party resolvers (using identifiers.org and digital object identifiers [DOI]). In each case, the URI can be algorithmically derived from the CURIE because the local identifier (local ID) portion itself is included (unmodified) within the URI. Access URL design patterns differ substantially by provider and may change over time. As long as access URLs (and other ephemeral links) are not used as the referenced identifier, they can include prefix and colon (Mouse Genome Informatics [MGI]) or not (Ensembl), they may include the entire local ID (Biosample) or not (DOI), and they may include type (MGI) or not (ZFIN).
Recommendation for versioning.
| Recommendation | UniProt | RefSeq | Ensembl | |
|---|---|---|---|---|
| General versioning practices | Primary versioning strategy | Record level | Record level | Release level |
| Past versions are accessible | All versions of individual records are accessible | All versions of individual records are accessible | Maintains all archives for at least 5 years; some key releases may be maintained for longer. All databases maintained for at least 10 years (currently all databases available from 2004) | |
| Release versioning available | No past releases available | |||
| Documentation exists regarding what kinds of record changes prompt a new version to be issued. | ||||
| URL version practices | The base identifier (the one with no explicit version) should resolve (302 redirect) to most recent version. | |||
| Remove dot suffix from the Local ID, e.g.: | Remove dot suffix from the Local ID, e.g.: | Remove build number from the URI, e.g.: | ||
| Older versions must resolve. | ||||
| Illegal or invalid version should produce an informative http error code and a HTML page explaining the error. | Error not returned | |||
| A list of all previous versions should be available. | See “history” tab in user interface | See format dropdown in user interface | ||
| Link from older version to current version should ideally be provided. | P12345.3 | Link available at the top of the page | Plans to support | |
| Two versions (or dates) should ideally be comparable. | Record history provides comparison | Record history provides comparison | Unsupported |
Local ID, local identifier; URI, unique resource identifier; URL, uniform resource locator.
Recommendations for identifier lifecycle management.
| Recommended Handling | Example |
|---|---|
| Single obsoleted identifier: | |
| UniProt entries Q57339 and O08022 have been merged into Q00626. Q57339 and O08022 are redirected to Q00626. | |
| UniProt entry P29358 has been split into P68250 and P68251. P29358 displays a warning and links to the demerged entries: |
ID, identifier.
Fig 6Eagle-i record-level citation widget.