Literature DB >> 24076747

A sea of standards for omics data: sink or swim?

Jessica D Tenenbaum¹, Susanna-Assunta Sansone, Melissa Haendel.

Abstract

In the era of Big Data, omic-scale technologies, and increasing calls for data sharing, it is generally agreed that the use of community-developed, open data standards is critical. Far less agreed upon is exactly which data standards should be used, the criteria by which one should choose a standard, or even what constitutes a data standard. It is impossible simply to choose a domain and have it naturally follow which data standards should be used in all cases. The 'right' standards to use is often dependent on the use case scenarios for a given project. Potential downstream applications for the data, however, may not always be apparent at the time the data are generated. Similarly, technology evolves, adding further complexity. Would-be standards adopters must strike a balance between planning for the future and minimizing the burden of compliance. Better tools and resources are required to help guide this balancing act.

Entities: Chemical

Keywords: Data Sharing; Data Standards; Information dissemination; Terminology

Mesh：

Substances：
Biomarkers

Year: 2013 PMID： 24076747 PMCID： PMC3932466 DOI： 10.1136/amiajnl-2013-002066

Source DB: PubMed Journal: J Am Med Inform Assoc ISSN： 1067-5027 Impact factor: 4.497

Background

Members of the scientific community are increasingly expected to share data, and to do so in a standards-compliant manner. This is evidenced by the recent mandates, announcements, and requests for information by the funding agencies1–5 and journals,6 and numerous essays and announcements by the scientific community,7–10 including pre-competitive initiatives by the life science industry.11 However, the scientific community is not necessarily well poised to comply.12 All stakeholders—funders, journal editors, researchers and those supporting them, struggle to navigate the existing standards and make informed decisions.13 As an example, in 2009 one of our groups aimed to create a standards-compliant, integrated data repository for clinical and ‘omics’ data, among other types. This begged the question: with which standards should we comply? Through subsequent efforts to answer this question, three key points have become clear: Different groups and individuals have different definitions for what constitutes a ‘data standard’. Even within one domain, no one standard is the ‘right’ standard across all cases; rather, one must select a standard (or even specific pieces of a standard) based on one's particular needs. Integrated resources and registries are needed to help researchers navigate the fluid standards landscape and to choose and implement the right standard for their respective project. The focus for that project was on omics data standards, but these points apply across the spectrum of biomedical data types. High-dimensional ‘Big Data’ equate to large numbers of parameters, which in turn require yet more data for sufficient statistical power. Importantly, this massive amount of data lends itself to many different analytic approaches, putting comprehensive analysis beyond the capabilities of any one researcher. The size and complexity of these data, combined with growing scarcity of research funding and the quest for personalized medicine, make it increasingly important to maximize the utility of research dollars through data sharing and re-use. Efforts to this end are demonstrated by a spate of new data sharing and aggregation initiatives by academics, private–public partnerships, and publishers, for example Sage Bionetworks,14 the Pistoia Alliance (http://www.pistoiaalliance.org) and DRYAD,15 among others.16–18 At the national level in the USA, the data sharing trend is reflected in programs such as the National Institutes of Health's (NIH) recently announced ‘Big Data to Knowledge’ (BD2K) initiative,19 and the White House office of science and technology policy's recent directive that the results of government-funded research be made publicly available.20 The Innovative Medicines Initiative (http://www.imi.europa.eu/) is Europe’s largest public–private initiative that supports collaborative research projects and builds networks of industrial and academic experts in order to boost pharmaceutical innovation in Europe. Internationally, the Research Data Alliance (https://rd-alliance.org/) has been established by an international steering group from funding agencies in the USA, EU and Australia; and recently the global alliance for genomic and clinical data sharing has brought together over 70 leading healthcare, research, and disease advocacy organizations, involving researchers from more than 40 countries, to enable secure sharing of genomic and clinical data.21 These types of initiatives, together with the evolving portfolio of grass-roots standards, have enhanced the need to maximize awareness and discoverability of standards. Such efforts are becoming more common,22–26 but they lack integration or unification. There is a clear need for some level of coordination, without taking the form of a top-down authority. How can we avoid requiring would-be standard adopters to spend considerable time and effort becoming well versed with a multitude of standards solely in order to rule most of them out?

What is a data standard?

The International Organization for Standardization defines a standard as ‘…a document that provides requirements, specifications, guidelines or characteristics that can be used consistently to ensure that materials, products, processes and services are fit for their purpose’.27 Standards range from de jure, that is, ordained by some official organization such as the International Organization for Standardization or the American National Standards Institute, to de facto, that is, developed by grass-root initiatives and commonly adopted, but not prescribed by an official or specific authority. The BioSharing registry (http://biosharing.org/) houses a fairly comprehensive, curated list of data standards (primarily de facto) in the life science, environmental, and biomedical space. These standards are divided into three categories. First, content standards take the form of reporting guidelines, for example, minimum information checklists. These vary from general guidance to itemized prescriptions of the information that should be provided (ie, curation guidelines), including both data and metadata. The second category consists of syntax standards in the form of representations and formats that facilitate the exchange of information. These fall broadly into two types: delimited text, or a ‘markup language’ such as XML. Third are the semantic standards in the form of terminology artifacts, such as controlled vocabularies or ontologies. These add an interpretive layer to the data by defining the concepts or terms in a domain, and in some cases the relationships between them. Other discussions of standards include the notion of a data model, which extends beyond terms and their definitions to describe the relationships between concepts in a domain.28 Other groups also use additional terms such as conceptual model, conceptual schema, ontology, or domain analysis model,29––32 but generally differ on what each of these terms means. This is in fact part of the confusion—even data standard experts do not agree on what constitutes a data standard. Nevertheless, focusing just within the context of transcriptomics, preliminary investigation yielded a list of 15 potentially relevant standards (table 1). Note that this list could grow depending on the type of sample and organism used, as many terminologies are species specific. Now imagine if a researcher has an associated dataset from a proteomics investigation, for example. How is a mere mortal to sort through these?

Table 1

A sampling of (some of the) standards related to microarray-based transcriptomics, generated by non-experts for evaluation of relevance to a project involving microarray-based transcriptomics data

Standard	Type	Description
MIAME	Reporting guideline	Minimum Information About a Microarray ExperimentSpecifies six components that must be included to describe a microarray experiment, for example, raw and processed data, experimental design, sample annotation, protocols. MIAME does not specify how these components must be represented, for example, in any given format, or using any given terminology
ISA-TAB	Exchange format	Generic format for experimental representations; conversion tools to MAGE-Tab, MIMiML and other formats exist
MAGE-TAB	Exchange format	MicroArray and Gene Expression-TabularSimple tab-delimited, spreadsheet-based format. Used by ArrayExpress
MAGE-ML	Exchange format	MicroArray and Gene Expression-Markup Language. No longer supported
SOFT	Exchange format	Simple Omnibus Format in Text. Line-based, plain text format designed for rapid batch submission of data. Used by GEO
MIMiML	Exchange format	MIAME Notation in Markup Language. Optimized for microarray and other high-throughput molecular abundance dataUsed by GEO
GO	Terminology artifact	Gene Ontology. Controlled vocabulary for annotation of gene function and cellular location. Part of the OBO Foundry
EFO	Terminology artifact	Experimental Factor Ontology. Provides a systematic description of many experimental variables. Used by ArrayExpress
OBI	Terminology artifact	Broader scope for experimental representations. Part of the OBO Foundry
MGED Ontology	Terminology artifact	Integrated in OBI
MAGE-OM	Object model	MicroArray and Gene Expression—Object Model. The object model from which MAGE-ML was derived
FuGE	Object model	Generic object model for functional genomics
SEND	Exchange format	Standard for Exchange of Nonclinical Data—an implementation of the CDISC (Clinical Data Interchange Standards Consortium) SDTM (Standard Data Tabulation Model)

GEML	Exchange format	These three standards have since been deprecated and/or replaced by other standards, but that progression may not always be clear to novice users
FUGO	Terminology artifact
MAML	Exchange format

A sampling of (some of the) standards related to microarray-based transcriptomics, generated by non-experts for evaluation of relevance to a project involving microarray-based transcriptomics data

Fit for purpose

In biomarker discovery, the phrase ‘fit-for-purpose’ refers to the notion that the degree of rigor for assay validation should be tailored to the intended purpose of a given biomarker study.33 The same is true for data standards adoption. While each individual project will inevitably have its own specific requirements, it can be useful to group projects across a spectrum of rigor. At the lowest level, there is the use case of data sharing within a laboratory or between collaborators. While minimum information guidelines should be followed, for the most part any documentation need only be human readable, and issues requiring clarification are merely a walk down the hall or an e-mail away (at least until the student graduates or the postdoc moves on). Data that are to be shared publicly, for example, accompanying a publication, require more rigor. Ideally, a prospective consumer of the data can both understand and reproduce those data without needing to contact the original author. Furthermore, much of the content of publications is now aggregated and curated by various online resources. These value-added services can be much more efficient and effective at making content available via secondary sources when quality data standards are used. Minimally structured data can be very helpful for such purposes; for example, the use of a unique identifier to describe a molecule or a standardized vocabulary term to denote the disease area under study. The highest level of rigor is needed for contribution of data to a structured data repository. In this case, additional effort is warranted in the form of structured fields and a standardized, machine-readable format. Such rigor enables querying across multiple datasets and integrative meta-analysis combining more than one set. One key point in differentiating between these levels of rigor is that there are different ‘flavors’ of annotation. At every level, there is a difference between what needs to be documented, and what needs to be documented in a structured and queryable fashion. While the option exists to select a standard that allows for maximum structure and adopt it only loosely, complexity can turn off would-be standards adopters, as well as waste time in development if such rigor will ultimately never be needed. Categories of criteria to be used in evaluating data standards for adoption include: The standard itself specification documentation ease of implementation (eg, level of documentation, requirement for programmer support) human and machine readability formal structure expressivity—the breadth of information that can be represented ease of use, for example, minimal required fields, text-based interface familiarity to biologists. Adoption and user community broad adoption and implementation, outside the initial group support supplied by the user community use by community databases software development that supports the standard (eg, for curating, submitting to databases) responsiveness to community requests availability of examples of use requirements of relevant authoritative bodies, for example, funders (NIH, National Science Foundation, Centers for Medicare & Medicaid Services), publishers, etc. Additional factors integration/compatibility with other standards extensibility and flexibility to cover new domains conversion and mapping, when applicable cost (eg, open vs licensing fee). Of course, specific projects may have additional criteria to add, and different projects will place different weight on the different items. Unfortunately, standards adoption, when it happens, is often determined less by an objective criteria-based evaluation and more based on historical precedent (‘my advisor used standard X’), marketing (‘I saw a press-release about standard X’) or sociopolitical circumstance (‘I know someone on the standard X team’). What makes it even more difficult to select standards empirically, based on objective criteria, is that standards are often complex. Even well-documented standards can be dense and impenetrable to prospective users who were not involved in their development. This is one reason why standards are often duplicated or reinvented. Other factors include the desire for some level of control, or recognition for doing the work.

Resources wanted

The recent data and informatics working group report to the advisory committee to the director of the NIH included recommendations to establish a minimal metadata framework for data sharing, and to create catalogs and tools to facilitate data sharing.2 A truly minimal set of metadata elements is important if we are to have any hope of compliance because the activation energy required for data curation and annotation represents a significant hurdle in facilitating data sharing. The minimum information for biological and biomedical investigations (MIBBI) project, part of the broader BioSharing effort, worked with different research communities to coordinate their ‘minimum information’ checklists,34 but each community has some unique requirements. Also, data annotation presents an inherent tension: the easier we make it for investigators to annotate their datasets, the harder it will be to ensure discoverability. Conversely, the more discoverable we make the datasets, for example, through annotation using controlled terminologies, the more burden we put on the data generators. Researchers need better tools and resources to identify, evaluate, and implement standards. BioSharing is a great resource to register and discover standards, and has adopted the initial set of criteria described above, requiring the communities to do a self-appraisal and tag their entries accordingly. The standards development community also has an active role to play if they wish to maximize the use and uptake of their work. Reviewers of publications and associated adherence to data standards should include biocurators. In the absence of widely agreed upon metrics to evaluate community standards, the decision about which is the right standard falls on the researcher. For reasons described above, this situation is problematic. Table 2 lists some potential resources/functionalities to address this problem. For any of these resources, it is important to note that technology is dynamic, and therefore so are any associated standards. Relevant resources must be similarly dynamic and up to date.

Table 2

Potential resources to assist in the selection and adoption of appropriate standards

Resource	Notes
Lay person's primer to standards	This would be a text document for the lay person to describe the standard, what problem it helps solve, and how it achieves that. Although FAQs address a number of these questions, one must first identify the standard and find the respective FAQ. This would be a centralized collection of documentation that requires no previous knowledge
‘Consumer reviews’	This would be a rating system along the lines of Amazon product reviews. Ontology registries such as the NCBO and the OBO Foundry enable or perform reviews, but the reviews are few in number, not substantive, or infrequent. As discussed above, the utility of a standard depends on the purpose for which it is being used, so information beyond numeric scores is needed
Standard-selection wizard	Decision support methods could be used to ask a researcher about the intended goals and make recommendations accordingly. For example, ‘what instrument type was used to generate the data?’ and, ‘will these data be deposited in a public data repository? If so, which one?’ etc. Clearly this would require significant resources and ongoing maintenance
Standards-adoption ‘helpdesk’	This would be a centralized resource of real humans with expertise across a number of standards. Once a standard has been selected, many have rich user communities and distribution lists for help with questions. However, for an individual investigator who wants to be standards-compliant and does not know where to begin, expert advice can save significant time in researching options
Quality assurance tools	Similar to syntax validators such as for RDF, tools to gauge or validate standards compliance are useful for data submitters as well as reviewers

NCBO, National Center for Biomedical Ontology (http://www.bioontology.org/); RDF, Resource Description Framework.

Potential resources to assist in the selection and adoption of appropriate standards NCBO, National Center for Biomedical Ontology (http://www.bioontology.org/); RDF, Resource Description Framework.

Discussion

While one can conjure up motivating scenarios from a regulatory or archiving standpoint, the value proposition behind adherence to standards only really makes sense if data are to be shared beyond the team that originally created them. Thanks in part to policies put in place by some funders and publishers,8 many high throughput datasets are made publicly available and, at some level, standards compliant. However, these policies have a number of restrictions that make them fall short. Some apply only to data generation through grants that exceed US$500 000.2 Some require only a very low bar of compliance, and data are still difficult if not impossible to interpret. In many cases, the policies are simply not enforced,7 although the government and the NIH have recently taken steps to rectify that fact.3 19 20 Ideally, it should be noted, researchers themselves would be shielded from the complexity of data standards. Developers, informaticists, and curators are perhaps better equipped to delve into data standards than would be a clinician or bench scientist, but even they are typically not experts in specialized standards. In an ideal world, data generators would have access to user-friendly tools that enable the seamless use of relevant standards and can be customized to fit the different data and domain needs.9 The actual standards would be hidden from the data generators, and their use made automatic through intuitive, user-friendly tools. Although we have described some tools for the discovery and evaluation of standards if one is so inclined, the real challenge is incentivizing researchers to go to the trouble. This will probably need a combination of proverbial carrots and sticks. On the penalty side, funders and publishers must continue to develop and publicize progressive data-sharing policies, and to enforce those policies through the delay of publication or future funding, if necessary. On the incentives side, a formal system for data citation must be developed, and those citations acknowledged and valued by funders, professional organizations, and university promotion and tenure committees. Recent activity in the realm of data publishing has been an important first step.35 36 Only when obstacles are minimized and incentives are properly aligned will investigators be able to justify the effort required to do the right thing.

17 in total

1. Formal ontology for natural language processing and the integration of biomedical databases.

Authors: Jonathan Simon; Mariana Dos Santos; James Fielding; Barry Smith
Journal: Int J Med Inform Date: 2005-09-08 Impact factor: 4.046

2. Fit-for-purpose method development and validation for successful biomarker measurement.

Authors: Jean W Lee; Viswanath Devanarayan; Yu Chen Barrett; Russell Weiner; John Allinson; Scott Fountain; Stephen Keller; Ira Weinryb; Marie Green; Larry Duan; James A Rogers; Robert Millham; Peter J O'Brien; Jeff Sailstad; Masood Khan; Chad Ray; John A Wagner
Journal: Pharm Res Date: 2006-01-12 Impact factor: 4.200

3. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration.

Authors: Barry Smith; Michael Ashburner; Cornelius Rosse; Jonathan Bard; William Bug; Werner Ceusters; Louis J Goldberg; Karen Eilbeck; Amelia Ireland; Christopher J Mungall; Neocles Leontis; Philippe Rocca-Serra; Alan Ruttenberg; Susanna-Assunta Sansone; Richard H Scheuermann; Nigam Shah; Patricia L Whetzel; Suzanna Lewis
Journal: Nat Biotechnol Date: 2007-11 Impact factor: 54.908

4. Data's shameful neglect.

Authors:
Journal: Nature Date: 2009-09-10 Impact factor: 49.962

Review 5. Lowering industry firewalls: pre-competitive informatics initiatives in drug discovery.

Authors: Michael R Barnes; Lee Harland; Steven M Foord; Matthew D Hall; Ian Dix; Scott Thomas; Bryn I Williams-Jones; Cory R Brouwer
Journal: Nat Rev Drug Discov Date: 2009-07-17 Impact factor: 84.694

6. Metcalfe's law and the biology information commons.

Authors: Stephen H Friend; Thea C Norman
Journal: Nat Biotechnol Date: 2013-04 Impact factor: 54.908

7. caCORE version 3: Implementation of a model driven, service-oriented architecture for semantic interoperability.

Authors: George A Komatsoulis; Denise B Warzel; Francis W Hartel; Krishnakant Shanbhag; Ram Chilukuri; Gilberto Fragoso; Sherri de Coronado; Dianne M Reeves; Jillaine B Hadfield; Christophe Ludet; Peter A Covitz
Journal: J Biomed Inform Date: 2007-04-02 Impact factor: 6.317

8. Life sciences domain analysis model.

Authors: Robert R Freimuth; Elaine T Freund; Lisa Schick; Mukesh K Sharma; Grace A Stafford; Baris E Suzek; Joyce Hernandez; Jason Hipp; Jenny M Kelley; Konrad Rokicki; Sue Pan; Andrew Buckler; Todd H Stokes; Anna Fernandez; Ian Fore; Kenneth H Buetow; Juli D Klemm
Journal: J Am Med Inform Assoc Date: 2012-06-28 Impact factor: 4.497

9. On the evolving portfolio of community-standards and data sharing policies: turning challenges into new opportunities.

Authors: Susanna-Assunta Sansone; Philippe Rocca-Serra
Journal: Gigascience Date: 2012-07-12 Impact factor: 6.524

10. Public availability of published research data in high-impact journals.

Authors: Alawi A Alsheikh-Ali; Waqas Qureshi; Mouaz H Al-Mallah; John P A Ioannidis
Journal: PLoS One Date: 2011-09-07 Impact factor: 3.240

26 in total

1. Resolving complex research data management issues in biomedical laboratories: Qualitative study of an industry-academia collaboration.

Authors: Sahiti Myneni; Vimla L Patel; G Steven Bova; Jian Wang; Christopher F Ackerman; Cynthia A Berlinicke; Steve H Chen; Mikael Lindvall; Donald J Zack
Journal: Comput Methods Programs Biomed Date: 2015-11-12 Impact factor: 5.428

2. A comparative analysis of the density of the SNOMED CT conceptual content for semantic harmonization.

Authors: Zhe He; James Geller; Yan Chen
Journal: Artif Intell Med Date: 2015-04-02 Impact factor: 5.326

Review 3. Big Data in Science and Healthcare: A Review of Recent Literature and Perspectives. Contribution of the IMIA Social Media Working Group.

Authors: M M Hansen; T Miron-Shatz; A Y S Lau; C Paton
Journal: Yearb Med Inform Date: 2014-08-15

Review 4. Big data in medicine is driving big changes.

Authors: F Martin-Sanchez; K Verspoor
Journal: Yearb Med Inform Date: 2014-08-15

5. Toward more transparent and reproducible omics studies through a common metadata checklist and data publications.

Authors: Eugene Kolker; Vural Özdemir; Lennart Martens; William Hancock; Gordon Anderson; Nathaniel Anderson; Sukru Aynacioglu; Ancha Baranova; Shawn R Campagna; Rui Chen; John Choiniere; Stephen P Dearth; Wu-Chun Feng; Lynnette Ferguson; Geoffrey Fox; Dmitrij Frishman; Robert Grossman; Allison Heath; Roger Higdon; Mara H Hutz; Imre Janko; Lihua Jiang; Sanjay Joshi; Alexander Kel; Joseph W Kemnitz; Isaac S Kohane; Natali Kolker; Doron Lancet; Elaine Lee; Weizhong Li; Andrey Lisitsa; Adrian Llerena; Courtney Macnealy-Koch; Jean-Claude Marshall; Paola Masuzzo; Amanda May; George Mias; Matthew Monroe; Elizabeth Montague; Sean Mooney; Alexey Nesvizhskii; Santosh Noronha; Gilbert Omenn; Harsha Rajasimha; Preveen Ramamoorthy; Jerry Sheehan; Larry Smarr; Charles V Smith; Todd Smith; Michael Snyder; Srikanth Rapole; Sanjeeva Srivastava; Larissa Stanberry; Elizabeth Stewart; Stefano Toppo; Peter Uetz; Kenneth Verheggen; Brynn H Voy; Louise Warnich; Steven W Wilhelm; Gregory Yandl
Journal: OMICS Date: 2014-01

6. Harnessing next-generation informatics for personalizing medicine: a report from AMIA's 2014 Health Policy Invitational Meeting.

Authors: Laura K Wiley; Peter Tarczy-Hornoch; Joshua C Denny; Robert R Freimuth; Casey L Overby; Nigam Shah; Ross D Martin; Indra Neil Sarkar
Journal: J Am Med Inform Assoc Date: 2016-02-05 Impact factor: 4.497

Review 7. Beyond the paradigm: Combining mass spectrometry and nuclear magnetic resonance for metabolomics.

Authors: Darrell D Marshall; Robert Powers
Journal: Prog Nucl Magn Reson Spectrosc Date: 2017-01-11 Impact factor: 9.795

8. Using association rule mining and ontologies to generate metadata recommendations from multiple biomedical databases.

Authors: Marcos Martínez-Romero; Martin J O'Connor; Attila L Egyedi; Debra Willrett; Josef Hardi; John Graybeal; Mark A Musen
Journal: Database (Oxford) Date: 2019-01-01 Impact factor: 3.451

9. Trend and Network Analysis of Common Eligibility Features for Cancer Trials in ClinicalTrials.gov.

Authors: Chunhua Weng; Anil Yaman; Kuo Lin; Zhe He
Journal: Smart Health (2014) Date: 2014-07

10. The center for expanded data annotation and retrieval.

Authors: Mark A Musen; Carol A Bean; Kei-Hoi Cheung; Michel Dumontier; Kim A Durante; Olivier Gevaert; Alejandra Gonzalez-Beltran; Purvesh Khatri; Steven H Kleinstein; Martin J O'Connor; Yannick Pouliot; Philippe Rocca-Serra; Susanna-Assunta Sansone; Jeffrey A Wiser
Journal: J Am Med Inform Assoc Date: 2015-06-25 Impact factor: 4.497