Literature DB >> 21347181

Towards interoperable reporting standards for omics data: hopes and hurdles.

Susanna-Assunta Sansone¹, Philippe Rocca-Serra, Dawn Field, Chris F Taylor, Weida Tong, Marco Brandizi, Eamonn Maguire, Nataliya Sklyar.

Abstract

BACKGROUND: As the size and complexity of scientific datasets and the corresponding information stores grow, standards for collecting, describing, formatting, submitting and exchanging information are playing an increasingly active role. Several initiatives occupy strategic positions in the international scenario, both within and across domains. However, the job of harmonising reporting standards is still very much a work in progress; both software interoperability and the data integration remain challenging as things stand.
RESULTS: The status quo with respect to standardization initiatives is summarized here, with particular emphasis on the motivation for, and the challenges of, ongoing synergistic activities amongst the academic community focused on the creation of truly interoperable standards.
CONCLUSIONS: Groups generating standards should engage with ongoing cross-domain activities to simplify the integration of heterogeneous data sets to the greatest possible extent.

Entities: Disease Gene Species

Year: 2009 PMID： 21347181 PMCID： PMC3041584

Source DB: PubMed Journal: Summit Transl Bioinform ISSN： 2153-6430

Background

The growing complexity of datasets

In the area of life science, the cycle of data generation and processing is being vastly accelerated by the development of high-throughput experimental methods associated with genomic and post-genomic technologies (e.g. genomics, transcriptomics, proteomics, and metabolomics, hereafter referred as ‘omics’). Biological and biomedical studies commonly range from simple one assay-based to complex multi-assay studies. For the latter type, for example, consider the reporting of a study looking at the effect on a number of subjects treated with different drugs by characterizing the metabolic profile of their urine (by mass spectroscopy), measuring protein and gene expression in the liver (by mass spectrometry and DNA microarrays, respectively), and conducting conventional histological analysis. Omics studies are information intensive and to record their complex structure it is necessary to define and capture the experimental metadata, including experimental design, sample source(s) and treatment(s), the preparation of the sample for the analytical assay, the processes and instruments used throughout, and the final data. It is widely recognized that capturing experimental metadata on this level of granularity is required to correctly interpret the results that they contextualize and enable efficient data sharing.

Standardization Initiatives Focused on Particular Domains of Application

Many groups have risen to this challenge; standards for collecting, describing, formatting, submitting and exchanging both the data and metadata from such complex studies either are under development or have been released [1]. Currently, several standards initiatives occupy strategic positions in the international scenario, largely falling into two groups identifiable by the needs of their respective user communities. One group of initiatives is driven by regulatory frameworks, and often supported by accredited (de jure) Standards Developing Organizations (SDOs). Most significantly, these efforts focus on the Voluntary eXploratory Data Submissions (VXDS) and electronic data submission programs of the US Food and Drug Administration (FDA) [2-4] or around initiatives by other governmental agencies, such as the US Environmental Protection Agency (EPA) [5]. These initiatives also include long-standing efforts in the clinical and non-clinical domains [6] alongside more recent activities in the pharmacogenomics area that add complex omics technologies to biomedical studies [7]. A second group of initiatives that address particular (omics or other) technologies or defined domains of application (e.g. system biology, pathways) have emerged from the academic community, in many cases with the support of commercial organizations such as instrument vendors. Such initiatives (e.g., [8-14]) are focused on supporting tool interoperability and data exchange among public and proprietary systems through the development of three kinds of (de facto) reporting standards: ‘minimum information’ checklists (specifications of data set content, however encoded), ontologies (semantics) and file formats (syntax). Minimal information checklists are easy-to-read, structured documents that reflect the consensus view of the essential pieces of information that should be reported; ontologies provide the semantics needed to describe the minimal information requirements and file formats the syntax to transmit and exchange these. Combining these three kinds of reporting standards a submission tool, for example, should guide researchers through the process of meeting the reporting requirements made by a given minimal information specification, enable straightforward practical use of ontology terms and export the collected information in a standard format to a given database.

Fragmentation of Standards

Domain-specific initiatives are regarded as important because they address ‘real world’ data reporting requirements. Unfortunately, focusing on particular communities’ interests or technologies leads to duplication of effort. More seriously, the development of (largely arbitrarily) different standards severely hinders data integration. Nowadays researchers are able to perform multi-assay studies where the same sample is run through the full range of ‘omics and conventional technologies, in combination. In this specific case, it is critical that the reporting standards are designed to be interoperable and fit neatly into a jigsaw, with users being able to take the pieces that are relevant to report their study. The fragmentation severely hinders the interoperability of databases and tools, implementing such reporting standards: this scenario is illustrated by the ArrayExpress [15] and PRIDE [16] – two EBI public repositories for transciptomics and proteomics data respectively. These systems implement (non-interoperable) standards applicable only for their ‘omics’ domains. Consequently, users have to deal with different submission formats, diverse representations of the metadata and terminologies when depositing their datasets in these systems, and similarly when downloading other datasets. Such fragmentation has a strong impact on the user community, particularly by hampering deposition and integration of multi-assay studies.

Results and Discussion

Integrative Cross-Domain Standardization Initiatives

Fortunately, amongst the academic community a number of initiatives aim to foster the harmonization and consolidation of the three kinds of reporting standard previously described Content: Twenty-seven groups now participate in the Minimum Information for Biomedical or Biological Investigations (MIBBI) project, which offers a one-stop shop for those exploring the range of extant ‘minimum information’ checklists (such as MIAME [17]) and fosters their collaborative, integrative development [18]. Semantics: More than 70 groups participate in the Open Biological Ontology (OBO) Foundry. The objective of the project is to encourage the development of orthogonal, interoperable ontologies [19]. Syntax: Several groups participate in the Functional Genomics (FuGE) project to develop a single generic data model that will underpin a variety of XML-based file formats by providing a single common framework [20]. Recently, a complementary initiative has been begun by a (growing) number of communities; to collaboratively develop the Investigation/Study/Assay (ISA-TAB) a tabular framework for presenting experimental metadata [21] that uses a reference system to complements existing biomedical formats such as the Study Data Tabulation Model (SDTM, [22]). These integrative cross-domain reporting standards are implemented by the BioInvestigation Index, a new prototype infrastructure at EBI set to provide users with a common structured representation and (public) storage mechanism for a variety of studies [23]. Although relying on EBI production systems, such as ArrayExpress and Pride, the BioInvestigation Index shields the users from the diverse reporting standards, by implementing the MIBBI, OBO Foundry and ISA-TAB synergistic efforts in its annotation and submission tool [24].

Hopes and Hurdles

To achieve interoperability from a technical perspective, ‘meta’ standardization projects such as MIBBI, OBO Foundry and ISA-TAB help (i) resolving overlaps between domain-specific standards and (ii) plugging gaps where they exist. It is anticipated, also, that some reporting standards will be more mature – ‘ready’ to be integrated – than others, particularly because development takes time and ‘buy-in’ both from potential users and those that govern them (journals, funders, regulators). These are technically complex, but demonstrably tractable tasks. By contrast, the sociological barriers facing these kinds of largescale collaborations can be far more challenging, mandating extensive liaison between communities. Managing the process of consensus building from start to finish takes time and expertise. However, the time participants can dedicate to these projects is chronically limited due to lack of financial support. The massively collaborative nature of such undertakings requires frequent face-to-face workshops to create the necessary conditions for the building of consensus. Unfortunately, for the initiatives that have emerged from the academic community, this is difficult without central grants or with limited funds [25]. Despite this chronic resource limitation, the lack of standardization is so problematic for researchers and those that support them, repeatedly proving to be a significant bottleneck in the collection, sharing, and integration of data that both developers, and the potential users, continue to participate on an almost exclusively voluntary basis. Two stakeholders have pivotal roles to play as enablers. Journals increasingly require compliance with appropriate consensual reporting standards, contingent on the availability of appropriate software and public repositories [26, 27, 28]. Consistent reporting has a positive and long-lasting impact on the value of collective scientific outputs. This has also been recognized by funding agencies that are increasingly playing an active role in the strategic stewardship of omics data, often through the development of data policies encouraging the use of (existing) standards and public standards-compliant repositories for data collection and management [29].

Conclusions

This paper has illustrated the growing number of standards and the complexity facing those attempting to use them, for example, to report or integrate datasets from multiple domains. We have also indicated the existence of a number of synergistic projects seeking to simplify the process of integrating reporting standards, where possible. Of course, this is not an exhaustive list; several coordinative infrastructure initiatives work to address the problem of sharing and archiving large amounts of data, according to common standards (e.g., [30-32]). There are many benefits accruing to the development and acceptance of reporting standards. For example, by limiting the range and variability of standards, the development and maintenance costs for commercial and academic software developers of standards-compliant products comes down. This results in more appropriate resources for the biomedical and scientific community, making the job of capturing, annotating, integrating, sharing and exploiting (meta)data simpler, increasing the prima facie value of the data to others (secondary users), and by extension, increasing the return on the investment of (public) funds that supported their generation. Above all actions a ‘top-down’ coordination is needed to help bringing these standardization efforts closer to address the fragmentation issue. Although, regulatory- or biomedical-driven initiatives have far stricter guidelines than academia, much could be learned from exchange of ideas and practices of these sectors.

18 in total

1. Minimum information about a microarray experiment (MIAME)-toward standards for microarray data.

Authors: A Brazma; P Hingamp; J Quackenbush; G Sherlock; P Spellman; C Stoeckert; J Aach; W Ansorge; C A Ball; H C Causton; T Gaasterland; P Glenisson; F C Holstege; I F Kim; V Markowitz; J C Matese; H Parkinson; A Robinson; U Sarkans; S Schulze-Kremer; J Stewart; R Taylor; J Vilo; M Vingron
Journal: Nat Genet Date: 2001-12 Impact factor: 38.330

2. Summary recommendations for standardization and reporting of metabolic analyses.

Authors: John C Lindon; Jeremy K Nicholson; Elaine Holmes; Hector C Keun; Andrew Craig; Jake T M Pearce; Stephen J Bruce; Nigel Hardy; Susanna-Assunta Sansone; Henrik Antti; Par Jonsson; Clare Daykin; Mahendra Navarange; Richard D Beger; Elwin R Verheij; Alexander Amberg; Dorrit Baunsgaard; Glenn H Cantor; Lois Lehman-McKeeman; Mark Earll; Svante Wold; Erik Johansson; John N Haselden; Kerstin Kramer; Craig Thomas; Johann Lindberg; Ina Schuppe-Koistinen; Ian D Wilson; Michael D Reily; Donald G Robertson; Hans Senn; Arno Krotzky; Sunil Kochhar; Jonathan Powell; Frans van der Ouderaa; Robert Plumb; Hartmut Schaefer; Manfred Spraul
Journal: Nat Biotechnol Date: 2005-07 Impact factor: 54.908

Review 3. Clinical genomics data standards for pharmacogenetics and pharmacogenomics.

Authors: Amnon Shabo
Journal: Pharmacogenomics Date: 2006-03 Impact factor: 2.533

4. Minimum information specification for in situ hybridization and immunohistochemistry experiments (MISFISHIE).

Authors: Eric W Deutsch; Catherine A Ball; Jules J Berman; G Steven Bova; Alvis Brazma; Roger E Bumgarner; David Campbell; Helen C Causton; Jeffrey H Christiansen; Fabrice Daian; Delphine Dauga; Duncan R Davidson; Gregory Gimenez; Young Ah Goo; Sean Grimmond; Thorsten Henrich; Bernhard G Herrmann; Michael H Johnson; Martin Korb; Jason C Mills; Asa J Oudes; Helen E Parkinson; Laura E Pascal; Nicolas Pollet; John Quackenbush; Mirana Ramialison; Martin Ringwald; David Salgado; Susanna-Assunta Sansone; Gavin Sherlock; Christian J Stoeckert; Jason Swedlow; Ronald C Taylor; Laura Walashek; Anthony Warford; David G Wilkinson; Yi Zhou; Leonard I Zon; Alvin Y Liu; Lawrence D True
Journal: Nat Biotechnol Date: 2008-03 Impact factor: 54.908

5. Democratizing proteomics data.

Authors:
Journal: Nat Biotechnol Date: 2007-03 Impact factor: 54.908

Review 6. The minimum information about a proteomics experiment (MIAPE).

Authors: Chris F Taylor; Norman W Paton; Kathryn S Lilley; Pierre-Alain Binz; Randall K Julian; Andrew R Jones; Weimin Zhu; Rolf Apweiler; Ruedi Aebersold; Eric W Deutsch; Michael J Dunn; Albert J R Heck; Alexander Leitner; Marcus Macht; Matthias Mann; Lennart Martens; Thomas A Neubert; Scott D Patterson; Peipei Ping; Sean L Seymour; Puneet Souda; Akira Tsugita; Joel Vandekerckhove; Thomas M Vondriska; Julian P Whitelegge; Marc R Wilkins; Ioannnis Xenarios; John R Yates; Henning Hermjakob
Journal: Nat Biotechnol Date: 2007-08 Impact factor: 54.908

7. The first RSBI (ISA-TAB) workshop: "can a simple format work for complex studies?".

Authors: Susanna-Assunta Sansone; Philippe Rocca-Serra; Marco Brandizi; Alvis Brazma; Dawn Field; Jennifer Fostel; Andrew G Garrow; Jack Gilbert; Federico Goodsaid; Nigel Hardy; Phil Jones; Allyson Lister; Michael Miller; Norman Morrison; Tim Rayner; Nataliya Sklyar; Chris Taylor; Weida Tong; Guy Warner; Stefan Wiemann
Journal: OMICS Date: 2008-06

8. The Functional Genomics Experiment model (FuGE): an extensible framework for standards in functional genomics.

Authors: Andrew R Jones; Michael Miller; Ruedi Aebersold; Rolf Apweiler; Catherine A Ball; Alvis Brazma; James Degreef; Nigel Hardy; Henning Hermjakob; Simon J Hubbard; Peter Hussey; Mark Igra; Helen Jenkins; Randall K Julian; Kent Laursen; Stephen G Oliver; Norman W Paton; Susanna-Assunta Sansone; Ugis Sarkans; Christian J Stoeckert; Chris F Taylor; Patricia L Whetzel; Joseph A White; Paul Spellman; Angel Pizarro
Journal: Nat Biotechnol Date: 2007-10 Impact factor: 54.908

9. An integrated bioinformatics infrastructure essential for advancing pharmacogenomics and personalized medicine in the context of the FDA's Critical Path Initiative.

Authors: Weida Tong; Stephen C Harris; Hong Fang; Leming Shi; Roger Perkins; Federico Goodsaid; Felix W Frueh
Journal: Drug Discov Today Technol Date: 2007

10. The minimum information about a genome sequence (MIGS) specification.

Authors: Dawn Field; George Garrity; Tanya Gray; Norman Morrison; Jeremy Selengut; Peter Sterk; Tatiana Tatusova; Nicholas Thomson; Michael J Allen; Samuel V Angiuoli; Michael Ashburner; Nelson Axelrod; Sandra Baldauf; Stuart Ballard; Jeffrey Boore; Guy Cochrane; James Cole; Peter Dawyndt; Paul De Vos; Claude DePamphilis; Robert Edwards; Nadeem Faruque; Robert Feldman; Jack Gilbert; Paul Gilna; Frank Oliver Glöckner; Philip Goldstein; Robert Guralnick; Dan Haft; David Hancock; Henning Hermjakob; Christiane Hertz-Fowler; Phil Hugenholtz; Ian Joint; Leonid Kagan; Matthew Kane; Jessie Kennedy; George Kowalchuk; Renzo Kottmann; Eugene Kolker; Saul Kravitz; Nikos Kyrpides; Jim Leebens-Mack; Suzanna E Lewis; Kelvin Li; Allyson L Lister; Phillip Lord; Natalia Maltsev; Victor Markowitz; Jennifer Martiny; Barbara Methe; Ilene Mizrachi; Richard Moxon; Karen Nelson; Julian Parkhill; Lita Proctor; Owen White; Susanna-Assunta Sansone; Andrew Spiers; Robert Stevens; Paul Swift; Chris Taylor; Yoshio Tateno; Adrian Tett; Sarah Turner; David Ussery; Bob Vaughan; Naomi Ward; Trish Whetzel; Ingio San Gil; Gareth Wilson; Anil Wipat
Journal: Nat Biotechnol Date: 2008-05 Impact factor: 54.908

1 in total

1. Editorial: Methods for Single-Cell and Microbiome Sequencing Data.

Authors: Himel Mallick; Lingling An; Mengjie Chen; Pei Wang; Ni Zhao
Journal: Front Genet Date: 2022-05-13 Impact factor: 4.772

1 in total