Literature DB >> 20679334

ISA software suite: supporting standards-compliant experimental annotation and enabling curation at the community level.

Philippe Rocca-Serra¹, Marco Brandizi, Eamonn Maguire, Nataliya Sklyar, Chris Taylor, Kimberly Begley, Dawn Field, Stephen Harris, Winston Hide, Oliver Hofmann, Steffen Neumann, Peter Sterk, Weida Tong, Susanna-Assunta Sansone.

Abstract

UNLABELLED: The first open source software suite for experimentalists and curators that (i) assists in the annotation and local management of experimental metadata from high-throughput studies employing one or a combination of omics and other technologies; (ii) empowers users to uptake community-defined checklists and ontologies; and (iii) facilitates submission to international public repositories.
AVAILABILITY AND IMPLEMENTATION: Software, documentation, case studies and implementations at http://www.isa-tools.org.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2010 PMID： 20679334 PMCID： PMC2935443 DOI： 10.1093/bioinformatics/btq415

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 HIGH-THROUGHPUT OMICS STUDIES

The development of high-throughput genomic and post-genomic (hereafter, ‘omics’) technologies entails changes in the handling, processing and sharing of data (Schofield et al., 2009). Omics datasets are often complex and rich in context. Studies may run material through several kinds of assay, using both omics and other technologies; for example, studying the effect of a compound on rat liver through transcriptome, proteome and metabolome profiling (using high-throughput sequencing and two kinds of mass spectrometry, respectively) alongside conventional analyses (e.g. histopathology). Such data must be accompanied by enough contextual information (i.e. metadata; sample characteristics, technology and measurement types; instrument parameters and sample-to-data relationships) to make datasets comprehensible and reusable if they are to underpin future investigations. Many funders and journals require that researchers share data, and encourage the enrichment and standardization of experimental metadata (Field et al., 2009). Consequently, more and richer studies are flowing into public databases. However, two bottlenecks can significantly hamper this process, necessitating urgent solutions. First, international public repositories for ‘omics data such as GEO (Barrett et al., 2009), ArrayExpress (Parkinson et al., 2009), PRIDE (Vizcaíno et al., 2010), ENA, SRA and DRA (Shumway et al., 2010), have their own submission formats, data models and terminologies, created for specific types of assay. This complicates the submission process for researchers producing multi-assay studies (and greatly increases the risk that these datasets become irrevocably fragmented). Secondly, the shortage of curators to check and annotate submissions to public repositories—a situation unlikely to change soon—necessitates better annotation at source (by experimentalists or community-based efforts; Howe et al., 2008). Free software, with automated content validation, is required to facilitate the collection, management and curation of a variety of study inhouse, and to format those data for submission to public repositories. Such software should support community-defined reporting standards, such as the minimum information checklists listed by the MIBBI Portal (Taylor et al., 2007), and ontologies, (Côté et al., 2006; Smith et al., 2007; Noy et al., 2009). The Investigation/Study/Assay (ISA) infrastructure described here is the first general-purpose format and freely available desktop software suite designed to regularize local management of experimental metadata by enabling curation at source, supporting community-defined reporting standards and preparing studies for submission to public repositories.

2 THE ISA FORMAT AND SOFTWARE SUITE

The software suite comprises five platform-independent Java-based software components for local use, including a relational database (Fig. 1), built around the ISA-Tab format. The components work both as stand-alone applications and as a unified system to assist in the local management and storage of experimental metadata, and to facilitate data submission to international public repositories. All components run as ‘desktop’ applications; in addition, the database component features a web-based query interface.

Fig. 1.

The role of each ISA software component, showing their interrelations, target users and the flow of information through the system.

2.1 ISA-Tab: an extensible, cross-domain format

‘Investigation’, ‘Study’ and ‘Assay’ are the three key entities around which the general-purpose ISA-Tab format for structuring and communicating metadata is built (Sansone et al., 2008). Investigation contains all the information needed to understand the overall goals and means used in an experiment; Study is the central unit, containing information on the subject under study, its characteristics and any treatments applied. Each Study has associated Assay(s), producing qualitative or quantitative data, defined by the type of measurement (i.e. gene expression) and the technology employed (i.e. high-throughput sequencing). The hierarchical structure of ISA-Tab enables the representation of studies employing one or a combination of omics and other technologies, overcoming the fragmentation of the existing submission formats built for specific types of assay. To ensure conversion, ISA-Tab has been designed with reference to these existing ‘omics formats (Jones et al., 2007), complementing and extending their work where necessary; for example, it shares both syntax and the use of easily-manipulable tab-delimited text files with ArrayExpress’ MAGE-Tab (Rayner et al., 2006). Additionally, where omics-based technologies are used in clinical or non-clinical studies, ISA-Tab complements existing biomedical formats such as the Study Data Tabulation Model (http://www.cdisc.org/sdtm), endorsed by the US Food and Drug Administration. ISA-Tab also complements the XML formats used by the PRIDE, ENA, SRA and DRA repositories, and consequently offers a way to render their experimental metadata documents in a more user-friendly format. Note though that ISA-Tab is simply a format; the decision on how to regulate its use (i.e. enforcing the filling of required fields, or the use of ontologies) is left to local administrators' use of ISA software components, or the growing number of other systems and groups implementing the format (e.g. Krestyaninova et al., 2009; SysMO-DB http://www.sysmo-db.org/community; XperimentR, http://www.imperial.ac.uk/bioinfsupport/resources/data_management/; more given on the ISA web site).

2.2 ISAcreator: a user-friendly editor

This desktop application enables users (i.e. experimentalists) to compile experimental metadata sets, and to import and edit existing ISA-Tab formatted files. It breaks down overall descriptions into relatively simple parts, uses graphical abstraction to enable visualization of the information described and facilitates time-efficient description of experimental steps by remembering prior behaviour (through user profiles). ISAcreator's aesthetically pleasing interface makes extensive use of Java Swing and external open source libraries (e.g. Prefuse, http://prefuse.org/). The editor uses a style of form- and spreadsheet-based data entry that is likely to be familiar to researchers, augmenting basic functionality such as ‘auto-fill’ and ‘undo’ with advanced features, listed below.

2.2.1 Ontology support

A dedicated ‘widget’ allows ontology terms to be searched for and inserted in real time via the BioPortal (Noy et al., 2009) and the Ontology Lookup Service (Côté et al., 2006). Terms from those sources are imported along with core metadata (identifiers, definitions and ontology version); term selection is facilitated by a search history displaying prior choices (through user profiles).

2.2.2 Design wizard

An alternative way for users to enter information that leverages common patterns to reduce repetitive tasks by guiding users through a series of questions that elicit information about the design of the Study and associated Assay(s).

2.2.3 Spreadsheet import

As a second alternative, this widget enables the mapping and import of information from existing spreadsheets; also the reformatting and reannotation of legacy data.

2.2.4 Data file chooser

This widget appends data files located either local to the operator, or identified by FTP on a remote system, to an experimental metadata sets. Upon completion of a valid investigation report, ISAcreator outputs a compressed ‘ISArchive’ containing the ISA-Tab-formatted metadata and either the actual data files, or a reference to them, if necessary (e.g. because of their large size), consisting of their address and file name.

2.3 ISAconfigurator: standards-compliant templates

This desktop application allows ‘power users’ (i.e. community curators) to customize the fields displayed by ISAcreator, and for example, to meet the requirements of one or more MIBBI minimum information checklists by declaring certain fields mandatory, or by specifying allowed values (e.g. drawn from a set of ontology terms, or formatted in a specific manner). Configuration files from ISAconfigurator are read by ISAcreator, which then generates interface components as required.

2.4 ISAvalidator: adherence to templates

This desktop application also reads configuration files and checks both that completed ISA-Tab files meet specified requirements and that associated data files have been linked. Whether ISA-Tab files are created with ISAcreator or another way (e.g. with spreadsheet software), ISAvalidator checks that the document is syntactically correct and internally consistent, and reports on errors (i.e. missing or incorrect values).

2.5 BioInvestigation Index: local storage

An ISArchive provides a simple way to store and share information in a structured manner, but those tasks are better performed by uploading such a file to an instance of our ‘BioInvestigation Index’ (BII), or another system that implements ISA-Tab import. The BII includes a management tool and relational database (tested with Oracle, MySQL and PostgreSQL). The former enables validation and loading of an ISArchive and provides simple permissions functionality to link users (or groups of users) to studies. The latter manages the storage of experimental metadata, which can be collectively searched and browsed via a query interface or web services; the destination for associated data files, and their protocol for transfer, is custom defined by the local administrator on installation. As an example, a publicly accessible instance of the BII, maintained by the European Bioinformatics Institute (http://www.ebi.ac.uk/bioinvindex), has proven useful as a curation and storage system for multi-assay studies, and as a mechanism for submitting data files to ArrayExpress, PRIDE, ENA and SRA. Installation of the BII system requires some knowledge of database management. However, it is portable enough to be easily installed in individual labs, to maximize the efficiency with which high-throughput studies can be managed and shared among users that have been granted access to them.

2.6 ISAconverter: submission to public repositories

ISAconverter recodes the relevant parts of ISArchives as MAGE-Tab, PRIDE XML or SRA-XML (used by ArrayExpress, PRIDE and ENA, SRA and DRA, respectively), enabling combined submission to public omics repositories. It is readily extensible to support export of other formats, e.g. SOFT required by GEO (Barrett et al., 2009). Mappings for format elements are available in the ISA-Tab specification and documentation on the ISA web site.

3 COLLABORATIONS AND CASE STUDIES

Developed for the European multi-site ‘CarcinoGENOMICS’ project (Vinken et al., 2008), the ISA software suite version one was released in early 2009. The core ISA developers are engaged with an ever-growing number of collaborators: case studies from early implementers already provide evidence of the diverse life science scenarios in which the suite's various components have been successfully tested and are being used with large datasets (details on the ISA web site). The main limitations recorded to date are simply the person hours required to specify the standards and ontologies to be used and to actually curate studies. Demonstrable acceptance and community engagement has also brought a new funding stream for this project, allowing us to continue the collaborative development of this exemplar system that supports data sharing policies, promotes the uptake of community-defined reporting standards and ontologies and enables curation at source (Field et al., 2009). The ISA components, in particular the BII, have been designed to provide core functionalities. Inevitably, each collaborator has additional in-house requirements that are too specific to be included as core functionality. This may be due to the nature of their studies or their need for one or more ISA software components to be interoperable with existing systems. To support further collaborative development, the core ISA developers are setting up an environment for distributed development, and are augmenting the ISA code base with Application Programming Interfaces (APIs). Ongoing collaborative activities include: a module to enable the analysis of ISA-Tab formatted metadata and any associated data, using R; integration with other data management and analysis systems (e.g. Fang et al., 2009; MetWare, http://metware.org); and giving assistance to the growing number of projects exploring the tools and underlying format (e.g. Sage http://sagecongress.org/WP/workstreams/Standards; Kawaji et al., 2009). Other collaborative activities include an enhanced user authentication system, support for additional formats such as RDF, OWL and SOFT, converters to/from lab equipment-related file formats (e.g. sampling robots and mass spectrometers) and improved packaging and distribution mechanisms to offer a single download bundle to facilitate installation.

18 in total

1. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration.

Authors: Barry Smith; Michael Ashburner; Cornelius Rosse; Jonathan Bard; William Bug; Werner Ceusters; Louis J Goldberg; Karen Eilbeck; Amelia Ireland; Christopher J Mungall; Neocles Leontis; Philippe Rocca-Serra; Alan Ruttenberg; Susanna-Assunta Sansone; Richard H Scheuermann; Nigam Shah; Patricia L Whetzel; Suzanna Lewis
Journal: Nat Biotechnol Date: 2007-11 Impact factor: 54.908

2. The first RSBI (ISA-TAB) workshop: "can a simple format work for complex studies?".

Authors: Susanna-Assunta Sansone; Philippe Rocca-Serra; Marco Brandizi; Alvis Brazma; Dawn Field; Jennifer Fostel; Andrew G Garrow; Jack Gilbert; Federico Goodsaid; Nigel Hardy; Phil Jones; Allyson Lister; Michael Miller; Norman Morrison; Tim Rayner; Nataliya Sklyar; Chris Taylor; Weida Tong; Guy Warner; Stefan Wiemann
Journal: OMICS Date: 2008-06

Review 3. The carcinoGENOMICS project: critical selection of model compounds for the development of omics-based in vitro carcinogenicity screening assays.

Authors: Mathieu Vinken; Tatyana Doktorova; Heidrun Ellinger-Ziegelbauer; Hans-Jürgen Ahr; Edward Lock; Paul Carmichael; Erwin Roggen; Joost van Delft; Jos Kleinjans; José Castell; Roque Bort; Teresa Donato; Michael Ryan; Raffaella Corvi; Hector Keun; Timothy Ebbels; Toby Athersuch; Susanna-Assunta Sansone; Philippe Rocca-Serra; Rob Stierum; Paul Jennings; Walter Pfaller; Hans Gmuender; Tamara Vanhaecke; Vera Rogiers
Journal: Mutat Res Date: 2008-04-26 Impact factor: 2.433

4. ArrayTrack: an FDA and public genomic tool.

Authors: Hong Fang; Stephen C Harris; Zhenjiang Su; Minjun Chen; Feng Qian; Leming Shi; Roger Perkins; Weida Tong
Journal: Methods Mol Biol Date: 2009

5. NCBI GEO: archive for high-throughput functional genomic data.

Authors: Tanya Barrett; Dennis B Troup; Stephen E Wilhite; Pierre Ledoux; Dmitry Rudnev; Carlos Evangelista; Irene F Kim; Alexandra Soboleva; Maxim Tomashevsky; Kimberly A Marshall; Katherine H Phillippy; Patti M Sherman; Rolf N Muertter; Ron Edgar
Journal: Nucleic Acids Res Date: 2008-10-21 Impact factor: 16.971

6. Big data: The future of biocuration.

Authors: Doug Howe; Maria Costanzo; Petra Fey; Takashi Gojobori; Linda Hannick; Winston Hide; David P Hill; Renate Kania; Mary Schaeffer; Susan St Pierre; Simon Twigger; Owen White; Seung Yon Rhee
Journal: Nature Date: 2008-09-04 Impact factor: 49.962

7. Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project.

Authors: Chris F Taylor; Dawn Field; Susanna-Assunta Sansone; Jan Aerts; Rolf Apweiler; Michael Ashburner; Catherine A Ball; Pierre-Alain Binz; Molly Bogue; Tim Booth; Alvis Brazma; Ryan R Brinkman; Adam Michael Clark; Eric W Deutsch; Oliver Fiehn; Jennifer Fostel; Peter Ghazal; Frank Gibson; Tanya Gray; Graeme Grimes; John M Hancock; Nigel W Hardy; Henning Hermjakob; Randall K Julian; Matthew Kane; Carsten Kettner; Christopher Kinsinger; Eugene Kolker; Martin Kuiper; Nicolas Le Novère; Jim Leebens-Mack; Suzanna E Lewis; Phillip Lord; Ann-Marie Mallon; Nishanth Marthandan; Hiroshi Masuya; Ruth McNally; Alexander Mehrle; Norman Morrison; Sandra Orchard; John Quackenbush; James M Reecy; Donald G Robertson; Philippe Rocca-Serra; Henry Rodriguez; Heiko Rosenfelder; Javier Santoyo-Lopez; Richard H Scheuermann; Daniel Schober; Barry Smith; Jason Snape; Christian J Stoeckert; Keith Tipton; Peter Sterk; Andreas Untergasser; Jo Vandesompele; Stefan Wiemann
Journal: Nat Biotechnol Date: 2008-08 Impact factor: 54.908

8. The FANTOM web resource: from mammalian transcriptional landscape to its dynamic regulation.

Authors: Hideya Kawaji; Jessica Severin; Marina Lizio; Andrew Waterhouse; Shintaro Katayama; Katharine M Irvine; David A Hume; Alistair R R Forrest; Harukazu Suzuki; Piero Carninci; Yoshihide Hayashizaki; Carsten O Daub
Journal: Genome Biol Date: 2009-04-19 Impact factor: 13.583

9. BioPortal: ontologies and integrated data resources at the click of a mouse.

Authors: Natalya F Noy; Nigam H Shah; Patricia L Whetzel; Benjamin Dai; Michael Dorf; Nicholas Griffith; Clement Jonquet; Daniel L Rubin; Margaret-Anne Storey; Christopher G Chute; Mark A Musen
Journal: Nucleic Acids Res Date: 2009-05-29 Impact factor: 16.971

10. ArrayExpress update--from an archive of functional genomics experiments to the atlas of gene expression.

Authors: Helen Parkinson; Misha Kapushesky; Nikolay Kolesnikov; Gabriella Rustici; Mohammad Shojatalab; Niran Abeygunawardena; Hugo Berube; Miroslaw Dylag; Ibrahim Emam; Anna Farne; Ele Holloway; Margus Lukk; James Malone; Roby Mani; Ekaterina Pilicheva; Tim F Rayner; Faisal Rezwan; Anjan Sharma; Eleanor Williams; Xiangqun Zheng Bradley; Tomasz Adamusiak; Marco Brandizi; Tony Burdett; Richard Coulson; Maria Krestyaninova; Pavel Kurnosov; Eamonn Maguire; Sudeshna Guha Neogi; Philippe Rocca-Serra; Susanna-Assunta Sansone; Nataliya Sklyar; Mengyao Zhao; Ugis Sarkans; Alvis Brazma
Journal: Nucleic Acids Res Date: 2008-11-10 Impact factor: 16.971

131 in total

1. Developing predictive molecular maps of human disease through community-based modeling.

Authors: Jonathan M J Derry; Lara M Mangravite; Christine Suver; Matthew D Furia; David Henderson; Xavier Schildwachter; Brian Bot; Jonathan Izant; Solveig K Sieberts; Michael R Kellen; Stephen H Friend
Journal: Nat Genet Date: 2012-01-27 Impact factor: 38.330

2. Sharing and archiving nucleic acid structure mapping data.

Authors: Philippe Rocca-Serra; Stanislav Bellaousov; Amanda Birmingham; Chunxia Chen; Pablo Cordero; Rhiju Das; Lauren Davis-Neulander; Caia D S Duncan; Matthew Halvorsen; Rob Knight; Neocles B Leontis; David H Mathews; Justin Ritz; Jesse Stombaugh; Kevin M Weeks; Craig L Zirbel; Alain Laederach
Journal: RNA Date: 2011-05-24 Impact factor: 4.942

3. Ontology-based metabolomics data integration with quality control.

Authors: Patricia Buendia; Ray M Bradley; Thomas J Taylor; Emma L Schymanski; Gary J Patti; Mansur R Kabuka
Journal: Bioanalysis Date: 2019-06-10 Impact factor: 2.681

4. Fast and Accurate Metadata Authoring Using Ontology-Based Recommendations.

Authors: Marcos Martínez-Romero; Martin J O'Connor; Ravi D Shankar; Maryam Panahiazar; Debra Willrett; Attila L Egyedi; Olivier Gevaert; John Graybeal; Mark A Musen
Journal: AMIA Annu Symp Proc Date: 2018-04-16

5. ProtocolNavigator: emulation-based software for the design, documentation and reproduction biological experiments.

Authors: Imtiaz A Khan; Adam Fraser; Mark-Anthony Bray; Paul J Smith; Nick S White; Anne E Carpenter; Rachel J Errington
Journal: Bioinformatics Date: 2014-08-22 Impact factor: 6.937

Review 6. The Human Physiome: how standards, software and innovative service infrastructures are providing the building blocks to make it achievable.

Authors: David Nickerson; Koray Atalag; Bernard de Bono; Jörg Geiger; Carole Goble; Susanne Hollmann; Joachim Lonien; Wolfgang Müller; Babette Regierer; Natalie J Stanford; Martin Golebiewski; Peter Hunter
Journal: Interface Focus Date: 2016-04-06 Impact factor: 3.906

7. How should the completeness and quality of curated nanomaterial data be evaluated?

Authors: Richard L Marchese Robinson; Iseult Lynch; Willie Peijnenburg; John Rumble; Fred Klaessig; Clarissa Marquardt; Hubert Rauscher; Tomasz Puzyn; Ronit Purian; Christoffer Åberg; Sandra Karcher; Hanne Vriens; Peter Hoet; Mark D Hoover; Christine Ogilvie Hendren; Stacey L Harper
Journal: Nanoscale Date: 2016-05-04 Impact factor: 7.790

8. Semantic Web repositories for genomics data using the eXframe platform.

Authors: Emily Merrill; Stéphane Corlosquet; Paolo Ciccarese; Tim Clark; Sudeshna Das
Journal: J Biomed Semantics Date: 2014-06-03

Review 9. Biomarkers in autism spectrum disorder: the old and the new.

Authors: Barbara Ruggeri; Ugis Sarkans; Gunter Schumann; Antonio M Persico
Journal: Psychopharmacology (Berl) Date: 2013-10-06 Impact factor: 4.530

10. The HUPO proteomics standards initiative- mass spectrometry controlled vocabulary.

Authors: Gerhard Mayer; Luisa Montecchi-Palazzi; David Ovelleiro; Andrew R Jones; Pierre-Alain Binz; Eric W Deutsch; Matthew Chambers; Marius Kallhardt; Fredrik Levander; James Shofstahl; Sandra Orchard; Juan Antonio Vizcaíno; Henning Hermjakob; Christian Stephan; Helmut E Meyer; Martin Eisenacher
Journal: Database (Oxford) Date: 2013-03-12 Impact factor: 3.451