Literature DB >> 23023984

InterMine: a flexible data warehouse system for the integration and analysis of heterogeneous biological data.

Richard N Smith¹, Jelena Aleksic, Daniela Butano, Adrian Carr, Sergio Contrino, Fengyuan Hu, Mike Lyne, Rachel Lyne, Alex Kalderimis, Kim Rutherford, Radek Stepan, Julie Sullivan, Matthew Wakeling, Xavier Watkins, Gos Micklem.

Abstract

SUMMARY: InterMine is an open-source data warehouse system that facilitates the building of databases with complex data integration requirements and a need for a fast customizable query facility. Using InterMine, large biological databases can be created from a range of heterogeneous data sources, and the extensible data model allows for easy integration of new data types. The analysis tools include a flexible query builder, genomic region search and a library of 'widgets' performing various statistical analyses. The results can be exported in many commonly used formats. InterMine is a fully extensible framework where developers can add new tools and functionality. Additionally, there is a comprehensive set of web services, for which client libraries are provided in five commonly used programming languages. AVAILABILITY: Freely available from http://www.intermine.org under the LGPL license. CONTACT: g.micklem@gen.cam.ac.uk SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Species

Mesh：

Year: 2012 PMID： 23023984 PMCID： PMC3516146 DOI： 10.1093/bioinformatics/bts577

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Integrative analysis exploiting diverse datasets is a powerful approach in modern biology, but it can also be complex and time consuming. There are huge quantities of data available in a range of different formats, and a high level of computational expertise is often required for performing analysis, even once the data have been successfully integrated into a single database [for a general review of biological data integration, see Triplet and Butler (2011)]. InterMine is a data warehouse framework initially developed for FlyMine to address these issues for the Drosophila community (Lyne ). It is being adopted by a number of major model organism databases including budding yeast, rat, zebrafish, mouse and nematode worm (Balakrishnan ; Shimoyama ), and is in use by the modENCODE project—a large-scale international initiative to characterize functional elements in the fly and worm genomes (Celniker ; Contrino ), as well as for drug target discovery (Chen ), Drosophila transcription factors (Pfreundt ) and mitochondrial proteomics (Smith ). Many features of the InterMine system have been designed to lower the effort required for the setup and maintenance of large-scale biological databases, and these are outlined later in the text and discussed in more detail in the Supplementary Material.

1.1 Data loading and integration

The InterMine database build system allows for the integration of datasets of varying size and complexity. It comes equipped with data integration modules that load data from common biological formats (e.g. GFF3, FASTA, OBO, BioPAX, GAF, PSI and chado—see Supplementary Material for a full list), while custom sources can be added by writing data parsers using the Java application programming interface (API) or by creating XML files in the appropriate format, for which a Perl API exists. InterMine’s core biological model is based on the Sequence Ontology (Eilbeck ) and can be easily extended by editing an XML file. All model-based components of the system are automatically derived from the data model, allowing simple error-free upgrading of the user interface and API as new data types are added. One of the challenges of integrating data from different sources is that valuable datasets can become outdated when the gene models, and therefore identifiers, are updated. InterMine addresses this with an identifier resolver system that automatically replaces outdated identifiers with current ones when loading the dataset into the InterMine database. This is achieved using a map of current identifiers and synonyms from relevant databases—the identifiers matching a single synonym are changed to the current version, while the ones matching none or multiple synonyms are discarded. Integrating datasets can also be problematic if data values from different data sources conflict. To address this, database developers are able to set a priority for each data type, giving precedence to the more reliable data source at any level of data model granularity. To track data provenance, the database records each time a database row is created or modified.

1.2 Performance-tuned architecture

At the core of InterMine is the ObjectStore, a custom object/relational mapping system written in Java and optimized for read-only database performance. The ObjectStore accepts object queries from a client (the web application or web services), generates SQL to execute in the underlying PostgreSQL database and materializes objects from the results (Fig. 1).

Fig. 1.

InterMine system architecture

InterMine system architecture The Query optimizer improves performance using PostgreSQL tools and using pre-computed tables of results, which are created after all datasets have been loaded, joining connected data from different tables in the database. In this way, InterMine does not require, as other biological data warehouses do, a pre-defined database schema to perform. The analysis of estimated query execution times enables performance tuning, and query results are cached using a smart caching system, which can use previously run queries or parts of them to improve retrieval speed. This allows fast data mining of even large databases: modMine, the InterMine for the modENCODE project (Contrino ), currently contains >325 GB of data.

1.3 Data access

Users can access data both through RESTful web services and through the web application. The web application includes a faceted keyword search and powerful query tools. Users can input their own lists of identifiers (e.g. from genes), and lists and queries can be saved in a user’s private ‘MyMine’ account. Data identifiers are automatically updated to the most current ones for user input data, taking out a time-consuming step necessary for analysis across datasets. Type-specific report pages can be set up to provide collated summary information and relevant embedded interactive graphical displays (such as an interaction viewer, genome browser and expression heat maps), both for specific items such as individual genes and for lists of data items. The report pages provide summaries of results from a wide range of data types, facilitating serendipitous discovery. InterMine provides a set of analysis ‘widgets’. These include tables (e.g. displaying orthologues and interactions), visual charts (e.g. displaying chromosome distribution graphs, gene expression results and interaction networks) and enrichment analysis widgets—see Supplementary Material for details (covering functional terms, protein domains and publications). Through a well-defined framework, developers can also create further widgets to customize the web application to their needs. Query ‘templates’ can easily be created for commonly run queries, and the QueryBuilder allows construction of advanced custom queries. InterMine also includes functionality for querying features with overlapping genome coordinates. Query results can be saved and exported from the InterMine web application, e.g. as tab-delimited, comma-delimited, GFF3, FASTA or BED files, and can also be exported to Galaxy (Goecks ), where other data analysis workflows can be established and performed, such as analysing high-throughput sequencing data. Web service APIs can be used to access the core functionality of InterMine such as uploading lists of data identifiers and running queries. They can also be used to access the metadata and specialized resources, such as retrieving enrichment statistics, running the region search or exploring the data model. Custom-written client libraries exist in a number of programming languages, including Python, Perl, Ruby, Java and JavaScript, enabling users to automate data-based workflows or access data directly without using the InterMine web application. The InterMine web application automatically generates code from queries for each of these client languages. In addition, query results are available in many formats including JSON or can be embedded directly in HTML, making it possible to view data from an InterMine database within another website (e.g. www.modencode.org). A recent use example of these facilities is the development of the YeastMine iPhone application (M. Cherry, personal communication).

2 CONCLUSIONS

Through its speed and flexibility, InterMine provides an advanced system for setting up biological data warehouses that facilitate bioinformatics analysis. As new data types emerge, the data model can be easily adapted to accommodate them. There is already a range of user-friendly tools, and the extensible framework means that specific analysis tools can be developed to cater to the changing needs of the users, making it particularly well-suited to the needs of biomedical researchers. Many of the major model organism databases are adopting InterMine as their data warehouse of choice, putting it in a good position to work on cross-organism interoperability as part of future development. There is an active InterMine developer community, and funding is in place until 2018, promising long-term sustainability of InterMine as a resource and continued improvements to meet researchers’ needs.

10 in total

1. TargetMine, an integrated data warehouse for candidate gene prioritisation and target discovery.

Authors: Yi-An Chen; Lokesh P Tripathi; Kenji Mizuguchi
Journal: PLoS One Date: 2011-03-08 Impact factor: 3.240

2. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences.

Authors: Jeremy Goecks; Anton Nekrutenko; James Taylor
Journal: Genome Biol Date: 2010-08-25 Impact factor: 13.583

3. MitoMiner: a data warehouse for mitochondrial proteomics data.

Authors: Anthony C Smith; James A Blackshaw; Alan J Robinson
Journal: Nucleic Acids Res Date: 2011-11-24 Impact factor: 16.971

4. modMine: flexible access to modENCODE data.

Authors: Sergio Contrino; Richard N Smith; Daniela Butano; Adrian Carr; Fengyuan Hu; Rachel Lyne; Kim Rutherford; Alex Kalderimis; Julie Sullivan; Seth Carbon; Ellen T Kephart; Paul Lloyd; E O Stinson; Nicole L Washington; Marc D Perry; Peter Ruzanov; Zheng Zha; Suzanna E Lewis; Lincoln D Stein; Gos Micklem
Journal: Nucleic Acids Res Date: 2011-11-12 Impact factor: 16.971

5. RGD: a comparative genomics platform.

Authors: Mary Shimoyama; Jennifer R Smith; Tom Hayman; Stan Laulederkind; Tim Lowry; Rajni Nigam; Victoria Petri; Shur-Jen Wang; Melinda Dwinell; Howard Jacob
Journal: Hum Genomics Date: 2011-01 Impact factor: 4.639

6. YeastMine--an integrated data warehouse for Saccharomyces cerevisiae data as a multipurpose tool-kit.

Authors: Rama Balakrishnan; Julie Park; Kalpana Karra; Benjamin C Hitz; Gail Binkley; Eurie L Hong; Julie Sullivan; Gos Micklem; J Michael Cherry
Journal: Database (Oxford) Date: 2012-03-20 Impact factor: 3.451

7. The Sequence Ontology: a tool for the unification of genome annotations.

Authors: Karen Eilbeck; Suzanna E Lewis; Christopher J Mungall; Mark Yandell; Lincoln Stein; Richard Durbin; Michael Ashburner
Journal: Genome Biol Date: 2005-04-29 Impact factor: 13.583

8. Unlocking the secrets of the genome.

Authors: Susan E Celniker; Laura A L Dillon; Mark B Gerstein; Kristin C Gunsalus; Steven Henikoff; Gary H Karpen; Manolis Kellis; Eric C Lai; Jason D Lieb; David M MacAlpine; Gos Micklem; Fabio Piano; Michael Snyder; Lincoln Stein; Kevin P White; Robert H Waterston
Journal: Nature Date: 2009-06-18 Impact factor: 49.962

9. FlyTF: improved annotation and enhanced functionality of the Drosophila transcription factor database.

Authors: Ulrike Pfreundt; Daniel P James; Susan Tweedie; Derek Wilson; Sarah A Teichmann; Boris Adryan
Journal: Nucleic Acids Res Date: 2009-11-01 Impact factor: 16.971

10. FlyMine: an integrated database for Drosophila and Anopheles genomics.

Authors: Rachel Lyne; Richard Smith; Kim Rutherford; Matthew Wakeling; Andrew Varley; Francois Guillier; Hilde Janssens; Wenyan Ji; Peter Mclaren; Philip North; Debashis Rana; Tom Riley; Julie Sullivan; Xavier Watkins; Mark Woodbridge; Kathryn Lilley; Steve Russell; Michael Ashburner; Kenji Mizuguchi; Gos Micklem
Journal: Genome Biol Date: 2007 Impact factor: 13.583

10 in total

108 in total

Review 1. Mapping the tumour human leukocyte antigen (HLA) ligandome by mass spectrometry.

Authors: Lena Katharina Freudenmann; Ana Marcu; Stefan Stevanović
Journal: Immunology Date: 2018-05-08 Impact factor: 7.397

2. The mouse gene expression database: New features and how to use them effectively.

Authors: Jacqueline H Finger; Constance M Smith; Terry F Hayamizu; Ingeborg J McCright; Jingxia Xu; Janan T Eppig; James A Kadin; Joel E Richardson; Martin Ringwald
Journal: Genesis Date: 2015-06-18 Impact factor: 2.487

3. Identifying Cell Subpopulations and Their Genetic Drivers from Single-Cell RNA-Seq Data Using a Biclustering Approach.

Authors: Funan Shi; Haiyan Huang
Journal: J Comput Biol Date: 2017-07 Impact factor: 1.479

4. Integrative Comparison of mRNA Expression Patterns in Breast Cancers from Caucasian and Asian Americans with Implications for Precision Medicine.

Authors: Yanxia Shi; Albert Steppi; Ye Cao; Jianan Wang; Max M He; Liren Li; Jinfeng Zhang
Journal: Cancer Res Date: 2016-11-15 Impact factor: 12.701

5. Astrogenomics: big data, old problems, old solutions?

Authors: Aaron Golden; S Djorgovski; John M Greally
Journal: Genome Biol Date: 2013-08-13 Impact factor: 13.583

6. Chemogenomic study of gemcitabine using Saccharomyces cerevisiae as model cell-molecular insights about chemoresistance.

Authors: Lucas de Sousa Cavalcante; Tales A Costa-Silva; Tiago Antônio Souza; Susan Ienne; Gisele Monteiro
Journal: Braz J Microbiol Date: 2019-09-12 Impact factor: 2.476

7. ZFIN, The zebrafish model organism database: Updates and new directions.

Authors: Leyla Ruzicka; Yvonne M Bradford; Ken Frazer; Douglas G Howe; Holly Paddock; Sridhar Ramachandran; Amy Singer; Sabrina Toro; Ceri E Van Slyke; Anne E Eagle; David Fashena; Patrick Kalita; Jonathan Knight; Prita Mani; Ryan Martin; Sierra A T Moxon; Christian Pich; Kevin Schaper; Xiang Shao; Monte Westerfield
Journal: Genesis Date: 2015-07-08 Impact factor: 2.487

Review 8. Cross-organism analysis using InterMine.

Authors: Rachel Lyne; Julie Sullivan; Daniela Butano; Sergio Contrino; Joshua Heimbach; Fengyuan Hu; Alex Kalderimis; Mike Lyne; Richard N Smith; Radek Štěpán; Rama Balakrishnan; Gail Binkley; Todd Harris; Kalpana Karra; Sierra A T Moxon; Howie Motenko; Steven Neuhauser; Leyla Ruzicka; Mike Cherry; Joel Richardson; Lincoln Stein; Monte Westerfield; Elizabeth Worthey; Gos Micklem
Journal: Genesis Date: 2015-07-08 Impact factor: 2.487

9. COVID-WAREHOUSE: A Data Warehouse of Italian COVID-19, Pollution, and Climate Data.

Authors: Giuseppe Agapito; Chiara Zucco; Mario Cannataro
Journal: Int J Environ Res Public Health Date: 2020-08-03 Impact factor: 3.390

10. Using WormBase: A Genome Biology Resource for Caenorhabditis elegans and Related Nematodes.

Authors: Christian Grove; Scott Cain; Wen J Chen; Paul Davis; Todd Harris; Kevin L Howe; Ranjana Kishore; Raymond Lee; Michael Paulini; Daniela Raciti; Mary Ann Tuli; Kimberly Van Auken; Gary Williams
Journal: Methods Mol Biol Date: 2018