Pavel S Novichkov1, John-Marc Chandonia1, Adam P Arkin1,2. 1. Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA. 2. Department of Bioengineering, University of California, Berkeley, CA 94720, USA.
Abstract
BACKGROUND: Many organizations face challenges in managing and analyzing data, especially when relevant datasets arise from multiple sources and methods. Analyzing heterogeneous datasets and additional derived data requires rigorous tracking of their interrelationships and provenance. This task has long been a Grand Challenge of data science and has more recently been formalized in the FAIR principles: that all data objects be Findable, Accessible, Interoperable, and Reusable, both for machines and for people. Adherence to these principles is necessary for proper stewardship of information, for testing regulatory compliance, for measuring the efficiency of processes, and for facilitating reuse of data-analytical frameworks. FINDINGS: We present the Contextual Ontology-based Repository Analysis Library (CORAL), a platform that greatly facilitates adherence to all 4 of the FAIR principles, including the especially difficult challenge of making heterogeneous datasets Interoperable and Reusable across all parts of a large, long-lasting organization. To achieve this, CORAL's data model requires that data generators extensively document the context for all data, and our tools maintain that context throughout the entire analysis pipeline. CORAL also features a web interface for data generators to upload and explore data, as well as a Jupyter notebook interface for data analysts, both backed by a common API. CONCLUSIONS: CORAL enables organizations to build FAIR data types on the fly as they are needed, avoiding the expense of bespoke data modeling. CORAL provides a uniquely powerful platform to enable integrative cross-dataset analyses, generating deeper insights than are possible using traditional analysis tools.
BACKGROUND: Many organizations face challenges in managing and analyzing data, especially when relevant datasets arise from multiple sources and methods. Analyzing heterogeneous datasets and additional derived data requires rigorous tracking of their interrelationships and provenance. This task has long been a Grand Challenge of data science and has more recently been formalized in the FAIR principles: that all data objects be Findable, Accessible, Interoperable, and Reusable, both for machines and for people. Adherence to these principles is necessary for proper stewardship of information, for testing regulatory compliance, for measuring the efficiency of processes, and for facilitating reuse of data-analytical frameworks. FINDINGS: We present the Contextual Ontology-based Repository Analysis Library (CORAL), a platform that greatly facilitates adherence to all 4 of the FAIR principles, including the especially difficult challenge of making heterogeneous datasets Interoperable and Reusable across all parts of a large, long-lasting organization. To achieve this, CORAL's data model requires that data generators extensively document the context for all data, and our tools maintain that context throughout the entire analysis pipeline. CORAL also features a web interface for data generators to upload and explore data, as well as a Jupyter notebook interface for data analysts, both backed by a common API. CONCLUSIONS: CORAL enables organizations to build FAIR data types on the fly as they are needed, avoiding the expense of bespoke data modeling. CORAL provides a uniquely powerful platform to enable integrative cross-dataset analyses, generating deeper insights than are possible using traditional analysis tools.
Authors: Mark B Smith; Andrea M Rocha; Chris S Smillie; Scott W Olesen; Charles Paradis; Liyou Wu; James H Campbell; Julian L Fortney; Tonia L Mehlhorn; Kenneth A Lowe; Jennifer E Earles; Jana Phillips; Steve M Techtmann; Dominique C Joyner; Dwayne A Elias; Kathryn L Bailey; Richard A Hurt; Sarah P Preheim; Matthew C Sanders; Joy Yang; Marcella A Mueller; Scott Brooks; David B Watson; Ping Zhang; Zhili He; Eric A Dubinsky; Paul D Adams; Adam P Arkin; Matthew W Fields; Jizhong Zhou; Eric J Alm; Terry C Hazen Journal: MBio Date: 2015-05-12 Impact factor: 7.867
Authors: Janna Hastings; Gareth Owen; Adriano Dekker; Marcus Ennis; Namrata Kale; Venkatesh Muthukrishnan; Steve Turner; Neil Swainston; Pedro Mendes; Christoph Steinbeck Journal: Nucleic Acids Res Date: 2015-10-13 Impact factor: 16.971
Authors: Adam P Arkin; Robert W Cottingham; Christopher S Henry; Nomi L Harris; Rick L Stevens; Sergei Maslov; Paramvir Dehal; Doreen Ware; Fernando Perez; Shane Canon; Michael W Sneddon; Matthew L Henderson; William J Riehl; Dan Murphy-Olson; Stephen Y Chan; Roy T Kamimura; Sunita Kumari; Meghan M Drake; Thomas S Brettin; Elizabeth M Glass; Dylan Chivian; Dan Gunter; David J Weston; Benjamin H Allen; Jason Baumohl; Aaron A Best; Ben Bowen; Steven E Brenner; Christopher C Bun; John-Marc Chandonia; Jer-Ming Chia; Ric Colasanti; Neal Conrad; James J Davis; Brian H Davison; Matthew DeJongh; Scott Devoid; Emily Dietrich; Inna Dubchak; Janaka N Edirisinghe; Gang Fang; José P Faria; Paul M Frybarger; Wolfgang Gerlach; Mark Gerstein; Annette Greiner; James Gurtowski; Holly L Haun; Fei He; Rashmi Jain; Marcin P Joachimiak; Kevin P Keegan; Shinnosuke Kondo; Vivek Kumar; Miriam L Land; Folker Meyer; Marissa Mills; Pavel S Novichkov; Taeyun Oh; Gary J Olsen; Robert Olson; Bruce Parrello; Shiran Pasternak; Erik Pearson; Sarah S Poon; Gavin A Price; Srividya Ramakrishnan; Priya Ranjan; Pamela C Ronald; Michael C Schatz; Samuel M D Seaver; Maulik Shukla; Roman A Sutormin; Mustafa H Syed; James Thomason; Nathan L Tintle; Daifeng Wang; Fangfang Xia; Hyunseung Yoo; Shinjae Yoo; Dantong Yu Journal: Nat Biotechnol Date: 2018-07-06 Impact factor: 54.908
Authors: Mark D Wilkinson; Michel Dumontier; I Jsbrand Jan Aalbersberg; Gabrielle Appleton; Myles Axton; Arie Baak; Niklas Blomberg; Jan-Willem Boiten; Luiz Bonino da Silva Santos; Philip E Bourne; Jildau Bouwman; Anthony J Brookes; Tim Clark; Mercè Crosas; Ingrid Dillo; Olivier Dumon; Scott Edmunds; Chris T Evelo; Richard Finkers; Alejandra Gonzalez-Beltran; Alasdair J G Gray; Paul Groth; Carole Goble; Jeffrey S Grethe; Jaap Heringa; Peter A C 't Hoen; Rob Hooft; Tobias Kuhn; Ruben Kok; Joost Kok; Scott J Lusher; Maryann E Martone; Albert Mons; Abel L Packer; Bengt Persson; Philippe Rocca-Serra; Marco Roos; Rene van Schaik; Susanna-Assunta Sansone; Erik Schultes; Thierry Sengstag; Ted Slater; George Strawn; Morris A Swertz; Mark Thompson; Johan van der Lei; Erik van Mulligen; Jan Velterop; Andra Waagmeester; Peter Wittenburg; Katherine Wolstencroft; Jun Zhao; Barend Mons Journal: Sci Data Date: 2016-03-15 Impact factor: 6.444
Authors: Philippa C Griffin; Jyoti Khadake; Kate S LeMay; Suzanna E Lewis; Sandra Orchard; Andrew Pask; Bernard Pope; Ute Roessner; Keith Russell; Torsten Seemann; Andrew Treloar; Sonika Tyagi; Jeffrey H Christiansen; Saravanan Dayalan; Simon Gladman; Sandra B Hangartner; Helen L Hayden; William W H Ho; Gabriel Keeble-Gagnère; Pasi K Korhonen; Peter Neish; Priscilla R Prestes; Mark F Richardson; Nathan S Watson-Haigh; Kelly L Wyres; Neil D Young; Maria Victoria Schneider Journal: F1000Res Date: 2017-08-31