Literature DB >> 30691868

Data Lakes, Clouds, and Commons: A Review of Platforms for Analyzing and Sharing Genomic Data.

Abstract

Data commons collate data with cloud computing infrastructure and commonly used software services, tools, and applications to create biomedical resources for the large-scale management, analysis, harmonization, and sharing of biomedical data. Over the past few years, data commons have been used to analyze, harmonize, and share large-scale genomics datasets. Data ecosystems can be built by interoperating multiple data commons. It can be quite labor intensive to curate, import, and analyze the data in a data commons. Data lakes provide an alternative to data commons and simply provide access to data, with the data curation and analysis deferred until later and delegated to those that access the data. We review software platforms for managing, analyzing, and sharing genomic data, with an emphasis on data commons, but also cover data ecosystems and data lakes.

Entities: Disease Gene Species

Keywords: cancer genomics clouds; data clouds; data commons; data sharing

Mesh：

Year: 2019 PMID： 30691868 PMCID： PMC6474403 DOI： 10.1016/j.tig.2018.12.006

Source DB: PubMed Journal: Trends Genet ISSN： 0168-9525 Impact factor: 11.639

The Challenges of Large Genomic Datasets

The commoditization of sensors has resulted in new generations of instruments that produce large datasets that are available to genetics researchers. Next generation sequencing produced whole exome and whole genome datasets that were 200 to 800 GB or larger, and large projects such as The Cancer Genome Atlas (TCGA) [1] contain more than 2 PB of data and derived data. Over the next few years, the research community will collect single-cell atlases [2], next generation imaging that captures the cellular microenvironment, and atlases about the cancer cells’ interactions with the immunological system, all of which will produce ever larger datasets. The accumulation of all these data has resulted in several challenges for the genetics research community. First, the size of the datasets is too large for all but the largest research organizations to manage and analyze. Second, the current model in which research groups set up their own computing infrastructure, download their own copy of the data, add their own data, and analyze the integrated dataset is simply too expensive for the government and private funding organizations to support. Third, the information technology (IT) expertise to set up the required large-scale computing environments and the bioinformatics expertise to set up the required bioinformatics environments are difficult for most organizations to support. Fourth, because of batch effects (see Glossary) [3], it is usually considered wise to re-analyze all of the data (from raw data) using a common set of bioinformatics pipelines to minimize the presence of batch effects. The importance of the appropriate data and computing infrastructure to create ‘knowledge bases’ and ‘knowledge networks’ to support precision medicine has been described in several reports [4,5]. In this review article, we describe some of the data, analysis, and collaboration platforms that have emerged to deal with these challenges.

Platforms for Data Sharing

Cloud Computing

Over the past 15 years, large-scale internet companies, such as Google, Amazon, and Facebook, have developed new computing infrastructure for their own internal use that became known as cloud computing platforms [6]. Some of these companies then made these platforms available to customers, including Amazon’s Amazon Web Services (AWS), Google’s Cloud Platform (GCP), and Microsoft’s Azure. Importantly, open source versions of some these platforms were also developed [7], including OpenStack (openstack.org) and OpenNebula (opennebula.org), enabling organizations to set up their own on-premise clouds. On-premise clouds are also called private clouds [8] to distinguish them from commercial public clouds that are used by multiple organizations. NIST has developed a definition of cloud computing that includes the following characteristics [8]: (i) elastic in the sense that large-scale resources are available and (ii) self-provisioned in the sense that a user can provision the computing infrastructure required directly through a portal or application programming interface (API). Although it took time for cloud computing to be adapted for biomedical informatics, there was early recognition within the cancer community of the importance of this technology [9,10], and several university- and institute-based projects developed production-level cloud computing platforms to support the cancer research community, including the Bionimbus Protected Data Cloud [11], the Galaxy Cloud [12,13], Globus Genomics [14], and the Cancer Genome Collaboratory [15]. In addition, commercial companies, including DNAnexus [16] and Seven Bridges [17], developed cloud-based solutions for processing genomic data. It may be helpful to divide computing platforms supporting biomedical research into three generations: (i) databases, (ii) data clouds, and (iii) data commons (Figure 1).

Figure 1.

Some of the Important Differences between Data Clouds and Data Commons.

Databases and Data Portals

First generation platforms operated databases in which biomedical datasets were deposited, beginning with GenBank [18]. As the web became the dominant infrastructure for collaboration, data portals emerged as applications that made the data in the underlying databases readily available to researchers. For the purposes here, one can think of a data portal as a website that provides interactive access to data in an underlying database. Although data portals are outside the scope of this review article, it is still important to mention the University of California, Santa Cruz (UCSC) Genome Browser [19] and cBioPortal [20] as some of the most important examples from this category. The UCSC Genome Browser has been in continuous development since it was first launched in 2000 to help visualize the first working draft of the human genome assembly [19]. Today, it contains more than 160 assemblies from more than 90 species and can be run not only over the web but also downloaded and run locally using a version called genome browser in a box (GBiB) [21]. The cBioPortal for Cancer Genomics [20] is a widely used resource that integrates and visualizes cancer genomic data, including mutations, copy number variation, gene expression data, and clinical information. Currently, cBioPortal includes data from TCGA that are processed by Broad’s Firehose and data from the International Cancer Genomic Consortium (ICGC) that are processed by the PANCAN Analysis Working group, plus additional smaller datasets [22]. cBioPortal was one of the first cancer data portals to organize data by genomic alterations, such as mutations, deletions, copy number variation, and expression levels, in a way that seemed natural to research oncologists and to tie the alterations back to the original cases to support further investigation when desired. With next generation sequencing, the size of genomics datasets began to grow, and large-scale computing infrastructure is required to process, manage, and distribute data. Several systems were developed to process datasets such as the TCGA. CGHub was developed to host the BAM files from the TCGA project [23] by the UCSC. The Firehose system, developed by the Broad Institute, integrates data from TCGA and processes the data using applications from Genome Analysis Toolkit (GATK) [24] and applies algorithms such as GISTIC2.0 [25] and MutSig [26]. The results can be browsed and accessed via a website (gdac.broadinstitute. org) and are available in Broad’s FireCloud system [27].

Data Clouds

Second generation systems co-locate computing with biomedical data enabling researchers to compute over the data. A good example of this is the BLAST service [28] provided by the National Center for Biotechnology Information. Over the past decade, cloud computing has enabled the co-location of on-demand, large-scale computing infrastructure that has created new opportunities for the large-scale analysis of hosted biomedical data. Here, we use the term ‘data cloud’ for this integrated infrastructure. A working definition of a data cloud for biomedical data is a cloud computing platform [6] that manages and analyzes biomedical data and, usually, integrates the security and compliance required to work with controlled access biomedical data, such as germline genomic data. Examples of biomedical data clouds include the Bionimbus Protected Data Cloud developed by the University of Chicago [11], the Cancer Genomics Cloud developed by Seven Bridges Genomics [17], the Cancer Collaboratory developed by the Ontario Institute for Cancer Research [29], and the Galaxy Cloud [12,13] developed by the Galaxy Project. We now describe three important milestones in the use of large-scale cloud computing in genomics. The first milestone was the launch of the National Cancer Institute (NCI) Genomics Data Commons [30] that used an OpenStack-based private cloud to analyze and harmonize genomic and associated clinical data from more than 18 000 cancertumor-normal pairs, including TCGA [1]. By data harmonization, we mean applying a uniform set of pipelines for cleaning, applying quality control criteria, processing, and post-processing submitted data[31]. The second milestone was the development of the three NCI Cloud Pilots: the ISB Cancer Genomics Cloud by the Institute for Systems Biology [32], FireCloud by the Broad Institute [27], and the Cancer Genomics Cloud by Seven Bridges Genomics [17], each of which provided cloud-based computing infrastructure to analyze TCGA data. The first two Cloud Pilots used GCP and the third used AWS. A third important milestone was the analysis of 280 whole genomes using multiple distributed public and private clouds by the PANCAN Analysis Working Group [29]. Cloud computing is widely used today to support scientific research for many disciplines outside of the biomedical sciences. In general, the architecture for these systems is simpler since the security and compliance infrastructure required for working with controlled access biomedical data is not required.

Data Commons

Third generation systems integrate biomedical data, computing and storage infrastructure, and software services required for working with data to create a data commons. Some examples of data commons and six core requirements for data commons are reviewed in [33]. A working definition of a ‘data commons’ is the colocation of data with cloud computing infrastructure and commonly used software services, tools, and applications for managing, integrating, analyzing, and sharing data that are exposed through APIs to create an interoperable resource [33]. Some of the core services (data common services) required for a data commons are as follows: authentication services for identifying researchers; authorization services for determining which datasets researchers can access; digital ID services for assigning permanent identifiers to datasets and accessing data using these IDs; metadata services for assigning metadata to a digital object identified by a digital ID and accessing the metadata; security and compliance services so that data commons can support controlled access data; data model services for integrating data with respect to one or more data models; and workflow services for executing bioinformatics pipelines so that data can be analyzed and harmonized. Accessing controlled access data requires services (i) and (ii). With service (iii), data stored in commons are findable and accessible. With service (iv), data stored in data commons can be reusable and interoperable. In practice, for data to be reusable depends in large part on the quality of the data annotation prepared by the data submitter. With services (iii) and (iv), data stored in commons are findable, accessible, reusable, and interoperable and thus is sometimes abbreviated as FAIR. The importance of making biomedical data FAIR has been stressed in efforts such as the European FORCE11 Initiative [34] and the National Institutes of Health (NIH) Big Data to Knowledge (BD2K) initiative [35]. Recently, a framework for metrics to measure the ‘FAIRness’ of services has also been developed [36]. Workflow services (vii) in data commons are quite varied and include running existing workflows that have been integrated into the commons and can be used to analyze data in the commons, pulling existing workflows from workflow repositories outside the commons and applying them to data in the commons, and developing new workflows and using them to analyze data in the commons. Also, some commons allow users to execute workflows, while others limit this to the data commons administrators. An example of a data commons is the NCI Genomic Data Commons (GDC) [30,37], used by more than 100 000 distinct cancer researchers in 2018. With the GDC [30], data commons began to curate and integrate contributed data using a common data model [core service (vi)], harmonize the contributed data using a common set of bioinformatics pipelines [core service (vii)], support the visual exploration of data through a data portal, and expose APIs to the core services (i)-(v) to support third party applications over the integrated and harmonized data.

Project Data: Object Data and Structured Data

Data in a data commons are usually organized into projects, with different projects potentially having their own data model and collecting different subsets of clinical, molecular, imaging, and other data. It is an open question of how best to organize data across projects so that it can integrated, harmonized, and queried. One natural division that is emerging is the distinction between the structured data, the unstructured data, and the data objects in a project. The object data typically include FASTQ or BAM files [38] used in genomics, image files, video files, and other large files, such as archive or backup files associated with a project. The structured data include clinical data, demographic data, biospecimen data, variant data [39], and other data associated with a data schema. The unstructured data include text, notes, articles, and other data that are not associated with a schema. Part of the curation process is to align the structured data in a project with an appropriate ontology. Examples include using the human phenotype ontology [40] and the NCI Thesaurus [41] for curating clinical phenotype data and CDISC [42] for curating clinical trials data. It can be quite challenging and labor intensive to match ontologies to clinical data, and several tools have been developed to make this easier [43,44]. If we call all the structured data, unstructured data, and associated schemas ‘project core data’, then it is quite common for the project’s object data to be 1000 times (or more) larger than the project’s core data. For example, with the TCGA’s projects [1], the data objects were measured in 10s to 100s TB, while the project’s core data were measured in 10s of GB. In practice, a project’s object data are assigned globally unique identifiers (GUIDs) and metadata and stored in clouds using services (iii) and (iv) and are immutable (although new versions may be added to the project), while the project core data are often updated, as part of the curation and quality assurance process and as new data are added to the project. A project’s object data are searched via its metadata [core service (iv)], while a project’s core data can be searched via its data model [core service (vi)]. Of course, a project’s object data can be processed to produce features that can then be managed and searched. Examples include developing algorithms for identifying particular types of cells in cell images and searching for these cells or processing BAM files to compute data quality scores and searching for BAM files with particular data quality problems. When data are curated and integrated with a common data model, synthetic cohorts can be created through a query, such as ‘find all males over 50 years of old that smoked and have a KRAS mutation [45].’ Another way to think of this is that core services (i)-(v) support the ‘shallow’ indexing and search via metadata, while core services (i)-(vi) support ‘deep’ indexing and search via the data model attached to project core data. In either case, when the services are exposed via APIs to third party applications, data become portable and data commons become interoperable, both of which are usually thought of as important requirements [33].

Data Lakes

Sometimes the term ‘data lake’ is used when data are stored simply with digital IDs and metadata (shallow indexing), but without a data model. Data models and schemas are used when the data are written or when the data are analyzed, but not when the data are stored. Additional information about data lakes can be found in [46]. Since it can be very labor intensive to import data with respect to a data model, and since not all the data in a commons are used, this has the advantage that the effort to align the data with a data model is not needed until the data are analyzed. Of course, at the time the data are analyzed and aligned with a data model, the expertise to do this may no longer be easily available. Through the use of cloud computing, data commons can support large-scale data, but this also creates sustainability challenges, due to the cost of large-scale storage and compute. One sustainability model that can be attractive to an organization is to provide the data at no cost, but to control the cost of the computing resources by using a ‘pay for compute model’ [33], establishing quotas for compute, giving compute allocations, or distributing ‘chits’ that can be redeemed for compute. Just as data lakes required less curation than data commons, data catalogs required less curation than data lakes. A ‘data catalog’ is simply a listing of data assets, some basic metadata, and their locations, but without a common mechanism for accessing the data, such as used in a data lake.

Workflows

Bioinformatics workflows are often data intensive and complex, consisting of several different programs with the outputs of one program used as the inputs to another. For this reason, specialized workflow management systems have been developed so that workflows can be mapped efficiently to different high-performance, parallel, and distributed computer systems. Workflow languages have been developed so that domain specialists knowledgeable about the workflows can describe the workflows in a manner that is independent of the specific underlying physical architecture of the system executing the workflows. Despite many years of effort though, there is still no standard language for expressing workflows in general and bioinformatics workflows in particular [47,48]. Within the cancer genomics community, the Common Workflow Language (CWL) [49] is gaining in popularity. The GA4GH Consortium (ga4gh.org) supports a technical effort to standardize bioinformatics workflows, which includes the workflow execution services (WES) and task execution service (TES). With the growing use of container-based environments for program execution, such as Docker, it is becoming more common to encapsulate workflows in containers to make them easier to reuse [50]. Before the wide adoption of containers, workflows were encapsulated in virtual machines for the same reason. Examples of services for accessing reproducible workflows include Dock-store [51] and Biocompute Objects [52].

Data and Commons Governance

A common definition of IT governance is [53]: (i) Assure that the investments in IT generate business value. (ii) Mitigate the risks that are associated with IT. (iii) Operate in such a way as to make good long-term decisions with accountability and traceability to those funding IT resources, those developing and support IT resources, and those using IT resources. This definition can be easily adapted to provide a good definition for data commons governance: (i) Assure that the investments in the data commons generate value to the research community. (ii) Manage the balance between the risks associated with participant data and the benefits realized from research involving these data [54]. (iii) Operate in such a way so as to make good long-term decisions with accountability and traceability to those sponsors that fund the data commons; the engineers that develop, manage, and operate the data commons; and the researchers that use it. An overview of principles for data commons and a description of eight principles for biomedical data commons can be found in [55]. A survey of how data are made available and controlled in commons is in [56]. A survey of data commons governance models is in [57]. The GA4GH framework for sharing data is described in [54]. The data governance structure for international data commons, such as the INRG Data Commons [58] and the ICGC Data Commons [59], can be challenging and may have restrictions on the movement of the underlying controlled access genomic data.

Building and Operating a Data Commons

Building a data commons usually consists of the following steps (Figure 2):

Figure 2.

Building a Data Commons.

Data commons support the entire life cycle of data, including defining the data model, importing data, cleaning data, exploring data, analyzing data, and then sharing new research discoveries.

Put in place data governance agreements that govern the contribution, management and use of the data in the data commons and common governance agreements that govern the development, operations, use, and sustainability of the commons. Develop a data model (or data models) that describe the data in the commons. Set up and configure the data commons itself. Work with the community to submit data to the data commons. Import, clean, and curate the submitted data. Process and analyze the data using bioinformatics pipelines to produce harmonized data products. This is often done with analysis working groups. Open up the commons to external researchers, third party applications, and interoperate with other commons. To support the activities, a data commons usually has the following components: A data exploration portal (or more simply a data portal) for viewing, exploring, visualizing and downloading the data in the commons. A data submission portal for submitting data to the commons. An API supporting third party applications. Systems for the large-scale processing of data in the commons to produce derived data products. Systems to support analysis working groups and other team science constructs used for the collaborative analysis and annotation of data in the commons. What are being called ‘workspaces’ are one of the mechanisms that are emerging to support this.

Data Ecosystems Containing Multiple Data Commons

As the number of data commons grow, there will be an increasing need for data commons to interoperate and for applications to be able to access data and services from multiple data commons. It may be helpful to think of this situation as laying the foundation for a data ecosystem [60]. Sometimes the data commons services (i)-(vi) described above are called ‘framework services’ since they provide the framework for building a data commons, and, in fact, can be used to support multiple data commons that interoperate (Figure 3, Key Figure). As mentioned above, when these services are exposed through an API, either as part of a data commons or as a part of framework services supporting multiple data commons, they can support an ecosystem of third party applications [45].

Figure 3.

Key Figure Data Commons Framework Services

This diagram shows how data commons framework services can support multiple data commons and an ecosystem of workspaces, notebooks, and applications.

There is no generally accepted definition of a data ecosystem at this time, but, at the minimum, a ‘data ecosystem’ for biomedical data (as opposed to a data commons) should support the following: Authentication and authorization services so that a community of researchers can access an ecosystem of data and applications with a common (research) identity and common authorization that is shared across data commons and applications. A collection of applications that are powered by APIs that are FAIR compliant that are shared across multiple data commons. The ability for multiple data commons to interoperate through framework services and, preferably, through data peering [33] so that access to data across data commons and applications is transparent, frictionless, and without egress charges, as long as the access is through a digital ID. Shared data models, or portions of data models, to simplify the ability for third party applications to access data from multiple data commons and applications. Projects within a larger overall program, or in related programs, may share a data model. More commonly, different projects may share some common data elements within a core data model, with each project having additional data elements unique for that project. Support for workspaces that may include the following: the ability to create synthetic (or virtual) cohorts and export cohorts to workspaces; the ability to execute bioinformatics workflows within workspaces; and workspace services for processing, exploring, and analyzing data using containers, virtual machines, or other mechanisms. Security and compliance services. Often workspace services (vb) and (vc) use a user-pay model as mentioned above. An example of a cancer data ecosystem is the NCI Cancer Research Data Commons or NCRDC [61]. The NCRDC spans the GDC [30] and the Cloud Resources [61], so that both AWS and the GCP can be used to both analyze data from the GDC as well as to support integrative data analysis across data uploaded by researchers with data from the GDC and other third party datasets. Data commons for proteomic and imaging data are in the process of being added to the NCRDC. The NCRDC uses the framework services described above so that multiple data commons and other NCRDC resources can share authentication, authorization, ID, and metadata services. In particular, this approach allows applications to be built that span multiple data commons.

Concluding Remarks and Future Directions

We have reviewed some of the more recent data and computing platforms that have been used to analyze large-scale data being produce in biology, medicine, and health care, with a particular emphasis on data commons. See Figure 4 for an overview of the different platforms. Data commons provide several important advantages, including the following:

Figure 4.

Data Platforms.

Data platforms can be categorized along four axes: the data architecture, the extent of the data curation and harmonization, the analysis architecture of a resource, and the analysis architecture of the ecosystem. The red lines can be viewed as classifying platforms using parallel coordinates and these four dimensions. The top line is the parallel coordinates associated with the National Cancer Institute (NCI) Cancer Research Data Commons, the line below is the parallel coordinates for the NCI Genomic Data Commons, the two lines below are two possible architectures for data lakes, while the bottom line is an architecture for a repository of files. Abbreviations: API, application programming interface; DCF, data commons framework; NA, not applicable; SaaS, software as a service.

Data commons support repeatable, reproducible, and open research. Some diseases are dependent upon having a critical mass of data to provide the required statistical power for the scientific evidence (e.g., to study combinations of rare mutations in cancer). With more data, smaller effects can be studied (e.g., to understand the effect of environmental factors on disease). Data commons enable researchers to work with large datasets at much lower cost to the sponsor than if each researcher set up their own local environment. Data commons generally provide higher security and greater compliance than most local computing environments. Data commons support large-scale computation so that the latest bioinformatics pipelines can be run. Data commons can interoperate with each other so that over time data sharing can benefit from a ‘network effect’. Over the next few years, one of the most important changes will be the ability of patients to submit their own data to a data commons and to gain some understanding of their own data in terms of the overall data available in the commons and their broader data ecosystem the commons is part of (see Outstanding Questions). The ability of patients to contribute their own data and to have control over how the data are used by the research communit [62] is an important aspect of what is sometimes called patient partnered research.

47 in total

1. The human genome browser at UCSC.

Authors: W James Kent; Charles W Sugnet; Terrence S Furey; Krishna M Roskin; Tom H Pringle; Alan M Zahler; David Haussler
Journal: Genome Res Date: 2002-06 Impact factor: 9.043

Review 2. A vision for a biomedical cloud.

Authors: R L Grossman; K P White
Journal: J Intern Med Date: 2012-02 Impact factor: 8.989

Review 3. Progress Toward Cancer Data Ecosystems.

Authors: Robert L Grossman
Journal: Cancer J Date: 2018 May/Jun Impact factor: 3.360

Review 4. Data Commons to Support Pediatric Cancer Research.

Authors: Samuel L Volchenboum; Suzanne M Cox; Allison Heath; Adam Resnick; Susan L Cohn; Robert Grossman
Journal: Am Soc Clin Oncol Educ Book Date: 2017

5. Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal.

Authors: Jianjiong Gao; Bülent Arman Aksoy; Ugur Dogrusoz; Gideon Dresdner; Benjamin Gross; S Onur Sumer; Yichao Sun; Anders Jacobsen; Rileen Sinha; Erik Larsson; Ethan Cerami; Chris Sander; Nikolaus Schultz
Journal: Sci Signal Date: 2013-04-02 Impact factor: 8.192

6. The Cancer Genomics Hub (CGHub): overcoming cancer through the power of torrential data.

Authors: Christopher Wilks; Melissa S Cline; Erich Weiler; Mark Diehkans; Brian Craft; Christy Martin; Daniel Murphy; Howdy Pierce; John Black; Donavan Nelson; Brian Litzinger; Thomas Hatton; Lori Maltbie; Michael Ainsworth; Patrick Allen; Linda Rosewood; Elizabeth Mitchell; Bradley Smith; Jim Warner; John Groboske; Haifang Telc; Daniel Wilson; Brian Sanford; Hannes Schmidt; David Haussler; Daniel Maltbie
Journal: Database (Oxford) Date: 2014-09-29 Impact factor: 3.451

7. Developing Cancer Informatics Applications and Tools Using the NCI Genomic Data Commons API.

Authors: Shane Wilson; Michael Fitzsimons; Martin Ferguson; Allison Heath; Mark Jensen; Josh Miller; Mark W Murphy; James Porter; Himanso Sahni; Louis Staudt; Yajing Tang; Zhining Wang; Christine Yu; Junjun Zhang; Vincent Ferretti; Robert L Grossman
Journal: Cancer Res Date: 2017-11-01 Impact factor: 12.701

8. The Cancer Genomics Cloud: Collaborative, Reproducible, and Democratized-A New Paradigm in Large-Scale Computational Research.

Authors: Jessica W Lau; Erik Lehnert; Anurag Sethi; Raunaq Malhotra; Gaurav Kaushik; Zeynep Onder; Nick Groves-Kirkby; Aleksandar Mihajlovic; Jack DiGiovanna; Mladen Srdic; Dragan Bajcic; Jelena Radenkovic; Vladimir Mladenovic; Damir Krstanovic; Vladan Arsenijevic; Djordje Klisic; Milan Mitrovic; Igor Bogicevic; Deniz Kural; Brandi Davis-Dusenbery
Journal: Cancer Res Date: 2017-11-01 Impact factor: 12.701

9. NCI Thesaurus: a semantic model integrating cancer-related clinical and molecular information.

Authors: Nicholas Sioutos; Sherri de Coronado; Margaret W Haber; Frank W Hartel; Wen-Ling Shaiu; Lawrence W Wright
Journal: J Biomed Inform Date: 2006-03-15 Impact factor: 6.317

10. The FAIR Guiding Principles for scientific data management and stewardship.

Authors: Mark D Wilkinson; Michel Dumontier; I Jsbrand Jan Aalbersberg; Gabrielle Appleton; Myles Axton; Arie Baak; Niklas Blomberg; Jan-Willem Boiten; Luiz Bonino da Silva Santos; Philip E Bourne; Jildau Bouwman; Anthony J Brookes; Tim Clark; Mercè Crosas; Ingrid Dillo; Olivier Dumon; Scott Edmunds; Chris T Evelo; Richard Finkers; Alejandra Gonzalez-Beltran; Alasdair J G Gray; Paul Groth; Carole Goble; Jeffrey S Grethe; Jaap Heringa; Peter A C 't Hoen; Rob Hooft; Tobias Kuhn; Ruben Kok; Joost Kok; Scott J Lusher; Maryann E Martone; Albert Mons; Abel L Packer; Bengt Persson; Philippe Rocca-Serra; Marco Roos; Rene van Schaik; Susanna-Assunta Sansone; Erik Schultes; Thierry Sengstag; Ted Slater; George Strawn; Morris A Swertz; Mark Thompson; Johan van der Lei; Erik van Mulligen; Jan Velterop; Andra Waagmeester; Peter Wittenburg; Katherine Wolstencroft; Jun Zhao; Barend Mons
Journal: Sci Data Date: 2016-03-15 Impact factor: 6.444

16 in total

1. High-performance method for identification of super enhancers from ChIP-Seq data with configurable cloud virtual machines.

Authors: Natalia N Orlova; Olga V Bogatova; Alexey V Orlov
Journal: MethodsX Date: 2020-11-28

2. Economical Utilization of Health Information with Learning Healthcare System Data Commons.

Authors: Thomas O'Hara; Anil Saldanha; Matthew Trunnell; Robert L Grossman; Bala Hota; Casey Frankenberger
Journal: Perspect Health Inf Manag Date: 2022-03-15

3. Informatics Methods and Infrastructure Needed to Study Factors Associated with High Incidence of Pediatric Brain and Central Nervous System Tumors in Kentucky.

Authors: Eric B Durbin; W Jay Christian; Isaac Hands; Mateusz P Koptyra; Jong Cheol Jong; Tom C Badgett
Journal: J Registry Manag Date: 2020

Review 4. Strategic vision for improving human health at The Forefront of Genomics.

Authors: Eric D Green; Chris Gunter; Leslie G Biesecker; Valentina Di Francesco; Carla L Easter; Elise A Feingold; Adam L Felsenfeld; David J Kaufman; Elaine A Ostrander; William J Pavan; Adam M Phillippy; Anastasia L Wise; Jyoti Gupta Dayal; Britny J Kish; Allison Mandich; Christopher R Wellington; Kris A Wetterstrand; Sarah A Bates; Darryl Leja; Susan Vasquez; William A Gahl; Bettie J Graham; Daniel L Kastner; Paul Liu; Laura Lyman Rodriguez; Benjamin D Solomon; Vence L Bonham; Lawrence C Brody; Carolyn M Hutter; Teri A Manolio
Journal: Nature Date: 2020-10-28 Impact factor: 49.962

5. BloodPAC Data Commons for Liquid Biopsy Data.

Authors: Robert L Grossman; Jonathan R Dry; Sean E Hanlon; Donald J Johann; Anand Kolatkar; Jerry S H Lee; Christopher Meyer; Lea Salvatore; Walt Wells; Lauren Leiman
Journal: JCO Clin Cancer Inform Date: 2021-04

6. The Collaboration on Attachment Transmission Synthesis (CATS): A Move to the Level of Individual-Participant-Data Meta-Analysis.

Authors: Marije L Verhage; Carlo Schuengel; Robbie Duschinsky; Marinus H van IJzendoorn; R M Pasco Fearon; Sheri Madigan; Glenn I Roisman; Marian J Bakermans-Kranenburg; Mirjam Oosterman
Journal: Curr Dir Psychol Sci Date: 2020-03-20

7. Global Alliance for Genomics and Health Meets Bioconductor: Toward Reproducible and Agile Cancer Genomics at Cloud Scale.

Authors: Vincent J Carey; Marcel Ramos; Benjamin J Stubbs; Shweta Gopaulakrishnan; Sehyun Oh; Nitesh Turaga; Levi Waldron; Martin Morgan
Journal: JCO Clin Cancer Inform Date: 2020-05

8. Evolutionary genomics can improve prediction of species' responses to climate change.

Authors: Ann-Marie Waldvogel; Barbara Feldmeyer; Gregor Rolshausen; Moises Exposito-Alonso; Christian Rellstab; Robert Kofler; Thomas Mock; Karl Schmid; Imke Schmitt; Thomas Bataillon; Outi Savolainen; Alan Bergland; Thomas Flatt; Frederic Guillaume; Markus Pfenninger
Journal: Evol Lett Date: 2020-01-14

9. Twelve quick steps for genome assembly and annotation in the classroom.

Authors: Hyungtaek Jung; Tomer Ventura; J Sook Chung; Woo-Jin Kim; Bo-Hye Nam; Hee Jeong Kong; Young-Ok Kim; Min-Seung Jeon; Seong-Il Eyun
Journal: PLoS Comput Biol Date: 2020-11-12 Impact factor: 4.475

10. Insights from Adopting a Data Commons Approach for Large-scale Observational Cohort Studies: The California Teachers Study.

Authors: James V Lacey; Nadia T Chung; Paul Hughes; Jennifer L Benbow; Christine Duffy; Kristen E Savage; Emma S Spielfogel; Sophia S Wang; Maria Elena Martinez; Sandeep Chandra
Journal: Cancer Epidemiol Biomarkers Prev Date: 2020-02-12 Impact factor: 4.254