| Literature DB >> 35199087 |
Michael C Schatz1,2, Anthony A Philippakis3, Enis Afgan1, Eric Banks3, Vincent J Carey4, Robert J Carroll5, Alessandro Culotti3,6, Kyle Ellrott7, Jeremy Goecks7, Robert L Grossman6, Ira M Hall8, Kasper D Hansen9, Jonathan Lawson3, Jeffrey T Leek9, Anne O'Donnell Luria3, Stephen Mosher1, Martin Morgan10, Anton Nekrutenko11, Brian D O'Connor3, Kevin Osborn12, Benedict Paten12, Candace Patterson3, Frederick J Tan13, Casey Overby Taylor14, Jennifer Vessio1, Levi Waldron15, Ting Wang16, Kristin Wuichet5.
Abstract
The NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL; https://anvilproject.org) was developed to address a widespread community need for a unified computing environment for genomics data storage, management, and analysis. In this perspective, we present AnVIL, describe its ecosystem and interoperability with other platforms, and highlight how this platform and associated initiatives contribute to improved genomic data sharing efforts. The AnVIL is a federated cloud platform designed to manage and store genomics and related data, enable population-scale analysis, and facilitate collaboration through the sharing of data, code, and analysis results. By inverting the traditional model of data sharing, the AnVIL eliminates the need for data movement while also adding security measures for active threat detection and monitoring and provides scalable, shared computing resources for any researcher. We describe the core data management and analysis components of the AnVIL, which currently consists of Terra, Gen3, Galaxy, RStudio/Bioconductor, Dockstore, and Jupyter, and describe several flagship genomics datasets available within the AnVIL. We continue to extend and innovate the AnVIL ecosystem by implementing new capabilities, including mechanisms for interoperability and responsible data sharing, while streamlining access management. The AnVIL opens many new opportunities for analysis, collaboration, and data sharing that are needed to drive research and to make discoveries through the joint analysis of hundreds of thousands to millions of genomes along with associated clinical and molecular data types.Entities:
Year: 2022 PMID: 35199087 PMCID: PMC8863334 DOI: 10.1016/j.xgen.2021.100085
Source DB: PubMed Journal: Cell Genom ISSN: 2666-979X
Figure 1.Inverting the model for data sharing
(Left) In the traditional model, project data (shown in purple, orange, and green) are copied to multiple sites where they are accessed by users on institutional computing clusters. Under this model, each institution must establish its own data center, and collaboration is achieved primarily through copying files between data centers. (Right) In the inverted model, users connect to a cloud-enabled resource such as the AnVIL to remotely access and analyze the data without copying. In this model, users virtually access a unified data center, allowing for deeper collaboration and sharing of the results.
Figure 2.Overview of the AnVIL ecosystem
(Top) The AnVIL is a federated cloud environment for the analysis of large genomic and related datasets. The AnVIL is built on a set of established components that bring together widely used platforms. The Terra platform provides a compute environment with secure data and analysis sharing capabilities. Dockstore provides standards-based sharing of containerized tools and workflows. R/Bioconductor, Jupyter, and Galaxy provide environments for users at different skill levels to construct and execute analyses. The Gen3 data commons framework provides data and metadata ingest, querying, and organization. (Bottom) The AnVIL has been used in a number of flagship NHGRI and other genomics projects. Summary of the genomics datasets available within the AnVIL as of December 2021, as shown at https://anvilproject.org/data. WGS, whole-genome sequencing; WXS, whole-exome sequencing.