| Literature DB >> 32864105 |
Matthew N Bernstein1, Ariella Gladstein2, Khun Zaw Latt3, Emily Clough4, Ben Busby4, Allissa Dillman4.
Abstract
The Sequence Read Archive (SRA) is a large public repository that stores raw next-generation sequencing data from thousands of diverse scientific investigations. Despite its promise, reuse and re-analysis of SRA data has been challenged by the heterogeneity and poor quality of the metadata that describe its biological samples. Recently, the MetaSRA project standardized these metadata by annotating each sample with terms from biomedical ontologies. In this work, we present a pair of Jupyter notebook-based tools that utilize the MetaSRA for building structured datasets from the SRA in order to facilitate secondary analyses of the SRA's human RNA-seq data. The first tool, called the Case-Control Finder, finds suitable case and control samples for a given disease or condition where the cases and controls are matched by tissue or cell type. The second tool, called the Series Finder, finds ordered sets of samples for the purpose of addressing biological questions pertaining to changes over a numerical property such as time. These tools were the result of a three-day-long NCBI Codeathon in March 2019 held at the University of North Carolina at Chapel Hill. Copyright:Entities:
Keywords: Hackathon; Jupyter; MetaSRA; Metadata; Ontology; RNA-seq; Sequence Read Archive
Mesh:
Year: 2020 PMID: 32864105 PMCID: PMC7445559 DOI: 10.12688/f1000research.23180.2
Source DB: PubMed Journal: F1000Res ISSN: 2046-1402
Figure 1. Data flows for hypothesis-driven query tools.
An overview of the backend processing functions called from the Jupyter notebooks.
Figure 2. Example results from the Case-Control Finder.
Results from running the Case-Control Finder for the query “liver cancer.” ( A) The Case-Control Finder displays the number of case/control samples matched by each tissue and cell type. ( B) The user can select either the case samples or control samples for a given tissue or cell type and display the most common ontology terms associated with those selected samples. Displayed here are the most common terms associated with the case samples labeled as “liver.” ( C) The notebook also displays four pie charts for viewing the fraction of samples belonging to a cell line (top left), each sex (top right), each developmental stage (bottom left), and whether they were given an experimental treatment (bottom right).
Figure 3. Example results from the Series Finder.
Results from running the Series Finder for the query “brain” sorted by “age,” where unit is specified as “year.” ( A) The Series Finder displays the number of samples sorted by age. ( B) The user can select samples associated with a given time point for further exploration. Here the samples annotated as “year = 63” are selected. The notebook then displays four pie charts for viewing the fraction of samples belonging to a cell line (top left), each sex (top right), each developmental stage (bottom left), and whether they were given an experimental treatment (bottom right). ( C) Given the selected samples from ( B), the notebook displays the most frequent terms associated with those selected samples.