Literature DB >> 32864105

Jupyter notebook-based tools for building structured datasets from the Sequence Read Archive.

Matthew N Bernstein¹, Ariella Gladstein², Khun Zaw Latt³, Emily Clough⁴, Ben Busby⁴, Allissa Dillman⁴.

Abstract

The Sequence Read Archive (SRA) is a large public repository that stores raw next-generation sequencing data from thousands of diverse scientific investigations. Despite its promise, reuse and re-analysis of SRA data has been challenged by the heterogeneity and poor quality of the metadata that describe its biological samples. Recently, the MetaSRA project standardized these metadata by annotating each sample with terms from biomedical ontologies. In this work, we present a pair of Jupyter notebook-based tools that utilize the MetaSRA for building structured datasets from the SRA in order to facilitate secondary analyses of the SRA's human RNA-seq data. The first tool, called the Case-Control Finder, finds suitable case and control samples for a given disease or condition where the cases and controls are matched by tissue or cell type. The second tool, called the Series Finder, finds ordered sets of samples for the purpose of addressing biological questions pertaining to changes over a numerical property such as time. These tools were the result of a three-day-long NCBI Codeathon in March 2019 held at the University of North Carolina at Chapel Hill. Copyright:

Entities: Chemical Disease Species

Keywords: Hackathon; Jupyter; MetaSRA; Metadata; Ontology; RNA-seq; Sequence Read Archive

Mesh：

Year: 2020 PMID： 32864105 PMCID： PMC7445559 DOI： 10.12688/f1000research.23180.2

Source DB: PubMed Journal: F1000Res ISSN： 2046-1402

Introduction

The Sequence Read Archive (SRA; Leinonen ) is a large public repository that stores next-generation sequencing data from thousands of diverse scientific investigations. Despite its promise, reuse and re-analysis of SRA data has been challenged by the heterogeneity and poor quality of the metadata that describe its biological samples ( Gonçalves & Musen, 2019). Recently, the MetaSRA project ( Bernstein ) standardized these metadata by annotating each sample with terms from biomedical ontologies including Cell Ontology ( Bard ), Uberon ( Mungall ), Disease Ontology ( Schriml ), Cellosaurus ( Bairoch, 2018), and the Experimental Factors Ontology ( Malone ). The MetaSRA also features an interface ( http://metasra.biostat.wisc.edu) for querying human RNA-seq samples using these ontology term annotations. However, the MetaSRA web interface is not capable of producing structured datasets such as those that match case samples associated with a target condition or disease with healthy control samples. Similarly, the MetaSRA is also not capable of searching for samples associated with a particular condition and/or tissue-type that are ordered according to a numeric property (e.g., age). Construction of such datasets is non-trivial and requires further processing of the results provided by the MetaSRA website. Specifically, finding case and control samples for a given disease requires matching case samples to control samples according to their tissue or cell type. For example, if one were to naively search the MetaSRA for “liver cancer” samples, the results would include samples from Kim , which consist of isolated T cells from liver tumors. Therefore, only matched T cell samples would make for appropriate controls. Furthermore, given these search results, users may wish to further filter samples according to whether they are poorly annotated (i.e. , are missing cell type or tissue information), whether they are derived from a cell line, or whether they were experimentally treated. Moreover, given these results, the user may wish to explore other ontology terms associated with the search results within either the case or control samples to check for any variables that may confound downstream analyses. Finding longitudinal or time-series data presents similar challenges. To the best of our knowledge, no existing tool addresses these tasks. To address these two tasks, we produced two Jupyter notebook-based tools. The first tool, called the Case-Control Finder, searches the SRA via the MetaSRA terms to produce matched-case and control samples for a given disease or condition where the cases and controls are matched by tissue and cell type. The second tool, called the Series Finder, finds ordered sets of samples for the purpose of answering biological questions pertaining to changes over a numerical property (e.g., time). More specifically, the Series Finder produces ordered sets of samples, where the order is determined based on a temporal property in the metadata as standardized by the MetaSRA’s real-valued properties. Examples of temporal properties include the age of a person from which a given sample originated or the time in which a given sample of cells have spent differentiating in vitro. These tools promise to facilitate the construction of suitable public datasets for secondary analyses.

Methods

The tools presented in this work were written in Python (v3.6) and make use of Python packages pandas ( McKinney, 2011), Matplotlib ( Hunter, 2007), and seaborn ( https://seaborn.pydata.org). These notebooks can be run in the cloud via Google Colab. A link to these notebooks can be found in the README within the Github repository ( https://github.com/mbernste/hypothesis-driven-SRA-queries).

Case-Control Finder

The Case-Control Finder implements the following steps to produce a dataset of matched-case control samples for a given disease ( Figure 1A):

Figure 1.

Data flows for hypothesis-driven query tools.

An overview of the backend processing functions called from the Jupyter notebooks.

Data flows for hypothesis-driven query tools.

An overview of the backend processing functions called from the Jupyter notebooks. Generate candidate case and control samples. Generate the set of candidate case samples by querying for all samples associated with a user-specified condition or disease using the MetaSRA-mapped ontology terms. Also, find all candidate control samples that are not associated with the target condition/disease. Filter poorly annotated samples. Filter samples based on a metadata completeness threshold, which requires that all samples be associated with either a tissue term or a cell type term. The tissue/cell type information is required for downstream matching of case samples to control samples. Apply user-specified filters. Further filter samples according to user-specified filtering parameters. The user can filter out cell line samples, treated samples, and in vitro differentiated samples. The user can also remove all diseased samples from the candidate control samples for the purpose of generating a healthy control-set. Match by tissue, cell type, age, and sex. The candidate case samples are then matched with the candidate control samples by their tissue and cell type terms. Optionally, the user can also match samples by age and sex. Specifically, given that each sample can be associated with multiple ontology terms in the MetaSRA, a set of case samples is matched with a set of control samples when both sets of samples are labelled with the same set of tissue and cell type terms. For example, a set of case samples annotated with the set of terms “liver” and “epithelial cell” will be matched only to control samples also labeled strictly with these terms ( Figure 2A). This ensures that case samples are matched with maximally similar control samples and mitigates matching samples from different tissue-types. For example, a set of case samples labelled with both the terms “liver” and “epithelial cell” will not be matched with a set of samples labelled only as “epithelial cell,” as there is no guarantee that the latter set of samples originate in the liver.

Figure 2.

Example results from the Case-Control Finder.

Results from running the Case-Control Finder for the query “liver cancer.” ( A) The Case-Control Finder displays the number of case/control samples matched by each tissue and cell type. ( B) The user can select either the case samples or control samples for a given tissue or cell type and display the most common ontology terms associated with those selected samples. Displayed here are the most common terms associated with the case samples labeled as “liver.” ( C) The notebook also displays four pie charts for viewing the fraction of samples belonging to a cell line (top left), each sex (top right), each developmental stage (bottom left), and whether they were given an experimental treatment (bottom right).

Example results from the Case-Control Finder.

Series Finder

The Series Finder finds RNA-seq data samples that are associated with a numerical property (e.g., age or time point) for a given tissue or cell type. To do so, the Series Finder utilizes the real-value property annotations provided by the MetaSRA where each real-value property in the MetaSRA is structured as a tuple consisting of a property name (e.g., age), numerical value, and unit (e.g., year). To perform a query, the user provides an ontology term, such as a tissue or cell type, as well as a property name and unit. The Series Finder then finds all samples that are associated with the target ontology term and real-value property. The user can also specify a set of filters (e.g. for filtering diseased samples or cell line samples) and the Series Finder will remove all samples that meet the filter specification. The Series Finder will then return all remaining samples ordered by their associated numerical values ( Figure 1B).

Results and use cases

We used the Case-Control Finder to query for samples of liver cancer RNA-seq samples matched with healthy control samples. This query resulted in 21 sets of samples representing different tissues or cell types including epithelial cells, hepatocytes, stem cells, and liver tissue ( Figure 2A). The Case-Control Finder identified common terms associated with the case “liver cancer” samples ( Figure 2B), and categorized these samples by cell line status, sex, developmental stage, and treatment status ( Figure 2C). We used the Series Finder to find all brain samples in the SRA ordered by the age of the sample donor. This query resulted in samples spanning many ages ( Figure 3A). This dataset could prove useful for exploring gene expression-based signatures of aging. The Series Finder also identified common terms at each age ( Figure 3B) and for each age’s sample-set, categorized those samples by cell line status, sex, developmental stage, and treatment status ( Figure 3C).

Figure 3.

Example results from the Series Finder.

Results from running the Series Finder for the query “brain” sorted by “age,” where unit is specified as “year.” ( A) The Series Finder displays the number of samples sorted by age. ( B) The user can select samples associated with a given time point for further exploration. Here the samples annotated as “year = 63” are selected. The notebook then displays four pie charts for viewing the fraction of samples belonging to a cell line (top left), each sex (top right), each developmental stage (bottom left), and whether they were given an experimental treatment (bottom right). ( C) Given the selected samples from ( B), the notebook displays the most frequent terms associated with those selected samples.

Example results from the Series Finder.

Conclusion and future work

We implemented two Jupyter notebooks for performing hypothesis-driven queries of public RNA-seq samples in the SRA. These tools are built upon the standardized metadata provided by the MetaSRA project and enable querying of the metadata beyond what is natively possible via the MetaSRA website interface. Given the SRA accessions of the RNA-seq samples that these tools produce, a user can then retrieve the gene expression data for these samples in order to perform secondary analyses. Specifically, the user can either download and process the raw reads from the SRA, or they can obtain preprocessed gene expression profiles from recent mass preprocessing efforts such as recount2 ( Collado-Torres, 2017), ARCHS4 ( Lachmann ), and refine.bio ( Greene ). Finally, these notebooks come pre-packaged with metadata files from the latest version of the SRA, as provided by the SRAdb ( Yuelin ), and MetaSRA. When the MetaSRA releases a new version of annotated metadata, these notebooks will be updated to track the new release. We also note a few limitations to this work. First given that the MetaSRA annotates the SRA samples using an automated computational pipeline, its annotations contain some errors. Errors in the MetaSRA may propagate to the results produced by these tools, and thus, the datasets produced by these tools are best utilized as sets of candidate datasets for downstream analysis. We point the reader to Bernstein for an analysis of the MetaSRA’s accuracy. We also note that the SRA stores sequencing data for both bulk RNA-seq and single-cell RNA-seq samples; however, this information is not encoded in any standardized way within the SRA nor is it captured by the MetaSRA. Thus, results returned by these tools may include a mixture of both single-cell and bulk data. For these reasons, we encourage users to validate the results returned by these tools by consulting their entries in the SRA before proceeding with downstream analyses. Lastly, to facilitate access to these tools, it would benefit to implement them within an easy-to-use web interface rather than Jupyter notebooks. Future work will entail either integrating these tools into the MetaSRA website, or implementing a stand-alone web application for these tools using a platform such as R Shiny.

Data availability

The figures and datasets produced in the analyses can be found on GitHub: https://github.com/mbernste/hypothesis-driven-SRA-queries/tree/master/results

Software availability

All code is maintained on GitHub: https://github.com/mbernste/hypothesis-driven-SRA-queries Archived code as at time of publication: https://doi.org/10.5281/zenodo.3957949 ( Bernstein, 2020) License: CC0 The authors have addressed all of my comments and significantly improved the tools. Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Partly Is the rationale for developing the new software tool clearly explained? Yes Is the description of the software tool technically sound? Yes Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others? Yes Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Partly Reviewer Expertise: Bioinformatics; Computational Biology; Machine Learning I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. Thank you to the authors for their thoughtful response and changes to the manuscript & notebooks. I'm happy with the current version of the manuscript and associated materials. Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Partly Is the rationale for developing the new software tool clearly explained? Yes Is the description of the software tool technically sound? Yes Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others? Yes Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Partly Reviewer Expertise: genetics, bioinformatics, data science education I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. This paper describes the development of two Jupyter notebook-based tools (Case-Control Finder and Series Finder) for improving the ease with which researchers can identify cases within the SRA for further study. While the paper does a nice job describing what the tool is and how it can be helpful and the code & examples provided/explained the paper function as expected (is reproducible), there are a few limitations in its implementation that will limit its utility with researchers: The fact that this tool requires a static version of the SRA metadata to be loaded in limits its ability to be updated and requires the authors to manually download the metadata - access by API to SRA would improve this process. While the provided examples work well, there are limitations to unfamiliar users and failures in cases that seem on reading the paper like they should work. - For example: in series finder if I change `term` to "heart" (instead of "brain"), almost all subsequent cells fail. - In case-control finder, if I change `condition` to "brain cancer", all but one samples returned are controls (which does not align with what is in the SRA?) and visualization formatting becomes difficult. - By clarifying what user options are (or examples) for each place where user is free to play with the input, this could be avoided. Similarly, functions lack documentation and examples here or checks on input within the functions, so diving into the code becomes critical for use, which will limit users. Adding documentation and checks for user input could assist in this overall. Minor issues: I was able to download locally using the "not recommended" approach; however, docker asked for a password using suggested approach in README (I didn't investigate further). In the paper & notebooks, tool would be improved by focusing on readability of visualizations. For example, flipping the bar charts in figure 2A by 90 degrees (and accompanying in the notebook), the labels would be more readable. And, by considering the colors in figure C, such that "orange" is not used in all three pie charts (when they do not represent the same categories) would be helpful. Having the number of samples summarized by the pie charts would also be helpful. The sentence in Introduction starting with "More specifically, the Series Finder produces..." is unclear. Specifically, on reading, I'm not sure what a temporal property would be in the metadata (other than the listed age). As a reader, this limits my understanding of 1 of the two notebooks provided and my ability to use the tool. I may be missing it, but it seems like cases and controls would benefit most from being able to also be matched on age and sex to truly make them useful for further analysis. It does not seem this functionality exists, or I'm missing it. Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Partly Is the rationale for developing the new software tool clearly explained? Yes Is the description of the software tool technically sound? Yes Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others? Yes Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Partly Reviewer Expertise: genetics, bioinformatics, data science education I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above. We greatly appreciate the reviewer's valuable suggestions and feedback. Please see our responses below: 1. We agree that using the MetaSRA’s API would be a great idea; however, the API restricts queries that return too many results. Specifically, for queries that return too many results, the API returns an error message that the search results are too large. This severely restricts our ability to use the API for these tools. We note that the MetaSRA is released in discrete chunks and does not track every ongoing change to the SRA; thus, whenever the MetaSRA version changes, we will update the static version of the MetaSRA packaged with these tools. We have added text to this manuscript detailing our commitment to performing these updates. Lastly, we added text to the README that makes it more explicit to the user which version of the MetaSRA these tools are utilizing. 2. - We tested the query “heart” and it now should return results. We also provide more thorough input validation for cases in which the query does not return results. - We have updated the code so that the tools retrieves sample that are annotated as an ancestral term to the query term (e.g. a sample labelled as “brain glioma” should be retrieved when the user inputs the query “brain cancer”). Now the query “brain cancer” will retrieve many more samples than before. We do note a few issues with the particular query “brain cancer” (which maps to term DOID:1319 in the Disease Ontology). Specifically, we found that the MetaSRA failed to label many samples as “brain cancer” due to the fact that many of the subterms (e.g. “brain glioma”) are missing important synonyms that would have led the MetaSRA to pick them up. For example, the term “brain glioma” (DOID:0060108) is not associated with the simple synonym “glioma” and thus, unless a sample for a given glioma sample was described using the string “brain glioma”, which appears to be rare, the MetaSRA failed to annotate this sample as a “brain glioma”. Instead, the MetaSRA labels glioma samples using an alternative “glioma” term from the Experimental Factors Ontology (EFO:0005543), which does not have “brain cancer” as an ancestor term, but instead has “brain neoplasm” as an ancestor (EFO:0003833). This case points to the fact that there is still work to be done in both standardizing the metadata in the SRA and in constructing comprehensive ontologies. Unfortunately, these issues remain out of the scope for this work; however, we now include new text in the Conclusion section that discusses how the original MetaSRA annotations contain some errors and that these errors may propagate to the output of these tools. - Thank you for this suggestion. We have added more detailed instructions for each input parameter. We also perform more thorough input-validation on the user’s input. Lastly, we have added more documentation to each function in utils to help a user who wishes to dive further into the code. Responses to minor issues: 1. We apologize for this password issue. Given how few dependencies these notebooks utilize, we decided that Docker is probably overkill for this project and therefore we removed this option altogether. We instead uploaded these notebooks to Google Colab to run in the cloud. If a user would like to run the notebooks locally, we now detail all of the dependencies in the file “requirements.txt” within the repository and offer guidance on installing these dependencies in the README. 2. Thank you for these suggestions. We flipped the barcharts 90 degrees and also use a different color palette for each pie chart. We note that the same samples are used to construct each of the four pie charts. 3. We added text to this sentence highlighting another example of a temporal property: time in which cells have spent differentiating in vitro. To this end, we have also added another parameter to the query that enables users to select only in vitro differentiating cells in order to answer possible biological questions pertaining to differentiation. 4. This is definitely an important feature, thank you for suggesting it. We now enable the user to match by age and sex in the notebook (see Section “3. Set filtering parameters”) in the notebook. Specifically, in the notebook, if the user sets the variable “MATCH_BY_SEX” to True, we only consider samples that are annotated by sex in the MetaSRA and then match accordingly. Similarly, if the user sets “MATCH_BY_AGE” to True, we only consider samples that are annotated with age and then match accordingly. Bernstein et al. provides two Jupyter notebook-based tools to facilitate re-analysis of human RNA-seq data deposited to SRA. The tools were built on top of annotated metadata of RNA-seq samples from the MetaRNA, and provided some visualizations of the summary statistics of the query results. I have the following suggestions and comments: The authors should indicate how to access the Jupyter notebooks in the abstract. It would require less overhead for users if the authors make their Jupyter notebook tools available to execute on Binder or Google Colab. Since MetaSRA mapped RNA-seq samples to biomedical ontologies, it would be useful to have the Jupyter tools also enable query using ontology terms in addition to free texts. For instance, a researcher may want to focus on samples from non-small cell lung carcinoma (DOID:3908) rather than any types of lung cancers. Currently, both notebooks load the metadata of the SRA samples from a preprocessed file in the Git repository. It would be useful to make it interoperable with MetaSRA through API to be able to query against the most updated version of SRA, which may include many more samples. As the volume of public RNA-seq data are drastically increasing. Please provide available options for the structured query, including "target_property" and "UNIT", in the "Series Finder" notebook. Please provide assessment of the precision and recall of the tools in terms of retrieving the correct samples given queries. Can the authors please comment on the applicability of the tools on bulk vs. single-cell samples? Please add discussion about how to perform secondary analysis on the SRA samples after obtaining the structured data from the Jupyter notebooks. Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Partly Is the rationale for developing the new software tool clearly explained? Yes Is the description of the software tool technically sound? Yes Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others? Yes Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Partly Reviewer Expertise: Bioinformatics; Computational Biology I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above. We greatly appreciate the reviewer's valuable feedback. Please find our responses to each point below: 1. Within the abstract we now point the reader to the tools’ Github repository, which describes how the tools can be executed either locally or in the cloud via Google Colab. 2. We have set up Google Colab notebooks to run these tools in the cloud. Links to the notebooks are found within the README in the Github repository. 3. We thank you for this suggestion. We have updated the tools to now accept both ontology term names (i.e. free text) as well as ontology term ID’s. 4. We agree that using the MetaSRA’s API would be a great idea; however, the API restricts queries that return too many results. Specifically, for queries that return too many results, the API returns an error message that the search results are too large. This severely restricts our ability to use the API for these tools. We note that the MetaSRA is released in discrete chunks and does not track every ongoing change to the SRA; thus, whenever the MetaSRA version changes, we will update the static version of the MetaSRA packaged with these tools. We have added text to this manuscript detailing our commitment to performing these updates. Lastly, we added text to the README that makes it more explicit to the user which version of the MetaSRA these tools are utilizing. 5. Within the instructions (within Section 1 of the Series Finder), we now provide the user example properties (such as “passage number” and “time”) as well as example units (such as “hour” and “day”). We also point the user to the Units Ontology for a full set of available units that are utilized by the underlying MetaSRA annotations. 6. We note that the accuracy of the results is dependent on the accuracy of the MetaSRA annotations, which have been thoroughly evaluated in the original MetaSRA publication by Bernstein et al. (2017). Therefore, we added text to the “Conclusion and future work” section that points readers to this analysis. We have also added text to this section that clarifies that these tools are for selecting an initial candidate set of samples from the SRA; however, given that the annotations are not error-free, we encourage the user to further validate the datasets returned by these tools before performing downstream analysis. 7. The SRA stores sequencing data for both bulk and single-cell data; however, this information is not encoded in the metadata in a standardized way nor is it captured by the MetaSRA. Therefore, one limitation of the tools presented in this work is that they may return datasets that comprise both bulk and single-cell samples. We describe this limitation in the Conclusion section and again encourage users to validate the results returned by these tools before performing downstream analyses. 8. In the Conclusion section, we now point the reader to databases of pre-processed SRA data including recount2, ARCHS4, and refine.bio. From these resources, users can download pre-processed expression data for the samples returned by the tools presented in this work.

12 in total

1. Modeling sample variables with an Experimental Factor Ontology.

Authors: James Malone; Ele Holloway; Tomasz Adamusiak; Misha Kapushesky; Jie Zheng; Nikolay Kolesnikov; Anna Zhukova; Alvis Brazma; Helen Parkinson
Journal: Bioinformatics Date: 2010-03-03 Impact factor: 6.937

2. Reproducible RNA-seq analysis using recount2.

Authors: Leonardo Collado-Torres; Abhinav Nellore; Kai Kammers; Shannon E Ellis; Margaret A Taub; Kasper D Hansen; Andrew E Jaffe; Ben Langmead; Jeffrey T Leek
Journal: Nat Biotechnol Date: 2017-04-11 Impact factor: 54.908

3. The sequence read archive.

Authors: Rasko Leinonen; Hideaki Sugawara; Martin Shumway
Journal: Nucleic Acids Res Date: 2010-11-09 Impact factor: 16.971

4. Uberon, an integrative multi-species anatomy ontology.

Authors: Christopher J Mungall; Carlo Torniai; Georgios V Gkoutos; Suzanna E Lewis; Melissa A Haendel
Journal: Genome Biol Date: 2012-01-31 Impact factor: 13.583

5. MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive.

Authors: Matthew N Bernstein; AnHai Doan; Colin N Dewey
Journal: Bioinformatics Date: 2017-09-15 Impact factor: 6.937

6. The variable quality of metadata about biological samples used in biomedical experiments.

Authors: Rafael S Gonçalves; Mark A Musen
Journal: Sci Data Date: 2019-02-19 Impact factor: 6.444

7. Human Disease Ontology 2018 update: classification, content and workflow expansion.

Authors: Lynn M Schriml; Elvira Mitraka; James Munro; Becky Tauber; Mike Schor; Lance Nickle; Victor Felix; Linda Jeng; Cynthia Bearer; Richard Lichenstein; Katharine Bisordi; Nicole Campion; Brooke Hyman; David Kurland; Connor Patrick Oates; Siobhan Kibbey; Poorna Sreekumar; Chris Le; Michelle Giglio; Carol Greene
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

8. 4-1BB Delineates Distinct Activation Status of Exhausted Tumor-Infiltrating CD8⁺ T Cells in Hepatocellular Carcinoma.

Authors: Hyung-Don Kim; Seongyeol Park; Seongju Jeong; Yong Joon Lee; Hoyoung Lee; Chang Gon Kim; Kyung Hwan Kim; Seung-Mo Hong; Jung-Yun Lee; Sunghoon Kim; Hong Kwan Kim; Byung Soh Min; Jong Hee Chang; Young Seok Ju; Eui-Cheol Shin; Gi-Won Song; Shin Hwang; Su-Hyung Park
Journal: Hepatology Date: 2019-10-18 Impact factor: 17.425

9. SRAdb: query and use public next-generation sequencing data from within R.

Authors: Yuelin Zhu; Robert M Stephens; Paul S Meltzer; Sean R Davis
Journal: BMC Bioinformatics Date: 2013-01-17 Impact factor: 3.169

10. Massive mining of publicly available RNA-seq data from human and mouse.

Authors: Alexander Lachmann; Denis Torre; Alexandra B Keenan; Kathleen M Jagodnik; Hoyjin J Lee; Lily Wang; Moshe C Silverstein; Avi Ma'ayan
Journal: Nat Commun Date: 2018-04-10 Impact factor: 17.694

1 in total

1. STAT: a fast, scalable, MinHash-based k-mer tool to assess Sequence Read Archive next-generation sequence submissions.

Authors: Kenneth S Katz; Oleg Shutov; Richard Lapoint; Michael Kimelman; J Rodney Brister; Christopher O'Sullivan
Journal: Genome Biol Date: 2021-09-20 Impact factor: 13.583

1 in total