Literature DB >> 28369191

The RNASeq-er API-a gateway to systematically updated analysis of public RNA-seq data.

Robert Petryszak¹, Nuno A Fonseca¹, Anja Füllgrabe¹, Laura Huerta¹, Maria Keays¹, Y Amy Tang¹, Alvis Brazma¹.

Abstract

MOTIVATION: The exponential growth of publicly available RNA-sequencing (RNA-Seq) data poses an increasing challenge to researchers wishing to discover, analyse and store such data, particularly those based in institutions with limited computational resources. EMBL-EBI is in an ideal position to address these challenges and to allow the scientific community easy access to not just raw, but also processed RNA-Seq data. We present a Web service to access the results of a systematically and continually updated standardized alignment as well as gene and exon expression quantification of all public bulk (and in the near future also single-cell) RNA-Seq runs in 264 species in European Nucleotide Archive, using Representational State Transfer.
RESULTS: The RNASeq-er API (Application Programming Interface) enables ontology-powered search for and retrieval of CRAM, bigwig and bedGraph files, gene and exon expression quantification matrices (Fragments Per Kilobase Of Exon Per Million Fragments Mapped, Transcripts Per Million, raw counts) as well as sample attributes annotated with ontology terms. To date over 270 00 RNA-Seq runs in nearly 10 000 studies (1PB of raw FASTQ data) in 264 species in ENA have been processed and made available via the API.
AVAILABILITY AND IMPLEMENTATION: The RNASeq-er API can be accessed at http://www.ebi.ac.uk/fg/rnaseq/api . The commands used to analyse the data are available in supplementary materials and at https://github.com/nunofonseca/irap/wiki/iRAP-single-library . CONTACT: rnaseq@ebi.ac.uk ; rpetry@ebi.ac.uk. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Mesh：

Year: 2017 PMID： 28369191 PMCID： PMC5870697 DOI： 10.1093/bioinformatics/btx143

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

The pattern of rapid growth of RNA-sequencing (RNA-Seq) data, observed in recent years, is set to continue as costs of sequencing experiments decrease and novel technologies and analysis methods reach maturity, e.g. single-cell RNA-Seq (Linnarson ). Figure 1 highlights sustained exponential growth in the number of public bulk RNA-Seq runs in European Nucleotide Archive (ENA).

Fig. 1

Cumulative number of public bulk RNA-Seq runs in ENA, in species covered by the API

Cumulative number of public bulk RNA-Seq runs in ENA, in species covered by the API A ‘run’ is a unit of biological assay performed on a sequencing machine for a single, de-multiplexed sequencing library preparation. Figure 2 shows the number of runs in the top 20 RNA-Seq data-rich species in ENA.

Fig. 2

The number of sequencing runs in the top 20 RNA-Seq data-rich species in ENA

The number of sequencing runs in the top 20 RNA-Seq data-rich species in ENA This sustained growth only exacerbates the challenges facing researchers wishing to discover, analyse and store available RNA-Seq data, particularly those based in institutions with limited computational resources. EMBL-EBI is in an ideal position to address these challenges and to allow the scientific community easy access to not just raw, but also processed RNA-Seq data. We have therefore undertaken the task of on-going standardized alignment and gene and exon expression quantification of all public bulk (and in the near future also single-cell) RNA-Seq data in ENA (Silvester ) in 264 species with genome references in Ensembl (Cunningham ), Ensembl Genomes (Kersey ) and WormBase Parasite (Howe ), depositing the results on the public EMBL-EBI FTP server, and making them discoverable via the RNASeq-er API (Application Programming Interface). Our fully automated analysis pipeline processes new RNA-Seq runs as soon as they become public in ENA and makes the results available via the API shortly after. In addition, all RNA-Seq runs in a given species are re-processed when a new genome assembly is released. While the initial processing of the bulk of public RNA-Seq data took around 6 months, the pipeline (utilising 2000 cores in parallel) is capable of processing around 500-1000 sequencing runs per day and thus provides results for any new run in ENA within days of it becoming public. The re-processing for new genome assembly typically takes a week or 2, with the exception of human and mouse (due to the sheer volume of data) and of large genome species (it took over a month to re-process all wheat runs after the new TGACv1 genome reference was released). The RNASeq-er API enables ontology-powered search for and retrieval of CRAM, bigwig and bedGraph files at individual ENA run level, and of gene and exon expression quantification matrices [Fragments Per Kilobase Of Exon Per Million Fragments Mapped (FPKM), Transcripts Per Million (TPM), raw counts] at ENA study level. The API returns data in tab-delimited and JSON formats, and provides additional search filter by the minimum percentage of reads mapped to the genome reference in a given run. The API also provides access to baseline gene expression quantifications, aggregated across all runs in each of over 4000 normal tissue, cell type, developmental stage, sex and strain conditions in 61 species. Please note that it is up to the user of the API to specify the minimum desired percentage of mapped reads—no such filtering is employed by the API a priori. To facilitate discoverability and to allow for interpretation of the analysed data, the API also provides sample attributes per run, including corresponding ontology terms derived from manual curation in ArrayExpress (Kolesnikov ) and Expression Atlas (Petryszak ). Where manually curated sample annotations are not available, BioSamples database (Faulconbridge ) records are used instead. This API has also been incorporated into BioServices Python Package (Cokelaer ) and CPAN Perl package (http://search.cpan.org/dist/Bio-EBI-RNAseqAPI/). The analysis pipeline behind the RNASeq-er API offers an important service to researchers performing RNA-Seq experiments that choose to submit their data to ArrayExpress via https://www.ebi.ac.uk/fg/annotare submission tool: the deposited studies are not only described by rich, ontology-annotated experimental metadata; the associated raw data is also analysed for free, and for qualifying studies, is subsequently visualized in Expression Atlas (via private access if pre-publication). This combined metadata-rich deposition, analysis and visualization service aims to make data depositions not only easily discoverable, but also to facilitate understanding and reproducibility of the underlying research results. The results of our analysis can also inform and feed into the submitters’ own downstream analyses well before the paper is ready for submission to a journal.

2 Implementation

The analysis of each sequencing run is performed using the iRAP pipeline (Fonseca ). First quality-filtered (Petryszak , Supplementary Material) reads are aligned to the latest genome reference via TopHat 2 (Kim ). Note that so far we have used STAR (Dobin ) for the wheat genome reference, but now that TopHat 2 has been improved to handle large genome references, we plan to use TopHat 2 only for all species. Then the resulting BAM (Li ) file is converted to CRAM (Fritz ) format; bigWig (https://genome.ucsc.edu/goldenpath/help/bigWig.html) and bedGraph (https://genome.ucsc.edu/goldenpath/help/bedgraph.html) genome track files are also generated. Where groups of technical replicates corresponding to a single biological sample were identified via manual curation in ArrayExpress, the corresponding CRAM, bigWig and bedGraph files are aggregated for each such biological replicate. The expressions (raw counts) of genes and exons defined in the corresponding GTF file (obtained from the same source as the genome reference) are quantified using HTSeq (Anders ) and DEXSeq (Anders ) respectively. FPKM and TPM are then calculated. The gene lengths are based on the union of exons. Finally, for each gene the median TPM expression and coefficient of variation are calculated across all runs that have the same unique combination of sample attributes, including tissue, cell type, developmental stage, sex and strain. The full API documentation is available in the Supplementary data. The latest API documentation is also available at http://www.ebi.ac.uk/fg/rnaseq/api/(html) and http://www.ebi.ac.uk/fg/rnaseq/api/doc (pdf). Click here for additional data file.

16 in total

1. STAR: ultrafast universal RNA-seq aligner.

Authors: Alexander Dobin; Carrie A Davis; Felix Schlesinger; Jorg Drenkow; Chris Zaleski; Sonali Jha; Philippe Batut; Mark Chaisson; Thomas R Gingeras
Journal: Bioinformatics Date: 2012-10-25 Impact factor: 6.937

2. Content discovery and retrieval services at the European Nucleotide Archive.

Authors: Nicole Silvester; Blaise Alako; Clara Amid; Ana Cerdeño-Tárraga; Iain Cleland; Richard Gibson; Neil Goodgame; Petra Ten Hoopen; Simon Kay; Rasko Leinonen; Weizhong Li; Xin Liu; Rodrigo Lopez; Nima Pakseresht; Swapna Pallreddy; Sheila Plaister; Rajesh Radhakrishnan; Marc Rossello; Alexander Senf; Dmitriy Smirnov; Ana Luisa Toribio; Daniel Vaughan; Vadim Zalunin; Guy Cochrane
Journal: Nucleic Acids Res Date: 2014-11-17 Impact factor: 16.971

3. Detecting differential usage of exons from RNA-seq data.

Authors: Simon Anders; Alejandro Reyes; Wolfgang Huber
Journal: Genome Res Date: 2012-06-21 Impact factor: 9.043

4. ArrayExpress update--simplifying data submissions.

Authors: Nikolay Kolesnikov; Emma Hastings; Maria Keays; Olga Melnichuk; Y Amy Tang; Eleanor Williams; Miroslaw Dylag; Natalja Kurbatova; Marco Brandizi; Tony Burdett; Karyn Megy; Ekaterina Pilicheva; Gabriella Rustici; Andrew Tikhonov; Helen Parkinson; Robert Petryszak; Ugis Sarkans; Alvis Brazma
Journal: Nucleic Acids Res Date: 2014-10-31 Impact factor: 16.971

5. HTSeq--a Python framework to work with high-throughput sequencing data.

Authors: Simon Anders; Paul Theodor Pyl; Wolfgang Huber
Journal: Bioinformatics Date: 2014-09-25 Impact factor: 6.937

6. Expression Atlas update--an integrated database of gene and protein expression in humans, animals and plants.

Authors: Robert Petryszak; Maria Keays; Y Amy Tang; Nuno A Fonseca; Elisabet Barrera; Tony Burdett; Anja Füllgrabe; Alfonso Muñoz-Pomer Fuentes; Simon Jupp; Satu Koskinen; Oliver Mannion; Laura Huerta; Karine Megy; Catherine Snow; Eleanor Williams; Mitra Barzine; Emma Hastings; Hendrik Weisser; James Wright; Pankaj Jaiswal; Wolfgang Huber; Jyoti Choudhary; Helen E Parkinson; Alvis Brazma
Journal: Nucleic Acids Res Date: 2015-10-19 Impact factor: 16.971

7. Updates to BioSamples database at European Bioinformatics Institute.

Authors: Adam Faulconbridge; Tony Burdett; Marco Brandizi; Mikhail Gostev; Rui Pereira; Drashtti Vasant; Ugis Sarkans; Alvis Brazma; Helen Parkinson
Journal: Nucleic Acids Res Date: 2013-11-21 Impact factor: 16.971

8. Expression Atlas update--a database of gene and transcript expression from microarray- and sequencing-based functional genomics experiments.

Authors: Robert Petryszak; Tony Burdett; Benedetto Fiorelli; Nuno A Fonseca; Mar Gonzalez-Porta; Emma Hastings; Wolfgang Huber; Simon Jupp; Maria Keays; Nataliya Kryvych; Julie McMurry; John C Marioni; James Malone; Karine Megy; Gabriella Rustici; Amy Y Tang; Jan Taubert; Eleanor Williams; Oliver Mannion; Helen E Parkinson; Alvis Brazma
Journal: Nucleic Acids Res Date: 2013-12-04 Impact factor: 16.971

9. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions.

Authors: Daehwan Kim; Geo Pertea; Cole Trapnell; Harold Pimentel; Ryan Kelley; Steven L Salzberg
Journal: Genome Biol Date: 2013-04-25 Impact factor: 13.583

10. WormBase 2016: expanding to enable helminth genomic research.

Authors: Kevin L Howe; Bruce J Bolt; Scott Cain; Juancarlos Chan; Wen J Chen; Paul Davis; James Done; Thomas Down; Sibyl Gao; Christian Grove; Todd W Harris; Ranjana Kishore; Raymond Lee; Jane Lomax; Yuling Li; Hans-Michael Muller; Cecilia Nakamura; Paulo Nuin; Michael Paulini; Daniela Raciti; Gary Schindelman; Eleanor Stanley; Mary Ann Tuli; Kimberly Van Auken; Daniel Wang; Xiaodong Wang; Gary Williams; Adam Wright; Karen Yook; Matthew Berriman; Paul Kersey; Tim Schedl; Lincoln Stein; Paul W Sternberg
Journal: Nucleic Acids Res Date: 2015-11-17 Impact factor: 16.971

16 in total

1. Scripting Analyses of Genomes in Ensembl Plants.

Authors: Bruno Contreras-Moreira; Guy Naamati; Marc Rosello; James E Allen; Sarah E Hunt; Matthieu Muffato; Astrid Gall; Paul Flicek
Journal: Methods Mol Biol Date: 2022

2. Pan-phylum In Silico Analyses of Nematode Endocannabinoid Signalling Systems Highlight Novel Opportunities for Parasite Drug Target Discovery.

Authors: Bethany A Crooks; Darrin Mckenzie; Luke C Cadd; Ciaran J McCoy; Paul McVeigh; Nikki J Marks; Aaron G Maule; Angela Mousley; Louise E Atkinson
Journal: Front Endocrinol (Lausanne) Date: 2022-07-01 Impact factor: 6.055

3. The miR-429 suppresses proliferation and migration in glioblastoma cells and induces cell-cycle arrest and apoptosis via modulating several target genes of ERBB signaling pathway.

Authors: Fatemeh Gheidari; Ehsan Arefian; Fatemeh Saadatpour; Mahboubeh Kabiri; Ehsan Seyedjafari; Ladan Teimoori-Toolabi; Masoud Soleimani
Journal: Mol Biol Rep Date: 2022-10-11 Impact factor: 2.742

Review 4. Cloud computing for genomic data analysis and collaboration.

Authors: Ben Langmead; Abhinav Nellore
Journal: Nat Rev Genet Date: 2018-01-30 Impact factor: 53.242

5. Integrative Bioinformatic Analyses of Global Transcriptome Data Decipher Novel Molecular Insights into Cardiac Anti-Fibrotic Therapies.

Authors: Maximilian Fuchs; Fabian Philipp Kreutzer; Lorenz A Kapsner; Saskia Mitzka; Annette Just; Filippo Perbellini; Cesare M Terracciano; Ke Xiao; Robert Geffers; Christian Bogdan; Hans-Ulrich Prokosch; Jan Fiedler; Thomas Thum; Meik Kunz
Journal: Int J Mol Sci Date: 2020-07-02 Impact factor: 5.923

6. AgriSeqDB: an online RNA-Seq database for functional studies of agriculturally relevant plant species.

Authors: Andrew J Robinson; Muluneh Tamiru; Rachel Salby; Clayton Bolitho; Andrew Williams; Simon Huggard; Eva Fisch; Kathryn Unsworth; James Whelan; Mathew G Lewsey
Journal: BMC Plant Biol Date: 2018-09-19 Impact factor: 4.215

7. Pancreatlas: Applying an Adaptable Framework to Map the Human Pancreas in Health and Disease.

Authors: Diane C Saunders; James Messmer; Irina Kusmartseva; Maria L Beery; Mingder Yang; Mark A Atkinson; Alvin C Powers; Jean-Philippe Cartailler; Marcela Brissova
Journal: Patterns (N Y) Date: 2020-10-05

8. A Multi-Strategy Sequencing Workflow in Inherited Retinal Dystrophies: Routine Diagnosis, Addressing Unsolved Cases and Candidate Genes Identification.

Authors: Marta Martín-Sánchez; Nereida Bravo-Gil; María González-Del Pozo; Cristina Méndez-Vidal; Elena Fernández-Suárez; Enrique Rodríguez-de la Rúa; Salud Borrego; Guillermo Antiñolo
Journal: Int J Mol Sci Date: 2020-12-08 Impact factor: 5.923

9. Expression Atlas: gene and protein expression across multiple studies and organisms.

Authors: Irene Papatheodorou; Nuno A Fonseca; Maria Keays; Y Amy Tang; Elisabet Barrera; Wojciech Bazant; Melissa Burke; Anja Füllgrabe; Alfonso Muñoz-Pomer Fuentes; Nancy George; Laura Huerta; Satu Koskinen; Suhaib Mohammed; Matthew Geniza; Justin Preece; Pankaj Jaiswal; Andrew F Jarnuczak; Wolfgang Huber; Oliver Stegle; Juan Antonio Vizcaino; Alvis Brazma; Robert Petryszak
Journal: Nucleic Acids Res Date: 2018-01-04 Impact factor: 16.971

10. A Novel Loss-of-Function Variant in Transmembrane Protein 263 (TMEM263) of Autosomal Dwarfism in Chicken.

Authors: Zhou Wu; Martijn F L Derks; Bert Dibbits; Hendrik-Jan Megens; Martien A M Groenen; Richard P M A Crooijmans
Journal: Front Genet Date: 2018-06-07 Impact factor: 4.599