| Literature DB >> 23543353 |
Benilton S Carvalho1, Gabriella Rustici.
Abstract
High-throughput technologies are widely used in the field of functional genomics and used in an increasing number of applications. For many 'wet lab' scientists, the analysis of the large amount of data generated by such technologies is a major bottleneck that can only be overcome through very specialized training in advanced data analysis methodologies and the use of dedicated bioinformatics software tools. In this article, we wish to discuss the challenges related to delivering training in the analysis of high-throughput sequencing data and how we addressed these challenges in the hands-on training courses that we have developed at the European Bioinformatics Institute.Entities:
Keywords: bioinformatics training; high-throughput sequencing analysis; open-source software; practical courses; statistical methodologies
Mesh:
Year: 2013 PMID: 23543353 PMCID: PMC3771233 DOI: 10.1093/bib/bbt018
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Main profiles of the users applying to our courses on the analysis of HTS data
| Use case | Description |
|---|---|
| 1 | I am starting a project involving HTS applications, such as RNA-seq and/or ChIP-seq, and I need to learn how to analyse the data that I will generate. I have never done this kind of analysis before, and I have very little familiarity with data analysis tools. Bioinformatics support is lacking in our department, so it is vital for me to acquire these skills if I want my project to be successful. |
| 2 | I am involved in HTS projects. The analysis is done by a bioinformatician, but I would like to learn more about the analysis to be able to have a better interaction with the bioinformatician. I have run some simple analysis tasks using pre-compiled scripts, but I would not know how to modify them to suit my needs. |
| 3 | I have been involved in microarray data analysis projects for a long time and now I am switching to HTS data analysis. I feel confident with using tools for microarray analysis, but I want to know what I need to use to analyse HTS data. I am confident in the use of some programming languages. |
| 4 | I am a bioinformatician, supporting various research groups with their analysis needs. I run HTS data analysis using some tools, but I want to learn how to use other tools as well as keep up to date with the latest algorithms that are being developed in this field. |
Figure 1:A typical RNA-seq data analysis workflow: the major steps involved in this pipeline are indicated, alongside some of the tools used to carry out individual steps. Quality assessment is first performed on the sequence reads before mapping them to a reference genome. The reads are then quantified into counts and normalized to minimize technical variability. Then statistical models for count data are applied to infer differential expression or differential exon usage.
Figure 2:Number of applications (solid line) received since 2009 for HT data analysis courses at EMBL-EBI and number of participants to such courses (dashed line).
Learning objectives for lectures (L) and practicals (P) of the ‘EMBO practical course on the analysis of high-throughput sequencing data’
| Day | Lecture/practical title | Learning objectives | Software |
|---|---|---|---|
| 1 | Understanding the HTS data analysis workflow (L) | Provides an overview of the course structure and introduces the HTS data analysis workflow. We discuss the content of each session and how these sessions are connected to each other. Participants are encouraged to discuss their course expectations. | |
| Introduction to R and Bioconductor (L/P) | The lecture gives a quick overview of the Bioconductor project [ | R ( | |
| Bioconductor ( | |||
| Short read representation, manipulation and assessment (L/P) | NGS data consist of a large number of short reads. The lecture introduces the FASTQ format used to store short reads and how to assess quality of such data. The practical allows participants to run quality assessment of short reads data and generate a quality report using the FASTX toolkit software or the Bioconductor package ShortRead [ | FASTX toolkit ( | |
| Bioconductor package: ShortRead | |||
| Mapping strategies for sequence reads (L/P) | The lecture presents the different methods for mapping short reads data to the reference genome. The practical teaches participants how to use Bowtie [ | Bowtie ( | |
| TopHat ( | |||
| 2 | Representing and manipulating alignments (L/P) | The lecture introduces the BAM format used to store aligned reads and discusses how to manipulate and visualize such data. The practical focuses on how to use SAMtools [ | SAMtools ( |
| IGV ( | |||
| Bioconductor packages: ShortRead, GenomicsRanges and Rsamtools | |||
| Annotation of genes and genomes (L/P)a | The lecture and the practical are dedicated to understanding how to retrieve and use genomic annotations using web-based resources like Biomart [ | Bioconductor packages: GenomicFeatures, BSgenome, biomaRt and rtracklayer | |
| Estimating expression over genes and exons with simple counts (L/P) | The lecture discusses how to go from aligned reads to expression estimation. Strategies for the discovery of novel transcribed regions are also presented [ | Bioconductor packages: GenomicRanges, Rsamtools, biomaRt | |
| 3 | Statistical concepts and methodologies for data analyses (L) | Gives an overview of the fundamental statistical elements needed to understand and perform downstream analysis steps. The statistical models used to handle RNA-seq data are presented as well as consideration on experimental design. | |
| Normalizing RNA-seq data (L) | Covers how to properly normalize RNA-seq data. Various normalization approaches are presented and compared [ | ||
| Haplotype and isoform level expression estimation (L/P)a | The lecture introduces the methods used to measure the expression of different isoforms and is followed by a practical using MMSEQ [ | MMSEQ ( | |
| 4 | Differential expression (L) | Explains how to calculate differential expression from RNA-seq data; different Bioconductor packages available to perform this analysis are compared [ | |
| Alternative exon usage (L)b | Explains how to calculate differential exon usage from RNA-seq data [ | ||
| Multiple testing (L)a | Addresses the importance of multiple testing corrections when measuring differential expression [ | ||
| Differential expression with RNA-seq (P) | Is dedicated to running differential expression and differential exon usage analysis with the Bioconductor packages DEseq [ | Bioconductor packages: DEseq and DEXseq | |
| 5 | Introduction to ChIP-Seq data and analysis (L) | Provides an overview of ChIP-seq and how to analyse this data [ | |
| ChIP-Seq data analysis with Bioconductor (P) | Illustrates common ChIP-seq analysis steps based on a number of Bioconductor packages. The package chipseq is used to perform filtering steps and obtain diagnostic plots to assess the data quality. The same package is then used to call peaks and the output is compared with the results obtained with the commonly used peak-finding algorithms MACS [ | Bioconductor packages: ShortRead, GenomicRanges, chipseq, BSgenome, GenomicFeatures, rtracklayer, DESeq | |
| MACS ( | |||
| SISSR ( | |||
| MEME ( | |||
| Differential analysis of ChIP-Seq data (L/P) | The lecture focuses on how to perform differential analysis of ChIP-seq data [ | Bioconductor package DiffBind | |
| 6 | ENA: introduction, data model and browsing (L/P)b | The lecture is dedicated to explaining how HTS data are stored in public repositories [ | European Nucleotide Archive ( |
| Data submission and compression format (L/P)b | The lecture shows participants how to submit their own HTS data to ENA [ | European Nucleotide Archive ( |
For organizers that wish to run shorter courses, we marked with a the sessions that can be shorten and with b the sessions that can be excluded. For more information on any of the Bioconductor packages listed in this table, please refer to individual packages, pages available at http://bioconductor.org/packages/release/.