| Literature DB >> 31167636 |
Patrick Perkins1, Serina Mazzoni-Putman2, Anna Stepanova2, Jose Alonso2, Steffen Heber3,4.
Abstract
BACKGROUND: Ribo-seq is a popular technique for studying translation and its regulation. A Ribo-seq experiment produces a snap-shot of the location and abundance of actively translating ribosomes within a cell's transcriptome. In practice, Ribo-seq data analysis can be sensitive to quality issues such as read length variation, low read periodicities, and contaminations with ribosomal and transfer RNA. Various software tools for data preprocessing, quality assessment, analysis, and visualization of Ribo-seq data have been developed. However, many of these tools require considerable practical knowledge of software applications, and often multiple different tools have to be used in combination with each other.Entities:
Keywords: Data analysis; Next-generation sequencing; Ribo-seq; Web application
Mesh:
Substances:
Year: 2019 PMID: 31167636 PMCID: PMC6551240 DOI: 10.1186/s12864-019-5700-7
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Summary of a Ribo-seq experiment and subsequent computational analysis. a The experiment starts with isolation of mRNA ribosome complexes, followed by nuclease digestion of mRNA sequences that are not protected by associated ribosomes. Purification of the mRNA fragments shielded by the ribosomes is then carried out, followed by library generation, deep sequencing, and data analysis. b Plot of ribosome protected fragment counts along the translation start and end sites. During the process of translation, the p-site holds the amino acid that is linked with the growing polypeptide chain, and therefore more accurately represents the exact position or codon within the coding sequence that the ribosome is interacting with. Because of this, an adjustment must be made to account for the positional differences between the first position of a read and the corresponding p-site of the read. Pausing of ribosomes at each codon leads to trinucleotide periodicity. The majority of reads are expected to be ‘in frame’ with the start codon. c Differences in Ribo-seq and RNA-seq read densities caused around the start and stop codons. In contrast to RNA-seq data, Ribo-seq data tends to show a large peak around the start codon, as well as larger percentages of sequencing reads in frame with the start codon
Fig. 2RiboStreamR Data Processing Pipeline. a Overview of flow of data from sample collection within a lab to processing in R. The red boxes and arrows are steps which are not handled by riboStreamR, while those in blue are handled within riboStreamR. b RiboStreamR requires input of three file types: a set of Bam files, a genome annotation file, and a fasta file containing the genome nucleotide sequences. c The uploaded sequencing data are converted into a GRanges object, where each row is an individual alignment, and every column contains attribute information (metadata) about the alignment. d P-site adjustment method. Reads are separated by length and a meta-gene read density plot around the translation start sites is produced for each read length. The p-site adjustment for each respective alignment length is chosen to be the distance from the largest in-frame upstream peak to the translation start site
Description of alignment attributes. Each attribute is contained within a separate metadata column in the GRanges object
| Attribute Name | Description |
|---|---|
| seqnames | Chromosome on which the aligned read is mapped. |
| ranges | Start and end position of the alignment in genomic coordinates. |
| strand | Strand to which the aligned read is mapped. |
| sample | Sample name from which the alignment originates. |
| exp | Experiment type of the sample from which the alignment originates, either ‘Ribo’ for Ribo-seq, or ‘RNA’ for RNA-seq. |
| length | Length of the aligned read in nucleotides. |
| gene | Gene to which the read is mapped. Corresponds to Gene IDs within the provided annotation file. ‘Other’ if not mapped within a gene. |
| feature | Feature type to which the read is mapped. Feature types correspond to those included in the user-provided annotation file. |
| pos | Genomic position of alignment based on p-site adjustment. |
| start_dist | Distance from transcription start site (TSS) of a gene to p-site position, in transcript coordinates (with introns removed). The major isoform of each gene is used to calculate this distance. |
| end_dist | Distance from p-site position to translation stop codon (TSC) of a gene, in transcript coordinates (with introns removed). The major isoform of each gene is used to calculate this distance. |
| gc | The percentage of nucleotides in the aligned read which are G’s or C’s. |
| mapq | The mapq score of the alignment. Typically, alignments with a mapq score of 50 are considered uniquely mapped, while all other scores are considered multi-mapping. |
| frame | The trinucleotide frame of a read’s p-site, relative to the TSS of the gene’s major isoform. Reads that map within an mRNA are assigned either a 0, 1, or 2, while reads which map outside the mRNA are assigned ‘none’. |
Fig. 3Output examples for each tool in the platform. Descriptions of each tool can be found in Table 2
Description of Tools within riboStreamR
| Tool Name | Description |
|---|---|
| Data Upload | The tool allows users to upload their RNA-seq, Ribo-seq, annotation, and fasta files. After data upload and pre-processing, it displays a table with the number of alignments within each of the uploaded BAM files. |
| Summary Table | The tool provides a table with summary statistics for each sample, including total number of alignments, percentage of uniquely mapped reads, feature percentages, complexity, duplication values, and periodicities. |
| Read Length | Computes read length distributions for any combination of input files or data subsets. |
| GC % | Computes GC percentage distributions for any combination of input files or data subsets. |
| Feature % | Generates bar charts of the relative numbers of alignments mapping to the different feature types or any other alignment attribute. |
| Single Gene | Visualizes the read densities within single genes. Genes are displayed in genomic coordinates with gray regions indicating exons. Density bars can be color coded to differentiate between data subsets. |
| Meta Periodicity | The tool generates a meta gene distribution around the TSS and TSC for a single sample, showing the read density at each nucleotide. This is useful for gauging the level of periodicity. |
| Sample Meta Distribution | Generates line graphs of the distribution of reads around the TSS and stop codon. The tool allows users to compare multiple samples or other read subsets. |
| Length Periodicity | Computes a bar graph that shows the relative number of alignments within each frame. |
| Differential Analysis | The tool computes a table which contains RPKM values for each gene, as well as the results of a differential analysis using edgeR, including the logFC, |
| Report Generator | The tool generates a report containing the outputs of the user selected tools, as well as the parameters used to produce the output, a description of the processing methods, and any user-provided notes. |
Fig. 4Examples of riboStreamR’s graphical output customization options. On the left side is a representation of the toolbar. (A) Filtering parameters are shown in blue, and allow plotting of distinct subsets of the input data; (A1) Read Length Distribution (RLD) plot where each line represents the alignments from 3 different samples; (A2) RLD plot where only alignments mapped to the CDS are included; (A3) RLD where only reads mapping to tRNA or rRNA regions are included. (B) Organizational parameters, shown in green, allow the user to adjust how the filtered data are grouped and positioned in the output; (B1) Same as A1; (B2) RLD plots where the two plots separate between alignments mapped within a CDS and those mapped to any other feature, and the lines separate between three different samples; (B3) RLD plots where each plot is a separate sample, and the two separate lines represent reads mapping to different feature types. (C) Examples of different plotting parameters, shown in orange, which change the aesthetics of the graphical output; (C1) same as A1 and B1; (C2) reduced bandwidth of line plot to simplify the comparisons between each separate read length; (C3) reduced range of x-axis range, as well as different color scheme of plots to highlight differences between samples