| Literature DB >> 27875984 |
Joshua G Dunn1,2,3,4, Jonathan S Weissman5,6,7,8.
Abstract
BACKGROUND: Next-generation sequencing (NGS) informs many biological questions with unprecedented depth and nucleotide resolution. These assays have created a need for analytical tools that enable users to manipulate data nucleotide-by-nucleotide robustly and easily. Furthermore, because many NGS assays encode information jointly within multiple properties of read alignments - for example, in ribosome profiling, the locations of ribosomes are jointly encoded in alignment coordinates and length - analytical tools are often required to extract the biological meaning from the alignments before analysis. Many assay-specific pipelines exist for this purpose, but there remains a need for user-friendly, generalized, nucleotide-resolution tools that are not limited to specific experimental regimes or analytical workflows.Entities:
Keywords: Bioinformatics; Genomics; Python; Ribosome profiling; Sequencing
Mesh:
Year: 2016 PMID: 27875984 PMCID: PMC5120557 DOI: 10.1186/s12864-016-3278-x
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
File formats used in genomics
| Data type | Format | Implementation |
|---|---|---|
| Feature annotations (e.g. genes, transcripts, exons, origins of replication) | BED, extended BED* | Plastid |
| BigBed | Plastid + kentUtils [ | |
| GTF2* | Plastid | |
| GFF3* | Plastid | |
| PSL* | Plastid | |
| Read alignments | bowtie | Plastid |
| BAM | Plastid + Pysam [ | |
| Reduced count data | bedGraph | Plastid |
| BigWig | Plastid + kentUtils [ | |
| wiggle (fixedStep) | Plastid | |
| wiggle (variableStep) | Plastid | |
| Sequence | FASTA | via Biopython [ |
| twobit | via twobitreader [ |
For each category of genomics data, many file formats exist. Plastid includes readers for each format that standardize the representation of data for each type, so that the meaning of each data type is separated from its format on disk. *tabix compression for these formats is supported via Pysam [27]
Fig. 1Uses of Plastid in analysis workflows. Plastid (yellow box) contains tools for both exploratory data analysis (blue, center) and command-line scripts for specific tasks (green, right). Plastid standardizes representation of data across the variety of file formats used to represent genomics data (left). Quantitative data are represented as arrays of data over the genome. Read alignments may be transformed into arrays using a mapping function appropriate to a given assay. Transcripts are represented as chains of segments that automatically account for their discontinuities during analysis. Plastid integrates directly with the SciPy stack (blue, center). For exploratory analysis in other environments (blue, above) or further processing in external programs (right, green), Plastid imports and exports data in standardized formats
Fig. 2Mapping functions extract biological data from read alignments. a. Mapping functions use various properties of a read alignment to determine the genomic position(s) at which it should be counted. b. Mapping functions for ribosome profiling use alignment coordinates and lengths to estimate ribosome positions, revealing features of translation, like a peak of density at the start codon (red circle) and three-nucleotide periodicity of ribosomal translocation (inset). c. For bisulfite sequencing, the fraction of C-to-T transitions at each cytosine are mapped, revealing a CpG island. d. A mapping function for DMS-seq differentiates structured from unstructured regions of a selenocysteine insertion element in the 3′ UTR of human SEPP1. DMS reactivity (blue bars) matches A and C residues predicted to be unstructured (yellow)
Fig. 3SegmentChains automate many common tasks. a. SegmentChain and Transcript objects automatically convert coordinates between genomic and transcript-relative spaces. b. SegmentChains and Transcripts can therefore convert read alignments or quantitative data aligned to the genome to arrays of values at each position in the chain. c. Subsections (green, pink) of chains can be fetched using start and end points relative to the parental chains. SegmentChains automatically generate the corresponding genomic coordinates. d. Regions of a chain can be masked from computations without altering the chain coordinates
Fig. 4Plastid streamlines analysis. a. The quality of a ribosome profiling dataset may be assayed by comparing the numbers of read counts in the first versus second half of each coding region. Plastid makes it possible to implement such analyses with few lines of easily readable code. b. Plastid readily integrates with the tools in the SciPy stack. Here, first- and second-half counts from (a) are plotted against each other using matplotlib, and a Pearson correlation coefficient calculated using SciPy
Plastid includes configurable mapping functions that cover many uses cases in sequencing analysis
| Method | Map reads | Sample use |
|---|---|---|
| Fiveprime | At a fixed offset from their 5′ ends | Ribosome profiling with RNase I (e.g. yeast, human), RNA-seq |
| Threeprime | At a fixed offset from their 3′ ends | Ribosome profiling with RNase I, RNA-seq |
| Fiveprime, variable | At an offset from 5′ end determined by read length | Ribosome profiling with RNase I, RNA-seq |
| Fiveprime, variable and stratified by read length | At an offset from 5′ end determined by read length, partitioning reads of each length into separate arrays | ORF annotation from ribosome profiling data |
| Center-weighted | Fractionally over entire length, optionally trimming a fixed number of nucleotides from the 5′ and 3′ ends | Ribosome profiling with MNase (e.g. |
Plastid’s command-line scripts automate common analysis tasks
| Analysis of count and alignment data | |
|
| Count the number of read alignments covering arbitrary regions of interest in the genome, and calculate read densities (in reads per nucleotide and in RPKM) over these regions |
|
| Count the number of read alignments and calculate read densities (in RPKM) specifically for genes and sub-regions (5′ UTR, CDS, 3′ UTR), correcting gene and sub-region boundaries for overlapping genes |
|
| Fetch vectors of counts at each nucleotide position in one or more regions of interest, saving each vector as its own line-delimited text file |
|
| Create wiggle or bedGraph files from alignment files after applying a read mapping rule (e.g. to map ribosome-protected footprints at their P-sites), for visualization in a genome browser |
|
| Compute a metagene profile of read alignments, counts, or quantitative data over one or more regions of interest |
|
| Estimate sub-codon phasing in ribosome profiling data |
|
| Estimate position of ribosomal P-site within ribosome profiling read alignments as a function of read length |
| Manipulation of genomic features | |
|
| Empirically annotate multimapping regions of a genome, given alignment criteria |
|
| Determine parent-child relationships of features in a GFF3 file |
|
| Convert transcripts between BED, BigBed, GTF2, GFF3, and PSL formats |
|
| Find all unique splice junctions in one or more transcript annotations, and optionally export these in Tophat’s.juncs format |
|
| Compare a set of splice junctions to a reference set, and, if possible with equal sequence support, slide discovered junctions to compatible known junctions |
Fig. 5Maximal spanning windows enable isoform-independent analysis. A maximal spanning window over a set of transcripts (or other genomic features) is defined as the largest possible window surrounding a shared landmark (in this example, a start codon; vertical line), over which the N th nucleotide from the landmark in each transcript corresponds to the same genomic position. Maximal spanning windows thus enable position-wise analysis over fractions of genes when isoform distributions are unknown. Plastid uses maximal spanning windows for metagene analysis, measuring sub-codon phasing in ribosome profiling, and estimating ribosomal P-site offsets
Fig. 6Metagene profiles reveal genomic signals. Schematic of metagene analysis. Normalized arrays of quantitative data (e.g. ribosomal P-sites; top) are taken at each position in the maximal spanning windows of multiple genes. These arrays are aligned at a landmark of interest (here, a start codon), and the median value of each column (nucleotide position), is taken to be the average (bottom)
Computing requirements for genomes and datasets of varying size
| Test | Organism | Run time (hh:mm:ss) | Peak memory usage (MB) |
|---|---|---|---|
| Read counting | Yeast | 00:01:18 ± 00:00:01 | 255 ± 0 |
| Read counting | Fly | 00:36:34 ± 00:00:03 | 1138 ± 7 |
| Read counting | Human | 00:19:56 ± 00:00:01 | 1053 ± 2 |
| Manipulate annotations | Yeast | 00:00:27 ± 00:00:02 | 467 ± 0 |
| Manipulate annotations | Fly | 00:03:37 ± 00:00:03 | 2620 ± 1 |
| Manipulate annotations | Human | 00:18:42 ± 00:01:49 | 4419 ± 1 |
| Export browser track | Yeast | 00:00:58 ± 00:00:00 | 281 ± 1 |
| Export browser track | Fly | 00:09:05 ± 00:00:40 | 2452 ± 7 |
| Export browser track | Human | 00:06:11 ± 00:00:03 | 537 ± 0 |
| Build crossmap | Yeast | 00:00:35 ± 00:00:00 | 100 ± 0 |
| Build crossmap | Fly | 00:10:44 ± 00:00:10 | 328 ± 7 |
| Build crossmap | Human | 04:11:51 ± 00:06:32 | 130 ± 1 |
Four command-line scripts were executed on yeast, fly, and human datasets. Runtimes and peak memory usage are given as the mean ± standard deviation of three replicates. See methods for details