| Literature DB >> 24314033 |
Adam Roberts, Harvey Feng, Lior Pachter1.
Abstract
BACKGROUND: Probabilistic assignment of ambiguously mapped fragments produced by high-throughput sequencing experiments has been demonstrated to greatly improve accuracy in the analysis of RNA-Seq and ChIP-Seq, and is an essential step in many other sequence census experiments. A maximum likelihood method using the expectation-maximization (EM) algorithm for optimization is commonly used to solve this problem. However, batch EM-based approaches do not scale well with the size of sequencing datasets, which have been increasing dramatically over the past few years. Thus, current approaches to fragment assignment rely on heuristics or approximations for tractability.Entities:
Mesh:
Substances:
Year: 2013 PMID: 24314033 PMCID: PMC3881492 DOI: 10.1186/1471-2105-14-358
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Accuracy Comparison for Synthetic Data. Accuracy of eXpress, eXpress-D, RSEM, and Cufflinks at multiple sequencing depths in a simulation of a billion fragments (read pairs) generated with (B) and without (A) sequence-specific bias. Expanded from Figure 2 in [10] and produced using the same synthetic datasets. Each algorithm was presented the same multisized subsets of 1 billion simulated fragments and the Spearman ranked correlation coefficient was calculated between the resulting estimates and ground-truth abundance values used in the simulation. The stars represent where each of the software packages crashed or were halted due to the test machine’s memory constraint (512 GB for all but eXpress-D).
Figure 2Method overview. The top portion (A) shows the procedure for running the distributed batch EM algorithm with Spark, ignoring sequence-specific bias. First, blocks of partitioned alignments (yellow) and targets (magenta) are distributed to the slave nodes, stored in memory as RDDs, and cached. An initial set of parameter estimates (green with black symbols) are broadcast to the slaves with alignments. The alignments on each slave are probabilistically assigned and new parameter estimates are partially accumulated (green with white symbols) on each node. These are then sent back to the master to be fully combined and re-broadcast for the next round. When sequence-specific bias is enabled, additional processing (B) takes place between some rounds. The parameter estimates are sent to the slaves with targets and the expected sequence bias given the current abundance estimates are accumulated and normalized on the master node to produce updated weights [9]. Auxiliary parameters (error, bias, and fragment lengths) are fixed after 1000 rounds, and the estimation procedure stops when convergence of the abundance parameters is reached.
Protocol Buffer specifications
| Fragment | ||
| name | string | Unique query name of fragment in SAM file |
| paired | bool | Boolean specifying if both ends were sequenced |
| alignments | FragmentAlignments | Collection of alignments for fragment |
| FragmentAlignment | ||
| target_id | uint32 | ID of target aligned to (index in SAM header) |
| read_l | ReadAlignment | Alignment information for 5’ (left) read, if exists |
| read_r | ReadAlignment | Alignment information for 3’ (right) read, if exists |
| ReadAlignment | ||
| first | bool | Boolean specifying if this end was sequenced first |
| left_pos | unit32 | 0-based left endpoint of alignment to reference |
| right_pos | unit32 | 0-based right endpoint of alignment to reference |
| mismatch_indices | byteArray | Positions in read that differ from reference |
| mismatch_nucs | byteArray | Nucleotides in read at mismatches, 2 bits/nuc |
| Target | ||
| name | string | Unique name of target sequence |
| id | uint32 | Index of target in SAM header |
| length | uint32 | Number of nucleotides in target sequence |
| seq | byteArray | Nucleotides of target sequence, 2 bits/nuc |
eXpress pre-processes the input data (SAM/BAM and FASTA file) and converts it to a format that is compatible with the distributed file system’s partitioning scheme. The information for each target and fragment are put into a space-efficient Protocol Buffer–retaining only the information necessary for optimization–, which is then serialized and encoded in base64. Each target or fragment takes up exactly one line in the file created for input into eXpress-D.
Figure 3Runtimes. Average time required for 100 iterations on EC2 for different amounts of input data running on data simulated with (purple) and without (teal) sequence-specific bias. In the latter case, the timing is for iterations after the first 20, which require a constant 30 minutes to learn the bias model. The cluster size is scaled as 3 slave nodes (6 cores) for each 50 million fragments. The results show that eXpress-D running on Spark maintains constant runtime when resources are scaled linearly with the amount of the data.
Slope of timing and resource curves
| 0.05 | 0.12 | |
| 1.8 | 0 | |
| 6 | 0 | |
| 27 | 0 |
We computed the mean slope between the runtime samples from Figure 3 from of this manuscript and Figure 2(b) of [10] to compare the scaling of the four methods in units of minutes per million fragments (mpmf). While eXpress, Cufflinks, and RSEM were all run on the same machine fixed at 8 cores and 24 GB RAM, the resources of eXpress-D were increased at a rate of 0.12 cores per million fragments (cpmf), allowing it scale in approximately constant time.