| Literature DB >> 26846841 |
Christos Kozanitis1, David A Patterson2.
Abstract
BACKGROUND: The impressively low cost and improved quality of genome sequencing provides to researchers of genetic diseases, such as cancer, a powerful tool to better understand the underlying genetic mechanisms of those diseases and treat them with effective targeted therapies. Thus, a number of projects today sequence the DNA of large patient populations each of which produces at least hundreds of terra-bytes of data. Now the challenge is to provide the produced data on demand to interested parties.Entities:
Mesh:
Year: 2016 PMID: 26846841 PMCID: PMC4741060 DOI: 10.1186/s12859-016-0904-1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1An example interval tree. This is the interval tree that stores intervals [1,5], [7,15], [16,19], [20,25], and [22,28]. Each node consists of two parts: The top part is the key of the node, which is the midpoint of the concatenation of all intervals of the subtree and the bottom part is a list of intervals that overlap with the key. So, in this example, the concatenation of all intervals is [1,28] and since the midpoint of that is 13, the root of the tree is keyed by 13 it stores interval [7,15] as it is the only interval of the collection that overlaps with 13. The intervals that do not overlap with the midpoint are used to build the left and right subtrees recursively such that all intervals of the left subtree end before and all intervals of the right subtree start after the midpoint
Fig. 2Our distributed range join architecture. In this picture, distributed table A joins with distributed table B on genomic interval overlapping. Table A goes through a number of transformations to enable the Spark driver to create an interval forest which stores index pointers to the original data. Next it propagates the interval forest to workers which transform table B by performing interval lookups on the forest. The result of this operation is table T1, which contains tuples of data from table B and pointers to data of table A. To materialize this we join it with table A1 on those pointers and we obtain table T, which is the final result of the operation. The text under each table shows the data type of the contents of each table
Performance of different implementations of interval keyed joins
| Method | Runtime (hr) |
|---|---|
| Interval Tree (this paper) | 0.28 |
| Shuffle Join (ADAM) | 2.5 |
| Brute force (default Spark SQL) | >14 |
Fig. 3Scalability of range join. The execution time drops linearly as the cluster size increases
Comparison between Spark SQL and existing software methods that are used to retrieve genomic data. Complex queries that today need more than a hundred lines of code to be implemented, take an order of magnitude fewer lines of code on Spark SQL without performance sacrifices
| Software tool | Lines of code | Runtime (min) |
|---|---|---|
| Spark SQL | 11 | 16 |
| State of the art software | BEDtools: 1 | 26 |
| samtols API based code: 130 | 1 | |
| total: 131 | 27 |