| Literature DB >> 30911095 |
Chengsheng Mao1, Alal Eran2,3, Yuan Luo4.
Abstract
Efficient large-scale annotation of genomic intervals is essential for personal genome interpretation in the realm of precision medicine. There are 13 possible relations between two intervals according to Allen's interval algebra. Conventional interval trees are routinely used to identify the genomic intervals satisfying a coarse relation with a query interval, but cannot support efficient query for more refined relations such as all Allen's relations. We design and implement a novel approach to address this unmet need. Through rewriting Allen's interval relations, we transform an interval query to a range query, then adapt and utilize the range trees for querying. We implement two types of range trees: a basic 2-dimensional range tree (2D-RT) and an augmented range tree with fractional cascading (RTFC) and compare them with the conventional interval tree (IT). Theoretical analysis shows that RTFC can achieve the best time complexity for interval queries regarding all Allen's relations among the three trees. We also perform comparative experiments on the efficiency of RTFC, 2D-RT and IT in querying noncoding element annotations in a large collection of personal genomes. Our experimental results show that 2D-RT is more efficient than IT for interval queries regarding most of Allen's relations, RTFC is even more efficient than 2D-RT. The results demonstrate that RTFC is an efficient data structure for querying large-scale datasets regarding Allen's relations between genomic intervals, such as those required by interpreting genome-wide variation in large populations.Entities:
Mesh:
Year: 2019 PMID: 30911095 PMCID: PMC6434014 DOI: 10.1038/s41598-019-41451-3
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1An example of interval queries regarding four different intersection relations, i.e., overlapping from the front (o), overlapping from behind (oi), contains (di) and contained in (d). The four types of interval queries are based on the query interval q and the data interval set S.
Allen’s interval relations and their transformation to the bound range for range query. Note: = [, ] is an interval from the data interval set, = [′, ′] is the query interval.
| Symbols | Relation | Illustration | Definition | Rewriting as range query |
|---|---|---|---|---|
|
|
|
| 0 < | |
|
|
|
| ||
|
|
|
| ||
|
|
|
| 0 < | |
|
|
|
| 0 < | |
|
|
|
| ||
|
|
|
| ||
|
|
|
| ||
|
|
|
| ||
|
|
|
| 0 < | |
| < |
| 0 < | ||
| > |
| |||
| = |
|
Figure 2If an interval is associated with a 2-dimensional point, an interval query can be transformed to a range query. The query interval [x′, y′] is associated with the point (x′, y′), and the interval query regarding each of Allen’s interval relations corresponds to the range query regarding the indicated area (marked in the figure by the corresponding relation symbol, the “=” query exactly corresponds to the point (x′, y′)). The start is less than or equal to the end for an interval, thus, points in the lower right area under the line y = x is invalid to associate an interval with.
Figure 3A simple example of search by fractional cascading. Given a value v, the query searches for the minimum values no less than v in A1 and A2 respectively. (a) Create the indexing array I from A1 to A2. Refer to such an index as FC-index. (b) Suppose v ∈ (v3, v4], the binary search for v from A1 will return v4 and its position 4, then the corresponding FC-index in A2 is I[4] = 3. (c) Directly return A2[I[4]] = v6 as the search result from A2.
Figure 4The three tree structures for a data interval set with n intervals. For brevity of illustration, the pointer fields of each node pointing to the left, right or parent node are omitted inside the node, and represented by the corresponding arrows between nodes. (a) The data structure of range tree with fractional cascading (RTFC). (b) The data structure of the basic 2-dimensional range tree (2D-RT). (c) The data structure of interval tree (IT) implemented as an augmented RB-tree (red-black tree) with n nodes.
The algorithm to build range tree with fractional cascading. Input is the original interval set. Return the root node of the resulting range tree.
| 1. | Sort |
| 2. | |
| 3. | Creating a leaf node |
| 4. |
|
| 5. | Split |
| 6. | Sort |
| 7. | Create an FC-index |
| 8. | Create an FC-index |
| 9. | |
| 10. | |
| 11. | Create a node |
| 12. |
|
| 13. |
|
Figure 5An example to illustrate the construction of RTFC and the query using RTFC. (a) The original interval set and the corresponding indices. (b) The resulting tree structure and the query processes. The red arrow lines represent the search path of x1 = 2.5 and x2 = 9.5. The item bounded by a red box in a node represents the searched index in the data array for y1 = 2.5. The green shaded intervals represent the result intervals. (c) Range query transformed from interval query.
The algorithm to execute interval query using range tree with fractional cascading. Input is the range tree; is the query interval; is the relationship. Return the result interval set corresponding to all the intervals in range tree satisfying the relationship with interval .
| 1. | Transform interval query with respect to interval |
| 2. | Initialize the result set |
| 3. | Find the split node |
| 4. | |
| 5. | |
| 6. | Report the interval in |
| 7. | |
| 8. |
|
| 9. | Perform binary search on |
| 10. | |
| 11. | |
| 12. | |
| 13. | j = |
| 14. | |
| 15. | Report interval, |
| 16. | |
| 17. | |
| 18. | |
| 19. | |
| 20. | |
| 21. | |
| 22. |
|
| 23. | |
| 24. | Report the interval in |
| 25. |
|
| 26.. | |
| 27. | |
| 28. | |
| 29. | |
| 30. | |
| 31. | Report interval, |
| 32. | |
| 33. | |
| 34. | |
| 35. | |
| 36. | |
| 37. | |
| 38. |
|
| 39. | |
| 40. | Report the interval in |
| 41. |
|
| 42. |
|
Total number of intervals in our experiments.
| ENCODE intervals | gnomAD intervals | |
|---|---|---|
| Total number | 1,340,125,581 | 241,056,551 |
The total building time (in seconds) of the three tree structures on ENCODE genomic intervals for the 23 chromosomes.
| RTFC | 2D-RT | IT | |
|---|---|---|---|
| Total building time (s) | 6971.58 | 11569.15 |
|
Abbreviations: RTFC = range tree with fractional cascading; 2D-RT = basic 2-dimensional range tree; IT = interval tree.
The query time of the 11 intersection queries (in seconds) and the corresponding result set sizes with gnomAD intervals as query intervals and ENCODE intervals as data intervals.
| Query | RTFC | 2D-RT | IT | Result size |
|---|---|---|---|---|
| o |
| 258.18 | 1774.81 | 36,731,416 |
| oi |
| 30.57 | 128.20 | 36,573,171 |
| d |
| 29.00 | 132.00 | 96,633 |
| di |
| 986.35 | 1935.54 | 47,218,140,890 |
| s |
| 137.67 | 128.68 | 3,005 |
| si | 144.63 | 156.85 |
| 113,873,940 |
| f |
| 43.68 | 126.52 | 3,220 |
| fi |
| 554.89 | 1733.57 | 113,684,059 |
| m |
| 548.73 | 1729.51 | 113,785,360 |
| mi |
| 155.19 | 134.88 | 114,013,875 |
| = | 147.69 | 157.90 |
| 417 |
Abbreviations: RTFC = range tree with fractional cascading; 2D-RT = basic 2-dimensional range tree; IT = interval tree.
The coarse interval query types in “findOverlaps” function and the corresponding refined relations in Allen’s interval algebra.
| Coarse types | Refined relations | NCList query time | RTFC query time |
|---|---|---|---|
| “any” | 5654.42 |
| |
| “within” |
| 942.71 |
|
| “start” |
| 1007.73 |
|
| “end” |
| 1033.35 |
|
| “equal” |
| 968.37 |
|
Column “NCList query time” indicates the query time (in seconds) using “findOverlaps” function with gnomAD intervals as query intervals and ENCODE intervals as data intervals. Column “RTFC query time” indicates the query time to execute all the corresponding interval queries with respect to the refined relations using RTFC.
The time consumptions (in seconds) of BEDTools and BEDOPS for coarse interval intersection query compared with RTFC.
| Tools | BEDTools | BEDOPS* | RTFC | ||||
|---|---|---|---|---|---|---|---|
| sort | query | total | build | query | total | ||
| Time consumption | 126911.72 | 2111.65 | 2388.56 | 4500.21 | 6971.58 |
| 8797.40 |
The data intervals are ENCODE intervals and query intervals are gnomAD intervals. *BEDOPS cannot return the detailed intersection information as RTFC and BEDTools, it returns only the subset of overlapping intervals in the first bed file and ignores the overlapping intervals in the second bed file.