| Literature DB >> 23894185 |
Laura K Wiley1, R Michael Sivley, William S Bush.
Abstract
Efficient storage and retrieval of genomic annotations based on range intervals is necessary, given the amount of data produced by next-generation sequencing studies. The indexing strategies of relational database systems (such as MySQL) greatly inhibit their use in genomic annotation tasks. This has led to the development of stand-alone applications that are dependent on flat-file libraries. In this work, we introduce MyNCList, an implementation of the NCList data structure within a MySQL database. MyNCList enables the storage, update and rapid retrieval of genomic annotations from the convenience of a relational database system. Range-based annotations of 1 million variants are retrieved in under a minute, making this approach feasible for whole-genome annotation tasks. Database URL: https://github.com/bushlab/mynclist.Entities:
Mesh:
Year: 2013 PMID: 23894185 PMCID: PMC3724366 DOI: 10.1093/database/bat056
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.Nested intervals disrupt ordering and query strategies. Given these two historic rosters and the question: who played in 2008 (represented with the red line)? In the non-nested example, sorting individuals by their first-year results was done in the same order as when we sort by the players’ last year. Thus, we can use a traditional index on start and end positions to quickly scan backwards, stopping at the 2004–2007 range (player C). In the nested example, the ordering of players by their last year on the team is different from when we sort by their first year. Thus, if we implement the same reverse search technique used in the non-nested example, stopping our query at the 2004–2007 range (player N) would skip player M. Therefore, we must search the entire set of players when intervals are nested.
Figure 2.The NCList algorithm and update procedure. (A) The original interval organization where contiguous overlapping intervals are grouped and individual intervals point to sublists containing the completely overlapped intervals. (B) Transition structure highlighting the intervals to be added (bolded with alphabetic labels). (C) The completed structure with inserted intervals fully incorporated.
Figure 3.Query performance of MyNCList compared with partition-based and multiple-indexing strategies in MySQL. Performance is shown in annotations per second (A) and in raw execution time (B).