| Literature DB >> 25306238 |
Eric Dun Ho, Qin Cao, Sau Dan Lee, Kevin Y Yip1.
Abstract
BACKGROUND: High-throughput experimental methods have fostered the systematic detection of millions of genetic variants from any human genome. To help explore the potential biological implications of these genetic variants, software tools have been previously developed for integrating various types of information about these genomic regions from multiple data sources. Most of these tools were designed either for studying a small number of variants at a time, or for local execution on powerful machines.Entities:
Mesh:
Year: 2014 PMID: 25306238 PMCID: PMC4210471 DOI: 10.1186/1471-2164-15-886
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Schematic illustration of the VAS workflow. Genomic features are pre-sorted and stored in data files with pointers for direct access to particular genomic locations. A user supplies the list of genetic variants and selects the genomic features to integrate with the variants at the client side. The variants extractor produces a compressed form of the input variants. The task is then sent to the backend and put into a waiting queue, and the user is shown a waiting page. When an execution daemon becomes available, it fetches the next task in the queue and uses the customized algorithms to perform data integration. The integration results are stored in a tab-delimited file. The user will then be shown a summary page of the integration results. An email notification will also be sent, with a link for a user to retrieve the summary page later. The user can then view the integration details of each input variant, perform interactive analysis on the UCSC Genome Browser, or download the annotation results in tab-delimited or Excel format.
List of genomic features provided by VAS
| Type | Genomic features |
|---|---|
| Chromatin | ENCODE open chromatin, histone modifications,protein-DNA binding
[ |
| Genomic states | ChromHMM segmentation
[ |
| Expression | ENCODE RNA-seq
[ |
| Sequence | UCSC
[ |
| Annotation | Gencode
[ |
| Variations | dbSNP
[ |
| Diseases | GWAS Catalog
[ |
Figure 2Amount of data upload and uploading time required at various sizes of the input list of genetic variants in our simulation study, before and after client-side data compression. The data uploading time for the uncompressed case was estimated based on the file size and the data transfer rate when transferring the compressed version of the same files.
Figure 3Usage of VAS. (a) Selecting genomic features to be integrated with the genetic variants. (b) Summary of the annotation results. Genomic features identified around each genetic variant (within a 10 kb window in this case) are shown, where a darker color indicates a stronger signal value. (c) Detailed view of a genetic variant, with an embedded UCSC Genome Browser image in which each genomic feature is shown as a signal track.
Figure 4An example of point-to-region data integration using our algorithm.
Figure 5An example of region-to-region data integration using our algorithm.
Data integration time of different methods
| Method | Integrating variant | Integrating variant flanking |
|---|---|---|
| locations (second) | windows (second) | |
| BigBed | 277.90 | 275.63 |
| Interval tree | 0.41 | 0.60 |
| Relational database | 8.05 | 736.23 |
| Tabix | 8.87 | 8.88 |
| Our algorithms | 0.21 | 0.52 |
For BigBed reader and interval tree, we used the implementation of bxpython. For relational database, we tried several indexing methods including standard B-tree index and spatial index, and report here the shortest time among these approaches. Tabix was called using the pytabix library in Python.
Some distinctive features of VAS as compared to some related tools
| Tool | CADD
[ | GEMINI
[ | GWASdb
[ | GWAVA
[ | HaploReg
[ | RegulomeDB
[ | VAS |
|---|---|---|---|---|---|---|---|
| Client-side data compression | No | (local) | N/A | No | No | No | Yes |
| Input variants allowed | ∼100,000 | (Unlimited) | 1 | >10,000 | 10,000 | ∼5,000 | 3,000,000 |
| Genomic features/aggregated | 63 | (User defined) | 37 | 14 | 10 | 1,012 | 1,000+ |
| features provided | (5 categories) | (6 categories) | (13 categories) | (16 categories) | |||
| Data storage and integration | (Not described) | Relational DB | Relational DB | (Not described) | Relational DB | Relational DB | Customized |
| Searching flanking regions | No | No | Yes | No | No | No | Yes |
| Asynchronous access of results | Yes | (local) | No | No | No | No | Yes |
| Linkout to genome browser | No | No | UCSC
[ | Ensembl
[ | No | UCSC | UCSC |
For GWAVA and RegulomeDB, the maximum number of input variants allowed is based on our own tests of the system. Properties of the tools are based on their versions on 8th September 2014.
Lists of genetic variants from the personal genome project tested on VAS
| Sample | Total number of variants | PGP variants | Chromosomal location | dbSNP ID | Clinical importance | Found by VAS | |
|---|---|---|---|---|---|---|---|
| hu47A9D1 | 960,613 | APOA5-S19W | chr11:116662407/chr11:116167616 | rs3135506 | Low | Yes | |
| APOE-C130R | chr19:45411941/chr19:50103780 | rs429358 | High | Yes | |||
| MBL2-G54D | chr10:54531235/chr10:54201240 | rs1800450 | Low | Yes | |||
| MBL2-R52C | chr10:54531242/chr10:54201247 | rs5030737 | Low | Yes | |||
| MTRR-I49M | chr5:7870973/chr5:7923972 | rs1801394 | Low | Yes | |||
| MYO7A-R302H | chr11:76869378/chr11:76547025 | rs41298135 | High | Yes | |||
| rs5186 | chr3:148459988/chr3:149942677 | rs5186 | Low | Yes | |||
| hu7DA960 | 960,613 | AMPD1-Q12X | chr11:115236057/chr11:115037579 | rs17602729 | Low | Yes | |
| KCNE1-D85N | chr21:35821680/chr21:34743549 | N/A | High | Yes | |||
| KRT5-G138E | chr12:52913668/chr12:51199934 | rs11170164 | Low | Yes | |||
| MBL2-G54D | chr10:54531235/chr10:54201240 | rs1800450 | Low | Yes | |||
| rs5186 | chr3:148459988/chr3:149942677 | rs5186 | Low | Yes | |||
| hu8D40D6 | 598,897 | APOE-C130R | chr19:45411941/chr19:50103780 | rs429358 | High | Yes | |
| HFE-S65C | chr6:26091185 | N/A | Low | Yes | |||
| MTRR-I49M | chr5:7870973/chr5:7923972 | rs1801394 | Low | Yes | |||
| PRPH-D141Y | chr12:49689404 | rs58599399 | High | Yes | |||
| RPF1-A91V | chr10:72360387/chr10:72030392 | rs35947132 | Low | Yes | |||
| SERPINA1-E288V | chr14:94847262/chr14:93917014 | rs17580 | Low | Yes | |||
| hu998A3D | 960,613 | BTD-D444H | chr3:15686693/chr3:15661696 | rs13078881 | Low | Yes | |
| C3-R102G | chr19:6718387/chr19:6669386 | rs2230199 | Moderate | Yes | |||
| COL4A1-Q1334H | chr13:110818598/chr13:109616598 | rs3742207 | Low | Yes | |||
| HFE-S65C | chr6:26091185 | N/A | Low | Yes | |||
| MTRR-I49M | chr5:7870973/chr5:7923972 | rs1801394 | Low | Yes | |||
| rs5186 | chr3:148459988/chr3:149942677 | rs5186 | Low | Yes | |||
| SERPINA1-E366K | chr14:94844947/chr14:93914699 | rs28929474 | High | Yes | |||
| hgD53911 | 612,647 | COL4A1-Q1334H | chr13:110818598/chr13:109616598 | rs3742207 | Low | Yes | |
| MTRR-I49M | chr5:7870973/chr5:7923972 | rs1801394 | Low | Yes | |||
| PKD1-R4276W | chr16:2139814/chr16:2079814 | rs114251396 | High | Yes | |||
| rs5186 | chr3:148459988/chr3:149942677 | rs5186 | Low | Yes | |||
| SCNN1G-E197K | chr16:23200963/chr16:23108463 | rs5738 | Low | Yes | |||
| VWF-R854Q | chr12:6143978/chr12:6014238 | rs41276738 | Moderate | Yes |
The variants listed in the "PGP variants" column include likely pathogenic and rare (<2.5%) pathogenic variants according to the reports available on the Personal Genome Project Web site. The information in the "Chromosomal location", "dbSNP ID" and "Clinical importance" columns was all obtained from these reports.
Time measurement of GEMINI and VAS
| Tool | Data loading/uploading (s)* | Data integration (s) | Total (s) | |
|---|---|---|---|---|
|
|
|
|
|
|
| GEMINI (our testing results) | Trial 1 | 9,944.6 | 154.1 | 10,098.6 |
| Trial 2 | 9,960.5 | 155.5 | 10,116.1 | |
| Trial 3 | 10,182.4 | 156.9 | 10,339.3 | |
| Trial 4 | 10,182.3 | 162.8 | 10,345.1 | |
| Trial 5 | 10,053.2 | 169.1 | 10,222.2 | |
| Average | 10,064.6 | 159.7 | 10,224.3 | |
| Std. dev. | 115.2 | 6.2 | 117.6 | |
| VAS | Trial 1 | 9.9 | 1,711.1 | 1,721.1 |
| Trial 2 | 10.4 | 1,772.3 | 1,782.7 | |
| Trial 3 | 9.7 | 1,552.5 | 1,562.1 | |
| Trial 4 | 9.2 | 1,541.6 | 1,550.8 | |
| Trial 5 | 9.6 | 1,580.9 | 1,590.5 | |
| Average | 9.8 | 1,631.7 | 1,641.4 | |
| Std. dev. | 0.4 | 103.7 | 104.1 |
Time for GEMINI to load the data into database and perform pre-computations, and time for VAS to upload the file from the client browser to our server.