| Literature DB >> 31127704 |
Jianping Jiang1, Jianlei Gu1,2, Tingting Zhao1, Hui Lu1,2.
Abstract
BACKGROUND: Next-generation sequencing (NGS) has been widely used in both clinics and research. It has become the most powerful tool for diagnosing genetic disorders and investigating disease etiology through the discovery of genetic variants. Variants identified by NGS are stored in variant call format (VCF) files. However, querying and filtering VCF files are extremely difficult for researchers without programming skills. Furthermore, as the mutation data are increasing exponentially, there is an urgent need to develop tools to manage these variant data in a centralized way.Entities:
Keywords: NGS; VCF visualization; software; variant filtering; variant management
Mesh:
Year: 2019 PMID: 31127704 PMCID: PMC6625089 DOI: 10.1002/mgg3.641
Source DB: PubMed Journal: Mol Genet Genomic Med ISSN: 2324-9269 Impact factor: 2.183
Figure 1Hierarchical system design of VCF‐Server. VCF‐Server is a web application based on the Browser/Server architecture
Figure 2VCF‐Server workflow. A public server has been provided for easy access, and local deployment is also possible. The user uploads one or more VCF files with or without whitelists/blacklists or private annotation databases. The back end stores data separately for each individual user and invokes different modules on demand. ⅰ, ⅱ, ⅲ and ⅳ are different application scenarios on the VCF‐Server
Comparison of existing noncommercial visualization and mining tools for VCF
| Features |
VCF‐Server |
BrowseVCF |
VCF‐Miner |
myVCF |
VCF.Filter |
|---|---|---|---|---|---|
| VCF index required | × | × | × | × | √ |
| Flexible variant filtering/selection | √ | √ | √ | × | √ |
|
| √ | × | × | × | × |
|
| √ | × | × | × | × |
|
| √ | × | × | × | × |
| Output view customize | √ | × | √ | × | × |
| Output format | HTML, VCF, CSV | HTML, TSV | HTML, VCF, TSV | HTML | Text, VCF |
| Docker container | √ | × | √ | × | × |
|
| √ | × | × | × | × |
|
| √ | × | × | × | × |
| Data Persistence | √ | × | √ | √ | × |
|
| √ | × | × | × | × |
| Login required | × | × | √ | × | × |
| Multiple tasks support | √ | × | √ | × | × |
| Installation required | × | √ | × | √ | × |
| Dependency for installation | × | √ | × | √ | × |
| Fully open source | √ | √ | × | √ | √ |
| Application architecture | B/S | B/S | B/S | B/S | Stand‐alone |
| Graphical user interface engine | HTML + Node.js | HTML + Python‐CGI | HTML + Java | HTML + django | Java GUI |
| VCF parser | C/HTSlib | Python/wormtable | Java | Python/PyVCF | Java/HTSJDK |
| Database engine | MongoDB | Berkeley DB | MongoDB | SQLite | None |
| Citation | Salatino & Ramraj, | Hart et al., | Pietrelli & Valenti, | Muller et al., |
The highlighted features (bold) are unique for VCF‐Server.
Data Persistence: Storing variants and VCFs on database for fast and multiple‐user access.
B/S: Browser/Server architecture.
Figure 3Screenshots of different functional modules on VCF‐Server. (a) VCF file management online. (b) Annotating VCF with commonly used annotation databases. (c) Variant filtering and visualization. (d) VCF index management
Performance of VCF‐Server, BrowseVCF, and VCF‐Miner on exome and whole‐genome data
| VCF Files | Size (Mb) | Variants | Step1: Pre‐processing | Step2: Indexing | Step3: Filtering | ||
|---|---|---|---|---|---|---|---|
| L0 | L1 | L2 | |||||
| VCF‐Server | |||||||
| GIAB_v.3.3.2.NA12878.vcf.gz | 129 | 3,775,119 | 2m2s | 3m1s |
|
|
|
| WES.Trio.vcf.gz | 5.7 | 129,131 | 4s | 7s |
|
|
|
| WGS.Trio.vcf.gz | 236 | 4,852,720 | 3m3s | 5m32s | 6m30s |
|
|
| WGS.Trio.anno.vcf.gz | 287 | 4,852,720 | 4m13s | 6m43s | 8m5s |
|
|
| BrowseVCF | |||||||
| GIAB_v.3.3.2.NA12878.vcf.gz | 129 | 3,775,119 | 3m24s | 21m53s | 3m59s | ||
| WES.Trio.vcf.gz | 5.7 | 129,131 | 10s | 58s | 11s | ||
| WGS.Trio.vcf.gz | 236 | 4,852,720 |
| 13m49s | 7m37s | ||
| WGS.Trio.anno.vcf.gz | 287 | 4,852,720 |
| 37m31s | 14m10s | ||
| VCF‐Miner | |||||||
| GIAB_v.3.3.2.NA12878.vcf.gz | 129 | 3,775,119 | 12m2s | 24m40s | 10s | ||
| WES.Trio.vcf.gz | 5.7 | 129,131 | 25s | 27s |
| ||
| WGS.Trio.vcf.gz | 236 | 4,852,720 | 17m6s | 28m20s |
| ||
| WGS.Trio.anno.vcf.gz | 287 | 4,852,720 | 20m30s | 41m55s | 12s | ||
All of the benchmarks are calculated on an Ubuntu Linux v16.04 server with INTEL CPU at 2.6 GHz and 64 GB RAM. All operations were performed using only one CPU. Bold value indicates which method is the fastest one of the step.
Step required to parse and import the input VCF file into the database. L0, L1, L2 stand for different fields of the VCF file that the VCF‐server parses and imports into the database. When L0 is on, the CHROM, POS, ID, REF, ALT, QUAL, and INFO fields will be parsed and imported into the database. When L1 is on, the L0 fields and the FORMAT string will be parsed and imported into the database. When L2 is on, the VCF‐Server will parse and import all of the fields into the database.
Indexes were built for the following fields: CHROM + POS, ID, REF + ALT, QUAL, FILTER.
Query executed on the FILTER field, keeping only PASS variants.