| Literature DB >> 30704396 |
Mohammad Zia1,2, Paul Spurgeon1, Adrian Levesque1,2, Thomas Furlani1, Jianxin Wang3,4.
Abstract
BACKGROUND: High throughput sequencing technologies have been increasingly used in basic genetic research as well as in clinical applications. More and more variants underlying Mendelian and complex diseases are being discovered and documented using these technologies. However, identifying and obtaining a short list of candidate disease-causing variants remains challenging for most of the users after variant calling, especially for people without computational skills.Entities:
Keywords: Annotation; Complex diseases; Genome; Genotyping; High-throughput sequencing; Mendelian diseases; VCF; Variants; WES; WGS
Mesh:
Year: 2019 PMID: 30704396 PMCID: PMC6357466 DOI: 10.1186/s12859-019-2636-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Comparison of existing open-source software tools with similar functions
| Features | GenESysV | GEMINI | BrowseVCF | VCF-Miner | Mendel,MD | BiERapp |
|---|---|---|---|---|---|---|
| Graphical Usera Interface | Yes | No | Yes | Yes | Yes | Yes |
| Study type | Single cohort complex disease, Case/Control, and Mendelian inheritance | Single cohort complex disease and Mendelian inheritance | Single cohort complex and Mendelian inheritance | Single cohort complex disease and Mendelian inheritance | Mendelian only | Single cohort complex disease, Case/Control, and Mendelian inheritance |
| Whole genome, exome or target study | All | All | All | All | WES or targeted study | WES or target study |
| Can handle studies with large numbers of samples | Yes | Yes | No | No | No | No |
| Database Type | Elasticsearch | Sqlite3 | Wormtable & BerkeleyDB | MongoDB | PostgreSQL | SQLite & MongoDB |
| Flag variants for further filtering | Yes | No | No | No | No | No |
aFeatures listed here are not exhaustive
Fig. 1Schematic view of GenESysV design. Input VCF file(s) is parsed into json format files using multiple CPU cores in parallel. An Elasticseach index mapping file, as well as a GUI configure file, are also created during data parsing. The GUI configure file is used to guide the automatic web graphical interface creation in a later stage. Elasticsearch index creation is also done in parallel to further speed up the entire data importation process
Fig. 2The Interface of GenESysV and querying process. Querying process starts with the selection of study name, dataset name, and analysis type (a), followed by setting filters (b) and selecting variant/annotation related attributes (c). The output is a table displaying up to 400 records (d). The full results can be obtained by clicking the “Export to CSV” button (Top right)
Fig. 3Benchmarking of VCF data import. Comparison of VCF importation between GenESysV and GEMINI. For comparison purposes, data importation performance for Annovar annotated VCF files is also shown in this figure. The phase3 VCF file from the 1000 Genomes Project is downloaded and annotated with VEP or Annovar. A series of VCF files containing the full or subsets of variants is generated by including variants from the first 100, 250, 500, 750, 1000, 1250, 1500, 1750, 2000 samples. These VCF files are used as inputs for importation using the data loading script (load_vcf.py) in our GenESysV package. These tests were performed using a server computer with 24 CPU cores (Intel(R) Xeon(R) CPU E5–2620 v3 @ 2.40GHz) and 128 GB memory (max heap size for Java virtual machine was set to 31 GB using the –Xmx flag, as recommended by the Elasticsearch documentation)
Comparison of query performance between GenESysV and GEMINIa
| Query | GenESysV filters and attributes | GEMINI query | AshkenazimTrio (6,312,781 variants) | 1000 Genomes Project phase3 (2504 samples, 85,211,311 variants) | ||
|---|---|---|---|---|---|---|
| GenESysV | GEMINI | GenESysV | GEMINI | |||
| Get all novel and detrimental variants | Filters: Limit Variants to dbSNP_ID: excluded, IMPACT: HIGH, FILTER: PASS. | select chrom, start, ref., alt, qual, impact_severity, filter from variants where in_dbsnp = 0 and impact_severity == ‘HIGH’ and filter is Null | 0.73 s/0.21 s | 2 m41.35 s/1.06 s | 33.22 s/0.49 s | 2 m42.80s/0.75 s |
| (64)b | (74) | (55) | (20) | |||
| Attributes: CHROM, POS, REF, ALT, IMPACT, QUAL, FILTER. | ||||||
| Get all rare, loss-of-function variants | Filters: EUR_AF (<): 0.01, Consequence: frameshift_variant, splice_acceptor_variant, splice_donor_variant, start_lost, start_retained_variant, stop_gained, stop_lost. | select chrom, start, ref., alt, qual, gene from variants where is_lof = 1 and aaf_1kg_eur < 0.01 and filter is Null limit 400 | 1.20s/0.34 s | 2.39 s/0.35 s | 9.97 s/0.53 s | 2.60s/0.59 s |
| (315) | (269) | (400)c | (400)c | |||
| FILTER: PASS. | ||||||
| Attributes: CHROM, POS, REF, ALT SYMBOL. | ||||||
| Get rare, loss-of-function variants and is also heterozygous in selected samples | Filters: Consequence: frameshift_variant, splice_acceptor_variant, splice_donor_variant, start_lost, start_retained_variant, stop_gained, stop_lost. | select chrom, start, ref., alt, qual, gene, gts.HG003, gts.HG004 from variants where is_lof = 1 and aaf_1kg_eur < 0.01 and filter is Null" --gt-filter “gt_types.HG003 == HET” or “gt_types.HG004 == HET” | 0.71 s/0.37 s | 3.21 s/0.47 s | 51.47 s/2.28 s | 1 m33.57 s/3.52 s |
| (239)e | (213) | (31) | (36) | |||
| FILTER: PASS, EUR_AF (<): 0.01, Sample_ID: HG003d, HG004, GT: 0|1,1|0. | ||||||
| Attributes: CHROM, POS, REF, ALT, SYMBOL, Sample_ID, GT. | ||||||
| Get missense variants in human HLA region | Filters: CHROM: 6, POS (>=): 28477797, POS (<=): 33448354, Consequence: missense_variant. | Select chrom, start, ref., alt, gene, max_aaf_all, impact, rs_ids from variants where chrom = ‘chr6’ and start > = 28,477,797 and end <= 33,448,354 and impact= ‘missense_variant’ limit 400. | 0.41 s/0.39 s | 3.70s/0.51 s | 6.77 s/0.62 s | 7.72 s/0.78 s |
| (400)c | (400)c | (400)c | (400)c | |||
| Attributes: CHROM, POS, dbSNP_ID, REF, ALT SYMBOL, MAX_AF. | ||||||
aTesting performed in a 16 CPU core (2.3GHz Intel Xeon E312xx (Sandy Bridge, IBRS update)) cloud instance running Ubuntu 16.04 OS, with 32 GB memory and solid state drive. VCF files are annotated with VEP
bQuery time (No. variants returned). The first number in the query time field is the time spent on the query when the system is cold, i.e. system cache is empty. The second number is the time spent on repeating queries when the data is cached by the first run of the same query. Each query was run three times and the median values are used for reporting
cThese queries return more than 400 variants (a default upper value set in GenESysV to return for display in the web-browser). To avoid measuring time spent in file downloading, we limited the number of variants returned by GEMINI to 400 to make them compatible
dThese sample IDs are for the AshkenazimTrio dataset. They are replaced with HG00096 and HG00097, respectively, when testing against the 1000 Genomes Project Phase3 dataset
eGenESysV does not always return the same number of variants as GEMINI for the equivalent queries. See supplement material for a possible explanation