| Literature DB >> 29409527 |
Alex V Kotlar1, Cristina E Trevino1, Michael E Zwick1, David J Cutler1, Thomas S Wingo2,3,4.
Abstract
Accurately selecting relevant alleles in large sequencing experiments remains technically challenging. Bystro ( https://bystro.io/ ) is the first online, cloud-based application that makes variant annotation and filtering accessible to all researchers for terabyte-sized whole-genome experiments containing thousands of samples. Its key innovation is a general-purpose, natural-language search engine that enables users to identify and export alleles and samples of interest in milliseconds. The search engine dramatically simplifies complex filtering tasks that previously required programming experience or specialty command-line programs. Critically, Bystro's annotation and filtering capabilities are orders of magnitude faster than previous solutions, saving weeks of processing time for large experiments.Entities:
Keywords: Annotation; Big data; Bioinformatics; Cloud; Filtering; Genomics; Natural-language search; Online; Web
Mesh:
Year: 2018 PMID: 29409527 PMCID: PMC5801807 DOI: 10.1186/s13059-018-1387-3
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Using Bystro online to find alleles of interest in sequencing experiments. a After logging in (https://bystro.io/), users upload one or more VCF or SNP-format files—containing alleles from a sequencing experiment—from a computer or a connected Amazon S3 bucket. Datasets of over 890 GB, containing thousands of samples and tens of millions of variants, are supported. The data are rapidly annotated in the cloud, using descriptions from public sources (e.g. RefSeq, dbSNP, Clinvar, and others). The annotated results can be filtered using Bystro’s natural-language search engine and any search results can be saved as new annotations. Annotated experiments and saved results can be viewed online, downloaded as tab-delimited text, or uploaded back to linked Amazon S3 buckets. b An example of using Bystro’s natural-language search engine to filter 1000 Genomes Phase 3 (https://bystro.io/public). To do so, users may type natural phrases, specific terms, numerical ranges, or apply filters on any annotated field. Queries are flexible, allowing misspelled terms such as “earl-onset” to accurately match. Complex tasks, such as identifying de novo variants can be achieved by using Boolean operators (AND, OR, NOT, +, -), exact-match filters, and user-defined terms. For instance, after labeling the “proband” and their “parents,” the user could simply search proband –parents or combine with additional parameters for more refined queries, i.e. proband –parents missingness < .1 gnomad.exomes.af_nfe < .001
Fig. 2Online performance comparison of Bystro, VEP, wANNOVAR, and GEMINI. Bystro, wANNOVAR, VEP, and GEMINI (running on Galaxy) we run under similar conditions. Total processing time was recorded for 1000 Genomes Phase 3 WGS VCF files, containing either the full dataset (2504 samples, 8.49 × 107 variant sites) or subsets (2504 samples and 5 × 104, 3 × 105, 1 × 106, and 6 × 106 variants). Only Bystro successfully processed more than 1 × 106 variants online; wANNOVAR (not shown) could not complete the smallest 5 × 104 variant subset; VEP could not complete more than 5 × 104 variants; and GEMINI/Galaxy could not complete more than 1 × 106 variants. Online, VEP outputted a restricted subset of annotation data compared to its offline version. GEMINI and Bystro (but not VEP) outputted whole-genome CADD scores, while only Bystro also returned whole-genome PhyloP and PhastCons conservation scores. Bystro was faster than GEMINI/Galaxy by 144–212× across all time points
Bystro, VEP, ANNOVAR offline command-line performance
| Software | Dataset | Samples | Variants | Variants/s | Bystro vs |
|---|---|---|---|---|---|
| Bystro | 1000G Phase 3 chr1 | 2504 | 1 × 106 | 8156 ± 195 | – |
| 1000G Phase 3 chr1 | 2504 | 2 × 106 | 8484 ± 67.9 | – | |
| 1000G Phase 3 chr1 | 2504 | 4 × 106 | 8516 ± 57.2 | – | |
| 1000G Phase 3 chr1 | 2504 | 6.5 × 106 | 7779 ± 21.8 | – | |
| 1000G Phase 1 | 1092 | 3.9 × 107 | 5417 ± 76.8 | – | |
| 1000G Phase 3 | 2504 | 8.5 × 107 | 7904 ± 15.9 | – | |
| VEP | 1000G Phase 1 | 1092 | 3.9 × 107 | 18.67 ± 0.58 | 290× |
| 1000G Phase 3 | 2504 | 8.5 × 107 | 10.00 ± 0.00 | 790× | |
| ANNOVAR | 1000G Phase 3 chr1 | 2504 | 1 × 106 | 74.67 ± 0.21 | 109× |
| 1000G Phase 3 chr1 | 2504 | 2 × 106 | 75.32 ± 0.06 | 113× | |
| 1000G Phase 3 chr1 | 2504 | 4 × 106 | 75.15 ± 0.39 | 113× | |
| 1000G Phase 3 chr1 | 2504 | 6.5 × 106 | NA | NA | |
| 1000G Phase 1 | 1092 | 3.9 × 107 | NA | NA | |
| 1000G Phase 3 | 2504 | 8.5 × 107 | NA | NA |
Bystro, VEP, and ANNOVAR were similarly configured with eight threads on Amazon i3.2xlarge servers. “Dataset” refers to the VCF file used. “Variants/s” is the average of three trials. VEP performance was recorded after 2 × 105 sites in consideration of time. In runs of 1 × 106 or more annotated sites, VEP performance did not deviate from the 2 × 105 value. ANNOVAR could not complete the full Phase 1, Phase 3, or Phase 3 chromosome 1 datasets due to memory limitations. Thus, ANNOVAR was compared to Bystro on subsets of 1000 Genomes Phase 3 chromosome 1. Bystro run times included time taken to compress outputs. 1000 Genomes Phase 1 performance reflects IO limitations
Online comparison of Bystro and recent programs in filtering 8.49 × 107 variants from 1000 Genomes
| Group | Search query | Time (s) | Variants | Tr:Tv |
|---|---|---|---|---|
| 1 | Exonic | 0.030 ± 0.030 | 993,343 | 2.96 |
| 2 (a) | cadd > 20 maf < .001 pathogenic expert review missense | 0.029 ± 0.009 | 65 | 1.71 |
| 2 (b) | cadd > 20 maf < .001 pathogenic expert’s review non-synonymous | 0.036 ± 0.019 | 65 | 1.71 |
| 2 (c) | cadd > 20 maf < .001 pathogen expert-reviewed nonsynonymous | 0.044 ± 0.025 | 65 | 1.71 |
| 3 (a) | Early onset breast cancer | 0.046 ± 0.029 | 4335 | 2.51 |
| 3 (b) | Early-onset breast cancer | 0.037 ± 0.020 | 4335 | 2.51 |
| 3 (c) | Early onset breast cancers | 0.033 ± 0.015 | 4335 | 2.51 |
| 4 (a) | Pathogenic nonsense Ehlers-Danlos | 0.038 ± 0.027 | 1 | NA |
| 4 (b) | Pathogenic nonsense E.D.S | 0.078 ± 0.087 | 1 | NA |
| 4 (c) | Pathogenic stopgain eds | 0.040 ± 0.022 | 1 | NA |
The full 1000 Genomes Phase 3 VCF file (853 GB, 8.49 × 107 variants, 2504 samples) was filtered in the publicly available Bystro web application using the Bystro natural-language search engine. VEP, GEMINI, and wANNOVAR (not shown) were also tested, but were unable to annotate this dataset or filter it. Bystro’s search engine uses a natural language parser that allows for unstructured queries: queries in groups 2, 3, and 4 show phrasing variations that did not affect results returned, as would be expected for a search engine that could handle normal language variation. “Tr:Tv” is the transition to transversion ratio automatically calculated for each query by the search engine. The transition to transversion ratio of 2.96 for the “exonic” query is close to the ~ 2.8–3.0 ratio expected in coding regions, suggesting that the search engine accurately identified exonic (coding) variants
Online comparison of Bystro and GEMINI/Galaxy in filtering 1 × 106 variants
| No. | Program | Query | Time (s) | Variants | Ts/Tv |
|---|---|---|---|---|---|
|
| Bystro | cadd > 15 alt:(a || c || t || g) | 0.004 ± 0 | 28,099 | 2.512 |
| 1 | GEMINI | SELECT * FROM variants JOIN variant_impacts ON variants.variant_id = variant_impacts.variant_id WHERE cadd_scaled > 15 | 442 ± 87 | 22,063 | NA |
|
| Bystro | gnomad.exomes.af < .001 cadd > 15 missense | 0.007 ± 0.003 | 6840 | 3.083 |
| 2 | GEMINI | SELECT * FROM variants JOIN variant_impacts ON variants.variant_id = variant_impacts.variant_id WHERE cadd_scaled > 15 AND aaf_exac_all < .001 AND variant_impacts.impact = “missense_variant” | 77.6 ± 18.6 | 5160 | NA |
|
| Bystro | gnomad.exomes.af < .001 cadd > 15 nonsynonymous | 0.006 ± 0.001 | 6840 | 3.083 |
| 3 | GEMINI | SELECT * FROM variants JOIN variant_impacts ON variants.variant_id = variant_impacts.variant_id WHERE cadd_scaled > 15 AND aaf_exac_all < .001 AND variant_impacts.impact = “nonsynonymous_variant” | NA | 0 | NA |
Bystro was compared to the latest hosted version of GEMINI (v0.8.1, on the Galaxy platform) in filtering the 1 × 106 variant subset of 1000 Genomes Phase 3, which was the largest tested file that GEMINI/Galaxy could process. GEMINI requires structured SQL queries, while Bystro allows for shorter, unstructured search. In query 1, Bystro searched for CADD scores only within single-nucleotide polymorphisms (using alt:(a || c || t || g) or equivalently the regex query alt:/[actg]/), to normalize results with GEMINI, which provides no CADD data for insertions and deletions. In queries 2 and 3, Bystro’s search engine returned identical results for the synonymous terms “missense” and “nonsynonymous,” despite annotating such sites only as “nonsynonymous.” In contrast, GEMINI required the specific term “missense_variant.” GEMINI/Galaxy and Bystro returned different results because the latest version of GEMINI on Galaxy (0.8.1) uses outdated annotation sources. Comparisons between Bystro and GEMINI/Galaxy are further limited as GEMINI does not provide a natural-language parser, annotation field filters, an interactive result browser, per-query statistics, or the ability to filter saved search results. Notably, Bystro also performed substantially faster, returning all results in < 1 s