| Literature DB >> 27268795 |
William McLaren1, Laurent Gil2, Sarah E Hunt2, Harpreet Singh Riat2, Graham R S Ritchie2, Anja Thormann2, Paul Flicek2, Fiona Cunningham3.
Abstract
The Ensembl Variant Effect Predictor is a powerful toolset for the analysis, annotation, and prioritization of genomic variants in coding and non-coding regions. It provides access to an extensive collection of genomic annotation, with a variety of interfaces to suit different requirements, and simple options for configuring and extending analysis. It is open source, free to use, and supports full reproducibility of results. The Ensembl Variant Effect Predictor can simplify and accelerate variant interpretation in a wide range of study designs.Entities:
Keywords: Genome; NGS; SNP; Variant annotation
Mesh:
Year: 2016 PMID: 27268795 PMCID: PMC4893825 DOI: 10.1186/s13059-016-0974-4
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Comparison of features of VEP with Annovar [95] and SnpEff [66]
| Class | Feature | VEP | Annovar | SnpEff |
|---|---|---|---|---|
| General | Language | Perl | Perl | Java |
| Availability (non-commercial) | Free | Registration required | Free | |
| Availability (commercial) | Free | License required | Free | |
| Licence | Apache 2.0 | Unspecified, not open source | LGPLv3 | |
| Input | VCF | Yes | Yes | Yes |
| rsID | Yes | No | No | |
| HGVS | Yes | No | No | |
| BED | No | No | Yes | |
| Sequence variants | Yes | Yes | Yes | |
| Structural variants | Yes | Yes | Yes | |
| Output | VCF | Yes | Yes (non-standard) | Yes |
| HGVS | Yes | Yes | Yes | |
| Summary statistics | Yes | Yes | Yes | |
| Graphical summary | Yes | No | Yes | |
| Customizable output | Yes | No | No | |
| Transcript sets | Ensembl | Yes | Yes | Yes |
| RefSeq | Yes | Yes | Yes | |
| GENCODE Basic | Yes | Yes | No | |
| Species supported | ~5000 | 94 | ~4500 | |
| User-created databases | Yes | Yes | Yes | |
| Interfaces | Local package | Yes | Yes | Yes |
| Submission-based web interface | Ensembl Tools | wAnnovar | Galaxy | |
| Instant prediction web interface | Yes | No | No | |
| Cloud/VM | Yes | No | Yes | |
| API access | Perl, REST | No | No | |
| Consequence types | Sequence Ontology | Yes | No | Yes |
| Impact classification | Yes | No | Yes | |
| Number of classes | 33 | 19 | 42 | |
| Default reporting level | Transcript | Gene | Transcript | |
| Summary level reporting | Optional, customisable | Default, customisable | No | |
| Splicing predictions | Yes (via plugins) | Yes (via external data) | Yes (experimental) | |
| Loss of function prediction | Yes (via plugins) | No | Yes | |
| Nonsense mediate decay assessment | No | No | Yes | |
| Non-coding | Regulatory features | Yes | Yes | Yes |
| Support multiple cell lines | Yes | No | Yes | |
| TFBS scoring | Yes | No | No | |
| miRNA structure location | Yes (via plugins) | No | No | |
| Known variants | Report known variants | Yes | Yes | Yes |
| Filter by frequency | Yes | Yes | Yes | |
| Clinical significance | Yes | Yes | Yes | |
| Other filters | Pre-set filters | Yes | Yes | Yes |
| Arbitrary filtering | Yes | No | Yes | |
| Other | Per-individual annotation | Basic | No | Somatic versus germline |
| Annotation with custom data | Yes | Yes | Yes | |
| Custom code extensions via Plugin architecture | Yes | No | No |
miRNA microRNA, TFBS transcription factor binding site, VM virtual machine
Gene and transcript-related fields reported by the VEP
| Property | Description |
|---|---|
| Gene ID | Ensembl stable identifier for affected gene |
| Gene symbol | Common name for gene, e.g., from HGNC |
| Transcript ID | Ensembl stable identifier for affected transcript |
| RefSeq ID | NCBI RefSeq identifier for affected transcript |
| CCDS ID | Consensus coding sequence (CCDS) identifier uniting Havana, Ensembl, and NCBI |
| Biotype | GENCODE biotype of affected transcript |
| cDNA coordinates | Coordinates of input variant in unprocessed cDNA |
| CDS coordinates | Coordinates of input variant in processed coding sequence (CDS) |
| Distance | Distance to transcript if variant falls outside transcript boundaries |
| Consequence type | SO consequence type of input variant allele on transcript |
| Exon | Number(s) of affected exon(s) |
| Intron | Number(s) of affected intron(s) |
| TSL | Transcript Support Level (TSL) highlights well-supported and poorly supported transcript models |
| APPRIS | Annotation principle splice isoforms (APPRIS) is a system to annotate alternatively spliced transcripts based on a range of computational methods, assigning primary and alternative statuses to transcripts |
| HGVS | HGVS notations for input variant relative to the coding sequence |
| Phenotype | Flag indicating known association with a phenotype or disease |
Protein-related fields reported by the VEP
| Property | Description |
|---|---|
| Protein ID | Ensembl stable identifier for affected protein product |
| RefSeq ID | NCBI RefSeq identifier for affected protein |
| SWISSPROT ID | Manually curated protein identifier from UniProt |
| TrEMBL ID | Automatically generated identifier from UniProt |
| UniParc ID | Combined protein identifier from UniProt |
| Protein coordinates | Coordinates of input variant in protein product |
| Codons | Reference and alternative codons as generated by input variant |
| Amino acids | Reference and alternative amino acids as generated by input variant |
| SIFT | SIFT pathogenicity prediction and score |
| PolyPhen | PolyPhen-2 pathogenicity prediction and score |
| Protein domains | Protein domains overlapping input variant |
| HGVS | HGVS notations for input variant relative to the protein sequence |
Examples of VEP plugins
| Plugin | Maintained by | Functionality |
|---|---|---|
| CADD | Martin Kircher | Integrates multiple annotations into one metric by contrasting variants that survived natural selection with simulated mutations |
| dbNSFP | Ensembl | Provides pre-calculated scores from dbNSFP for many pathogenicity prediction tools for every possible missense variant in the human genome [ |
| dbscSNV | Ensembl | Retrieves data for splice variants from dbscSNV [ |
| ExAC | Ensembl | Retrieves ExAC allele frequencies from the Exome Aggregation Consortium (ExAC) project [ |
| GWAVA | Graham Ritchie | Predicts the functional impact of variants on non-coding elements from, e.g., ENCODE using GWAVA |
| GXA | Ensembl | Reports data from the Expression Atlas |
| LD | Ensembl | Finds variants in linkage disequilibrium with any overlapping existing variants |
| LOFTEE | Konrad Karczewski | Predicts if stop gain, splice site, or frameshift variants lead to loss of function (LoF) in the affected protein |
| MaxEntScan | Ensembl | Compares scores for reference and mutant splice site sequences using a maximum entropy method |
| miRNA | Ensembl | Reports whether a variant is predicted to fall in a stem or loop region of a mature miRNA |
| UpDownStream | Ensembl | By default the VEP searches 5 kb either side of input variants for transcripts. Configures this distance which is useful in species with small intergenic distances or for investigating long-range trans-acting regulatory interactions |
| VAX | Michael Yourshaw | Incorporates data from KEGG, Human Protein Atlas, MitoCarta, OMIM, and more into VEP output |
For a full list of plugins see [76]
Regulatory element-related fields reported by the VEP
| Property | Description |
|---|---|
| Regulatory or Motif feature ID | Ensembl identifier for affected regulatory element |
| Motif name | External name for transcription factor binding motif |
| Motif position | Coordinates of input variant in transcription factor binding motifs |
| Motif score | Score reflecting effect of input variant on closeness of binding motif sequence to consensus |
| Informative position | Flag indicating if the position occupied by the variant in the binding motif is important in the consensus sequence |
Co-located variant-related fields reported by the VEP
| Property | Description |
|---|---|
| Variant ID | External identifier for variant co-located with input, e.g., rsID from dbSNP |
| Somatic | Somatic status of co-located variant |
| GMAF | Global minor allele and frequency of co-located variant from combined 1000 Genomes phase 3 populations |
| Other frequencies | Frequency data from continental level 1000 Genomes phase 3 data and two NHLBI–Exome Sequencing Project populations |
| Clinical significance | Clinical significance status of co-located variant as reported by ClinVar |
| Phenotype | Flag indicating known association with a phenotype or disease |
| PubMed ID | NCBI PubMed IDs of publications citing co-located variant |
Example filters available in the VEP
| Option or command | Description |
|---|---|
| Runtime filters | |
| --no_intergenic | Filter out variants that fall in intergenic regions |
| --pick | Choose one consequence for each variant; priority is given to the canonical transcript for each gene, protein coding transcripts, and more severe consequence types e.g., missense_variant is more severe than intron_variant |
| --per_gene | Picks one consequence using the same methodology as --pick but chooses one per overlapping gene |
| --filter_common | Filter out variants that are co-located with a known variant that has a minor allele frequency greater than 1 %. |
| Results filters using filter_vep.pl | |
| SIFT is deleterious OR PolyPhen is probably_damaging | Filter for results where SIFT or PolyPhen-2 predicts the variant protein will be non-functional |
| AFR >0.1 AND EUR <0.05 | Filter for variants co-located with those that are common in African populations but rare in European populations |
| Gene in gene_list.txt AND Phenotype matches cancer | Filter for results for variants that fall in the genes with IDs listed in gene_list.txt and that have been annotated with a cancer phenotype from a custom dataset (VEP script only) |
Fig. 1A typical VEP Web results page. Section (1) gives summary pie charts and statistics. Section (2) contains a preview of the results table with navigation, filtering, and download options. The preview table contains hyperlinks to genes, transcripts, regulatory features, and variants in the Ensembl browser. The results can be downloaded in VCF, text, or custom VEP file formats
Fig. 2Example of JSON output as produced by the VEP script and REST API (redacted and prettified for display)
Comparison of runtime
| Tool | Chr. 21 | All |
|---|---|---|
| Annovar | 0 m38.933 s (1732 v/s) | 21 m50.037 s (3415 v/s) |
| SnpEff | 1 m46.178 s (635 v/s) | 46 m39.142 s (1598 v/s) |
| SnpEff (threaded)* | 1 m21.046 s (832 v/s) | 10 m28.274 s (7121 v/s) |
| VEP | 0 m47.216 s (1428 v/s) | 62 m9.107 s (1200 v/s) |
Two datasets from Illumina’s Platinum Genomes were used [93], both on the GRCh37 assembly: 67416 variants from chromosome 21 and the whole genome set of 4,474,140 variants. Each tool was configured to use the Ensembl release 75 gene set, with options configured for the fastest runtime. Run time and speed in variants per second (v/s) are shown. *SnpEff was run in threaded mode but multiple warnings and errors were produced during these runs.