| Literature DB >> 25780760 |
Véronique Geoffroy1, Cécile Pizot2, Claire Redin2, Amélie Piton3, Nasim Vasli2, Corinne Stoetzel4, André Blavier5, Jocelyn Laporte2, Jean Muller6.
Abstract
Background. Most genetic disorders are caused by single nucleotide variations (SNVs) or small insertion/deletions (indels). High throughput sequencing has broadened the catalogue of human variation, including common polymorphisms, rare variations or disease causing mutations. However, identifying one variation among hundreds or thousands of others is still a complex task for biologists, geneticists and clinicians. Results. We have developed VaRank, a command-line tool for the ranking of genetic variants detected by high-throughput sequencing. VaRank scores and prioritizes variants annotated either by Alamut Batch or SnpEff. A barcode allows users to quickly view the presence/absence of variants (with homozygote/heterozygote status) in analyzed samples. VaRank supports the commonly used VCF input format for variants analysis thus allowing it to be easily integrated into NGS bioinformatics analysis pipelines. VaRank has been successfully applied to disease-gene identification as well as to molecular diagnostics setup for several hundred patients. Conclusions. VaRank is implemented in Tcl/Tk, a scripting language which is platform-independent but has been tested only on Unix environment. The source code is available under the GNU GPL, and together with sample data and detailed documentation can be downloaded from http://www.lbgi.fr/VaRank/.Entities:
Keywords: Annotation; Barcode; Human genetics; Molecular diagnostic; Mutation detection; Next generation sequencing; Software; Variant ranking
Year: 2015 PMID: 25780760 PMCID: PMC4358652 DOI: 10.7717/peerj.796
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Figure 1High throughput sequencing data analysis workflow and VaRank positioning.
Figure 2VaRank’s workflow.
The work flow is separated into 4 major steps, (i) Sequencing data from a single or from multiple VCF files are integrated including variant call quality summary, (ii) Annotation of each variant including genetic and predictive information (functional impact, putative effects in protein coding regions, population frequency, phenotypic features…) from different sources. The annotation can either be done by Alamut Batch or SnpEff. (iii) Presence/absence of variants (with homozygote/heterozygote status) within all samples represented in a barcode, and (iv) Prioritization, to score and rank variants according to their predicted pathogenic status. The final output files are available for each samples.
Figure 3Barcode.
(A) The barcode represents the SNV’s zygosity status in an ordered list of samples. Samples that are homozygous for the reference allele are represented using “0,” heterozygous variants are represented using “1” and homozygous variants are represented with “2.” (B) Selected annotations from the VaRank output representing 3 variants from a single patient. The barcode gives an overview of the presence/absence of one variant in all other patients analyzed. The family barcode gives a user ordered view of the presence/absence of one variant in a selection of patients. Together with this, the total counts of alleles are given in the last 4 columns. (C) Example of pedigrees and barcodes that can be specifically used in family analyses such as trio exome sequencing. On the left, homozygous mutations in a consanguineous family could be highlighted by the “121” barcode indicating homozygous variants (“2”) in the proband inherited from heterozygous parents (“1”). On the right de novo variants in the proband could be highlighted with the proposed barcode “010.”
Summary description of the annotations provided by VaRank using Alamut Batch.
| Column name | Annotation |
|---|---|
| VariantID | Variant identifier [#chr]_[genomicposition]_[RefBase]_[VarBase] |
| Gene | Gene symbol |
| omimId | OMIM® id |
| TranscriptID | RefSeq transcript id |
| TranscriptLength | Length of transcript (full cDNA length) |
| Chr | Chromosome of variant |
| Start | Start position of variant |
| End | End position of variant |
| Ref | Nucleotide sequence in the reference genome (restricted to 50bp) |
| Mut | Alternate nucleotide sequence (restricted to 50bp) |
| Uniprot | Uniprot |
| protein | Protein id (NCBI) |
| posAA | Amino acid position |
| wtAA_1 | Reference codon |
| varAA_1 | Alternate codon |
| Phred_QUAL | QUAL: The Phred scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. Because the Phred scale is −10 * log(1 − |
| HomHet | Homozygote or heterozygote status |
| TotalReadDepth | Total number of reads covering the position |
| VarReadDepth | Number of reads supporting the variant |
| %Reads_variation | Percent of reads supporting variant over those supporting reference sequence/base |
| VarType | Variant Type (substitution, deletion, insertion, duplication, delins) |
| CodingEffect | Variant Coding effect (synonymous, missense, nonsense, in-frame, frameshift, start loss, stop loss) |
| VarLocation | Variant location (upstream, 5’UTR, exon, intron, 3’UTR, downstream) |
| Exon | Exon (nearest exon if intronic variant) |
| Intron | Intron |
| gNomen | Genomic-level nomenclature |
| cNomen | cDNA-level nomenclature |
| pNomen | Protein-level nomenclature |
| rsID | dbSNP variation |
| rsValidation | dbSNP validated status |
| rsClinicalSignificance | dbSNP variation clinical significance |
| rsAncestralAllele | dbSNP ancestral allele |
| rsHeterozygosity | dbSNP variation average heterozygosity |
| rsMAF | dbSNP variation global Minor Allele |
| rsMAFAllele | dbSNP variation global minor allele |
| rsMAFCount | dbSNP variation sample size |
| 1000g_AF | 1,000 genomes global allele frequency |
| 1000g_AFR_AF | 1,000 genomes allele frequency in African population |
| 1000g_SAS_AF | 1,000 genomes allele frequency in South Asian population |
| 1000g_EAS_AF | 1,000 genomes allele frequency in East Asian population |
| 1000g_EUR_AF | 1,000 genomes allele frequency in European population |
| espRefEACount | ESP reference allele count in European American population |
| espRefAACount | ESP reference allele count in African American population |
| espRefAllCount | ESP reference allele count in all population |
| espAltEACount | ESP alternate allele count in European American population |
| espAltAACount | ESP alternate allele count in African American population |
| espAltAllCount | ESP alternate allele count in all population |
| espEAMAF | Minor allele frequency in European American population |
| espAAMAF | Minor allele frequency in African American population |
| espAllMAF | Minor allele frequency in all population |
| espAvgReadDepth | Average sample read Depth |
| delta MESscore (%) | % difference between the splice score of variant with the score of the reference base |
| wtMEScore | WT seq. MaxEntScan score |
| varMEScore | Variant seq. MaxEntScan score |
| delta SSFscore (%) | % difference between the splice score of variant with the score of the reference base |
| wtSSFScore | WT seq. SpliceSiteFinder score |
| varSSFScore | Variant seq. SpliceSiteFinder score |
| delta NNSscore (%) | % difference between the splice score of variant with the score of the reference base |
| wtNNSScore | WT seq. NNSPLICE score |
| varNNSScore | Variant seq. NNSPLICE score |
| DistNearestSS | Distance to Nearest splice site |
| NearestSS | Nearest splice site |
| localSpliceEffect | Splicing effect in variation vicinity (New donor Site, New Acceptor Site, Cryptic Donor Strongly Activated, Cryptic Donor Weakly Activated, Cryptic Acceptor Strongly Activated, Cryptic Acceptor Weakly Activated) |
| SiftPred | SIFT prediction |
| SiftWeight | SIFT score ranges from 0 to 1. The amino acid substitution is predicted damaging is the score is <=0.05, and tolerated if the score is >0.05. |
| SiftMedian | SIFT median ranges from 0 to 4.32. This is used to measure the diversity of the sequences used for prediction. A warning will occur if this is greater than 3.25 because this indicates that the prediction was based on closely related sequences. The number should be between 2.75 and 3.5 |
| PPH2pred | PolyPhen-2 prediction using HumVar model are either “neutral, possibly damaging, probably damaging” or “neutral, deleterious” depending on the annotation engine. |
| phyloP | phyloP |
| PhastCons | PhastCons score |
| GranthamDist | Grantham distance |
| VaRank_VarScore | Prioritization score according to VaRank |
| AnnotationAnalysis | Yes or No indicates if the variation could annotated by any annotation engine |
| Avg_TotalDepth | Total read depth average at the variant position for all samples analyzed that have the variation |
| SD_TotalDepth | Standard deviation associated with Avg_TotalDepth |
| Count_TotalDepth | Number of samples considered for the average total read depth |
| Avg_SNVDepth | Variation read depth average at the variant position for all samples analyzed that have the variation |
| SD_SNVDepth | Standard deviation associated with Avg_SNVDepth |
| Count_SNVDepth | Number of samples considered for the average SNV read depth |
| familyBarcode | Homozygote or heterozygote status for the sample of interest and its associated samples |
| Barcode | Homozygote or heterozygote status for all sample analyzed together (Hom: 2; Het: 1; Sample name is given at the first line of the file: ## Barcode) |
| Hom_Count | Number of homozygote over all samples analyzed together |
| Het_Count | Number of heterozygote over all samples analyzed together |
| Allele_Count | Number of alleles supporting the variant |
| Sample_Count | Total number of samples |
Figure 4Distribution of variants in 180 patients for 217 genes.
The gray line represents the distribution of the number of variants identified in each sample in a cohort of 180 patients sequenced for 217 genes. The dark line represents the cumulative number of non-redundant (NR) variants in the same dataset due to each new sample added.
Scoring scheme description.
Scores in bold reflect score values after the adjustment score is applied.
| Variant category | Option name | VaRank score | Definitions |
|---|---|---|---|
| Known mutation |
| 110 | Known mutation as annotated by HGMD and/or dbSNP (rsClinicalSignificance = “pathogenic/probable-pathogenic”). |
| Nonsense |
| 100, | A single-base substitution in DNA resulting in a STOP codon(TGA, TAA or TAG). |
| Frameshift |
| 100 | Exonic insertion/deletion of a non-multiple of 3bp resulting often in a premature stop in the reading frame of the gene. |
| Essential splice site |
| 90, | Variation in one of the canonical splice sites resulting in a significant effect on splicing. |
| Start loss |
| 80, | Variation leading to the loss of the initiation codon (Met). |
| Stop loss |
| 80, | Variation leading to the loss of the STOP codon. |
| Intron-exon boundary |
| 70, | Variation outside of the canonical splice sites (donor site is −3 to +6, acceptor site −12 to +2). |
| Missense |
| 50, | A single-base substitution in DNA not resulting in a change in the amino acid. |
| Indel in-frame |
| 40 | Exonic insertion/deletion of a multiple of 3bp. |
| Deep intron-exon boundary |
| 25, | Intronic variation resulting in a significant effect on splicing. |
| Synonymous coding |
| 10, | A single-base substitution in DNA not resulting in a change in the amino acid. |
Notes.
Each variant score is adjusted (+5) if high conservation at the genomic level is observed (phastCons cutoff >0.95).
Missense scores are adjusted (+5) for each deleterious prediction (SIFT and/or PPH2).
Analysis of 32 patients with Bardet-Biedl syndrome.
From the BBS dataset, mutations and ranking for 32 patients sequenced for 30 genes. The ranking for each mutation has been obtained from the “filteredVariants” output. Mutations in italics are predicted to affect splicing.
| Patient# | Gene | RefSeq | Mutation (cDNA) | Mutation (protein) | Ranking |
|---|---|---|---|---|---|
| P11 |
| NM_024649.4 | c.[436C>T];[436C>T] | p.[R146*];[R146*] | Rank 1 |
| ALD6 | c.[436C>T];[(592-?)_(830+?)del] | p.[R146*];[ ?] | Rank 1 | ||
| ALO47 |
| p.[R160Q];[R160Q] | Rank 1 | ||
| AIO57 | c.[670G>A];[670G>A] | p.[E224K];[E224K] | Rank 1 | ||
| AMK19 | c.[951+1G>A];[1169T>G] | p.[?];[M390R] | Rank 1, Rank 2 | ||
| P9 |
| p.[ ?];[ ?] | Rank 1 | ||
| AKH61 | c.[1169T>G];[1169T>G] | p.[M390R];[M390R] | Rank 1 | ||
| P1 | p.[?];[?] | Rank 2 | |||
| AHZ63 | c.[1473+4T>A];[=] | p.[?];[=] | Rank 1 | ||
| AGA99 |
| NM_031885.3 | p.[ ?];[ ?] | Rank 1 | |
| P2 | p.[?];[?] | Rank 1 | |||
| P7 | c.[565C>T];[565C>T] | p.[R189*];[R189*] | Rank 1 | ||
| ALG76 | c.[626T>C];[626T>C] | p.[L209P];[L209P] | Rank 1 | ||
| AKX44 | c.[814C>T];[814C>T] | p.[R272*];[R272*] | Rank 1 | ||
| AGL23 | c.[1992delT];[1992delT] | p.[H665Tfs*675];[H665Tfs*675] | Rank 1 | ||
| P13 |
| NM_152384.2 | c.[149T>G];[149T>G] | p.[L50R];[L50R] | Rank 1 |
| ALG5 | c.[413G>A];[413G>A;] | p.[R138H];[R138H] | Rank 1 | ||
| AIZ46 |
| NM_018848.3 | c.[3G>A];[110A>G] | p.[M1I];[Y37C] | Rank 1, Rank 2 |
| AIZ62 | c.[571G>T];[724G>T] | p.[E191*];[A242S] | Rank 1, Rank 2 | ||
| P10 | p.[?];[?] | Rank 1 | |||
| ALB60 |
| NM_198428.2 | c.[855del];[855del] | p.[W285*];[W285*] | Rank 1 |
| ALS67 |
| NM_024685.3 | c.[271_272insT];[728_731delAAGA] | p.[C91Lfs*95];[K243Ifs*257] | Rank 1, Rank 2 |
| AMA70 | c.[271_272insT];[1201G>T] | p.[C91Lfs*95];[G401*] | Rank 1, Rank 2 | ||
| JSL | c.[285A>T];[2119-2120delGT] | p.[R95S];[V707*fs] | Rank 1, Rank 2 | ||
| P8 | c.[1181_1182insGCATTTAT];[1181_1182insGCATTTAT] | p.[S396Lfs*401];[S396Lfs*401] | Rank 1 | ||
| AMR64 | c.[1241T>C];[1241T>C] | p.[L414S];[L414S] | Rank 1 | ||
| AKR68 | c.[1241T>C];[1241T>C] | p.[L414S];[L414S] | Rank 2 | ||
| ALP79 |
| NM_001178007.1 | c.[865G>C];[205C>T(;)1859A>G] | p.[A289P];[L69F(;)Q620R] | Rank 1, Rank 2 |
| ALB64 |
| NM_015120.4 | c.[1724C>G];[1724C>G] | p.[S575*];[S575*] | Rank 1 |
| AIA84 | c.[3340del];[3340del] | p.[E1112Rfs*1120];[E1112Rfs*1120] | Rank 1 | ||
| ADC44 | c.[7904insC];[7904insC] | p.[N2636Qfs*59];[N2636Qfs*59] | Rank 1 | ||
| AKO26 | c.[10879C>T];[10879C>T] | p.[R3627*];[R3627*] | Rank 1 |
Notes.
The second mutation of the patient, a complete heterozygous deletion of exon 8 and 9 (c.(592-?)_(830+?)del) is a pathogenic CNV that cannot be ranked by VaRank.
Parent of BBS patients, a single heterozygous mutations is expected.
This validated mutation was filtered out in the “filteredVariants” results due to low sequencing quality (only 7 reads supported the variant) but ranked at the second position in the non-filtered results.
Analysis of 203 patients with intellectual disability.
(A) Mutations and ranking in the 25 positive patients from the 107 patients sequenced for 217 genes (Redin et al., 2014). (B) Mutations and ranking from 12 novel positive patients with ID identified in an additional cohort of 96 patients screened for 275 genes. Patients are sorted according to the mode of inheritance and the identified gene. Known mutations (from the literature) are highlighted in bold. Ranking into parenthesis highlights the ranking of the variations with a similar score. Mode of inheritance include: AD, autosomic dominant; AR, autosomic recessive; XL, X-linked; XLD, dominant on the X chromosome.
| (A) | |||||||
|---|---|---|---|---|---|---|---|
| Patient# | Sex | Gene | Chromosome | Mode of inheritance | Mutation (cDNA) | Mutation (protein) | Ranking |
| APN-58 | M |
| 21 | AD | c.[613C>T];[=] | p.[R205*];[=] | Rank 2 |
| APN-87 | M | c.[621_624delinsGAA];[=] | p.[E208Nfs*3];[=] | Rank 1 | |||
| APN-63 | M |
| 9 | AD | c.[1733C>G];[=] | p.[P578R];[=] | Rank 1 (2) |
| APN-14 | M |
| 12 | AD | c.[6118_6125del];[=] | p.[G2040Nfs*32];[=] | Rank 1 (2) |
| APN-46 | M |
| 17 | AD | c.[2332_2336del];[=] | p.[G778Efs*7];[=] | Rank 1 |
| APN-122 | F |
| 22 | AD | c.[2955_2970dup];[=] | p.[P992Rfs*325];[=] | Rank 1 |
| APN-38 | M |
| 1 | AD | c.[724C>T];[=] | p.[E242*];[=] | Rank 2 |
| APN-139 | M |
| 6 | AD | c. | p.[?];[?] | Rank 1 |
| APN-41 | M |
| 18 | AD | c.[514_517del];[=] | p.[K172Ffs*61];[=] | Rank 1 |
| APN-117 | F | c.[520C>T];[=] | p.[R174*];[=] | Rank 1 | |||
| APN-138 | M |
| X | XL |
|
| Rank 1 (2) |
| APN-137 | M |
| X | XL | c.[811_812del];[0] | p.[E271Aspfs*11];[0] | Rank 1 (2) |
| APN-42 | M |
| X | XL | c.[10889del];[0] | p.[R3630Efs*27];[0] | Rank 1 |
| APN-113 | M |
| X | XL |
|
| Rank 1 |
| APN-82 | M |
| X | XL | c.[894_903del];[0] | p.[W299Tfs*18];[0] | Rank 1 |
| APN-68 | M |
| X | XL | c.[3097C>T];[0] | p.[E1033*];[0] | Rank 1 (2) |
| APN-34 | M |
| X | XL | c.[2152G>C];[0] | p.[A718P];[0] | Rank 1 |
| APN-135 | M | c.[1296dup];[0] | p.[E433*];[0] | Rank 1 | |||
| APN-16 | M |
| X | XL | c.[797_798delinsTT];[0] | p.[C266F];[0] | Rank 1 |
| APN-130 | F |
| X | XLD | Rank 2a | ||
| APN-142 | F | Rank 1a | |||||
| APN-105 | M |
| X | XL | c.[1249+5G>C];[0] | p.[Y406Ffs*24];[0] | Rank 4 |
| APN-43 | M |
| X | XL | c. | p.[?];[0] | Rank 1 |
| APN-110 | M |
| X | XL |
|
| Rank 1 |
Notes.
Known mutation not annotated as pathogenic in dbSNP.
Figure 5Representation of the non-redundant variations by functional type in 3 datasets.
The chart is built upon the Intellectual disability (ID) and Bardet-Biedl Syndrome (BBS) (consolidated from 188 patients addressed for BBS) datasets discussed in the Results section together with an enhanced exome dataset (35 exomes). The “truncating” category corresponds to frameshift, nonsense, stoploss and startloss.