| Literature DB >> 30072699 |
Abhishek Kumar1, Obul Reddy Bandapalli2, Nagarajan Paramasivam3,4, Sara Giangiobbe5, Chiara Diquigiovanni6, Elena Bonora6, Roland Eils3,7, Matthias Schlesner3,8, Kari Hemminki5,9, Asta Försti5,9.
Abstract
Whole-genome sequencing methods in familial cancer are useful to unravel rare clinically important cancer predisposing variants. Here, we present improvements in our pedigree-based familial cancer variant prioritization pipeline referred as FCVPPv2, including 12 tools for evaluating deleteriousness and 5 intolerance scores for missense variants. This pipeline is also capable of assessing non-coding regions by combining FANTOM5 data with sets of tools like Bedtools, ChromHMM, Miranda, SNPnexus and Targetscan. We tested this pipeline in a family with history of a papillary thyroid cancer. Only one variant causing an amino acid change G573R (dbSNP ID rs145736623, NM_019609.4:exon11:c.G1717A:p.G573R) in the carboxypeptidase gene CPXM1 survived our pipeline. This variant is located in a highly conserved region across vertebrates in the peptidase_M14 domain (Pfam ID PF00246). The CPXM1 gene may be involved in adipogenesis and extracellular matrix remodelling and it has been suggested to be a tumour suppressor in breast cancer. However, the presence of the variant in the ExAC database suggests it to be a rare polymorphism or a low-penetrance risk allele. Overall, our pipeline is a comprehensive approach for prediction of predisposing variants for high-risk cancer families, for which a functional characterization is a crucial step to confirm their role in cancer predisposition.Entities:
Mesh:
Substances:
Year: 2018 PMID: 30072699 PMCID: PMC6072708 DOI: 10.1038/s41598-018-29952-z
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Summary of familial cancer variant prioritization pipeline version 2 (FCVPPv2). This pipeline uses platypus tool[13] for joint variant calling after mapping of the sequencing reads from cases and controls. FCVPPv2 uses several external tools for variant annotation namely ExAC, 1000 Genomes phase III data, ANNOVAR and dbNSFPv3, dbSNP and EVS6500. For candidate variants the variants are filtered using read quality parameters like coverage and quality scores (QUAL) must be >5 and >20, respectively. Minor allele frequency (MAF) must be below 0.1% in the European populations in all used databases. Furthermore, these variants are screened with respective to family-pedigree and this is the most critical step in the germline genomics (shown in black shade). After this step, variants are ranked with the help of CADD v1.3[20] and any variants with CADD PHRED score of >10 belongs to top 10% for probable functional and deleterious variants in the human genome. These deleterious variants are subsequently divided into 4 different categories based on their locations. The coding variants are considered deleterious based on the consensus from 12 deleteriousness prediction tools and 5 intolerance scores. Variants in the 5′ UTR are considered regulatory based on the Haploreg V4.1[21], RegulomeDB[22] and SNPnexus[9] while variants in the 3′ UTR are regulatory if supported by the presence of miRNA binding site using Miranda[7] and Targetscan 7.0[8] tools and additional hints are received from Haploreg V4.1[21] and RegulomeDB[22]. For variants in the non-coding segments we combined several state-of- the-art tools such as chromHMM, Segway, FunSeq2 and FANTOM5 data. Non-coding (intergenic and intronic) variants may not always have CADD > 10 even though they will have regulatory implications, so we analyzed all non-coding variants after pedigree segmentation, either with or without CADD > 10. Putative deleterious or regulatory variants are visualized using Locuszoom, SniPA and UCSC genome browser. Potential variants are also checked with sets of additional features, e.g. list of known CPGs[1] and clinically relevant variants (ClinVar), expression data and somatic mutations. We also checked the sequencing data of all cases and controls in a particular family for correctness using the IGV browser.
Figure 2Overview of strategies for regulatory variant detection in the non-coding segments of the human genome. We utilized the FANTOM5 data using the SlideBase Tool (slidebase.binf.ku.dk) with 32693 enhancers and 184476 promoters (downloaded in March 2017). We matched our variants (pedigree segregated) with FANTOM5 data using Bedtools intersect to retrieve a list of potentially critical variants localized within promoters and/or enhancers, and we examined the status of transcription factor (TF) binding sites using SNPnexus[9]. We checked the signals for chromatin binding using ChromHMM and genomic segmentation data from Segway via CADDv1.3[20]. In addition, we examined if the putative noncoding regulatory variants were localized in the ultra-conserved non-coding elements (UCNEs) and their clusters, also known as ultra-conserved genomic regulatory blocks (UGRBs) with the help of UCNEbase[24] and also if these variants were located in ultra-sensitive and sensitive regions (Ultrasen), defined by FunSeq2[23]. The top-ranked variants were examined for their regulatory nature by using Locuszoom, SniPA, UCSC and ZENBU genome browsers. We also examined if the putative enhancer variant fall into the category of super-enhancers using super-enhancer archive (SEA)[25] and dbSUPER[26]. Expression profile, RNA-seq data-based information and motif changes and disruptions were gathered with help from FANTOM5 data via the SlideBase.
Summary of intolerance scores and conservational scores.
| Tools | Details | Score Range | Significant score | Ref. |
|---|---|---|---|---|
| Residual Variation Intolerance Score (RVIS) | based upon allele frequency data | Negative to Positive | RVIS < 0 - intolerant |
[ |
| RVIS - ExAC data set | ||||
| RVIS - local data set | ||||
| pLI score | Developed by ExAC Consortium for Loss-of-Function (LoF) mutations | pLI ≥ 0.9 - highly LoF-intolerant |
[ | |
| Z-score | Developed by ExAC Consortium for missense and synonymous variants | Positive Z scores –intolerant |
[ | |
| Genomic Evolutionary Rate Profiling (GERP) | −12.3 to 6.17 | >2 |
[ | |
| PhastCons | 0 to 1 | >0.3 |
[ | |
| Phylogenetic P-value (PhyloP) | −14 to +6 | ≥3.0 |
[ |
Summary of used tools for deleteriousness prediction for missense variants.
| Tools | Methodology | Score ranges | Prediction | References |
|---|---|---|---|---|
| Sorting Intolerant from Tolerant (SIFT) | Position-specific scoring matrix (PSSM) with Dirichlet priors | 0 to 1* | D – Damaging (<0.05) |
[ |
| Polymorphism Phenotyping version-2 (PolyPhen-v2) | Naïve Bayes classifier trained using supervised machine-learning | 0 to 1** | D – probably damaging (0.957–1) |
[ |
| PolyPhen2_HDIV (HumDiv$) | ||||
| Polyphen2_HVAR (HumVar%) | ||||
| Log ratio test (LRT) | Uses log ratio test | 0 to 1*** | D – Deleterious |
[ |
| MutationTaster | Naïve bayes model operated on the integrated data source | 0 to 1** | A– disease_causing_automatic |
[ |
| MutationAssessor | Multiple sequence alignment (MSA) and conservation scores | −5.135 to 6.49** | H – High |
[ |
| Functional Analysis Through Hidden Markov Models (FATHMM) | Hidden Markov models (HMM) | −18.09 to 11.0* | D – Damaging (< = −1.5) |
[ |
| MetaSVM | Support vector machine (SVM) based score, derived by incorporating different scores# | −2 to 3** | D – Damaging (>0) |
[ |
| MetaLR | Logistic regression (LR) based score, derived by combining different scores# | 0 to 1** | D – Damaging (>0.5) |
[ |
| Variant Effect Scoring Tool version 3 (VEST3) | Supervised machine learning-based method | 0 to 1** | NA |
[ |
| Protein Variation Effect Analyzer (PROVEAN) | Pair-wise alignment-based scoring method | −14 to 14* | D – Damaging (< = −2.5) |
[ |
| Reliability index (RI) | SVM based | 0 to 10** | D – Damaging (≥5) |
[ |
*Lower scores indicate deleterious nature.
**Higher scores indicate deleterious nature.
***Score cannot decide deleterious nature.
$HumDiv - collection of mendelian disease variants (5564 deleterious + 7539 neutral in 978 human protein) against divergence from close mammalian homologs of human proteins (> = 95% sequence identity).
%HumVar - compilation of all human variants (22196 deleterious + 21119 neutral) associated with some disease (non-cancer mutations) or loss of activity/function vs. common (MAF > 1%) human polymorphism with no reported association with a disease.
#10 scores from SIFT, PolyPhen-2 HDIV, PolyPhen-2 HVAR, GERP++, MutationTaster, Mutation Assessor, FATHMM, LRT, SiPhy and PhyloP and the maximum frequency observed in the 1000 G data.
Figure 3Summary of the papillary thyroid cancer (PTC) family and variant ranking within this family. (A) Pedigree of the PTC family. (B) Variant ranking for the PTC family and selection of CPXM1 variant as the top deleterious variant.
Overview of the 7 top-ranked germline variants detected in the PTC family.
| Gene Name | Gene Description | Variant | Variant nomenclature$ | Variant type | No. of cases | No. of unknown cases | ANNOVAR Annotation | Exonic Classification | CADD score |
|---|---|---|---|---|---|---|---|---|---|
| C1orf27 | chromosome 1 open reading frame 27 | 1_186355211_G_A | NM_017847.5:exon4:c.G326A:p.R109H | SNVs | 2 | 0 | exonic | nonsynonymous SNV | 25.1 |
| FAM129A | family with sequence similarity 129, member A | 1_184792402_T_C | NM_052966.3:exon8:c.A884G:p.K295R | SNVs | 2 | 0 | exonic | nonsynonymous SNV | 23.9 |
| ZBTB41 | zinc finger and BTB domain containing 41 | 1_197128680_C_T | NM_194314.2:exon10:c.G2539A:p.D847N | SNVs | 2 | 0 | exonic | nonsynonymous SNV | 23.1 |
| CPXM1 | carboxypeptidase X (M14 family), member 1 | 20_2776248_C_T | NM_019609.4:exon11:c.G1717A:p.G573R | SNVs | 2 | 0 | exonic | nonsynonymous SNV | 32 |
| KCNE3 | potassium voltage-gated channel, Isk-related family, member 3 | 11_74167200_AATAT_A | NM_005472.4:c.*1097–1097delATAT | Indel | 2 | 0 | ncRNA_UTR3 | . | 11 |
| AR | androgen receptor | X_66765158_T_TGCAGCAGCA | NM_000044.3:c.239_240insGCAGCAGCA | Indel | 2 | 0 | exonic | nonframeshift insertion | 12.8 |
| NLK | glucose-6-phosphate isomerase | 17_26522009_T_TCACA | NM_016231.4:c.*347_*348insCACA | Indel | 2 | 0 | UTR3 | . | 11.7 |
$ - as per guidelines of the Human Genome Variation Society (HGVS, website http://www.hgvs.org/).
Figure 4Overview of the top missense variants in the PTC family. (A) The 4 top ranked variants are shown with their favorable and unfavorable features. Grantham scores - 0–50 - conservative, 51–100 - moderately conservative, 101–150 - moderately radical and ≥151 - radical. (B) Location of the G573R variant in the peptidase M14 domain of CXPM1. (C) The G573R variant is localized in a highly conserved region. CXPM1 protein sequences were downloaded from GenBank as human (GenBank ID - NP_062555.1), gorilla (XP_004061758.1), cat (XP_003983774.1), pig (XP_003134381.1), seal (XP_021544821.1), lizard (XP_008120663.1), Xenopus (XP_002936314.1), catfish (XP_017320329.1), carp (XP_018934262.1), molly (XP_014844715.1) and zebrafish (XP_693256.4).