| Literature DB >> 28361668 |
Simona De Summa1, Giovanni Malerba2, Rosamaria Pinto1, Antonio Mori3, Vladan Mijatovic3, Stefania Tommasi1.
Abstract
BACKGROUND: NGS technology represents a powerful alternative to the standard Sanger sequencing in the context of clinical setting. The proprietary software that are generally used for variant calling often depend on preset parameters that may not fit in a satisfactory manner for different genes. GATK, which is widely used in the academic world, is rich in parameters for variant calling. However the self-adjusting parameter calibration of GATK requires data from a large number of exomes. When these are not available, which is the standard condition of a diagnostic laboratory, the parameters must be set by the operator (hard filtering). The aim of the present paper was to set up a procedure to assess the best parameters to be used in the hard filtering of GATK. This was pursued by using classification trees on true and false variants from simulated sequences of a real dataset data.Entities:
Keywords: Indel; NGS; SNV; Targeted gene panel; Variant calling; Variant filtering
Mesh:
Year: 2017 PMID: 28361668 PMCID: PMC5374681 DOI: 10.1186/s12859-017-1537-8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Overall GATK unfiltered alterations identified in HC and LC dataset (100 replicates of a dataset of 26 individuals and 11 genes)
| TV | FV | |
|---|---|---|
| HC dataset | ||
|
| 17,359 | 64,281 |
|
| 89,743 | 1,372 |
| LC dataset | ||
|
| 7,656 | 80,489 |
|
| 96,203 | 17,043 |
TV true variants, FV false variants
Descriptive statistics of GATK filters in the HC dataset, stratifying calls by type (SNV/Indels). Data are displayed as mean ± sd
| SNVs | Indels | |||||
|---|---|---|---|---|---|---|
|
|
|
|
|
|
| |
| BQRS | 0.11 ± 0.03 | -0.6 ± 0.5 | <0.0001 | 0.28 ± 0.12 | 0.15 ± 0.05 | <0.0001 |
| RPRS | -0.067 ± 0.05 | 0.05 ± 0.4 | 0.0009 | -0.74 ± 0.22 | 0.23 ± 0.05 | <0.0001 |
| CRS | 0.0007 ± 0.02 | -0.009 ± 0.29 | 0.72 | 0.001 ± 0.07 | 0.007 ± 0.03 | 0.6 |
| DP | 96.61 ± 0.58 | 49.25 ± 5.06 | <0.0001 | 109.4 ± 9.57 | 96.01 ± 0.1 | <0.0001 |
| MQ | 60 ± 0 | 59.99 ± 0.07 | - | 60 ± 0 | 60 ± 0 | - |
| MQRS | -0.03 ± 0.02 | -0.05 ± 0.28 | 0.3 | -0.02 ± 0.09 | -0.21 ± 0.04 | <0.0001 |
| GQ | 99 ± 0 | 79.15 ± 12.06 | - | 99 ± 0 | 73.16 ± 2.7 | - |
The mean value is the mean value of the median value from each of the 100 replicates
BQRS BaseQRankSum, RPRS ReadPosRankSum, CRS ClippingRankSum, DP depth of coverage, MQ MappingQuality, MQRS MappingQualityRankSum, GQ genotype quality, TV true variants, FV false variants
Descriptive statistics of GATK filters in the LC dataset, stratifying calls by type (SNV/Indels). Data are displayed as mean ± sd
| SNVs | Indels | |||||
|---|---|---|---|---|---|---|
|
|
|
|
|
|
| |
| BQRS | 0.02 ± 0.02 | -0.27 ± 0.12 | <0.0001 | 0.16 ± 0.16 | 0.01 ± 0.03 | <0.0001 |
| RPRS | -0.19 ± 0.03 | -0.31 ± 1.1 | <0.0001 | -1.29 ± 0.3 | 0.04 ± 0.04 | <0.0001 |
| CRS | -0.02 ± 0.02 | -0.05 ± 0.07 | <0.0001 | -0.004 ± 0.12 | -0.04 ± 0.01 | 0.001 |
| DP | 19.97 ± 0.17 | 22.72 ± 1.3 | <0.0001 | 21.83 ± 1.97 | 20.24 ± 0.42 | <0.0001 |
| MQ | 60 ± 0 | 60 ± 0 |
| 60 ± 0 | 60 ± 0 |
|
| MQRS | -0.06 ± 0.01 | -0.1 ± 0.08 | <0.0001 | -0.15 ± 0.15 | -0.1 ± 0.04 | 0.03 |
| GQ | 99 ± 0 | 20.04 ± 2.9 |
| 99 ± 0 | 17.94 ± 0.92 |
|
The mean value is the mean value of the median value from each of the 100 replicates
BQRS BaseQRankSum, RPRS ReadPosRankSum, CRS ClippingRankSum, DP depth of coverage, MQ MappingQuality, MQRS MappingQualityRankSum, GQ genotype quality, TV true
Perfomance of the individual filters evaluated to discriminate between true and false variants by the AUC values from ROC curve, grouped by type of variants (SNV or Indel) and status of the genotype call (homozygote or heterozygote) according to the depth of sequencing (LC or HC dataset)
| SNV | Indel | |||
|---|---|---|---|---|
|
|
|
|
| |
| HC dataset | ||||
| BQRS | 0.73 | 0.53 | 0.5 | 0.53 |
| RPRS | 0.57 | 0.61 | 0.52 | 0.68 |
| CRS | 0.5 | 0.51 | 0.53 | 0.5 |
| DP | 0.79 | 0.8 | 0.76 | 0.6 |
| MQ | 0.55 | 0.5 | 0.6 | 0.63 |
| MQRS | 0.52 | 0.53 | 0.58 | 0.58 |
| GQ | 0.65 | 0.95 | 0.77 | 0.77 |
| ADT | 0.96 | 0.8 | 0.56 | 0.94 |
| ADTL | 0.8 | 0.77 | 0.72 | 0.94 |
| FS | 0.51 | 0.62 | 0.51 | 0.54 |
| LC dataset | ||||
| BQRS | 0.58 | 0.65 | 0.5 | 0.52 |
| RPRS | 0.53 | 0.5 | 0.54 | 0.74 |
| CRS | 0.52 | 0.5 | 0.51 | 0.51 |
| DP | 0.63 | 0.67 | 0.52 | 0.62 |
| MQ | 0.51 | 0.54 | 0.54 | 0.52 |
| MQRS | 0.52 | 0.52 | 0.51 | 0.5 |
| GQ | 0.79 | 0.99 | 0.53 | 0.97 |
| ADT | 0.98 | 0.99 | 0.5 | 0.92 |
| ADTL | 0.67 | 0.98 | 0.54 | 0.98 |
| FS | 0.5 | 0.54 | 0.5 | 0.54 |
Parameters and their thresholds selected by regression trees
| Sequencing depth | Variant type | Genotype by GATK | Filter rule |
|---|---|---|---|
| 20x | SNV | homozygous | ADT > =0.98 |
| 20x | SNV | heterozygous | ADT < 0.55 |
| 20x | INDEL | homozygous | N/A (*) |
| 20x | INDEL | heterozygous | ADT < 0.26 & GQ > =98.5 & DP > =23.5 & MQ > =59.5 |
| 100x | SNV | homozygous | ADT > =0.96 |
| 100x | SNV | heterozygous | GQ > =68.5 |
| 100x | INDEL | homozygous | ADTL > =5.08 |
| 100x | INDEL | heterozygous | ADT < 0.15 & MQ > =59.91 & GQ > =98.5 |
(*): no reliable filters were selected by classification trees
Results by the application of selection parameters and their thresholds on simulated datasets
| TV | FV | Variant selected by hard filtering % | |
|---|---|---|---|
| HC dataset | |||
| Homo SNVs | |||
|
| 2,382 (66.6) | 1,195 (33.4) | 93.9 |
|
| 2,238 (98.6) | 31 (1.3) | |
| Het SNVs | |||
|
| 87,361 (99.8) | 177 (0.2) | 98.6 |
|
| 86,166 (99.9) | 24 (0.03) | |
| Homo indels | |||
|
| 54 (0.12) | 43,871 (99.8) | 27.7 |
|
| 15 (75) | 5 (25) | |
| Het indels | |||
|
| 17.305 (45.8) | 20,410 (54.1) | 84.6 |
|
| 14,646 (94) | 935 (6) | |
| LC dataset | |||
| Homo SNVs | |||
|
| 2,084 (12.38) | 14,721 (87.6) | 96.9 |
|
| 2,020 (92.2) | 171 (7.8) | |
| Het SNVs | |||
|
| 95,119 (97.6) | 2,322 (2.3) | 99.4 |
|
| 94,602 (99.9) | 80 (0.08) | |
| Homo indels | |||
|
| 154 (0.4) | 45623 (99.6) | 100 |
|
| 154 (0.4) | 45623 (99.6) | |
| Het indels | |||
|
| 7,502 (17.6) | 34,889 (82.3) | 43 |
|
| 3,226 (99.1) | 27 (0.8) | |
% have to be intended as the percentage of unfiltered variants for “overall”calls and as the percentage of alterations which were not filtered out in the hard filtering process for “selected”calls; % of selection indicates the amount of variants selected from the total callset. TV true variants, FV false variants
Homopolymeric sequences flanking false positive variants
| Chr | Position | Flanking sequence | N° of occurrences | |
|---|---|---|---|---|
| HC dataset | ||||
| Homo SNVs | chr10 | 131565164 | CCGGT | 77 |
| chr3 | 178921420 | GGACT | 73 | |
| Het SNVs | chr13 | 48919347 | TAAAC | 63 |
| chr3 | 178937372 | CTTGG | 9 | |
| Homo Indels | chr4 | 55602995 | AGAGC | 1842 |
| chr10 | 89693016 | AAGTT | 1802 | |
| Het Indels | chr13 | 48955363 | AGTTA | 2175 |
| chr3 | 178941853 | CTATC | 1678 | |
| LC dataset | ||||
| Homo SNVs | chr2 | 204736165 | GGGTT | 334 |
| chr13 | 48954225 | GGTAA | 241 | |
| Het SNVs | chr7 | 140534584 | AAACA | 32 |
| chr13 | 48955464 | CTTTG | 20 | |
| Homo Indels | chr7 | 140481508 | AACAG | 1153 |
| chr7 | 140481513 | TAAAA | 1084 | |
| Het Indels | chr3 | 69915434 | TAAAG | 1202 |
| chr10 | 89693016 | AAGTT | 1107 | |
Variant locus is on the 6th nucletide (bold) of the 11 nucleotide string (flanking sequence)