| Literature DB >> 30824707 |
Ruibang Luo1,2, Fritz J Sedlazeck3, Tak-Wah Lam4, Michael C Schatz5.
Abstract
The accurate identification of DNA sequence variants is an important, but challenging task in genomics. It is particularly difficult for single molecule sequencing, which has a per-nucleotide error rate of ~5-15%. Meeting this demand, we developed Clairvoyante, a multi-task five-layer convolutional neural network model for predicting variant type (SNP or indel), zygosity, alternative allele and indel length from aligned reads. For the well-characterized NA12878 human sample, Clairvoyante achieves 99.67, 95.78, 90.53% F1-score on 1KP common variants, and 98.65, 92.57, 87.26% F1-score for whole-genome analysis, using Illumina, PacBio, and Oxford Nanopore data, respectively. Training on a second human sample shows Clairvoyante is sample agnostic and finds variants in less than 2 h on a standard server. Furthermore, we present 3,135 variants that are missed using Illumina but supported independently by both PacBio and Oxford Nanopore reads. Clairvoyante is available open-source ( https://github.com/aquaskyline/Clairvoyante ), with modules to train, utilize and visualize the model.Entities:
Mesh:
Year: 2019 PMID: 30824707 PMCID: PMC6397153 DOI: 10.1038/s41467-019-09025-z
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Time per epoch of different models of GPU and CPU in model training
| Equipment | Seconds per epoch per 11 M samples |
|---|---|
| GTX 1080 Ti | 170 |
| GTX 980 | 250 |
| GTX Titan | 520 |
| Tesla K40 w/top power setting | 580 |
| Tesla K40 | 620 |
| Tesla K80 (one socket) | 700 |
| GTX 680 | 780 |
| Intel Xeon E5-2680 v4 28-core | 2900 |
Performance of Clairvoyante on Illumina data at common variant sites in 1KGp3
| Seq. tech. | Model trained on | Trained epochs | Ending learning rate and lambda | Call variants in | Best variant quality cutoff | Overall FPR (%) | Overall FNR (%) | Overall precision (%) | Overall F1 score (%) | SNP FPR (%) | SNP FNR (%) | SNP precision (%) | SNP F1 score (%) | Indel FPR (%) | Indel FNR (%) | Indel precision (%) | Indel F1 score (%) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Illumina | HG001 | 67a | 1.E−05 | HG001 | 67 | 0.28 | 0.45 | 99.72 | 99.64 | 0.07 | 0.10 | 99.93 | 99.91 | 1.93 | 3.38 | 98.00 | 97.31 |
| 999 | 1.E−03 | 119 | 0.25 | 0.41 | 99.75 | 99.67 | 0.08 | 0.08 | 99.92 | 99.92 | 1.64 | 3.13 | 98.30 | 97.58 | |||
| 1499 | 1.E−04 | 128 | 0.28 | 0.41 | 99.72 | 99.66 | 0.08 | 0.08 | 99.92 | 99.92 | 1.87 | 3.11 | 98.07 | 97.48 | |||
| 1999 | 1.E−05 | 147 | 0.29 | 0.42 | 99.71 | 99.64 | 0.08 | 0.09 | 99.92 | 99.92 | 1.95 | 3.24 | 97.98 | 97.37 | |||
| HG001 | 67a | 1.E−05 | HG002 | 58 | 0.32 | 0.51 | 99.68 | 99.59 | 0.11 | 0.15 | 99.89 | 99.87 | 2.11 | 3.58 | 97.82 | 97.12 | |
| 999 | 1.E−03 | 107 | 0.30 | 0.49 | 99.70 | 99.61 | 0.11 | 0.14 | 99.89 | 99.87 | 1.94 | 3.47 | 97.99 | 97.26 | |||
| 1499 | 1.E−04 | 151 | 0.34 | 0.54 | 99.66 | 99.56 | 0.11 | 0.16 | 99.89 | 99.86 | 2.26 | 3.80 | 97.65 | 96.92 | |||
| 1999 | 1.E−05 | 147 | 0.37 | 0.54 | 99.63 | 99.55 | 0.12 | 0.15 | 99.88 | 99.86 | 2.46 | 3.89 | 97.45 | 96.77 | |||
| HG002 | 66a | 1.E−05 | HG001 | 53 | 0.31 | 0.80 | 99.69 | 99.44 | 0.08 | 0.14 | 99.92 | 99.89 | 2.18 | 6.26 | 97.68 | 95.67 | |
| 999 | 1.E−03 | 96 | 0.28 | 0.76 | 99.72 | 99.48 | 0.07 | 0.13 | 99.93 | 99.90 | 2.00 | 6.00 | 97.87 | 95.90 | |||
| 1499 | 1.E−04 | 134 | 0.33 | 0.81 | 99.67 | 99.43 | 0.08 | 0.15 | 99.92 | 99.88 | 2.37 | 6.34 | 97.48 | 95.53 | |||
| 1999 | 1.E−05 | 148 | 0.35 | 0.83 | 99.65 | 99.41 | 0.08 | 0.15 | 99.92 | 99.89 | 2.50 | 6.50 | 97.34 | 95.38 | |||
| HG002 | 66a | 1.E−05 | HG002 | 54 | 0.28 | 0.76 | 99.72 | 99.48 | 0.07 | 0.13 | 99.93 | 99.90 | 2.01 | 6.17 | 97.86 | 95.80 | |
| 999 | 1.E−03 | 99 | 0.24 | 0.72 | 99.76 | 99.52 | 0.06 | 0.13 | 99.94 | 99.90 | 1.76 | 5.80 | 98.14 | 96.13 | |||
| 1499 | 1.E−04 | 124 | 0.27 | 0.72 | 99.73 | 99.50 | 0.07 | 0.12 | 99.93 | 99.90 | 1.96 | 5.87 | 97.92 | 95.99 | |||
| 1999 | 1.E−05 | 132 | 0.28 | 0.73 | 99.72 | 99.50 | 0.07 | 0.12 | 99.93 | 99.90 | 2.03 | 5.94 | 97.84 | 95.91 | |||
| DeepVariant | 3 | 0.04 | 0.06 | 99.96 | 99.95 | 0.01 | 0.03 | 99.99 | 99.98 | 0.27 | 0.28 | 99.72 | 99.72 | ||||
| LoFreq | Benchmarked SNP only | 10.59 | 0.51 | 85.02 | 91.69 | – | – | – | – | ||||||||
| GATK UnifiedGenotyper, HG001 | 3 | 0.19 | 0.35 | 99.81 | 99.73 | 0.10 | 0.07 | 99.90 | 99.91 | 0.80 | 2.43 | 99.14 | 98.35 | ||||
| GATK HaplotypeCaller, HG001 | 4 | 0.07 | 0.11 | 99.93 | 99.91 | 0.013 | 0.03 | 99.99 | 99.98 | 0.50 | 0.66 | 99.47 | 99.41 | ||||
| GATK UnifiedGenotyper, HG002 | 3 | 0.20 | 0.37 | 99.80 | 99.71 | 0.12 | 0.09 | 99.87 | 99.89 | 0.73 | 2.52 | 99.20 | 98.33 | ||||
| GATK HaplotypeCaller, HG002 | 5 | 0.07 | 0.10 | 99.93 | 99.92 | 0.023 | 0.05 | 99.98 | 99.96 | 0.38 | 0.50 | 99.60 | 99.55 | ||||
aFast training mode
Performance of Clairvoyante on PacBio data at common variant sites in 1KGp3
| Seq. tech. | Model trained on | Trained epochs | Ending learning rate and lambda | Call variants in | Best variant quality cutoff | Overall FPR (%) | Overall FNR (%) | Overall precision (%) | Overall F1 score (%) | SNP FPR (%) | SNP FNR (%) | SNP precision (%) | SNP F1 score (%) | Indel FPR (%) | Indel FNR (%) | Indel precision (%) | Indel F1 score (%) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PacBio | HG001 | 50a | 1.E−05 | HG001 | 69 | 1.51 | 7.41 | 98.38 | 95.39 | 0.32 | 1.43 | 99.68 | 99.12 | 10.94 | 60.59 | 76.24 | 51.96 |
| 999 | 1.E−03 | 94 | 1.39 | 7.07 | 98.51 | 95.64 | 0.26 | 1.29 | 99.74 | 99.22 | 10.39 | 58.41 | 78.21 | 54.31 | |||
| 1499 | 1.E−04 | 89 | 2.17 | 6.06 | 97.70 | 95.78 | 0.25 | 1.18 | 99.75 | 99.28 | 16.44 | 49.38 | 72.02 | 59.45 | |||
| 1999 | 1.E−05 | 85 | 2.43 | 5.81 | 97.43 | 95.78 | 0.26 | 1.18 | 99.74 | 99.28 | 18.20 | 46.98 | 70.44 | 60.50 | |||
| HG001 | 50a | 1.E−05 | HG002 | 75 | 1.78 | 7.48 | 98.07 | 95.22 | 0.71 | 1.47 | 99.28 | 98.91 | 10.29 | 60.05 | 77.70 | 52.77 | |
| 999 | 1.E−03 | 96 | 1.98 | 7.31 | 97.87 | 95.21 | 0.75 | 1.45 | 99.23 | 98.89 | 11.54 | 58.54 | 76.08 | 53.67 | |||
| 1499 | 1.E−04 | 114 | 2.07 | 7.77 | 97.76 | 94.91 | 0.76 | 1.45 | 99.23 | 98.89 | 12.21 | 63.04 | 72.67 | 49.00 | |||
| 1999 | 1.E−05 | 123 | 1.97 | 7.94 | 97.86 | 94.87 | 0.75 | 1.44 | 99.24 | 98.90 | 11.50 | 64.74 | 73.09 | 47.57 | |||
| HG002 | 72a | 1.E−05 | HG001 | 56 | 1.63 | 9.22 | 98.20 | 94.35 | 0.49 | 2.55 | 99.49 | 98.46 | 10.72 | 68.46 | 72.43 | 43.94 | |
| 999 | 1.E−03 | 99 | 1.69 | 8.47 | 98.16 | 94.73 | 0.57 | 1.84 | 99.42 | 98.79 | 10.62 | 67.39 | 73.31 | 45.14 | |||
| 1499 | 1.E−04 | 116 | 2.43 | 8.25 | 97.36 | 94.47 | 0.80 | 1.91 | 99.19 | 98.64 | 14.89 | 64.53 | 66.97 | 46.37 | |||
| 1999 | 1.E−05 | 127 | 2.34 | 8.57 | 97.45 | 94.34 | 0.89 | 2.04 | 99.09 | 98.52 | 13.56 | 66.58 | 68.06 | 44.83 | |||
| HG002 | 72a | 1.E−05 | HG002 | 55 | 1.88 | 7.08 | 97.98 | 95.38 | 0.55 | 1.25 | 99.44 | 99.10 | 12.15 | 58.10 | 75.20 | 53.82 | |
| 999 | 1.E−03 | 88 | 1.86 | 6.59 | 98.01 | 95.66 | 0.49 | 1.15 | 99.51 | 99.18 | 12.45 | 54.11 | 76.34 | 57.32 | |||
| 1499 | 1.E−04 | 101 | 2.02 | 5.81 | 97.85 | 95.99 | 0.42 | 1.02 | 99.57 | 99.28 | 14.10 | 47.73 | 76.11 | 61.98 | |||
| 1999 | 1.E−05 | 101 | 2.05 | 5.70 | 97.83 | 96.03 | 0.41 | 0.99 | 99.59 | 99.30 | 14.40 | 46.87 | 75.96 | 62.52 | |||
| GATK UnifiedGenotyper, HG001 | 1 | 0.83 | 99.92 | 8.19 | 0.15 | 0.94 | 99.91 | 8.19 | 0.17 | – | – | – | – | ||||
| GATK HaplotypeCaller, HG001 | 1 | 0.08 | 97.91 | 52.26 | 4.02 | 0.06 | 97.65 | 61.16 | 4.53 | 0.67 | 99.98 | 0.39 | 0.04 | ||||
| GATK UnifiedGenotyper, HG002 | 1 | 0.75 | 99.91 | 9.91 | 0.17 | 0.85 | 99.90 | 9.91 | 0.19 | – | – | – | – | ||||
| GATK HaplotypeCaller, HG002 | 1 | 0.69 | 98.74 | 63.81 | 2.47 | 0.79 | 98.58 | 63.88 | 2.78 | 0.02 | 100.00 | 15.66 | 0.007 | ||||
aFast training mode
Performance of Clairvoyante on ONT data at common variant sites in 1KGp3
| Seq. Tech. | Model trained on | Trained epochs | Ending learning rate and lambda | Call variants in | Best variant quality cutoff | Overall FPR (%) | Overall FNR (%) | Overall precision | Overall F1 score (%) | SNP FPR (%) | SNP FNR (%) | SNP precision (%) | SNP F1 score (%) | Indel FPR (%) | Indel FNR (%) | Indel precision (%) | Indel F1 score (%) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ONT | HG001 (except for chr1) | 110a | 1.E−05 | HG001 (chr1) | 47 | 3.40 | 17.09 | 95.93 | 88.94 | 3.33 | 9.29 | 96.34 | 93.44 | 3.99 | 86.42 | 76.59 | 23.07 |
| 999 | 1.E−03 | 53 | 4.05 | 16.31 | 95.20 | 89.07 | 3.79 | 8.95 | 95.85 | 93.39 | 6.28 | 81.70 | 73.19 | 29.28 | |||
| 1499 | 1.E−04 | 70 | 2.95 | 14.99 | 96.55 | 90.41 | 2.55 | 7.76 | 97.24 | 94.68 | 6.32 | 79.28 | 75.46 | 32.51 | |||
| 1999 | 1.E−05 | 74 | 2.96 | 14.78 | 96.55 | 90.53 | 2.52 | 7.59 | 97.28 | 94.78 | 6.68 | 78.64 | 74.90 | 33.24 | |||
| Nanopolish, HG001 | 20 | 0.04 | 21.57 | 97.51 | 86.93 | 0.03 | 15.66 | 98.11 | 90.70 | 0.12 | 63.46 | 88.57 | 51.73 | ||||
| DeepVariant (chr1) | 3 | 0.22 | 23.82 | 96.26 | 85.05 | 0.21 | 15.30 | 96.78 | 90.33 | 0.30 | 89.11 | 72.98 | 18.95 | ||||
| LoFreq | – | Benchmarking SNP only | 2.79 | 46.99 | 90.69 | 66.91 | – | – | – | – | |||||||
| GATK UnifiedGenotyper, HG001 | 1 | 3.66 | 84.44 | 80.07 | 26.05 | 4.12 | 82.47 | 80.07 | 28.77 | – | – | – | – | ||||
| GATK HaplotypeCaller, HG001 | 1 | 0.41 | 98.65 | 76.04 | 2.65 | 0.45 | 98.48 | 76.73 | 2.97 | 0.14 | 99.98 | 9.59 | 0.03 | ||||
| Nanopolish, afcut0.2, depthcut4, and chr19 | 20 | 0.15 | 34.13 | 90.00 | 76.07 | 0.08 | 27.23 | 94.56 | 82.25 | 0.73 | 83.06 | 36.45 | 23.13 | ||||
| Nanopolish, 1kgp3, and chr19 | 20 | 0.08 | 22.71 | 95.28 | 85.35 | 0.05 | 16.49 | 96.88 | 89.70 | 0.30 | 66.77 | 73.64 | 45.79 | ||||
aFast training mode
Fig. 1The IGV screen capture of the selected variants. a A heterozygote SNP from T to G at chromosome 11, position 98,146,409 called only in the PacBio and ONT data, b a heterozygote deletion AA at chromosome 20, position 3,200,689 not called in all three technologies, c a heterozygote insertion ATCCTTCCT at chromosome 1, position 184,999,851 called only in the Illumina data, and d a heterozygote deletion G at chromosome 1, position 5,072,694 called in all three technologies. The tracks from top to down show the alignments of the Illumina, PacBio, and ONT reads from HG001 aligned to the human reference GRCh37
Fig. 2A Venn diagram that shows the number of undetected known variants by different sequencing technologies or combinations
Fig. 3Clairvoyante network architecture and layer details. The descriptions under each layer, include (1) the layer’s function; (2) the activation function used; (3) the dimension of the layer in parenthesis (input layer: height × width × arrays, convolution layer: height × width × filters, fully connected layer: nodes), and (4) kernel size in brackets (height × width)
Fig. 4Selected illustrations of how Clairvoyante represents the three common types of a small variant, and a nonvariant. The figure includes: (top left) a C > G SNP, (top right) a 9-bp insertion, (bottom left) a 4-bp deletion, and (bottom right) a nonvariant with a reference allele. The color intensity represents the strength of a certain variant signal. The SNP insertion and deletion examples are ideal with almost zero-background noise. The nonvariant example illustrates how the background noises look like when not mingled with any variant signal