| Literature DB >> 35637238 |
Han Yang1, Fei Gu2, Lei Zhang3,4, Xian-Sheng Hua3.
Abstract
Genome variant calling is a challenging yet critical task for subsequent studies. Existing methods almost rely on high depth DNA sequencing data. Performance on low depth data drops a lot. Using public Oxford Nanopore (ONT) data of human being from the Genome in a Bottle (GIAB) Consortium, we trained a generative adversarial network for low depth variant calling. Our method, noted as LDV-Caller, can project high depth sequencing information from low depth data. It achieves 94.25% F1 score on low depth data, while the F1 score of the state-of-the-art method on two times higher depth data is 94.49%. By doing so, the price of genome-wide sequencing examination can reduce deeply. In addition, we validated the trained LDV-Caller model on 157 public Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) samples. The mean sequencing depth of these samples is 2982. The LDV-Caller yields 92.77% F1 score using only 22x sequencing depth, which demonstrates our method has potential to analyze different species with only low depth sequencing data.Entities:
Mesh:
Year: 2022 PMID: 35637238 PMCID: PMC9151722 DOI: 10.1038/s41598-022-12346-7
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Figure 1Examples of alignments for different genome variants. For easy understanding, we simplify illustration of the alignment. Each alignment consists of reference sequence and related reads. (a) is extracted at a non-variant genome site, where a difference C is from sequencing error. (b) indicates a heterozygous SNP variant with allele C. (c) is a heterozygous Insertion variant, for instance, from T to TAAT. (d) is a homozygous Deletion variant, i.e., from AGT to A. (c) and (d) are known as Indels variant.
Figure 2Illustration of input pileup images for two typical methods. (a) is for the DeepVariant, and (b) is for the Clair. The front images of (a) and (b) are reconstructed from the rest slices. (a) is a color image of size with multiple channels, e.g., read base, base quality, mapping quality, etc. (b) is a three-dimension image of size containing four kinds of counting information.
Experimental results on GIAB dataset. We used chromosomes 1 to 22 of samples HG001, HG002, and HG003. The suffix ’ds’ indicates down-sampled rate. Both SNPs and Indels are used to make overall evaluations.
| Method | DS rate | Train data | Test data | Precision | Recall | F1 score | TP | FP | FN |
|---|---|---|---|---|---|---|---|---|---|
| Clair | 1.0ds | HG001:chr1-chr22 | HG002:chr1-chr22 | 0.9524 | 0.9376 | 0.9449 | 2,815,372 | 140,722 | 187,362 |
| Clair | 1.0ds | HG001:chr1-chr22 | HG003:chr1-chr22 | 0.9555 | 0.9331 | 0.9441 | 2,675,506 | 124,673 | 191,895 |
| Clair | 0.5ds | HG001:chr1-chr22 | HG002:chr1-chr22 | 0.9344 | 0.9020 | 0.9179 | 2,708,574 | 190,127 | 294,101 |
| LDV-Caller | 0.5ds | HG001:chr1-chr22 | HG002:chr1-chr22 | 0.9622 | 0.9235 | 2,773,166 | 108,865 | 229,584 | |
| Clair | 0.5ds | HG002:chr2-chr22 | HG002:chr1 | 0.9323 | 0.8999 | 0.9158 | 214,797 | 15,588 | 23,887 |
| LDV-Caller | 0.5ds | HG002:chr2-chr22 | HG002:chr1 | 0.9581 | 0.9200 | 219,592 | 9,615 | 19,095 | |
| Clair | 0.3ds | HG001:chr1-chr22 | HG002:chr1-chr22 | 0.8289 | 0.7716 | 0.7992 | 2,316,904 | 478,278 | 685,727 |
| LDV-Caller | 0.3ds | HG001:chr1-chr22 | HG002:chr1-chr22 | 0.9300 | 0.8601 | 2,582,572 | 194,462 | 420,102 | |
| Clair | 0.5ds | HG001:chr1-chr22 | HG003:chr1-chr22 | 0.8924 | 0.8479 | 0.8705 | 2,436,327 | 293,749 | 431,043 |
| LDV-Caller | 0.5ds | HG001:chr1-chr22 | HG003:chr1-chr22 | 0.9476 | 0.9102 | 2,609,887 | 145,095 | 257,491 |
Experimental results for ablation study on 0.5ds data. HG001 is the training data while HG002 is for testing. Since the LDV-Caller has extra submodules G and D than the Clair C, we made two experiments to validate the effectiveness of both G and D. The indicates whether the certain submodule is used. The first row is actually the Clair method while the last row is our LDV-Caller. MSE is used to evaluate submodule G while other metrics are used to evaluate the whole LDV-Caller.
| Modules | Metrics | |||||
|---|---|---|---|---|---|---|
| MSE | Precision | Recall | F1 Score | |||
| - | 0.9344 | 0.9020 | 0.9179 | |||
| 0.1996 | 0.9516 | 0.9127 | 0.9317 | |||
| 0.2014 | 0.9622 | 0.9235 | ||||
Significant value in bold.
Figure 3Precision/Sensitivity curves for SNPs and Indels on HG002.
Figure 4Example of IGV visualization. The variant is at chr21:15,980,562 of HG002. This variant is not detected by Clair method on 0.3ds down-sampled data. But our LDV-Caller correctly found it from 0.3ds data and gave a high confidence. (a) is the alignment of original high depth data. (b) is the alignment of 0.3ds down-sampled data. (c) contains three pileup images. From left to right, they are low depth pileup image, predicted “high” depth pileup image, and real high depth pileup image respectively.
Experimental results on SARS-CoV-2 dataset.
| Method | Depth | Precision | Recall | F1 score |
|---|---|---|---|---|
| Clair | 22x | 0.9170 | 0.9123 | 0.9146 |
| LDV-Caller | 22x | 0.9392 | 0.9165 |
Significant value in bold.
Figure 5The framework of our LDV-Caller. Three sub-models, generative model G, adversarial discriminator D, and variant caller C are included. Detailed structures are in Table 4.
The detailed layers in our LDV-Caller. k, d, c respectively represent kernel size, dilation radius, channel size.
| Module | Details |
|---|---|
| counting based pileup features, | |
| convert to 2D image, | |
| (conv2d, conv2d, maxpool2d), (k=3, d=1,2, c=16) | |
| (conv2d, conv2d, maxpool2d), (k=3, d=1,2, c=32) | |
| (conv2d, conv2d, maxpool2d), (k=3, d=1,2, c=64) | |
| (conv2d, conv2d), (k=3, d=1,2, c=64) | |
| (deconv2d, concat, conv2d, conv2d), (k=3, d=1,2, c=64) | |
| (deconv2d, concat, conv2d, conv2d), (k=3, d=1,2, c=32) | |
| (deconv2d, concat, conv2d, conv2d), (k=3, d=1,2, c=16) | |
| backbone has same structure with encoder of | |
| global-avgpooling layer, hidden size=64 | |
| fully-connected layer with sigmoid, hidden size=1 | |
| BiLSTM, layer=2, hidden size=128 | |
| 2 fully-connected layers, hidden size=192, 96 | |
| 4 output fully-connected layers with softmax, hidden size=# of classes |