| Literature DB >> 30348075 |
Thomas K F Wong1, Louis Ranjard2, Yu Lin3, Allen G Rodrigo2.
Abstract
BACKGROUND: Pooling techniques, where multiple sub-samples are mixed in a single sample, are widely used to take full advantage of high-throughput DNA sequencing. Recently, Ranjard et al. (PLoS ONE 13:0195090, 2018) proposed a pooling strategy without the use of barcodes. Three sub-samples were mixed in different known proportions (i.e. 62.5%, 25% and 12.5%), and a method was developed to use these proportions to reconstruct the three haplotypes effectively.Entities:
Keywords: Barcode; Haplotype reconstruction; Pooling strategy
Mesh:
Year: 2018 PMID: 30348075 PMCID: PMC6198429 DOI: 10.1186/s12859-018-2424-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
The results of estimation on the sample proportions by HaploJuice
| Case | Actual sample proportion | Estimated sample proportion | ||||
|---|---|---|---|---|---|---|
|
|
|
| (Average ± Standard deviation) | |||
| 1 | 0.5 | 0.4 | 0.1 | 0.50 ± 0.001 | 0.40 ± 0.001 | 0.10 ± 0.001 |
| 2 | 0.5 | 0.3 | 0.2 | 0.50 ± 0.001 | 0.30 ± 0.001 | 0.20 ± 0.001 |
| 3 | 0.6 | 0.3 | 0.1 | 0.60 ± 0.001 | 0.30 ± 0.001 | 0.10 ± 0.001 |
| 4 | 0.7 | 0.2 | 0.1 | 0.70 ± 0.001 | 0.20 ± 0.001 | 0.10 ± 0.001 |
One hundred data sets were simulated for each case
Comparison of performance of different methods on reconstruction of three haplotypes for simulated data sets
| a. Proportion of three samples: 0.5, 0.4, 0.1 (total length of three haplotypes: 30k) | |||||
| Software | # contigs | Longest | N50 | Haplotypes | Error rate |
| ≥ 500 bp | contig | coverage | |||
| HaploJuice |
|
|
|
|
|
| hmmfreq[ |
| 9855 ± 6.8 | 9850 ± 6.3 | 98.5 ± 0.0 | 0.276 ± 0.254 |
| shoRAH[ | 30.8 ± 11.7 | 9819 ± 124.8 | 9799 ± 116.7 | 97.5 ± 3.5 | 0.646 ± 0.492 |
| SAVAGE[ | 9.8 ± 3.5 | 9972 ± 11.8 | 305 ± 300.3 | 51.3 ± 7.1 | 0.001 ± 0.004 |
| PredictHaplo[ | 2.0 ± 0.2 | 9991 ± 4.2 | 9984 ± 5.6 | 67.7 ± 5.7 | 0.102 ± 0.034 |
| QuRe[ | 3.7 ± 1.9 | 6993 ± 1306.3 | 7374 ± 686.5 | 43.8 ± 13.5 | 0.331 ± 0.318 |
| b. Proportion of three samples: 0.5, 0.3, 0.2 (total length of three haplotypes: 30k) | |||||
| Software | # contigs | Longest | N50 | Haplotypes | Error rate |
| ≥ 500bp | contig | coverage | |||
| HaploJuice |
|
|
|
|
|
| hmmfreq[ |
| 9854 ± 5.8 | 9850 ± 7.6 | 98.5 ± 0.0 | 0.089 ± 0.104 |
| shoRAH[ | 27.9 ± 6.6 | 9814 ± 118.3 | 9789 ± 113.9 | 97.1 ± 4.7 | 0.591 ± 0.358 |
| SAVAGE[ | 11.4 ± 3.4 | 9983 ± 8.2 | 436 ± 281.8 | 54.7 ± 7.1 | 0.001 ± 0.005 |
| PredictHaplo[ | 2.0 ± 0.2 | 9991 ± 3.7 | 9984 ± 5.8 | 68.0 ± 6.6 | 0.087 ± 0.040 |
| QuRe[ | 4.2 ± 2.2 | 7348 ± 820.8 | 7436 ± 776.9 | 44.9 ± 15.9 | 0.761 ± 0.851 |
| c. Proportion of three samples: 0.6, 0.3, 0.1 (total length of three haplotypes: 30k) | |||||
| Software | # contigs | Longest | N50 | Haplotypes | Error rate |
| ≥ 500bp | contig | coverage | |||
| HaploJuice |
|
|
|
|
|
| hmmfreq[ |
| 9854 ± 5.6 | 9849 ± 6.2 | 98.5 ± 0.0 | 0.210 ± 0.214 |
| shoRAH[ | 25.2 ± 5.9 | 9837 ± 115.0 | 9808 ± 113.3 | 97.4 ± 4.8 | 0.749 ± 0.516 |
| SAVAGE[ | 11.2 ± 3.0 | 9971 ± 20.9 | 419 ± 260.5 | 53.9 ± 6.3 | 0.001 ± 0.006 |
| PredictHaplo[ | 2.0 ± 0.0 | 9991 ± 3.5 | 9984 ± 4.7 | 66.7 ± 0.0 | 0.089 ± 0.025 |
| QuRe[ | 3.9 ± 1.9 | 7074 ± 1284.4 | 7300 ± 716.6 | 39.1 ± 14.5 | 0.492 ± 0.597 |
| d. Proportion of three samples: 0.7, 0.2, 0.1 (total length of three haplotypes: 30k) | |||||
| Software | # contigs | Longest | N50 | Haplotypes | Error rate |
| ≥ 500bp | contig | coverage | |||
| HaploJuice |
|
|
|
|
|
| hmmfreq[ |
| 9855 ± 6.2 | 9850 ± 6.7 | 98.5 ± 0.0 | 0.240 ± 0.220 |
| shoRAH[ | 20.2 ± 4.7 | 9835 ± 115.0 | 9812 ± 106.4 | 93.8 ± 11.2 | 0.912 ± 0.630 |
| SAVAGE[ | 15.2 ± 3.0 | 9974 ± 10.6 | 708 ± 161.7 | 65.1 ± 7.0 | 0.001 ± 0.005 |
| PredictHaplo[ | 2.0 ± 0.0 | 9991 ± 3.8 | 9984 ± 4.7 | 66.7 ± 0.0 | 0.088 ± 0.021 |
| QuRe[ | 3.6 ± 1.8 | 6787 ± 1333.0 | 7121 ± 809.6 | 28.4 ± 11.2 | 0.319 ± 0.535 |
One hundred data sets were generated for each of the cases with different sets of sample proportions. Format of the data is: average ± standard deviation. The best value for each column is highlighted among the software outputting the contigs over 90% haplotype coverage
Estimated frequencies of three kangaroo sub-samples among the mixture of reads [8] for three amplicons resulted from our method
| Amplicon | Target proportions | Average estimated proportions (average variation in %) | ||||
|---|---|---|---|---|---|---|
|
|
|
|
|
|
| |
| Amplicon 1 | 0.625 | 0.250 | 0.125 | 0.656 | 0.229 | 0.115 |
| (4.9%) | (8.3%) | (8.0%) | ||||
| Amplicon 2 | 0.625 | 0.250 | 0.125 | 0.640 | 0.246 | 0.114 |
| (2.4%) | (1.6%) | (8.7%) | ||||
| Amplicon 3 | 0.625 | 0.250 | 0.125 | 0.646 | 0.251 | 0.103 |
| (3.4%) | (0.3%) | (17.9%) | ||||
It revealed the existence of variations on the ratios of the sub-samples when mixing them during the library preparation. Ten data sets were for each amplicon
Comparison of performance of different methods on reconstruction of three haplotypes for real kangaroo data sets from the mixture of reads [8] for (a) amplicon 1, (b) amplicon 2, and (c) amplicon 3
| a. Amplicon 1 (total length of three haplotypes: 13921) | |||||
| Software | # contigs | Longest | N50 | Haplotypes | Error rate |
| ≥ 500bp | contig | coverage | |||
| HaploJuice |
|
|
|
|
|
| hmmfreq[ |
| 4485 ± 0.6 | 4484 ± 0.6 | 96.6 ± 0.0 | 0.26 ± 0.10 |
| shoRAH[ | 24.0 ± 2.6 | 4592 ± 7.0 | 4592 ± 6.0 | 95.6 ± 10.4 | 1.05 ± 0.32 |
| SAVAGE[ | 13.2 ± 2.1 | 903 ± 132.3 | 482 ± 169.6 | 47.3 ± 5.2 | 0.02 ± 0.04 |
| PredictHaplo[ | 1.1 ± 0.3 | 4630 ± 2.0 | 462 ± 1461.3 | 36.5 ± 10.5 | 0.01 ± 0.01 |
| QuRe[ | 4.0 ± 1.9 | 4343 ± 9.9 | 3909 ± 1373.7 | 74.9 ± 21.8 | 0.42 ± 0.32 |
| b. Amplicon 2 (total length of three haplotypes: 12694) | |||||
| Software | # contigs | Longest | N50 | Haplotypes | Error rate |
| ≥ 500bp | contig | coverage | |||
| HaploJuice |
|
|
|
|
|
| hmmfreq[ |
| 3998 ± 4.0 | 3998 ± 4.0 | 94.5 ± 0.1 |
|
| shoRAH[ | 24.2 ± 5.7 | 4119 ± 14.5 | 4118 ± 12.1 | 90.8 ± 13.5 | 0.41 ± 0.48 |
| SAVAGE[ | 8.8 ± 3.8 | 1806 ± 761.5 | 572 ± 81.7 | 50.2 ± 4.7 | 0.00 ± 0.00 |
| PredictHaplo[ | 2.0 ± 0.0 | 4140 ± 2.6 | 4136 ± 0.0 | 65.2 ± 0.0 | 0.00 ± 0.00 |
| QuRe[ | 2.4 ± 0.7 | 3746 ± 4.7 | 3373 ± 1185.0 | 38.4 ± 14.3 | 0.22 ± 0.28 |
| c. Amplicon 3 (total length of three haplotypes: 15391) | |||||
| Software | # contigs | Longest | N50 | Haplotypes | Error rate |
| ≥ 500bp | contig | coverage | |||
| HaploJuice |
| 5116 ± 9.1 |
|
|
|
| hmmfreq[ |
| 5029 ± 3.1 | 5027 ± 3.6 | 98.0 ± 0.1 | 0.23 ± 0.11 |
| shoRAH[ | 27.6 ± 3.0 |
| 5111 ± 7.4 | 96.3 ± 10.5 | 1.91 ± 0.44 |
| SAVAGE[ | 11.8 ± 2.3 | 2510 ± 672 | 550 ± 40.4 | 55.6 ± 4.3 | 0.01 ± 0.01 |
| PredictHaplo[ | 1.6 ± 0.5 | 5170 ± 3.9 | 3070 ± 2642.4 | 53.3 ± 17.2 | 0.14 ± 0.09 |
| QuRe[ | 3.0 ± 1.1 | 4567 ± 2.1 | 4106 ± 1442.7 | 35.6 ± 12.5 | 0.25 ± 0.28 |
There are 10 data sets for each amplicon with total coverage of the reads 1600x. For each data set, the sub-samples were mixed in the proportions: 0.125, 0.25, 0.625. The format of data is: average ± standard deviation. The best value for each column is highlighted among the methods with contigs over 90% coverage on three haplotypes
Fig. 1Coverage of HaploJuice contigs as a function of haplotype genetic distances. The figure shows how the performance of HaploJuice varies with different genetic distances between the sub-samples
Fig. 2Performance of HaploJuice with different sample frequencies. The figures (a) and (b) show the haplotype coverages and the error rates of the contigs under different sub-sample proportions, respectively
The average running time (in min) of different methods to reconstruct haplotypes for each Kangaroo data set
| HaploJuice | hmmfreq | ShoRah | SAVAGE | PredictHaplo | QuRe |
|---|---|---|---|---|---|
| [ | [ | [ | [ | [ | |
| 0.14 | 13.53 | 7.81 | 11.21 | 4.30 | 139.93 |
Fig. 3Work flow in HaploJuice. HaploJuice first estimates the sub-sample proportions from a mixture of reads using maximum likelihood method. The algorithm then reconstructs the haplotype sequences using a dynamic programming method
The expected frequencies of top-n most frequent sub-sequences for a mixture from 3 samples
| Case | Expected frequencies of sub-sequences | ||
|---|---|---|---|
| 1 |
|
|
|
| 2 |
|
| |
| 3 |
|
| |
| 4 |
|
| |
| 5 |
|
| |
This is a total of B3=5 cases. f and are the proportions of erroneous sequences
There are a total of 27 cases for generating 3 sub-sequences by 3 haplotypes
| Haplotypes which generate the sub-sequences | Expected frequencies | |||||
|---|---|---|---|---|---|---|
| Case | subseq1 | subseq2 | subseq3 | subseq1 | subseq2 | subseq3 |
| 1 |
|
|
|
|
|
|
| 2 |
|
|
|
|
|
|
| 3 |
|
|
|
|
|
|
| 4 |
|
|
|
|
|
|
| 5 |
|
|
|
|
|
|
| 6 |
|
|
|
|
|
|
| 7 |
| Erroneous |
|
| ||
| 8 |
| Erroneous |
|
| ||
| ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ |
| 26 | Erroneous | Erroneous |
|
| ||
| 27 | Erroneous | Erroneous |
|
| ||
h represents that the sub-sequence is generated from haplotype i, and ’erroneous’ represents the erroneous sub-sequences. f is the estimated proportion of sample i, and are the proportions of erroneous sub-sequences