| Literature DB >> 26156142 |
Andrew D Hellicar1, Ashfaqur Rahman2, Daniel V Smith3, John M Henshall4.
Abstract
BACKGROUND: Despite ongoing reduction in genotyping costs, genomic studies involving large numbers of species with low economic value (such as Black Tiger prawns) remain cost prohibitive. In this scenario DNA pooling is an attractive option to reduce genotyping costs. However, genotyping of pooled samples comprising DNA from many individuals is challenging due to the presence of errors that exceed the allele frequency quantisation size and therefore cannot be simply corrected by clustering techniques. The solution to the calibration problem is a correction to the allele frequency to mitigate errors incurred in the measurement process. We highlight the limitations of the existing calibration solutions such as the fact they impose assumptions on the variation between allele frequencies 0, 0.5, and 1.0, and address a limited set of error types. We propose a novel machine learning method to address the limitations identified.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26156142 PMCID: PMC4495942 DOI: 10.1186/s12859-015-0593-1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Polynomial calibration functions.(a) Examples of calibration functions for the heterozygous case (b) Distortion corrections for calibrations functions corresponding to E=0.2.
Figure 2Data cleaning workflow. Workflow in cleaning input data. steps involve removing all data corresponding to bad SNPs; removing all SNP results for bad samples including both individual and pool samples. Finally low amplitude detections are removed. Size of data set shown next to arrows. Input data files shown lightly shaded. Output files dark shaded.
Figure 3SNP coordinate pairs. (H ,H ) coordinates for the AA (blue), AB (purple), BB (red) cases plotted for four SNPs. Black points correspond to individual samples for which no call data is provided by platform. Pre-calibration errors E (0.5) and E (0.5) are shown for top right SNP.
Figure 4Data set generation. Workflow for generating data sets for the various training and testing regimes are shown. Numbers correspond to number of (H ,H ) pairs copied to data set. Pool samples dark shading, individual sample light shaded. Pool samples are copied multiple times to mixed data sets to ensure equal representation in mixed data sets.
Parameters describing machine learning approaches
|
| |||
|---|---|---|---|
|
|
|
|
|
| num layers | 2 | 1 | 3 |
| nodes per layer | 2 | 1 | 6 |
| learning rate | 0.11 | 0.01 | 1.0 |
| momentum | 0.15 | 0.01 | 1.0 |
| Non-linearity | Sigmoid in all hidden layers | ||
|
| |||
| nu | 0.092 | 0.01 | 1.0 |
| C | 0.027 | 0.01 | 1.0 |
| Kernel | Gaussian | ||
Sets used for machine learning under different regimes in format: (training sets ; testing sets)
|
| |||
|---|---|---|---|
|
|
|
|
|
|
| ( | ( | ( |
|
| ( | ( | ( |
|
| ( | ( | ( |
Allele frequency MSE ( ) obtained by calibration polynomial methods
|
|
|
|
| |
|---|---|---|---|---|
|
|
| |||
| None | 8.83 | 3.27 | 12.32 | 15.80 |
| k-correction | 4.26 | 3.74 | 8.35 | 12.44 |
| Piecewise linear | 4.07 | 3.72 | 8.23 | 12.38 |
| 2nd order Lagrange | 4.21 | 4.01 | 8.28 | 12.34 |
| Piecewise Hermite | 2.68 | 2.40 | 8.73 | 14.77 |
| Piecewise Hermite equal deriv. | 7.62 | 7.54 | 14.16 | 20.69 |
| Piecewise Hermite equal domain | 3.54 | 3.45 | 10.27 | 16.99 |
| Best approach applied per SNP | 2.58 | 2.45 | 7.57 | 11.34 |
Percentage of SNPs where given method obtains best performance
|
|
|
|
|
|---|---|---|---|
| None | 0 | 14.6 | 35.4 |
| k-correction | 4.2 | 27.1 | 29.2 |
| Piecewise linear | 0 | 14.6 | 10.4 |
| 2nd order Lagrange | 0 | 8.3 | 14.6 |
| Piecewise Hermite | 85.4 | 22.9 | 4.2 |
| Piecewise Hermite equal deriv. | 0 | 2.1 | 2.1 |
| Piecewise Hermite equal domain | 10.4 | 10.4 | 4.2 |
Machine learning allele frequency MSE’s ( )
|
|
|
|
|
| |
|---|---|---|---|---|---|
|
|
| ||||
|
| 4.56 | 3.12 | 7.58 | 10.58 | |
| LR |
| 6.10 | 2.72 | 6.66 | 7.44 |
|
| 32.93 | 1.15 | 19.03 | 5.28 | |
|
| 2.58 | 2.01 | 8.90 | 15.10 | |
| MLP |
| 4.96 | 2.42 | 6.34 | 8.00 |
|
| 16.35 | 2.17 | 10.92 | 5.91 | |
|
| 4.22 | 2.68 | 6.78 | 9.29 | |
| SVM |
| 6.64 | 2.55 | 6.55 | 6.54 |
|
| 10.05 | 2.37 | 8.40 | 7.05 | |
Ratio of the best machine learning approach MSE to the best existing technique MSE for each training and testing set combination
|
| |||
|---|---|---|---|
|
|
|
|
|
|
| 0.63 | 0.82 | 0.75 |
|
| 1.22 | 0.77 | 0.53 |
|
| 2.47 | 1.02 | 0.43 |