| Literature DB >> 27454532 |
Aaron R Sharp1, Joshua A Udall2.
Abstract
BACKGROUND: Physical mapping of DNA with restriction enzymes allows for the characterization and assembly of much longer molecules than is feasible with sequencing. However, assemblies of physical map data are sensitive to input parameters, which describe noise inherent in the data collection process. One possible way to determine the parameter values that best describe a dataset is by trial and error.Entities:
Mesh:
Year: 2016 PMID: 27454532 PMCID: PMC4965707 DOI: 10.1186/s12859-016-1099-1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Input parameter values
| Parameter | Overlap significance threshold | False positive labels per 100 kbp | Proportion restriction sites unlabeled | Min. molecule length (kbp) | Min. labels per molecule |
|---|---|---|---|---|---|
| Values | 1.11E-04 | 0.5 | 0.15 | 100 | 6 |
| 1.11E-06 | 1.5 | 0.3 | 150 | 8 | |
| 1.11E-08 | 2.5 | 0.45 | 180 | 10 | |
| 1.11E-10 | |||||
| 1.11E-12 |
Min. is short for minimum
Map data collected
| Date Run | Quantity (Mbp) | Molecule N50 (kbp) | Average labels per 100 kbp |
|---|---|---|---|
| 28-May-14 | 5,861.00 | 218.6 | 7.2 |
| 04-Jun-14 | 15,723.90 | 154.5 | 8.2 |
| 05-Jun-14 | 32,131.70 | 150.4 | 8.6 |
| 05-Jun-14 | 18,135.40 | 143.9 | 9 |
| 22-Jul-14 | 7,122.50 | 188.7 | 6.1 |
| 23-Jul-14 | 9,651.20 | 175.8 | 9.3 |
| 24-Jul-14 | 2,833.90 | 165.8 | 9.1 |
| 24-Jul-14 | 5,492.80 | 198.6 | 10.2 |
| 25-Jul-14 | 15,037.10 | 189.7 | 6.1 |
| 28-Jul-14 | 6,246.70 | 189.7 | 6.6 |
| 29-Jul-14 | 4,848.80 | 155.4 | 10 |
| 30-Jul-14 | 9,029.30 | 163.8 | 10.1 |
| 31-Jul-14 | 15,970.40 | 168.3 | 10.1 |
| 05-Aug-14 | 12,213.10 | 171.2 | 10.3 |
| 06-Aug-14 | 15,718.60 | 169.8 | 10.2 |
| 07-Aug-14 | 7,312.50 | 161.5 | 10.6 |
| 07-Aug-14 | 1,176.00 | 155.5 | 11.3 |
| 07-Aug-14 | 17,104.90 | 160 | 10.9 |
| 07-Aug-14 | 15,670.10 | 150.6 | 11 |
Compute resources required for de novo assembly
| Assembly step | Sort | Split | Pairwise alignment | Assembly | Total |
|---|---|---|---|---|---|
| Applicable parameters | Minimum lengtha, minimum labelsa | Minimum lengtha, minimum labelsa | Minimum lengtha, minimum labelsa, significance thresholda, false positive, false negative | Minimum length, minimum sites, significance threshold, false positive, false negative | Minimum length, minimum sites, significance threshold, false positive, false negative |
| Steps run | 1 | 1 | 9 | 405 | - |
| Parallel jobs per step | 1 | 3 | 1,250 | 1 | - |
| Minutes elapsed | 1 | 6 | 3,442,500 | 105,614 | 3,548,121 |
| Predictedb minutes | 405 | 2,733 | 154,912,500 | 105,614 | 155,021,252 |
| Megabytes RAM used | 1 | 5,667 | 176,321,250 | 1,130,838,484 | 1,307,165,402 |
| Predictedb megabytes | 405 | 2,295,135 | 7,934,456,000 | 1,130,838,484 | 9,067,590,024 |
| Megabytes disk space used | 580 | 580 | 4,640,000 | 51,874 | 4,693,034 |
| Predictedb megabytes | 2,900 | 2,900 | 208,800,000 | 51,874 | 209,089,674 |
aInput parameter applies only as an output filter; it does not affect the algorithms internal workings. A step run with lenient parameters can serve as input for a more stringent downstream step, which will filter its input
bEstimation of resources required if all 405 Sort, Split, and Pairwise alignment steps were run
Fig. 1Assembly accuracy and contig N50 lengths are affected by different input parameters. Contig N50 lengths are relatively stable to permutations of false positive label rates (FP, per 100 kbp), false negative label rates (FN), and minimum labels per molecule (Labels) (left). When the same assemblies are grouped by minimum molecule length (Lengths, in kbp) and significance threshold (P-val.) (right), more substantial changes in response to these input parameters are observed. Some inaccurate assemblies have high N50 lengths. The converse is also true
Fig. 2Several metrics of assembly contiguity and internal consistency fail to predict assembly accuracy