| Literature DB >> 31615395 |
Daniel Svensson1, Rickard Sjögren1,2, David Sundell3, Andreas Sjödin3, Johan Trygg4,5.
Abstract
BACKGROUND: Selecting the proper parameter settings for bioinformatic software tools is challenging. Not only will each parameter have an individual effect on the outcome, but there are also potential interaction effects between parameters. Both of these effects may be difficult to predict. To make the situation even more complex, multiple tools may be run in a sequential pipeline where the final output depends on the parameter configuration for each tool in the pipeline. Because of the complexity and difficulty of predicting outcomes, in practice parameters are often left at default settings or set based on personal or peer experience obtained in a trial and error fashion. To allow for the reliable and efficient selection of parameters for bioinformatic pipelines, a systematic approach is needed.Entities:
Keywords: Assembly; Classification; Design of Experiments; MinION; Nanopore; Optimization; Scaffolding; Sequencing; Variant calling
Mesh:
Year: 2019 PMID: 31615395 PMCID: PMC6794737 DOI: 10.1186/s12859-019-3091-z
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Schematic visualization of doepipeline design space movement. Example of optimization of two factors (A and B) through both the screening (a) and the optimization phase (b), completed in 3 iterations. Each dot represents an executed pipeline with the parameters set by factors A and B. Triangles represent executed pipelines using the optima of an Ordinary Least Squares (OLS) model calculated in each optimization iteration. Red dots and triangles represent the best configuration of factors found in each iteration. Dashed lines represent the current high and low parameter settings in each iteration. Screening phase: a GSD using three levels and a reduction factor of 2 is used to span the design space. The pipelines are executed with the factor configurations suggested by the GSD and an approximate optimum is found (red dot). Optimization phase: in iteration 2, an optimization design is created around the best configuration found in the screening phase (black dots). In iteration 3, the design space is moved in the direction of the configuration of factors that produced the best result (red triangle) in iteration 2. doepipeline halts when the best response is produced by a configuration of factors that lies close to the center point (red triangle in iteration 3).
Factors in the de-novo assembly case
| Parameter | Abbr. | Type | Min | Max | Default | Optimized |
|---|---|---|---|---|---|---|
| Size of k-mer ( | KMER | Ordinal | 20 | 90 | 31a | 38 |
| Minimum mean k-mer coverage of a unitig ( | MIKC | Quantitative | 2 | 15 | sqrt (median)b | 8.5 |
| Minimum alignment length of a read ( | MIAL | Ordinal | 20 | 60 | 40 | 30 |
| Minimum number of pairs required for building contigs ( | MIPA | Ordinal | 5 | 15 | 10 | 15 |
The four factors investigated in the de-novo assembly case are described above. The letter in parenthesis following the parameter name is the parameter used in the abyss-pe command line interface. Min and max values define the design space. a: There is no default value explicitly specified by the ABySS documentation. However here we used a k-mer size of 31 for comparison purposes. b: This refers to the square root of the median k-mer coverage, which is affected by the sequencing depth and choice of k-mer size. The optimized values are the combination of factor values that produced the best outcome, as found by doepipeline
Responses in the de-novo assembly case
| Response | Abbr. | Criterion | Low/high limita | Target | Defaultb | Optimized |
|---|---|---|---|---|---|---|
| Total sequence in assembly (bp) | tSeq | Maximize | 1,830,000 | 1,894,157 | 1,835,427 | 1,864,165 |
| Number of contigs in assembly | nSeq | Minimize | 95 | 85 | 91 | 89 |
| N50 | N50 | Maximize | 28,000 | 35,000 | 28,149 | 31,847 |
The three responses that were measured in the de-novo assembly case are described above. a: Responses that have the criterion maximize have a low limit, and those with the criterion minimize have a high limit. b: Default values are based on using a k-mer size of 31 and leaving all other parameters unchanged
Factors in the scaffolding case
| Parameter | Abbr. | Type | Min | Max | Default | Optimized |
|---|---|---|---|---|---|---|
| Minimum alignment length to allow a contig to be included for scaffolding ( | ALEN | Ordinal | 0 | 5000 | 0 | 0 |
| Minimum gap between two contigs ( | GLEN | Ordinal | −3000 | 3000 | −200 | − 750 |
| Maximum link ratio between two best contig pairs ( | RRAT | Quantitative | 0.1 | 0.7 | 0.3 | 0.325 |
| Minimum identity of the alignment of the long reads to the contig sequences ( | IDEN | Ordinal | 30 | 90 | 70 | 82 |
The four factors investigated in the scaffolding case are described above. The letter in parenthesis following the parameter name is the parameter used in the SSPACE command line interface. Min and max values define the design space. The optimized values are those that in combination produced the best outcome, as found by doepipeline
Factors in the k-mer classification case
| Parameter | Abbr. | Type | Min | Max | Default | Optimized |
|---|---|---|---|---|---|---|
| Minimum k-mer hits | MH | Ordinal | 1 | 200 | a | 14 |
| Standard deviation of the relative errors of the estimate | PRES | Ordinal | 10 | 18 | 12 | 17 |
| Minimum tax-ID score threshold | FILT | Quantitative | 0 | 0.05 | 0 | 0 |
The three factors investigated in the k-mer case are described above. Min and max values define the design space. The optimized values are those that in combination produced the best outcome, as found by doepipeline. a: The KrakenUniq documentation to our knowledge does not state what the default value is
Factors in the variant calling case
| Step | Parameter | Abbr. | Type | Min | Max | Default | Optimized |
|---|---|---|---|---|---|---|---|
| Variant calling | Global assumed mismapping rate for reads ( | GMQ | Ordinal | 20 | 55 | 45 | 46 |
| Minimum base quality for calling ( | MBQ | Ordinal | 5 | 25 | 10 | 10 | |
| Minimum reads per alignment start ( | RAS | Ordinal | 5 | 25 | 10 | 20 | |
| Minimum confidence threshold for calling ( | SCC | Quantitative | 5 | 25 | 10 | 5 | |
| Variant filtering | Quality by depth ( | QD | Quantitative | 0 | 10 | 2 | 0.41 |
| Read position rank sum test ( | RPRS | Quantitative | −40 | 0 | −20 | −37.5 | |
| Fisher test for strand bias ( | FS | Quantitative | 0 | 250 | 200 | 62.5 | |
| Strand odds ratio ( | SOR | Quantitative | 0 | 20 | 10 | 8.16 |
The factors investigated in the variant calling case are described above. The optimization was carried out sequentially for two main steps, variant calling and variant filtering, and which step each factor belongs to is indicated. For the variant calling step, the factor’s corresponding command line flag is given in parentheses after the parameter name. For the variant filtering step, the corresponding information tag annotated in the VCF file is indicated in parentheses. The min and max values define the design space. The default values for all factors are also indicated; for the calling step they are the built-in default values of the HaplotypeCaller tool, while for the filtering step the default values are those recommended by the GATK team. The optimized values are those that in combination produced the best outcome, as found by doepipeline