| Literature DB >> 35182034 |
Pasi K Korhonen1, Babak Shaban2, Noel G Faux2, Liina Kinkar1, Bill C H Chang1, Daxi Wang1,3, Bicheng Yang4, Neil D Young1, Robin B Gasser1.
Abstract
The revolution in genomics has enabled large-scale population genetic investigations of a wide range of organisms, but there has been a relatively limited focus on improving analytical pipelines. To efficiently analyse large data sets, highly integrated and automated software pipelines, which are easy to use, efficient, reliable, reproducible and run in multiple computational environments, are required. A number of software workflows have been developed to handle and process such data sets for population genetic analyses, but effective, specialized pipelines for genetic and statistical analyses of nonmodel organisms are lacking. For most species, resources for variomes (sets of genetic variations found in populations of species) are not available, and/or genome assemblies are often incomplete and fragmented, complicating the selection of the most suitable reference genome when multiple assemblies are available. Additionally, the biological samples used often contain extraneous DNA from sources other than the species under investigation (e.g., microbial contamination), which needs to be removed prior to genetic analyses. For these reasons, we established a new pipeline, called Escalibur, which includes: functionalities, such as data trimming and mapping; selection of a suitable reference genome; removal of contaminating read data; recalibration of base calls; and variant-calling. Escalibur uses a proven gatk variant caller and workflow description language (WDL), and is, therefore, a highly efficient and scalable pipeline for the genome-wide identification of nucleotide variation in eukaryotes. This pipeline is available at https://gitlab.unimelb.edu.au/bioscience/escalibur (version 0.3-beta) and is essentially applicable to any prokaryote or eukaryote.Entities:
Keywords: bioinfomatics/workflows; molecular evolution; parasitology; population genetics
Mesh:
Substances:
Year: 2022 PMID: 35182034 PMCID: PMC9314989 DOI: 10.1111/1755-0998.13600
Source DB: PubMed Journal: Mol Ecol Resour ISSN: 1755-098X Impact factor: 8.678
FIGURE 1Schematic representation of the genomic analysis pipeline, Escalibur. In step 1, paired‐end (PE) sequence reads in each genomic library are trimmed; resultant reads are mapped to all reference genomes; binary sequence alignment/map (BAM) files for each sample are combined; and PCR duplicates are marked. The optimum reference genome is established based on mapping‐rate and read‐coverage averages. In step 2, the mapping quality is assessed, extraneous “contaminating” sequences (originating, for example, from the environment, microbes and/or host species) are removed and BAM files are corrected accordingly. In step 3, using gatk, the corrected BAM files are optionally recalibrated; the base quality score recalibration (BQSR) is applied in two iterations against high‐quality variants; initial variants are called; variant call format (VCF) files are created; and variants are then filtered to retain only those which are statistically significant
Homozygous variants called in data representing strains CB4854 and CB4856 of Caenorhabditis elegans using Escalibur compared with variants recorded in the CeNDR database. Variants were called separately using calibrated and uncalibrated data
| Description | All SNPs (count) | Common SNPs (count; %) | Unique SNPs (count; %) | All indels (count) | Common indels (count; %) | Unique indels (count; %) |
|---|---|---|---|---|---|---|
| CB4854/uncalibrated data | ||||||
| CeNDR | 90,032 | 74,411; 82.6 | 15,621; 17.4 | 42,261 | 23,803; 56.3 | 18,458; 43.7 |
|
| 80,207 | 74,411; 92.8 | 5,796; 7.2 | 26,081 | 23,803; 91.3 | 2,278; 8.7 |
| CB4856/uncalibrated data | ||||||
| CeNDR | 214,914 | 189,597; 88.2 | 25,317; 11.8 | 95,513 | 59,628; 62.4 | 35,885; 37.6 |
|
| 211,244 | 189,597; 89.8 | 21,647; 10.2 | 68,982 | 59,628; 86.4 | 9,354; 13.6 |
| CB4854/calibrated data | ||||||
| CeNDR | 90,032 | 74,344; 82.6 | 15,688; 17.4 | 42,261 | 23,766; 56.2 | 18,495; 43.8 |
|
| 80,228 | 74,344; 92.7 | 5,884; 7.3 | 26,076 | 23,766; 91.1 | 2,310; 8.9 |
| CB4856/calibrated data | ||||||
| CeNDR | 214,914 | 187,784; 87.4 | 27,130; 12.6 | 95,513 | 58,554; 61.3 | 36,959; 38.7 |
|
| 210,552 | 187,784; 89.2 | 22,768; 10.8 | 68,216 | 58,554; 85.8 | 9,662; 14.2 |
| Effect of calibrated vs. uncalibrated data on variants called for CB4854 | ||||||
| Uncalibrated | 80,207 | 79,791; 99.5 | 416; 0.5 | 26,081 | 25,913; 99.4 | 168; 0.6 |
| Calibrated | 80,228 | 79,791; 99.5 | 437; 0.5 | 26,076 | 25,913; 99.4 | 163; 0.5 |
| Effect of calibrated vs. uncalibrated data on variants called for CB4856 | ||||||
| Uncalibrated | 211,244 | 207,472; 98.2 | 3,772; 1.8 | 68,982 | 67,259; 97.5 | 1,723; 2.5 |
| Calibrated | 210,552 | 207,472; 98.5 | 3,080; 1.5 | 68,216 | 67,259; 98.6 | 957; 1.4 |
NCBI accession identifiers PRJNA318647 and SAMN04902526.
NCBI accession identifiers PRJNA318647 and SAMN04902368.