| Literature DB >> 34970617 |
Víctor Lorente-Leal1,2, Damien Farrell3, Beatriz Romero1,2, Julio Álvarez1,2, Lucía de Juan1,2, Stephen V Gordon3.
Abstract
Whole genome sequencing (WGS) and allied variant calling pipelines are a valuable tool for the control and eradication of infectious diseases, since they allow the assessment of the genetic relatedness of strains of animal pathogens. In the context of the control of tuberculosis (TB) in livestock, mainly caused by Mycobacterium bovis, these tools offer a high-resolution alternative to traditional molecular methods in the study of herd breakdown events. However, despite the increased use and efforts in the standardization of WGS methods in human tuberculosis around the world, the application of these WGS-enabled approaches to control TB in livestock is still in early development. Our study pursued an initial evaluation of the performance and agreement of four publicly available pipelines for the analysis of M. bovis WGS data (vSNP, SNiPgenie, BovTB, and MTBseq) on a set of simulated Illumina reads generated from a real-world setting with high TB prevalence in cattle and wildlife in the Republic of Ireland. The overall performance of the evaluated pipelines was high, with recall and precision rates above 99% once repeat-rich and problematic regions were removed from the analyses. In addition, when the same filters were applied, distances between inferred phylogenetic trees were similar and pairwise comparison revealed that most of the differences were due to the positioning of polytomies. Hence, under the studied conditions, all pipelines offer similar performance for variant calling to underpin real-world studies of M. bovis transmission dynamics.Entities:
Keywords: Bovine Tuberculosis (bTB); Mycobacterium bovis; Mycobacterium tuberculosis complex (MTBC); SNP analysis; bioinformatics; genomic epidemiology; variant calling pipeline; whole genome sequencing (WGS)
Year: 2021 PMID: 34970617 PMCID: PMC8712436 DOI: 10.3389/fvets.2021.780018
Source DB: PubMed Journal: Front Vet Sci ISSN: 2297-1769
Figure 1Process summary. Raw VCF files obtained from a real-life phylogeny (22) were used to generate artificial genomes using SimuG and the M. bovis AF2122/97 reference genome. Artificial FASTQ Generator was then used to generate artificial reads and these were then input in the evaluated pipelines. Identified variants and output phylogenetic trees were compared between pipelines and the simulation.
Pipeline properties of the different tools evaluated in this study.
|
| ||||
|---|---|---|---|---|
|
|
|
|
| |
| Institution | USDA-APHIS | UCD | APHA | LLI – RCB |
| Language | Python | Python | Nextflow | Perl |
| Reference | NC_002945.4 | LT708304.1 | LT708304.1 | NC_002945.4 |
| Parameter setup | No | Yes | No | Yes |
|
| ||||
| Deduplication | Picard | No | FastUniq | Picard |
| Trimming | None | Yes | Trimmomatic | None |
|
| ||||
| Read aligner | BWA | BWA | BWA | BWA |
| SNP calling | FreeBayes | BCFtools | BCFtools | SAMtools + GATK |
| Phred base quality | 20 (Step 1) | User defined | 10 | 20 |
| Normalize | No | No | Yes | Yes |
| SNP quality threshold | 150 | ≥40 or User defined | None | None |
| Min. map quality | 56 | 60 | None | None |
| SNP coverage depth | None | 30 | 5 | 4F and 4R |
| Region filter | Excel file (validated problematic positions) | BED file (PE/PPE genes) | TSV (95% similarity self-BLAST) | TSV file (repetitive sequences) |
| Proximality filter | None | Yes | None | Yes |
| Allele frequency/fraction | 0.05 | DP4>4 | ≥ 0.8 | 75% |
| Considers as diploid | Yes | No | No | No |
| Low coverage positions | Reference if QUAL <50 N if 50 < QUAL <150 | Reference | Reference | Consensus base or ignore position if quality is below thresholds in >5% of samples |
| Alignment file | Core SNPs (polymorphic) | Core SNPs (polymorphic) | Consensus genome | Core SNPs (all) |
| Spoligotyping | Yes | Yes | No | No |
| Tree building | RAxML | RAxML | No | No |
| GUI | No | Yes | No | No |
| Other analyses | Lineage classification | INDEL analysis | Lineage classification | Lineage classification, antibiotic resistance annotation |
Only allows for minor parameter settings, such as reference file or type of analysis in step 2.
Deactivated by default.
Figure 2Effect of hard filtering in the performance of the evaluated pipelines when compared to the simulated dataset as indicated in the (A) recall and (B) precision rates. Asterisks indicate filters in MTBseq for which the minimum coverage threshold was adjusted.
Figure 3Effect of filtering in the total number of homoplasic positions per pipeline identified by HomoplasyFinder. The percentage represents the proportion of homoplasic positions from the total number of identified positions using each specific filter.
Figure 4Venn diagram showing the agreement between the positions identified by the different pipelines and the simulated dataset (A) before and (B) after hard filters (filter F) were applied.
Figure 5MCA analysis of the RF distances (first two dimensions) between maximum likelihood (ML) trees obtained from core SNP multi-FASTA alignments produced by (A) vSNP, (B) SNiPgenie, (C) BovTB, and (D) MTBseq. Shapes without outlines correspond to bootstrap replicates whereas bold shapes correspond to the best ML trees output by RAxML. Color shading corresponds to the hard filtering approach used.
Figure 6Pairwise comparison of filtered (filter F) simulated trees (left) and trees obtained from the evaluated pipelines (right): (A) vSNP, (B) SNiPgenie, (C) BovTB, and (D) MTBseq.