| Literature DB >> 36093336 |
Lingzi Xiaoli1, Jill V Hagey1, Daniel J Park2, Christopher A Gulvik1, Erin L Young3, Nabil-Fareed Alikhan4, Adrian Lawsin1, Norman Hassell1, Kristen Knipe1, Kelly F Oakeson3, Adam C Retchless1, Migun Shakya5, Chien-Chi Lo5, Patrick Chain5, Andrew J Page4, Benjamin J Metcalf1, Michelle Su1, Jessica Rowell6, Eshaw Vidyaprakash6, Clinton R Paden1, Andrew D Huang6, Dawn Roellig1, Ketan Patel1, Kathryn Winglee1, Michael R Weigand1, Lee S Katz1.
Abstract
Background: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the cause of coronavirus disease 2019 (COVID-19), has spread globally and is being surveilled with an international genome sequencing effort. Surveillance consists of sample acquisition, library preparation, and whole genome sequencing. This has necessitated a classification scheme detailing Variants of Concern (VOC) and Variants of Interest (VOI), and the rapid expansion of bioinformatics tools for sequence analysis. These bioinformatic tools are means for major actionable results: maintaining quality assurance and checks, defining population structure, performing genomic epidemiology, and inferring lineage to allow reliable and actionable identification and classification. Additionally, the pandemic has required public health laboratories to reach high throughput proficiency in sequencing library preparation and downstream data analysis rapidly. However, both processes can be limited by a lack of a standardized sequence dataset.Entities:
Keywords: Benchmarking; COVID-19; Standardization; WGS; sha256
Year: 2022 PMID: 36093336 PMCID: PMC9454940 DOI: 10.7717/peerj.13821
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 3.061
The summary description for six datasets. Each dataset is numbered, named, and given a description. The intended use is also listed.
| Dataset | Name | Description | Intended use | Reference |
|---|---|---|---|---|
| 1 | Boston outbreak | A cohort of 63 samples from a real outbreak with three introductions, metagenomic approach | To understand the features of virus transmission during a real outbreak setting |
|
| 2 | CoronaHiT rapid | A cohort of 39 samples prepared by 18 h wet-lab protocol and sequenced by two platforms (Illumina vs MinION), amplicon-based approach | To verify that a bioinformatics pipeline finds virtually no differences between sequences from the same genome run on different platforms. |
|
| 3 | CoronaHiT routine | A cohort of 69 samples prepared by 30 h wet-lab protocol and sequenced by two platforms (Illumina vs MinION), amplicon-based approach | To verify that a bioinformatics pipeline finds virtually no differences between sequences from the same genome run on different platforms. |
|
| 4 | VOI/VOC lineages | A cohort of 16 samples from 11 representative CDC defined VOI/VOC | To benchmark lineage-calling bioinformatics software, especially for VOI/VOCs. | This study |
| 5 | Non-VOI/VOC lineages | A cohort of 39 samples from representative non VOI/VOC | To benchmark lineage-calling bioinformatics software, nonspecific to VOI/VOCs. | This study |
| 6 | Failed QC | A cohort of 24 samples failed basic QC metrics, covering 8 possible failure scenarios, amplicon-based approach | To serve as controls to test bioinformatics QC cutoffs. | This study |
Notes.
VOI, variant of interest; VOC, variant of concern
Figure 1Automated workflow for identifying representative sequences for datasets.
Sequences go through several quality checks before being considered as part of a benchmark dataset. These checks include lineage agreement with Pangolin, a minimum Phred score, a minimum depth of coverage, a check with the software TheiaCov, a check of the amplicon strategy, a minimization of the count of SNPs in regards to a reference genome, and a check against the spike region’s mutations. Asterisks denote steps taken with in-house python scripts.
QC metrics.
QC metrics are shown with their thresholds, which bioinformatics tool we used, and the QC category.
| No. | QC Metrics | Cutoff | Tool (version) | Category |
|---|---|---|---|---|
| 1 | total reads | NC | FastQC | Step 1: Fastq quality check |
| 2 | read length | NC | FastQC | Step 1: Fastq quality check |
| 3 | average phred score | >25 | FastQC | Step 1: Fastq quality check |
| 4 | mean depth per nucleotide (MDN) | >10 | Samtools | Step 2: Depth check |
| 5 | standard deviation for MDN | NC | Samtools | Step 2: Depth check |
| 6 | coefficient of variation for MDN | NC | Samtools | Step 2: Depth check |
| 7 | number of nucleotides with depth <10 (for Illumina) | <3000 | Samtools | Step 2: Depth check |
| 8 | number of nucleotides with depth <20(for nanopore) | <3000 | Samtools | Step 2: Depth check |
| 9 | number of paired-end reads | NC | Titan 1.4.4 | Step 3: Bioinformatics workflow check |
| 10 | assembly total length | >29400 | Titan 1.4.4 | Step 3: Bioinformatics workflow check |
| 11 | ambiguous Ns | <10% | Titan 1.4.4 | Step 3: Bioinformatics workflow check |
| 12 | assembly mean coverage | >25 | Titan 1.4.4 | Step 3: Bioinformatics workflow check |
| 13 | % mapped to the Wuhan reference | >65% | Titan 1.4.4 | Step 3: Bioinformatics workflow check |
| 14 | VADR alert number | <=1 | Titan 1.4.4 | Step 3: Bioinformatics workflow check |
| 15 | nextclade_aa_dels | NC | Titan 1.4.4 | Step 3: Bioinformatics workflow check |
| 16 | nextclade_aa_subs | NC | Titan 1.4.4 | Step 3: Bioinformatics workflow check |
| 17 | nextclade_version | NC | Titan 1.4.4 | Step 3: Bioinformatics workflow check |
| 18 | pango_lineage | NC | Titan 1.4.4 | Step 3: Bioinformatics workflow check |
| 19 | pangolin_version | NC | Titan 1.4.4 | Step 3: Bioinformatics workflow check |
Notes.
not a criterion
These values are reported but not used as criteria for passing or failing a sample.