| Literature DB >> 33791701 |
Wolfgang Maier, Simon Bray, Marius van den Beek, Dave Bouvier, Nathaniel Coraor, Milad Miladi, Babita Singh, Jordi Rambla De Argila, Dannon Baker, Nathan Roach, Simon Gladman, Frederik Coppens, Darren P Martin, Andrew Lonie, Björn Grüning, Sergei L Kosakovsky Pond, Anton Nekrutenko.
Abstract
The COVID-19 pandemic is the first global health crisis to occur in the age of big genomic data.Although data generation capacity is well established and sufficiently standardized, analytical capacity is not. To establish analytical capacity it is necessary to pull together global computational resources and deliver the best open source tools and analysis workflows within a ready to use, universally accessible resource. Such a resource should not be controlled by a single research group, institution, or country. Instead it should be maintained by a community of users and developers who ensure that the system remains operational and populated with current tools. A community is also essential for facilitating the types of discourse needed to establish best analytical practices. Bringing together public computational research infrastructure from the USA, Europe, and Australia, we developed a distributed data analysis platform that accomplishes these goals. It is immediately accessible to anyone in the world and is designed for the analysis of rapidly growing collections of deep sequencing datasets. We demonstrate its utility by detecting allelic variants in high-quality existing SARS-CoV-2 sequencing datasets and by continuous reanalysis of COG-UK data. All workflows, data, and documentation is available at https://covid19.galaxyproject.org .Entities:
Year: 2021 PMID: 33791701 PMCID: PMC8010728 DOI: 10.1101/2021.03.25.437046
Source DB: PubMed Journal: bioRxiv
Figure 1.Analysis flow in our analysis system. VCF = variant call format, TSV = tab separated values, JSON = JavaScript Object Notation.
Description of analysis workflows. PE = paired-end; SE = single end
| # | Workflow | Input | Read aligner | AV caller |
|---|---|---|---|---|
| 1 | Illumina RNAseq SE | SE fastq | bowtie2 | lofreq |
| 2 | Illumina RNAseq PE | PE fastq | bwa-mem | lofreq |
| 3 | Illumina ARTIC | PE fastq | bwa-mem | lofreq |
| 4 | ONT Artic | ONT fastq/fasta | minimap2 | medaka |
| 5 | Reporting | Output of any of the above workflows |
Jupyter and ObservableHQ notebooks using for Stage 2 of the analysis
| Notebook framework | Figure from this paper generated using this notebook | Link |
|---|---|---|
| Jupyter | 4, S1– S6 | |
| ObservableHQ | 3 (also see |
Figure 2.Number of SRA accession for each sequencing technology and library preparation strategy.
Allelic-variant (AV) counts pre/post filtering. AVs= number of all detected AVs; Sites = number of distinct variable sites across SARS-CoV-2 genome; Samples = number of samples in the corresponding dataset. These datasets will also be available from the Viral Beacon project at (https://covid19beacon.crg.eu/).
| Dataset | Links | AVs | Sites | Samples |
|---|---|---|---|---|
| “Boston” | 9,249/8,492 | 1,027/315 | 639 | |
| “COG-Pre” | 7,338/4,761 | 2,747/287 | 503 | |
| “COG-Post” | 38,919/38,813 | 5,760/1,795 | 1,818 |
Figure 3.Dot plot of all allelic-variants (AV) across samples in Boston dataset. X-axis: genome position, Y-axis: Samples, colors correspond to functional classes of AVs. Samples are arranged by hierarchical clustering using cosine distances on mean allele frequencies of all AVs. A. Dot-plot of all allelic variants in the “Boston” dataset; rows – samples, columns – genomic coordinates; samples are arranged by hierarchical clustering. Limited to variants that occur in at least 4 samples. B. Dot-plot of observed variants in the “Boston” dataset; restricted to variants that appear only at AF≤10% and occur in at least 4 samples each. Variants are partitioned into 10 clusters, using K-medoids using the Hamming distance on AF vectors; the cluster with 8 variants is highlighted in orange.
Interactive version is at https://observablehq.com/@spond/intrahost-variant-exploration-landing
Eight low frequency allelic-variants co-occurred in 8 samples.
| Nucleotide Variant | Sample count | Effect |
|---|---|---|
| 4,338:C→T | 29 | nsp3/S540F |
| 6,604:A→G | 54 | nsp3/L1295 |
| 9,535:C→T | 37 | nsp4/T327 |
| 12,413:A→C | 63 | nsp8/N108H |
| 13,755:A→C | 53 | RdRp/R105S |
| 14,304:A→C | 30 | RdRp/K288N |
| 17,934:C→A | 30 | helicase/T566 |
| 20,716:A→T | 37 | MethTr/M20L |
| 26,433:A→C | 35 | E/K63N |
Figure 4.Intersection between allelic-variants (AV) reported here with AVs of concern (VOC). Big blob in “COG-Post” dataset corresponds to L18F change in gene S. Size of markers ∝ fraction of samples containing variant. [min;max] - maximum and minimum counts of samples containing variants shown in this figure. E.g., in “Boston” the largest marker corresponds to an AV shared by 7 samples, and the smallest by 3 samples.
“Boston” (https://covid19.galaxyproject.org/genomics/interactive_images/voc_Boston.html)
Allelic-variants (AVs) with maximum allele frequency < 80% overlapping with codons under selection. % samp = fraction of samples containing a given AV in each dataset
| POS | R | A | Gene | AA | % samp | mean | min | max | codon |
|---|---|---|---|---|---|---|---|---|---|
| “Boston” | |||||||||
| 25,842 | A | C | orf3a | T151P | 6.259 | 0.083 | 0.051 | 0.180 | ACT |
| 22,254 | A | G | S | I231M | 2.034 | 0.066 | 0.050 | 0.102 | ATA |
| “COG-Pre” | |||||||||
| 21,637 | C | T | S | P26S | 0.397 | 0.520 | 0.260 | 0.779 | CCT |
| 22,343 | G | T | S | G261V | 0.397 | 0.425 | 0.091 | 0.758 | GGT |
| 28,825 | C | T | N | R185C | 0.397 | 0.619 | 0.528 | 0.709 | CGT |
| “COG-Post” | |||||||||
| 1,463 | G | A | nsp2 | G220D | 0.715 | 0.737 | 0.669 | 0.767 | GGT |
| 22,343 | G | T | S | G261V | 0.660 | 0.703 | 0.672 | 0.737 | GGT |
| 25,217 | G | T | S | G1219V | 0.330 | 0.752 | 0.746 | 0.756 | GGT |
| 21,845 | C | T | S | T95I | 0.275 | 0.466 | 0.069 | 0.731 | ACT |
| 29,252 | C | T | N | S327L | 0.220 | 0.196 | 0.083 | 0.359 | TCG |
| 29,170 | C | T | N | H300Y | 0.165 | 0.314 | 0.076 | 0.735 | CAT |
| 4,441 | G | T | nsp3 | V575L | 0.110 | 0.725 | 0.712 | 0.739 | GTG |
| 21,829 | G | T | S | V90F | 0.110 | 0.508 | 0.242 | 0.774 | GTT |
| 28,394 | G | A | N | R41Q | 0.110 | 0.318 | 0.054 | 0.582 | CGG |
| 29,446 | G | T | N | V392L | 0.110 | 0.770 | 0.765 | 0.774 | GTG |