| Literature DB >> 33303831 |
A Sina Booeshaghi1, Nathan B Lubock2, Aaron R Cooper2, Scott W Simpkins2, Joshua S Bloom2,3, Jase Gehring4, Laura Luebbert5, Sri Kosuri2, Lior Pachter6,7.
Abstract
Scalable, inexpensive, and secure testing for SARS-CoV-2 infection is crucial for control of the novel coronavirus pandemic. Recently developed highly multiplexed sequencing assays (HMSAs) that rely on high-throughput sequencing can, in principle, meet these demands, and present promising alternatives to currently used RT-qPCR-based tests. However, reliable analysis, interpretation, and clinical use of HMSAs requires overcoming several computational, statistical and engineering challenges. Using recently acquired experimental data, we present and validate a computational workflow based on kallisto and bustools, that utilizes robust statistical methods and fast, memory efficient algorithms, to quickly, accurately and reliably process high-throughput sequencing data. We show that our workflow is effective at processing data from all recently proposed SARS-CoV-2 sequencing based diagnostic tests, and is generally applicable to any diagnostic HMSA.Entities:
Year: 2020 PMID: 33303831 PMCID: PMC7730459 DOI: 10.1038/s41598-020-78942-7
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Massively parallel diagnostic testing by high-throughput sequencing. Workflow of a high-throughput sequencing based diagnostic test. (1) Samples are collected and prepared. (2) Samples are barcoded and amplified. (3) Multiplexed samples are pooled and sequenced using a high-throughput sequencer. 4) Sequencing data is aligned to a set of genes, (5) sample indices are error corrected, (6) counts are computed, and (7) diagnostic results are obtained.
Figure 2Sample classification, viral load prediction and limit of detection. (a) Positive and negative samples from the Plate 2 S ATCC RNA experiment can be effectively separated using logistic regression. Points correspond to samples and are colored by the known amount of viral RNA per sample. The probability of each sample having a non-zero amount of viral RNA is given by the logistic function and is painted as orthogonal to the logistic regression boundary. The shape of the point indicates whether the sample was predicted to be positive for viral RNA (circle) or negative (square). (b) The standard curve measuring spike-in and virus versus the known amount of viral RNA per sample with optimal exponential coefficients determined by logistic regression; samples are colored by their predicted classification. (c) The limit of detection as estimated from 99 rounds of split/test and logistic regression to classify samples with a non-zero amount of viral RNA. The limit of detection is defined as the number of RNA molecules for which the recall is greater than 19/20 (= 0.95) (d) The viral load per sample can be predicted with a weighted linear regression using the log counts from each gene. Each point is a sample, with perfect predictions lying on the diagonal line. The size of the points represents their weight, with points weighted so that each titer is represented with equal weight. The code to reproduce each figure is here: https://github.com/pachterlab/BLCSBGLKP_2020/blob/master/notebooks/diagnostic.ipynb (a) and (b), https://github.com/pachterlab/BLCSBGLKP_2020/blob/master/notebooks/lod_fda.ipynb (c), https://github.com/pachterlab/BLCSBGLKP_2020/blob/master/notebooks/viral_load.ipynb (d).
Figure 3Orthogonal validation by read clustering. Scatter plots between the kallisto|bustools and the starcode workflow show near identical results on the genes targeted by the SwabSeq protocol: (a) RPP30, (b) S, and (c) S spike-in. Each point is a sample and the Pearson correlation is determined for the counts for a gene for all samples between kallisto|bustools and starcode. The code to reproduce this figure is here: https://github.com/pachterlab/BLCSBGLKP_2020/blob/master/notebooks/kb_v_starcode.ipynb.