| Literature DB >> 31035503 |
Avraam Tapinos1, Bede Constantinides2,3, My V T Phan4, Samaneh Kouchaki5,6, Matthew Cotten7,8,9, David L Robertson10,11.
Abstract
Advances in DNA sequencing technology are facilitating genomic analyses of unprecedented scope and scale, widening the gap between our abilities to generate and fully exploit biological sequence data. Comparable analytical challenges are encountered in other data-intensive fields involving sequential data, such as signal processing, in which dimensionality reduction (i.e., compression) methods are routinely used to lessen the computational burden of analyses. In this work, we explored the application of dimensionality reduction methods to numerically represent high-throughput sequence data for three important biological applications of virus sequence data: reference-based mapping, short sequence classification and de novo assembly. Leveraging highly compressed sequence transformations to accelerate sequence comparison, our approach yielded comparable accuracy to existing approaches, further demonstrating its suitability for sequences originating from diverse virus populations. We assessed the application of our methodology using both synthetic and real viral pathogen sequences. Our results show that the use of highly compressed sequence approximations can provide accurate results, with analytical performance retained and even enhanced through appropriate dimensionality reduction of sequence data.Entities:
Keywords: DFT; DWT; PAA; alignment; assembly; compressive genomics; data compression; data transformation; taxonomic classification; time series
Mesh:
Year: 2019 PMID: 31035503 PMCID: PMC6563281 DOI: 10.3390/v11050394
Source DB: PubMed Journal: Viruses ISSN: 1999-4915 Impact factor: 5.048
Numerical nucleotide sequence representation methods.
| Method | Numerical Representation |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Figure 1A numerically represented DNA sequence transformed at various levels of spatial resolution using the discrete Fourier transform (DFT) of the whole sequence (A), the Haar discrete wavelet transform (DWT) (B) and piecewise aggregate approximation (PAA) (C). A 30 nucleotide sequence (x-axis) is represented as a numerical sequence (black lines) using the real number representation method (y-axis where T = 1.5, C = 0.5, G = −0.5 and A = −1.5) for DFT approximations of the sequence with 5 (red), 3 (blue) and 1 (green) Fourier frequencies (A); DWT approximations of the same sequence with 8 level wavelets (red), 4 level wavelets (blue) and 2 level wavelets (green) (B); PAA approximations of the same sequence with 8 (red), 5 (blue) and 3 (green) coefficients (C).
Figure 2Overview of our proposed methodology using time series transformation/approximation methods: () Creation of numerical representations of input sequences. () Application of an appropriate signal decomposition method to transform sequences into their feature space. () Use of approximated transformations to perform rapid data analysis in lower dimensional space. () Validation of inferences against original, full-resolution input sequences. In the case of reference-based alignment and taxonomic classification, approximated read transformations were compared with a reference sequence. In our de novo implementation, pairwise comparisons were performed between all of the approximated read transformations.
Simulated read data. Each row contains details for each simulated dataset (i.e., virus family, virus, GenBank ID, variation type, variation level, number of reads and simulator used to generate data). Abbreviations: Ins, insertions; Del, deletions and Sub, substitutions.
| Family | Virus | GenBank Genome ID | Variation Type (%) | Reads | Simulator | ||
|---|---|---|---|---|---|---|---|
| Ins | Del | Sub | |||||
| HIV | HXB2 | K03455 | 0.0 | 0.0 | 0.0 | 2133 | CuReSim |
| HIV | HXB2 | K03455 | 0.0 | 0.0 | 1.0 | 2133 | CuReSim |
| HIV | HXB2 | K03455 | 0.0 | 0.0 | 2.0 | 2133 | CuReSim |
| HIV | HXB2 | K03455 | 0.0 | 0.0 | 3.0 | 2133 | CuReSim |
| HIV | HXB2 | K03455 | 0.0 | 0.0 | 4.0 | 2133 | CuReSim |
| HIV | HXB2 | K03455 | 0.0 | 0.0 | 5.0 | 2133 | CuReSim |
| HIV | HXB2 | K03455 | 0.5 | 0.5 | 0.0 | 2133 | CuReSim |
| HIV | HXB2 | K03455 | 1.0 | 1.0 | 0.0 | 2133 | CuReSim |
| HIV | HXB2 | K03455 | 1.5 | 1.5 | 0.0 | 2133 | CuReSim |
| HIV | HXB2 | K03455 | 2.0 | 2.0 | 0.0 | 2133 | CuReSim |
| HIV | HXB2 | K03455 | 2.5 | 2.5 | 0.0 | 2133 | CuReSim |
| HIV | HXB2 | K03455 | 0.5 | 0.5 | 1.0 | 2133 | CuReSim |
| HIV | HXB2 | K03455 | 1.0 | 1.0 | 2.0 | 2133 | CuReSim |
| HIV | HXB2 | K03455 | 1.5 | 1.5 | 3.0 | 2133 | CuReSim |
| HIV | HXB2 | K03455 | 2.0 | 2.0 | 4.0 | 2133 | CuReSim |
| HIV | HXB2 | K03455 | 2.5 | 2.5 | 5.0 | 2133 | CuReSim |
| Mixed Viruses: Caliciviridae, Filoviridae, Pneumoviridae | Norovirus, Ebola virus, RSV | KM198529, KM198528, KM198511, KM198500, KM198486, KU296608, KU296553, KU296549, KU296528, KU296416, KP317952, KP317946, KP317934, KP317923, KP317922 | 0.0 | 0.0 | 0.0 | 200,000 | WGSIM |
| Mixed Viruses: Caliciviridae, Filoviridae, Pneumoviridae | Norovirus, Ebola virus, RSV | KM198529, KM198528, KM198511, KM198500, KM198486, KU296608, KU296553, KU296549, KU296528, KU296416, KP317952, KP317946, KP317934, KP317923, KP317922 | 1.0 | 1.0 | 1.0 | 200,000 | WGSIM |
| Mixed Viruses, Caliciviridae, Filoviridae, Pneumoviridae | Norovirus, Ebola virus, RSV | KM198529, KM198528, KM198511, KM198500, KM198486, KU296608, KU296553, KU296549, KU296528, KU296416, KP317952, KP317946, KP317934, KP317923, KP317922 | 3.33 | 3.33 | 3.33 | 100,000 | WGSIM |
| Mixed Viruses, Caliciviridae, Filoviridae, Pneumoviridae | Norovirus, Ebola virus, RSV | KM198529, KM198528, KM198511, KM198500, KM198486, KU296608, KU296553, KU296549, KU296528, KU296416, KP317952, KP317946, KP317934, KP317923, KP317922 | 6.66 | 6.66 | 6.66 | 200,000 | WGSIM |
Real short reads data. Rows contain information for each real reads’ dataset (i.e., virus family, virus, genome strain GenBank ID, SRA project ID, number of reads and technology used to sequence data). SRA: Sequence Read Archive; ENA: European Nucleotide Archive.
| Family | Virus | Amplicon/Random Primer | GenBank Genome ID | ENA/SRA_ID | Reads | Sequencing Technology |
|---|---|---|---|---|---|---|
| Caliciviridae | Norovirus | Amplicon | KM198486 | ERR225628 | 2126502 | Illumina MiSeq |
| Caliciviridae | Norovirus | Amplicon | KM198500 | ERR225629 | 3037674 | Illumina MiSeq |
| Caliciviridae | Norovirus | Amplicon | KM198511 | ERR225631 | 3285078 | Illumina MiSeq |
| Caliciviridae | Norovirus | Amplicon | KM198528 | ERR225632 | 4361884 | Illumina MiSeq |
| Caliciviridae | Norovirus | Amplicon | KM198529 | ERR225633 | 5187234 | Illumina MiSeq |
| Filoviridae | Ebola virus | Amplicon | KU296608 | SRR3107337 | 522968 | Ion Torrent PGM |
| Filoviridae | Ebola virus | Amplicon | KU296549 | SRR3107338 | 771031 | Ion Torrent PGM |
| Filoviridae | Ebola virus | Amplicon | KU296416 | SRR3107340 | 186657 | Ion Torrent PGM |
| Filoviridae | Ebola virus | Amplicon | KU296553 | SRR3107342 | 478346 | Ion Torrent PGM |
| Filoviridae | Ebola virus | Amplicon | KU296528 | SRR3107343 | 42410 | Ion Torrent PGM |
| Pneumoviridae | RSV | Amplicon | KP317934 | ERR303259 | 7275032 | Illumina MiSeq |
| Pneumoviridae | RSV | Amplicon | KP317922 | ERR303260 | 9278070 | Illumina MiSeq |
| Pneumoviridae | RSV | Amplicon | KP317946 | ERR303261 | 11111114 | Illumina MiSeq |
| Pneumoviridae | RSV | Amplicon | KP317923 | ERR303262 | 13293226 | Illumina MiSeq |
| Pneumoviridae | RSV | Amplicon | KP317952 | ERR303263 | 15237848 | Illumina MiSeq |
Reference genomes used during classification and reference-based alignment.
| Family | Virus | GenBank ID: | Length (nt) |
|---|---|---|---|
| Retroviridae | Human immunodeficiency virus 1 (HXB2) | K03455 | 9179 |
| Caliciviridae | Norovirus | KM198509.1 | 7425 |
| Filoviridae | Zaire ebolavirus | KM034562.1 | 18957 |
| Pneumoviridae | Human orthopneumovirus (Respiratory Syncytial Virus) | KP317934.1 | 15233 |
Figure 3Accuracy of our prototype classification implementation and two established tools on HIV-1 HXB2 simulated datasets. All plots illustrate the F-measures obtained on the 16 different HIV datasets. The y-axis indicates the F-measure score, and the x-axis depicts the reads data files. Plot 3-i depicts the F-measures obtained for each classifier on the simulations with 0% to 5% of substitution variation rate. Plot 3-ii illustrates the F-measures obtained for each classifier on the simulations with 0% to 5% uniform insertion/deletion variation, and plot 3-iii illustrates the F-measures obtained for each tool on simulations of uniform 0% to 10% insertion/deletion and substitution variation.
Figure 4Accuracy of our prototype classification implementation and two established tools on mixed viruses simulated datasets. The y-axis indicates the F-measure score, and the x-axis depicts the reads data files. The plot depicts the F-measures obtained for each classifier on the mixed virus simulations. DFT: discrete Fourier transform; DWT: discrete wavelet transform; PAA: piece-wise aggregate approximation.
Figure 5Accuracy of our prototype classification implementation and two established tools on real sequences. The y-axis indicates the F-measure score, and the x-axis depicts the reads data files. Plot 5-i depicts the F-measures obtained for each classifier on the Norovirus sequences data. Plot 5-ii illustrates the F-measures obtained for each classifier on the Ebola sequence data. Plot 5-iii illustrates the F-measures obtained for each tool on Respiratory syncytial virus (RSV) sequence data. DFT: discrete Fourier transform; DWT: discrete wavelet transform; PAA: piece-wise aggregate approximation.
Figure 6Pseudocode for the alignment procedure.
Figure 7Accuracy of our prototype reference alignment implementation and four established tools on HIV-1 HXB2 simulated datasets. This Figure illustrates the F-measures obtained on the 16 different HIV datasets. Plot 6-(i) depicts the F-measures obtained for each aligner on the simulations with 0% to 5% of substitution variation rate. Plot 6-(ii) illustrates the F-measures obtained for each aligner on the simulations with 0% to 5% uniform insertion/deletion variation, and plot 6-(iii) illustrates the F-measures obtained for each tool on simulations of uniform 0% to 10% insertion/deletion and substitution variation. DFT: discrete Fourier transform; DWT: discrete wavelet transform; PAA: piece-wise aggregate approximation.
Figure 8Accuracy of our prototype aligner implementation and four established tools on mixed viruses simulated datasets. The y-axis indicates the F-measure score, and the x-axis depicts the reads data files. The plot depicts the F-measures obtained for each aligner on the mixed virus simulations. DFT: discrete Fourier transform; DWT: discrete wavelet transform; PAA: piece-wise aggregate approximation.
Figure 9Accuracy of our prototype aligner implementation and four established tools on real sequences datasets. The y-axis indicates the F-measure score, and the x-axis depicts the reads data files. Plot 8-(i) depicts the F-measures obtained for each aligner on the Norovirus sequences data. Plot 8-(ii) illustrates the F-measures obtained for each aligner on the Ebola sequences data. Plot 8-(iii) illustrates the F-measures obtained for each tool on the Respiratory syncytial virus (RSV) sequences data. DFT: discrete Fourier transform; DWT: discrete wavelet transform; PAA: piece-wise aggregate approximation.
Figure 10A de novo assembly methodology for numerically represented nucleotide reads. All-against-all sequence comparison (A) enables the construction of a read graph with weighted edges. The weight assigned to each edge is the smallest pairwise distance obtained between every possible k-mer representation of the two reads. In this example, a 5-mer was used. The smallest distance between every possible k-mer can be obtained by either using a sliding window approach or break reads every possible subsequence with length k. (B) The shortest path in the graph is identified with a breadth-first search algorithm (red coloured edges) thereby (C) enabling read alignment. A DNA walk representation of the overlapped reads (D) may subsequently be used as a three-dimensional graphical portrayal of the reads, illustrating alignment characteristics.
Figure 11Accuracy of our prototype de novo assembly implementation and two established tools on HIV-1 HXB2 simulated datasets. The contigs obtained for each assembler were evaluated against the reference genome used to generate the simulated data. BLASTn was used to align all contigs to an HIV-1 HXB2 reference genome and determine genome coverage. The y-axis indicates the number of gaps and mismatches that exist in the contigs obtained for each tool, and the x-axis depicts the length of the genome the reported contigs cover. The contigs obtained from the assembly of the HIV-1 HXB2 simulated short read data were evaluated against the K03455 reference genome. Plot 10-i illustrates results obtained from all assemblers on variation-free data. Plots 10-ii to 10-vi illustrate results obtained from all assemblers on data with different levels of substitution variation. Plots 10-vii to 10-xi illustrate results obtained from all assemblers on data with different levels of insertion/deletion variation. Plots 10-xii to 10-xvi illustrate results obtained from all assemblers on data with different levels of combined insertion/deletion and substitution variation.
Figure 12Accuracy of our prototype de novo assembly implementation and two established tools on mixed viruses simulated datasets. The contigs obtained for each assembler were evaluated against the reference genome that was used to generate the simulated data. BLASTn was used to align all contigs to an HIV-1 HXB2 reference genome and determine how much of the particular genome they cover. The y-axis indicates the number of gaps and mismatches that exist in the contigs obtained for each tool, and the x-axis depicts the length of the genome the reported contigs cover. The contigs obtained from the mixed virus simulated dataset were evaluated against the KM198529, KM198528, KM198511, KM198500, KM198486, KU296608, KU296553, KU296549, KU296528, KU296416, KP317952, KP317946, KP317934, KP317923 and KP317922 references genomes. Plots 11-i to 11-iv illustrate results obtained from all assemblers on data with 0%, 3%, 10% and 20% variation levels accordingly.