| Literature DB >> 35564837 |
Christopher Riccardi1, Gabriel Innocenti1,2, Marco Fondi1, Giovanni Bacci1.
Abstract
Advances in Next Generation Sequencing technologies allow us to inspect and unlock the genome to a level of detail that was unimaginable only a few decades ago. Omics-based studies are casting a light on the patterns and determinants of disease conditions in populations, as well as on the influence of microbial communities on human health, just to name a few. Through increasing volumes of sequencing information, for example, it is possible to compare genomic features and analyze the modulation of the transcriptome under different environmental stimuli. Although protocols for NGS preparation are intended to leave little to no space for contamination of any kind, a noticeable fraction of sequencing reads still may not uniquely represent what was intended to be sequenced in the first place. If a natural consequence of a sequencing sample is to assess the presence of features of interest by mapping the obtained reads to a genome of reference, sometimes it is useful to determine the fraction of those that do not map, or that map discordantly, and store this information to a new file for subsequent analyses. Here we propose a new mapper, which we called Squid, that among other accessory functionalities finds and returns sequencing reads that match or do not match to a reference sequence database in any orientation. We encourage the use of Squid prior to any quantification pipeline to assess, for instance, the presence of contaminants, especially in RNA-Seq experiments.Entities:
Keywords: dynamic programming; mapping; quality check; rna-seq
Mesh:
Year: 2022 PMID: 35564837 PMCID: PMC9103773 DOI: 10.3390/ijerph19095442
Source DB: PubMed Journal: Int J Environ Res Public Health ISSN: 1660-4601 Impact factor: 4.614
Figure 1Read orientation modes handled by Squid. Library types ISF, ISR (A) and IU (C) model fragments in which the sequencing reads are oriented towards each other. Read A and B represent any R1–R2 pair, as long as mutual exclusivity and orientation are conserved. Library types OSF, OSR (B) and OU (C) instruct Squid of the opposite case, in which the sequencing reads do not face one another. A fragment is never modelled in a matching library protocol (both reads mapping to the same strand) in the current implementation.
Squid’s performance compared to Bowtie2. The average reported speedup is 6.24; both programs were set to operate with a kmer length of 9 and step size of 1 nucleotide. Bowtie2 was set to ignore quality scores since Squid does not consider them in the current implementation. Moreover, a reporting value of k = 1 was selected in Bowtie2 to only report the first alignment found, and not look for multiple alignments. Squid library string “−l IU” ensured that all of the inward orientations of the read pairs were to be detected.
| Reads | Time (s) | Percentage Mapped | Speedup | |||
|---|---|---|---|---|---|---|
| Sample | Squid | Bowtie2 | Squid | Bowtie2 | ||
| S_01 | 294,551 | 18.14 | 98.91 | 100 | 100 | 5.45 |
| S_02 | 428,090 | 25.00 | 149.04 | 100 | 100 | 5.96 |
| S_03 | 433,494 | 24.61 | 156.52 | 100 | 100 | 6.36 |
| S_04 | 477,233 | 24.16 | 163.3 | 100 | 100 | 6.75 |
| S_05 | 568,939 | 33.70 | 216.28 | 100 | 100 | 6.41 |
| S_06 | 822,335 | 50.38 | 280.41 | 100 | 100 | 5.56 |
| S_07 | 1,056,705 | 52.42 | 365.25 | 100 | 100 | 6.96 |
| S_08 | 1,611,954 | 75.84 | 550.62 | 100 | 99.99 | 7.26 |
| S_09 | 1,933,648 | 97.18 | 647.1 | 100 | 100 | 6.65 |
| S_10 | 2,261,861 | 111.65 | 755.5 | 100 | 100 | 6.76 |
Figure 2Scatterplot of transcripts quantification. Each circle represents the raw counts of a gene in Salmon (y axis) and Squid (x axis). Squid was run using exhaustiveness 0 (A) and 15 (B), respectively. Per-gene ratio was calculated dividing Salmon’s raw counts by Squid’s raw counts (extracted from the BEDPE output file). Note how the ratio scale is different by an order of magnitude between (A) and (B), indicating that mapping accuracy is affected when no additional cycles are performed. Coefficients of determination were 0.97 and 0.99 in (A) and (B), respectively. The regression line was calculated using the generalized additive model (GAM) through the R package ggplot2.