| Literature DB >> 30999839 |
Mara Sangiovanni1, Ilaria Granata2, Amarinder Singh Thind3, Mario Rosario Guarracino3.
Abstract
BACKGROUND: Next Generation Sequencing (NGS) experiments produce millions of short sequences that, mapped to a reference genome, provide biological insights at genomic, transcriptomic and epigenomic level. Typically the amount of reads that correctly maps to the reference genome ranges between 70% and 90%, leaving in some cases a consistent fraction of unmapped sequences. This 'misalignment' can be ascribed to low quality bases or sequence differences between the sample reads and the reference genome. Investigating the source of the unmapped reads is definitely important to better assess the quality of the whole experiment and to check for possible downstream or upstream 'contamination' from exogenous nucleic acids.Entities:
Keywords: Contamination; Next generation sequencing; Unmapped reads
Mesh:
Year: 2019 PMID: 30999839 PMCID: PMC6472186 DOI: 10.1186/s12859-019-2684-x
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1The DecontaMiner pipeline. The tools and the relative functions, input and output file formats are shown. The outputs are grouped in three main directories: ‘Low quality’, ‘Ambiguous’ and ‘Valid’, which collect the result files of each analyzed sample
DecontaMiner, TruePure and FastQScreen feature comparison
| DecontaMiner | TruePure | FastQScreen | ||
|---|---|---|---|---|
| Input type: | bam |
| × | × |
| fastq |
|
|
| |
| fasta |
|
| × | |
| Multiple samples processing |
| × | × | |
| Paired end processing |
| × | × | |
| Unlimited input |
| × | × | |
| User defined databases |
| × |
| |
| Read tracking |
| × | × | |
| Parameter tuning |
| × | × | |
| Runs on HPC |
| × |
| |
| Visual output |
|
|
|
Sinthetic reads compared to DecontaMiner, TruePure and FastQScreen detected reads
| Simulated data | DecontaMiner | TruePurea | FastQScreenb | |||||
|---|---|---|---|---|---|---|---|---|
| Read | % | Valid | % | Sequences | % | One hit/one | % | |
| counts | reads | found | genome | |||||
| Bacteria | 99816 | 80.28 | 80527 | 92.73 | 975 | 80.25 | 13986 | 86.41 |
| Fungi | 2011 | 1.62 | 47 | 0.05 | 0 | 0 | 3 | 0.02 |
| Viruses | 22504 | 18.1 | 6263 | 7.22 | 240 | 19.75 | 2196 | 13.57 |
| Total | 124331 | 100 | 86837 | 100 | 1215 | 100 | 16185 | 100 |
aThese are the results on the manually curated input file;
bFor FastQScreen only the hits mapping on a single genome are shown
Species detection: precision and recall for DecontaMiner, TruePure and FastQScreen
| DecontaMiner | TruePurea | FastQScreenb | ||||
|---|---|---|---|---|---|---|
| Precision | Recall | Precision | Recall | Precision | Recall | |
| Bacteria | 0.82 | 0.9 | 0.67 | 1 | NA | NA |
| Fungi | 1 | 1 | 0 | 0 | NA | NA |
| Viruses | 0.92 | 1 | 0.53 | 0.89 | NA | NA |
aThese are the results on the manually curated input file
bFastQScreen does not provide a detailed report on the distribution of the hits found
Fig. 2Overall read mapping rate distribution (GSE68086) Area chart showing the mapping rate of the GSE68086 dataset samples. The amount of mapped reads ranges from 45.5% to 94.2%, indicating a great variability among samples
Fig. 3Bacterial abundance in the tumoral samples. The heatmaps show the relative abundance of bacterial species in 4 tumor types: breast cancer (a), GBM (b), Lung (c) and digestive system cancers (d). Bacteria with a match count ≥ 100 and a relative abundance ≥ 5% in at least one sample/group are shown. P. Acnes is highlighted by a dot in all groups, being the most abundant contaminant. The heatmaps are generated by the DecontaMiner offline HTML page and online website
Number of contaminating genera and species having at least one hundred matches, and a relative abundance ≥ 5% in at least one sample/group are shown, for each of the three considered kingdoms and for Tumor and Control samples
| Tumors | Controls | |||
|---|---|---|---|---|
| Genus | Species | Genus | Species | |
| Bacteria | 100 | 199 | 57 | 91 |
| Fungi | 12 | 15 | 8 | 10 |
| Viruses | / | 9 | / | 7 |
Fig. 4Bacterial abundance in the healthy samples. The dot chart shows the relative abundance of bacteria, grouped by genus. Only genera with a match count ≥ 100 and a relative abundance ≥ 5% in at least one sample are shown. The dot size is proportional to the abundance. The most relevant bacteria belong to the Paeniglutamicibacter and Propionibacterium genera. The dot charts are generated by the DecontaMiner offline html page and online website
Fig. 5Fungi contamination in tumor and healthy groups. The stacked bar chart shows the fungal genera having an average value in all groups ≥ 10%, considering tumor and healthy (HC) groups. Bars are stacked by the group for which the contaminating organism (x-axis) has been detected. The y-axis scale reports the sum of the values in all the samples. Groups are ranked in a increasing order, in terms of contaminant abundance, from the bottom to the top. The stacked bar charts are automatically generated by the DecontaMiner online companion website