| Literature DB >> 27998267 |
Markus Lux1, Jan Krüger2, Christian Rinke3, Irena Maus2, Andreas Schlüter2, Tanja Woyke4, Alexander Sczyrba2, Barbara Hammer5.
Abstract
BACKGROUND: A major obstacle in single-cell sequencing is sample contamination with foreign DNA. To guarantee clean genome assemblies and to prevent the introduction of contamination into public databases, considerable quality control efforts are put into post-sequencing analysis. Contamination screening generally relies on reference-based methods such as database alignment or marker gene search, which limits the set of detectable contaminants to organisms with closely related reference species. As genomic coverage in the tree of life is highly fragmented, there is an urgent need for a reference-free methodology for contaminant identification in sequence data.Entities:
Keywords: Binning; Clustering; Contamination detection; Machine learning; Quality control; Single-cell sequencing
Mesh:
Substances:
Year: 2016 PMID: 27998267 PMCID: PMC5168860 DOI: 10.1186/s12859-016-1397-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Acdc contamination detection pipeline: Results from both reference-free and reference-based techniques are fusioned and post-processed to end up with a clean sample
Description of parameters for various techniques used in acdc
| Method | Parameter description |
|---|---|
| Data pre-processing | Given a target of |
| BH-SNE | The parameter |
| DIP | The significance level which is uncritical as it is |
| CC | The number of clusters found depends on the underlying graph. In acdc, the graph is constructed by connecting each data point to it’s |
| Bootstrapping | We set the number of bootstraps |
| Kraken | The only parameter required by Kraken is the database to be used. It can be specified as a parameter to acdc as well. |
| RNAmmer | 16S rRNA gene sequence prediction using RNAmmer does not require any parameters. |
Fig. 2Data pre-processing that transforms a sequential data representation into vectorial data using a sliding window technique: Exemplary for k=4, on each shift, a 256-dimensional vector is generated by counting all permutations of the four bases
Fig. 3Illustration of the complementary detection capabilities of DIP and CC using two different contaminated samples. Left: Using a mutual 9-nearest-neighbor graph, CC identifies two clusters (very small contamination) while DIP isn’t able to detect multimodality as seen in the distribution of pairwise distances below. Right: Two overlapping clusters prevent CC from detecting two components while DIP detects significant multimodality in the distribution of pairwise distances
Fig. 4Acdc result interface. For each sample shown in the left-hand side table, visualizations are shown on the right-hand side. Individual clusters can be exported in fasta format by clicking on the respective cluster color on the bottom right
Description and availability of the mix data set. A detailed description of these data can be found in the Additional file 1. Non-available references are denoted by ’NA’
| Species name | Ref. | Strain availability |
|---|---|---|
|
| [ | Prof. Dr. W. Schwarz, Prof. Dr. W. Liebel, Dr. V. Zverlov, Dr. D. Koeck, Technische Universität München, Institute for Microbiology, Munich, Germany |
|
| NA | |
|
| [ | |
|
| NA | Prof. Dr. H. König, Dr. K.G. Cibis, Johannes Gutenberg-University, Institute for Microbiology and Wine Research, Mainz, Germany |
|
| [ | |
|
| [ | |
|
| NA | |
|
| NA | Dr. M. Klocke and Dr. S. Hahnke, Leibniz-Institut für Agrartechnik Potsdam-Bornim e.V. (ATB), Department of Bioengineering, Potsdam, Germany |
|
| NA | Prof. Dr. Scherer, Dr. S. Off, Dr. Y.S. Kim, University of Applied Sciences Hamburg (HAW), Faculty Life Sciences/Research Center ’Biomass Utilization Hamburg’, Hamburg, Germany |
Acdc evaluation of contamination detection performance. Entries depict the number of correctly identified clean and contaminated samples with additional information about false predictions in parentheses
| Data set | Identified clean samples | Identified contaminated samples |
|---|---|---|
| simulated | 4/7 (3 warnings) | 10/11 (1 warning) |
| mix | 0/0 | 8/9 (1 warning) |
| benchmark | 0/0 | 22/30 (6 warnings, 2 clean) |
| mdm | 150/201 (39 warnings, 12 contaminated) | 0/0 |
Precision, recall and F 1-scores of predicted clean base pairs for both ProDeGe and acdc on the simulated and benchmark data sets
| Data set |
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| ProDeGe | ProDeGe | ProDeGe | acdc | acdc | acdc | |
| simulated (kingdom) | No result | No result | No result | 1.00 | 1.00 |
|
| simulated (phylum) | No result | No result | No result | 0.99 | 0.98 |
|
| simulated (class) | No result | No result | No result | 1.00 | 0.99 |
|
| simulated (order) | No result | No result | No result | 0.99 | 0.98 |
|
| simulated (family) | No result | No result | No result | 1.00 | 1.00 |
|
| simulated (genus) | 0.22 | 0.32 | 0.22 | 0.95 | 0.97 |
|
| simulated (species) | 0.50 | 0.33 | 0.36 | 0.38 | 0.77 |
|
|
benchmark ( | 1.00 | 0.88 | 0.93 | 0.97 | 0.99 |
|
|
benchmark ( | 1.00 | 0.73 | 0.83 | 0.99 | 0.99 |
|
|
benchmark ( | 1.00 | 0.70 | 0.81 | 1.00 | 1.00 |
|
Each row contains average values of the given sub data set. Bold values depict the best performing entry. Entries marked as “no result” either produced an empty clean fasta file or did not finish computation