| Literature DB >> 33170178 |
Tiziano Dallavilla1, Giuseppe Marceddu2, Arianna Casadei3, Luca De Antoni4, Matteo Bertelli5.
Abstract
BACKGROUND AND AIM: Next generation sequencing (ngs) is becoming the standard for clinical diagnosis. Different steps of NGS, such as DNA extraction, fragmentation, library preparation and amplification, require handling of samples, making the process susceptible to contamination. In diagnostic environments, sample contamination with DNA from the same species can lead to errors in diagnosis. Here we propose a simple method to detect within-sample contamination based on analysis of the heterozygous single nucleotide polymorphisms allele ratio (AR).Entities:
Year: 2020 PMID: 33170178 PMCID: PMC8023143 DOI: 10.23750/abm.v91i13-S.10531
Source DB: PubMed Journal: Acta Biomed ISSN: 0392-4203
Summary of artificially contaminated samples. Starting with three Coriell samples, we generated nine samples contaminated at different levels. ‘Sample name’ indicates the name given to the sample generated, ‘Mixed samples’ indicates the Coriell sample used to generate the contaminated sample, ‘Contamination %’ indicates the percentage of contamination, ‘Volume of principal sample’ and ‘Volume of contaminant’ indicate the proportion used to generate the contaminated sample
| C210 | NA20828+NA20582 (contaminant) | 10 | 9 | 1 |
| C27 | NA20828+NA20582 (contaminant) | 7 | 9.3 | 0.7 |
| C25 | NA20828+NA20582 (contaminant) | 5 | 9.5 | 0.5 |
| C22 | NA20828+NA20582 (contaminant) | 2 | 9.8 | 0.2 |
| C320 | NA20582+NA20763 (contaminant) | 20 | 8 | 2 |
| C310 | NA20582+NA20763 (contaminant) | 10 | 9 | 1 |
| C37 | NA20582+NA20763 (contaminant) | 7 | 9.3 | 0.7 |
| C35 | NA20582+NA20763 (contaminant) | 5 | 9.5 | 0.5 |
| C32 | NA20582+NA20763 (contaminant) | 2 | 9.8 | 0.2 |
Summary of the z-score percentages of contaminated samples and controls used in validation. The z-score % of a sample indicates the percentage of SNPs in the sample with a z-score outside the expected region of -1.96/+1.96. ‘Sample’ indicates the name of the sample, ‘z-score %’ the sample score, ‘Contamination %’ the percentage of contamination in the sample, ‘Number of SNPs’ the number of variants in the sample in the VCF file after filtering, and ‘Total SNPs outside threshold’ indicates how many SNPs had an unexpected z-score
| C320 | 39.3 | 20 | 346 | 136 |
| C310 | 13.6 | 10 | 279 | 38 |
| C210 | 12.6 | 10 | 294 | 37 |
| C27 | 11.3 | 7 | 275 | 31 |
| C37 | 9.9 | 7 | 262 | 26 |
| NA20763 | 9.5 | 0 | 284 | 27 |
| NA20828 | 9.2 | 0 | 271 | 25 |
| C25 | 9.2 | 5 | 272 | 25 |
| C35 | 8.8 | 5 | 263 | 23 |
| C32 | 7.7 | 2 | 260 | 20 |
| C22 | 6.9 | 2 | 274 | 19 |
| NA20582 | 5.8 | 0 | 271 | 15 |
Figure 1.Distribution of allele ratios AR in reference dataset and contaminated samples. (A) Distribution of AR in reference dataset. The red line is the mean of the distribution, the violet the median and the blue lines define the 95% confidence interval (CI95). The blue areas outside the CI95 define the region of unexpected AR: whatever falls outside the CI95 is considered unexpected. It can be seen that the distribution is normal with minimal tails in the unexpected regions. (B) Distribution of AR in the artificially contaminated samples. The red line is the mean of the distribution, the violet the median, and the blue lines define the 95% confidence interval of the reference dataset, the green dotted line defines the CI95 of the contaminated dataset. The distribution is no longer normal and a greater percentage of data falls in the unexpected regions with respect to the reference dataset, showing the effects of contamination on the AR of SNPs
Figure 2.Percentage of SNPs with z-score outside the defined thresholds (-1.96/+1.96) for samples of the validation set. In black the score obtained by sample C3 (see Table 1) at different ratios of contamination. In red the score obtained by sample C2 (see Table 1) at different ratios of contamination. The blue line is the mean of the z-scores obtained by the reference dataset while the green dotted line defines the upper limit of the CI95 of the z-scores of the reference dataset. The graph suggests that our method is able to detect contamination down to 20-10%. The threshold for discriminating between contaminated/non contaminated sample should be chosen depending on how many FPs can be tolerated. Contamination around 20% would probably generate no FPs. Detection of lower contaminations is possible but with more FP calls. More experiments are needed to have an estimate of FPs for different contamination percentages
Figure 3.Comparison of AR distributions of Coriell samples used to generate contaminated samples and the resulting contaminated samples. In blue the sample used as principal, in orange the one used as contaminant and in green the resulting contaminated sample. Black dotted line is the line of best fit of the reference sample and the red line is that of the contaminated sample. (A) Results for 5% contamination. The best fit lines show that is difficult to distinguish the AR distributions, making it impossible for our method to detect contamination at such a low percentage. (B) Results for 10% contamination. The algorithm is able to distinguish the two distributions but since the score obtained by the non-contaminated sample is close to that of the 10% contaminated sample, we cannot exclude the presence of FPs if the threshold is chosen to detect 10% contamination. (C) Results for 20% contamination. In this case the contaminated sample has almost tri-modal distribution which makes it extremely easy to distinguish from the reference distribution