| Literature DB >> 28192439 |
Konrad Zych1, Basten L Snoek2, Mark Elvin3, Miriam Rodriguez2, K Joeri Van der Velde4, Danny Arends1, Harm-Jan Westra5,6,7, Morris A Swertz4, Gino Poulin3, Jan E Kammenga2, Rainer Breitling8, Ritsert C Jansen1, Yang Li1,9.
Abstract
In high-throughput molecular profiling studies, genotype labels can be wrongly assigned at various experimental steps; the resulting mislabeled samples seriously reduce the power to detect the genetic basis of phenotypic variation. We have developed an approach to detect potential mislabeling, recover the "ideal" genotype and identify "best-matched" labels for mislabeled samples. On average, we identified 4% of samples as mislabeled in eight published datasets, highlighting the necessity of applying a "data cleaning" step before standard data analysis.Entities:
Mesh:
Year: 2017 PMID: 28192439 PMCID: PMC5305221 DOI: 10.1371/journal.pone.0171324
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Graphical summary of the reGenotyper algorithm.
reGenotyper uses a data perturbation strategy and exploits the highly parallel nature of molecular profilinrg in modern genetic studies. 1) Observed genotype data and QTLs. The data matrix in the middle contains the genotype information at each marker position (row) for each sample (column), where orange and blue represent two different genotypes. The observed QTL significance of Phenotypes 1–4 from a standard QTL mapping technique is shown in graded shades of purple, with a darker color representing a stronger QTL significance. 2) Perturbation of true genotypes. Specifically, perturbing the genotype at a particular marker of a correct sample (correct → wrong) will lead to a decreased QTL significance for all molecular traits mapping to a QTL near that marker. In this panel, the genotype of the 3rd sample at the 5th marker position (indicated by an arrow circle) is randomly perturbed (changed from the orange to the blue allele). Then we re-map the QTLs using the perturbed genotype data and unchanged phenotype matrix and observe that for most of the QTL the significance decreases (dark color changes to light color), i.e. the QTL loses significance if noise is added. 3) Correction of wrong genotypes. The genotype of the 2rd sample at the 5th marker position (indicated by an arrow circle) is randomly perturbed (changed from the blue to orange allele). Then we re-map the QTLs using the perturbed genotype data and unchanged phenotype matrix. In this case, for most of the QTL the significance increases after perturbation (light color changes to dark color) since the original genotype was wrong. When such an increase is observed for a number of phenotype—marker pairs, it suggests that the genotype of this sample was mislabeled.
Fig 2Individual evidence of the samples (arranged around the circle) being detected as potentially mislabeled sample (MS) across seven different experiments (each represented by a circle) from C. elegans studies using the reGenotyper method.
Different shades of green represent the mislabeling score, with a darker color corresponding to a higher score, and magenta indicating that the sample has been detected as MS with a score larger than 90%. White indicates that the sample was not used in this experiment (as not all samples were used in all experiments). The samples with consistent strong evidence of being potentially mislabeled across experiments (i.e., showing high scores in multiple experiments) are more likely to indeed be mislabeled. Note that sample WN53 shows a mislabeling score larger than 0.9 in four independent experiments, making it very likely that it was indeed mislabeled, as confirmed by subsequent experiments.