| Literature DB >> 28770015 |
Alexander M Morin1, Evan Gatev1, Lisa M McEwen1, Julia L MacIsaac1, David T S Lin1, Nastassja Koen2, Darina Czamara3, Katri Räikkönen4, Heather J Zar5, Karestan Koenen6, Dan J Stein2, Michael S Kobor1,7, Meaghan J Jones1.
Abstract
BACKGROUND: Cord blood is a commonly used tissue in environmental, genetic, and epigenetic population studies due to its ready availability and potential to inform on a sensitive period of human development. However, the introduction of maternal blood during labor or cross-contamination during sample collection may complicate downstream analyses. After discovering maternal contamination of cord blood in a cohort study of 150 neonates using Illumina 450K DNA methylation (DNAm) data, we used a combination of linear regression and random forest machine learning to create a DNAm-based screening method. We identified a panel of DNAm sites that could discriminate between contaminated and non-contaminated samples, then designed pyrosequencing assays to pre-screen DNA prior to being assayed on an array.Entities:
Keywords: 450K; Blood banking; Contamination; Cord blood; DNA methylation; Genotyping; Maternal blood
Mesh:
Substances:
Year: 2017 PMID: 28770015 PMCID: PMC5526324 DOI: 10.1186/s13148-017-0370-2
Source DB: PubMed Journal: Clin Epigenetics ISSN: 1868-7075 Impact factor: 6.551
Fig. 1Principal component and X chromosome DNA methylation (DNAm) patterns revealed maternal blood contamination in cord blood. a Plotting the first two principal components of 450K DNAm data identified a number of male samples with DNAm patterns similar to female participants or intermediate between male and female. b Examining the distribution of X chromosome DNAm beta values in these samples revealed that the intermediate male samples clearly showed patterns indicative of a mixture of male (top) and female (bottom) distributions
Fig. 2DNA methylation at 10 autosomal CpGs was sufficient to correctly identify all known contaminated male samples, and found 13 contaminated female samples. a The 10 CpGs selected by the random forest method clearly separate cord and adult samples, and also clearly discriminate non-contaminated (N) from contaminated (C) male samples, and divide unknown (U) samples into two groups. b Counting the number of sites over thresholds per sample (x axis), contamination was called if at least 5 of the 10 CpGs were above the threshold. Unclear males were all non-contaminated, and 13 females were identified as being contaminated. c A subset of 3 out of the 10 CpGs can be used for pyrosequencing screening. Two thresholds are shown—one requiring two of the three CpGs to be above the threshold to be called contaminated (yellow), and one requiring all three (red)
Fig. 3Summary of performance of all methods used to predict cord blood contamination. Each column represents the same participant across each method. The 10 CpG method using 450K array data was the most reliable, but using a subset of three CpGs was sufficient to identify at least 82% of contaminated samples
Fig. 4Pre-screening using the pyrosequencing method correctly identified contaminated male samples. a Applying a cut-off of 2 CpGs above the threshold (yellow line) to the 3 CpG pyrosequencing method on validation data, 18 males and 15 females were identified as contaminated. b Principal component plot of EPIC DNA methylation data on all non-contaminated samples with two male samples that had been called contaminated by pyrosequencing showed that contaminated male samples had been correctly identified. c Using the 10 CpG method from EPIC data, only the 2 male samples known to be contaminated had more than 5 CpGs above the threshold (red line)
Fig. 5Identification of studies with significant contamination levels in public data. Using available data, we examined the 10 CpGs chosen to identify contamination, though some studies had previously filtered their data and some CpGs were not available. We called maternal contamination of samples if more than 50% of the available CpGs were above our contamination thresholds, and identified two studies (GSE54399 and PREDO) with contaminated samples