Literature DB >> 32282793

DRAMS: A tool to detect and re-align mixed-up samples for integrative studies of multi-omics data.

Yi Jiang^1,2,3, Gina Giase⁴, Kay Grennan⁵, Annie W Shieh⁵, Yan Xia^1,5, Lide Han³, Quan Wang^2,3, Qiang Wei^2,3, Rui Chen^2,3, Sihan Liu¹, Kevin P White^6,7, Chao Chen^1,8, Bingshan Li^2,3, Chunyu Liu^1,5,9.

Abstract

Studies of complex disorders benefit from integrative analyses of multiple omics data. Yet, sample mix-ups frequently occur in multi-omics studies, weakening statistical power and risking false findings. Accurately aligning sample information, genotype, and corresponding omics data is critical for integrative analyses. We developed DRAMS (https://github.com/Yi-Jiang/DRAMS) to Detect and Re-Align Mixed-up Samples to address the sample mix-up problem. It uses a logistic regression model followed by a modified topological sorting algorithm to identify the potential true IDs based on data relationships of multi-omics. According to tests using simulated data, the more types of omics data used or the smaller the proportion of mix-ups, the better that DRAMS performs. Applying DRAMS to real data from the PsychENCODE BrainGVEX project, we detected and corrected 201 (12.5% of total data generated) mix-ups. Of the 21 mix-ups involving errors of racial identity, DRAMS re-assigned all data to the correct racial group in the 1000 Genomes project. In doing so, quantitative trait loci (QTL) (FDR<0.01) increased by an average of 1.62-fold. The use of DRAMS in multi-omics studies will strengthen statistical power of the study and improve quality of the results. Even though very limited studies have multi-omics data in place, we expect such data will increase quickly with the needs of DRAMS.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Chromatin

Year: 2020 PMID： 32282793 PMCID： PMC7179940 DOI： 10.1371/journal.pcbi.1007522

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

This is a PLOS Computational Biology Methods paper.

Introduction

Investigation of complex traits and disorders can use multiple omics data to systematically explore regulatory networks and causal relationships. Sample mix-ups can occur in omics experiments during sample collection, handling, genotyping, and data management. As the number of datasets to be integrated increases, the likelihood of error also multiplies. Sample mix-ups reduce statistical power and generate false findings. Not only is the detection and re-alignment of errors in data identifications (IDs) critical to ensuring accurate findings in integrative studies, such corrections can increase statistical power thus the number of positive findings [1]. For multi-omics data, the sample re-alignment procedure can be generally divided into two steps: first, to estimate genetic relatedness among the data of different omics and group together all the data of the same individual; then, to assign potential IDs for each data group. It is well-known that genetic information from the same individual should be identical regardless of the omics from which it originated. Using genotype data as a mediator, data originated from the same individual can be grouped together. Several tools have been developed to estimate genetic relatedness for multi-omics data in many different ways, such as genotype concordance [2, 3], correlation of different quantifications [1, 2], correlation of variant allele fractions [4], concordance of sequencing reads [4], etc. The various methods made it possible to compare different data types, such as DNA sequencing, RNA sequencing, SNP array, etc. However, these tools mainly focused on implementing the first step of sample re-alignment, that is, to estimate genetic relatedness among multi-omics data. None of the existing tools has a systematic solution on determining the potential IDs for each data. It is certain that, after grouping the highly related data, some data ID can be empirically corrected based on the “majority vote” strategy [1, 2, 5]. However, without a systematic solution, it only works for small data of low-dimension, difficult to scale up, and the results lack statistics-based confidence. As multi-omics data of more data dimension are expected in the near future, it is much more challenging to identify the potential ID. Especially when more than one data are labeled by mistake for a single individual. Actually, taking full advantage of information from data of more omics types makes data ID correction more accurate, turning a challenge into an opportunity. Here, we described DRAMS, a tool to Detect and Re-Align Mixed-up Samples that leverages sample relationships in multi-omics data by directly comparing genotype data. DRAMS also uses a logistic regression model followed by a modified topological sorting algorithm to systematically re-align misclassifications. This tool integrates sample relationships among different omics types, concordance rates of genetics-based sexes and reported sexes, etc. With this design, DRAMS can be applied to as many omics types as possible. Using both simulated data and real data, we proved the accuracy and power of DRAMS in studies involving multi-omics data.

Results

Design of DRAMS

The goal of DRAMS is to detect and re-align mix-ups based on the grounds that all omics data originating from the same individual should match genotypes. DRAMS operates on a two-step process: first, we ensure that all omics data of the same samples were grouped together by their genotypic relatedness; after that, we find out the potential true ID of each data group (.

Illustration of key steps.

The workflow has two steps: 1) extracting highly related data pairs; and 2) assigning potential data IDs. For the first step, genotypes were called for data from each omics types using GATK HaplotypeCaller. Contaminated data were checked and removed based on the VerifyBamID software and heterozygous rates. DRAMS then estimated genetic relatedness among all available omics data by comparing genotypes using GCTA. Any omics type that contains genetic information can be used here to call genotypes and compare genetic relatedness. Normally, the genetic relatedness scores for all data pairs were in bimodal distribution. Highly related data pairs were extracted based on the distribution of relatedness scores and connected to create multiple groups. Based on the matching of data IDs in each highly related data pair, we can classify the highly-related data pairs as “matched data pairs” and “mismatched data pairs”. We may also find some data unrelated to any other data. For the second step, we visualized the groups into multiple independent sub networks. For each sub network, each node represents a data in the group and each edge represents a highly-related data pair. The text in each node represents the data ID. Different colors represent different omics types. The parallel line connects matched data; whereas, the singular line connects mismatched data. After applying a logistic regression model, DRAMS tool estimated switch directions and probabilities for each mismatched data pair. The arrow denotes possible switch direction. The thickness of the line weight correlates to the degree of switch probability. The final IDs for the data in each group can be determined by sorting the nodes. For the first step, we extracted highly related data pairs. To accomplish this, we called genotypes from each omics dataset and estimated genetic relatedness among these data. After that, we grouped the data according to the highly-related data pairs. A “group” was defined as a set of data, in which each data was highly related to at least one of the other data. Based on the grouping results, we can classify relationships of data into three types. The first type contains highly related data pairs that have the same individual IDs (matched pairs). These are the least likely to be mis-assigned. The second type contains highly related data pairs with different individual IDs (mismatched pairs), in which some individual IDs may have been swapped. We also have some data that are unrelated to any other. We put them into the third type, which will be discarded since they are unassignable. For the second step, we connect all the highly related data pairs and produce multiple independent groups. Each node within a group represents one data point with each edge connecting a highly related data pair. Then, we use a multi-step, combined knowledge-based and statistical approach to search for the potential IDs in each of the groups, which contains both matched and mismatched data pairs. To estimate which data ID from the mismatched data pair was more likely a correct ID, we first use a logistic regression model. The model estimates the direction and probability based on three pieces of information: 1) data relationships among multiple omics types–data matching a greater number of omics sets are more likely to represent the true ID than those matching only a few; 2) when the reported sex of the data agrees with the genetics-based sex, it is more likely to be accurate than not; 3) user-defined priority of omics data: the user’s confidence in the correctness of each omics type is documented as ranks. We manually identified 44 high-confidence mismatched data pairs with well-defined switch directions from the PsychENCODE BrainGVEX project[6] and used them to train the logistic regression model and establish parameters. (Methods). In the last step, we determine the final ID for all omics data points within each group. We sort all data points in each group using a modified topological sorting algorithm that is based on the switch probabilities obtained from the logistic regression above (Methods). The data ID with the highest value in each group will be selected as the final ID. In this study, we used simulation data and the PsychENCODE BrainGVEX data to evaluate the performance of DRAMS.

Performance of DRAMS using simulation data

To test the performance of DRAMS on correcting data IDs, we generated multiple highly related data pair datasets with a few samples (ranging from 5% to 30%) being randomly shuffled to simulate sample mix up. The variable parameters for the simulation data include a range of sample sizes, numbers of omics types, and percentages of mix-ups (Methods). The stringent mode of DRAMS was used. In the stringent mode, we discarded the data groups with less than three data or with no shared IDs (i.e. all data IDs in a group are different), as they are almost unlikely to be corrected. We found that when a larger proportion of samples are mixed up, fewer samples can be successfully corrected ( S3 Table). Taking datasets with five omics types and a sample size of 300 as examples, when 5% of the samples are mis-assigned, all samples can be successfully corrected; when 30% of the samples are mis-assigned we can successfully correct an average of 91.5% (SD: 2.30%) of errors, with an average of 0.481% (SD: 0.734%) of overcorrected items (.

Performance of DRAMS in simulation data.

We simulated sample mix-ups and used DRAMS to correct data IDs. The simulation data includes a range of sample sizes (50, 100, 150, 200, 250, and 300, each with 50% females and 50% males) and a range of omics types (3, 4, 5, and 6 for figure a, b, c, and d, respectively). To simulate sample mix-ups, we randomly shuffled a gradient proportion of data for each omics type: 5%, 10%, 15%, 20%, 25%, and 30%. The simulation process was repeated 100 times. For each repeat, we used DRAMS (the stringent mode) to correct mix-ups. The colored violin plots show the proportion of mix-ups successfully corrected (different colors indicate different sample sizes). The grey violin plots show the proportion of mix-ups overcorrected (simulations of different sample sizes were combined). The process of generating simulation data was described in S1 Fig. The number of omics types involved also greatly influences the performance of DRAMS. The more omics types we have, the better the chance that we can recover the true identity of each erroneous data. Taking datasets with a sample size of 300 and 15% of mixed-up samples as examples, if we have three omics types (the minimum number of omics types required for DRAMS input), an average of 72.2% (SD: 4.62%) of mix-ups can be successfully corrected, with an average of 0.0207% (SD: 0.146%) of overcorrected items (. If six omics types are used, an average of 99.6% (SD: 0.815%) mix-ups can be successfully corrected, with an average of 0.119% (SD: 0.479%) of overcorrected iterms (. Sample size is also important to the performance of DRAMS. Yet, sample size has no significant influence on DRAMS performance when proportions of sample mix-ups and number of omics used remain consistent (Pearson correlation between sample size and average proportions of successfully corrected mix-ups: -0.0105; P value: 0.901). However, larger sample sizes seem to stabilize correction results. As sample size increases, the standard deviation of the proportions of successfully corrected mix-ups decreased (Pearson correlation: -0.395; P value: 9.50 × 10−7).

Performance of DRAMS using real data from the PsychENCODE BrainGVEX project

Data summary

The PsychENCODE BrainGVEX project[6] generated six types of omics data (S1 Table), including low-depth Whole Genome Sequencing (WGS), RNA sequencing (RNA-Seq), Assay for Transposase-Accessible Chromatin using Sequencing (ATAC-Seq), Ribosome Sequencing (Ribo-Seq), and SNP array data from two platforms, including Affymetrix 5.0 450K (Affymetrix) and Psych v1.1 beadchips (PsychChip). We called a total of 19,242,755 SNPs from WGS data on autosomes, 17,786,350 SNPs from RNA-Seq data, 10,571,742 SNPs from ATAC-Seq data, 156,354 SNPs from Ribo-Seq data, 10,891,109 SNPs from Affymetrix data, and 13,589,867 SNPs from PsychChip data. We used two methods to check sample contamination: using VerifyBamID[7], and calculating heterozygous rates (S2 Fig). We defined the samples with both FREEMIX >0.3 and heterozygous rates >0.3 as contaminated samples. We removed the sample “2015–916” in WGS from the subsequent analyses for being contaminated.

Check sample alignment based on genetics-based sex

We checked sample alignment by comparing genetics-based sexes with reported sexes for data of WGS, PsychChip, ATAC-Seq, and RNA-Seq. Based on X chromosome heterozygosity and Y chromosome call rate, we calculated an F-value (A Plink-derived metric to distinguish males and females. Details in Methods) for each data in each omics type. Then, the genetics-based sexes were inferred according to the distribution of F-values (called SNP-inferred sexes). By comparing reported sex with SNP-inferred sex, we identified a total of 74 data with mismatched sexes and 1174 data with matched sexes, indicating that some samples might have been mixed-up (. For the 426 samples in RNA-Seq, only three samples were identified as having mismatched sexes, indicating that this RNA-Seq data may have an overall good quality in terms of sample matching. For Ribo-Seq data, we did not estimate SNP-inferred sexes as the threshold to separate males and females, since Ribo-Seq data cannot be defined based on the distribution of F-values. Neither did we estimate SNP-inferred sexes for Affymetrix 5.0 SNP array data since no genotype on sex chromosomes were available. As complementary evidence, we also used XIST gene expression level (XIST-inferred sexes) to infer genetics-based sexes for RNA-Seq and Ribo-Seq data (Methods). Note: “Number of sex-matched samples” indicates the number of samples with the same reported sex and SNP-inferred sex.

Detect and correct data IDs

We calculated genetic relatedness scores among all the data in the six omics types using GCTA[3]. Based on the distribution of genetic relatedness scores, we extracted the highly related data pairs using a threshold of 0.65 (S3 Fig). We identified a total of 1971 matched pairs and 518 mismatched pairs (). We also found eight data that were not related to any other data, including three ATAC-Seq data, four Ribo-Seq data, and one Affymetrix data (S4 Table).

Summary of highly related data pairs.

Summary of highly related data pairs among the six omics types. A highly related data pair was defined as a data pair with genetic relatedness score > 0.65. M: Matched pairs (highly related data pairs that have the same individual IDs), Mis: Mismatched pairs (highly related data pairs with different individual IDs). Of the 518 mismatched data pairs, 44 pairs have certain switch directions. We used these to train the logistic regression model (Method). For the remaining 474 mismatched data pairs, we used the logistic regression model to predict switch directions and probabilities. Because WGS and PsychChip data were processed with the same sources of DNA, we considered them as one omics type in the regression step. Similarly, we considered ATAC-Seq and Ribo-Seq as one omics type in the regression, as they were processed from the same original tissues. Based on the proportion of samples with concordant SNP-inferred sex and reported sex (, as well as on our prior knowledge about the data processing of each omics type, we assigned the omics priority as RNA-Seq > Affymetrix > WGS & PsychChip > ATAC-Seq & Ribo-Seq. After using the logistic regression to predict the switch directions and probabilities for the 474 mismatched data pairs, we connected all highly related data pairs to create groups of paired data. Then, we used the modified topological sorting algorithm to sort the nodes in each group and picked the node with the highest score as the final ID for all data in each group. In the end, we corrected 201 (12.5%) IDs for data of the six omics types ( S5 Table). After correcting data IDs, eighteen data still have mismatched SNP-inferred sexes and reported sexes, including 15 ATAC-Seq data and three RNA-Seq data (S4 Fig). For ATAC-Seq, mismatches are not unexpected since accurately inferring genetics-based sex for some samples is difficult due to their low sequencing coverage in X and Y chromosomes. For RNA-Seq, we found that the XIST-inferred sexes were all consistent with reported sexes for all the samples (S5 Fig), indicating that all the samples in RNA-Seq might have been assigned with their true IDs. As we found three samples with mismatched SNP-inferred sexes and reported sexes, we inferred that it may be inaccurate to estimate genetics-based sexes based on the genotypes called from RNA-Seq data. For Ribo-Seq, both XIST-inferred sexes and F-values reported sexes were more inconsistent than RNA-Seq or DNA-based data. This may suggest that neither sex chromosome genotypes nor XIST gene expression works well to infer genetics-based sexes for Ribo-Seq data.

Validate data ID corrections by race group assignment

To confirm that the 201 data IDs were correctly assigned, we used race as an independent validation. We performed Principal Component Analysis (PCA) on data of the four major racial groups, European, Asian, African, and African American from the 1000 Genomes Project (1000G)[8] (S2 Table) and our BrainGVEX data. PCA plotted the 1000G and BrainGVEX data into four racial groups. Before correcting data IDs, twenty-one data were classified in wrong racial groups. After correcting data IDs, all the data have concordant races with 1000G (S5 Table). For WGS, PsychChip, ATAC-Seq, and Ribo-Seq data that have race-switched data IDs, all were switched back into the correct PCA groups, indicating that those samples were likely to have been mislabeled and successfully corrected by DRAMS (.

Validation of cross-races switched data.

PCA results for BrainGVEX samples (grey dots) and 1000G samples (colored dots) are shown. All data with switched races are marked with their original IDs and new IDs, as well as the corresponding race. The correspondence between BrainGVEX and 1000G races are shown in S2 Table. PCA results for BrainGVEX (a) WGS, (b) PsychChip, (c) ATAC-Seq, and (d) Ribo-Seq samples are shown. RNA-Seq and Affymetrix data are not shown as there is no data with switched races. Proportions of switched races data per dataset: WGS: 0.12; PsychChip: 0.12; ATAC-Seq: 0.19; Ribo-Seq: 0.21. CAUC: Caucasian, HiSP: Spanish, AA: African American, AS: Asian American, CEU: Utah Residents with Northern and Western European Ancestry, ASW: African Ancestry in Southwest US, CHB: Han Chinese in Beijing, China, YRI: Yoruba in Ibadan, Nigeria.

Increased number of cis-QTLs after correcting data IDs

We mapped four sets of cis-QTLs based on different data combinations (WGS with RNA-Seq, WGS with Ribo-Seq, PsychChip with RNA-Seq, and PsychChip with Ribo-Seq) for BrainGVEX data before and after correcting data IDs using DRAMS. After correcting data IDs, although the sample sizes were reduced slightly due to the removal of a few unresolved samples, the numbers of cis-QTLs increased by an average of 1.62-fold for the FDR<0.01 cutoff and average 1.54-fold for the FDR<0.05 (. We also tested the proportion of novel and discarded eQTLs replicated in the Genotype-Tissue Expression (GTEx) project [9] (. Around 50% of the novel eQTLs can be replicated in GTEx (denoted by π1). For the discarded eQTLs, only around 20% can be replicated. This relative stability clearly demonstrated the power and importance of correcting data IDs in QTL mapping. Note: Only chromosome 1 was used to save computing time. * π1 was estimated using the “qvalue” package in R.

Sensitivity and specificity in extracting highly related data pairs

To determine the minimum number of SNPs needed to accurately identify highly related data pairs from all random pairs, we randomly selected eight subsets of SNPs numbered from 200 to 10,000 from the BrainGVEX data and re-calculated pair-wise genetic relatedness scores. Five types of omics data were used for this estimation, including WGS, PsychChip, ATAC-Seq, RNA-Seq, and Ribo-Seq. Only the samples with fully matched data IDs among all omics types were used. Since both the WGS and the PsychChip platforms cover most of the common SNPs in the whole genome, we were able to successfully identify all the highly related data pairs without any false positives, using as few as 200 common SNPs (MAF > 0.1) (S6 Table, S6 Fig). When comparing DNA-based genotypes with data from other platforms, such as RNA-Seq, Ribo-Seq, or ATAC-Seq, we could capture only a small proportion of genotyped loci ranging from 74% to 85%, due to the platforms’ limited coverage of genomic regions. To ensure that enough genotypes are compared, at least 1000 SNP loci should be used to estimate genetic relatedness. The comparison of data from RNA-Seq, Ribo-Seq, or ATAC-Seq can be problematic since each platform has somewhat different priorities for capturing various genomic regions. Because of this, we found that fewer genotyped loci could be compared, undoubtedly reducing sensitivity and specificity. For example, for the comparison between RNA-Seq and ATAC-Seq data, the proportion of shared SNPs range from 51% to 55%. Even when 1000 common SNPs were used, nine false positive pairs and three false negative pairs were still found (sensitivity: 0.9796; specificity: 0.9996). We recommend using 2000 or more common SNPs to estimate genetic relatedness. We also did the same analysis using rare SNVs (MAF < 0.1). Rare SNVs are less powerful to distinguish highly related data pairs from random pairs than common SNPs (S7 Table, S7 Fig). It is difficult to identify all highly related data pairs even when 10,000 rare SNVs were used. However, rare SNVs have good specificity. When 1000 rare SNVs were used, only two false positives for WGS versus ATAC-Seq (specificity: 0.9999), and five false positives for RNA-Seq versus ATAC-Seq (specificity: 0.9998) were found. No false positives were found for all the other comparisons.

Discussion

To meet the demand for reducing the influence of sample mix-ups on multi-omics integrative studies, we developed a tool DRAMS to detect and correct mixed-up data IDs. The principle of DRAMS is that genotypes of all omics data assayed on the same individuals should be identical. We directly call genotypes and estimate pair-wise genetic relatedness by calculating the genotype concordance rates among all data to be checked. Therefore, any omics type, as long as it contains genotype information, could be used for DRAMS correction. DRAMS groups the data potentially originating from the same individual together and determines the potential IDs for data within each group. The group size is influenced by the number of omics types. Having more omics types bolstered the information available to unlock the potential IDs. However, the increase also results in greater complexity. DRAMS used a logistic regression model followed by a modified topological network sorting algorithm to systematically integrate the genetic relationships, sex concordance, and omics priority to determine the potential ID for each data. The tool performs well in both simulation data and BrainGVEX data. With this design, DRAMS can be applied to an unlimited number of omics data. According to our simulation data, DRAMS performs better as more omics types are included. This is a major advancement of our framework that outperformed existing tools. Since sex plays such an essential role in verifying data ID, we naturally chose to employ sex information in the design of DRAMS to increase reliability. When applying DRAMS to BrainGVEX data, we used two strategies; we inferred genetics-based sexes from sex chromosome genotypes (SNP-inferred sex) and alternatively, from XIST gene expression (XIST-inferred sex). When applying DRAMS to RNA-Seq data, all data had consistent XIST-inferred sexes and reported sexes after correcting the data IDs. However, three out of 426 data had inconsistent SNP-inferred sexes and reported sexes. For Ribo-Seq data, as XIST is a non-coding RNA, it’s not accurate to estimate genetics-based sex according to XIST expression. Also, due to the low coverage for Ribo-Seq data, estimating genetics-based sex according to sex chromosome genotypes is not confident. Therefore, it is reasonable that we did not see an obvious consistency among XIST-inferred sexes, SNP-inferred sexes, and reported sexes. We used ethnic information as a validation step for BrainGVEX data ID correction. After correcting data IDs, all of the samples grouped into the correct race with other 1000G reference samples. Although matched race does not necessarily mean correct data ID, it is strong evidence to prove that the data IDs were switched to the correct directions for the mismatched data pairs. We identified more QTLs after data ID correction. Although this is not direct proof for correct data ID assignment either, it is likely the results of better alignment of different omics types, and consequently largely increased statistical power for QTL analyses. The threshold to extract highly related data pairs should be selected very carefully. A loose threshold can lead to a large proportion of overcorrection (S8 Table). For data of related individuals such as family data, it is possible that the related individual pairs have the relatedness score > 0.65. For this situation, we recommend to use a stringent threshold to extract highly related data pairs using GCTA or NGSCheckMate. For GCTA, we recommend to eyeball the distribution of genetic relatedness scores and choose a higher threshold. For NGSCheckMate, the tool provided a “-f” parameter that defines highly related sample pairs in a stringent mode, which will reduce the probability of mislabeling among relatives. However, even in stringent mode, it is still possible that some related individuals be identified as the same individual. In other words, DRAMS need to be used with great caution when processing data from family members. That will be a major challenge for algorithm, particularly on data with less usable genotypes or of more noise. We used the BrainGVEX data to assess the minimum number of SNPs needed to identify highly related data pairs. We found that common SNPs have greater power in distinguishing highly related data pairs from random pairs. Nonetheless, rare SNVs are also useful as they are unlikely to produce false positive findings. In addition, we also found that when comparing data from different platforms (which cover different genomic regions), a smaller proportion of genotyped loci could be compared between different platforms, indicating that additional SNPs should be used. Based on the BrainGVEX data, we recommend using 2000 or more common SNPs to extract highly related data pairs for most platforms. Nonetheless, since the genotype qualities of different platforms may differ substantially, we recommend including as many variants as possible, even rare variants, to fortify the sensitivity and specificity of the DRAMS tool. Matching omics data is only the first step in the process. Assigning the correct data ID, typically associated with sample demographic information and phenotypic data, is another important step. Conceivably, some analyses are more sensitive to sample information, covariates (sex, diagnosis, etc.) than others. Mis-assigned sample information severely affects some analyses, such as differential gene expression and case-control comparison. DRAMS can correct mix-ups and identify correct labels and associated sample information, before conducting the integrative analyses. Currently, only a few projects have produced multi-omics data, like the data generated in Drs. Gilad and Pritchard’s lab[10], data from GTEx[9], and ROSMAP[11, 12]. As more large-scale multi-omics data will be generated in the near future to address problems of regulatory networks and causal relationships, we expect that DRAMS will be helpful for those studies.

Materials and methods

Sample resources

A total of 440 individuals from the PsychENCODE BrainGVEX study[6] with six types of omics data were used to validate data IDs and assign the potential IDs. BrainGVEX was part of the PsychENCODE project focusing on gene expression regulation in human brain FC region (Frontal Cortex). The samples include 420 Caucasians, 2 Hispanic, 1 African American, 3 Asian American, and 14 unknown (S1 Table). The six omics types included 1) 285 samples (176 males, 106 females, and 3 unknown-sex samples) of low-depth Whole Genome Sequencing (WGS, average depth: 5×) data; 2) 426 samples (274 males, 152 females) of RNA-Seq data; 3) 295 samples (180 males, 112 females, and 3 unknown-sex samples) of Assay for Transposase-Accessible Chromatin using Sequencing (ATAC-Seq) data,; 4) 197 samples (122 males, 70 females, and 5 unknown-sex samples) of Ribosome Sequencing (Ribo-Seq) data; and, SNP array data from two platforms, including 5) 137 samples (92 males, 45 females) of Affymetrix 5.0 450K (Affymetrix) data; and, 6) 263 samples (163 males, 100 females) of Psych v1.1 beadchips (PsychChip) data.

Genotype calling from data of each omics type

We used the same pipeline to call genotypes for all sequencing data. For each dataset in FASTQ format, all reads were mapped to the human reference genome (hg19) using BWA [13] after sequencing adapters and low-quality bases were removed. PCR duplications were removed using the MarkDuplicates package in Picard tools (http://broadinstitute.github.io/picard/). Then, GATK IndelRealigner and BaseRecalibrator were used to recalibrate the mapping quality of the reads [14]. For each omics type, genotypes were called using GATK HaplotypeCaller for all samples jointly. Each set of omics data were processed separately.

Estimation of sample contamination

Two methods were used to check sample contamination. One is VerifyBamID[7], which because it requires both BAM files and VCF files as input, can only be applied to sequencing data. The results include a parameter “FREEMIX” (0–1 scale), which indicates the proportion of non-reference bases observed in reference sites. This parameter can be used as an indicator of sample contamination. As an alternate method, we wrote a Linux script to directly calculate the heterozygous rate based on genotypes. This ran faster than the VerifyBamID approach. We defined the samples with FREEMIX >0.3 in VerifyBamID and defined heterozygous rates >0.3 as contaminated samples, which were removed from subsequent analyses. Only heterozygous rates were calculated for PsychChip and Affymetrix samples, as they are not supported by VerifyBamID.

Infer genetics-based sexes

We used the Plink software (“—check-sex” module in “ycount” mode) to calculate F-values for data of WGS, PsychChip, ATAC-Seq, RNA-Seq, and Ribo-Seq [15]. This method is mainly based on the X chromosome heterozygosity. It also uses the Y chromosome call rate to improve the accuracy of sex estimates. Basically, the F-values were in bi-modal distribution. Based on the distribution, we were able to select a threshold for each omics type and infer sexes (SNP-inferred sexes) for each data. The data with F-value larger than the threshold were classified as males, while the others were classified as females. We did not infer SNP-inferred sexes for Ribo-Seq data since an obvious bi-modal distribution of F-values was not apparent. For RNA-Seq and Ribo-Seq data, we also inferred sexes based on XIST (X-inactive specific transcript) gene expression levels (XIST-inferred sex). XIST is a noncoding RNA that is only expressed in cells containing at least two X chromosomes [16]. Normally, the XIST gene is only expressed in female samples. In this study, we considered samples with XIST expression larger than 2 (TPM, Transcripts Per Kilobase Million) as females. We compared the reported sex and genetics-based sex for each data in each omics type and calculated the sex concordance rate for each omics type. Samples with unknown SNP-inferred sexes were not included when calculating the sex concordance rate. The sex concordance rate can represent a parameter indicating omics priority in the logistic regression model (See “Estimate switch directions and probabilities for mismatched data pairs”).

Estimate genetic relatedness and extract highly-related data pairs

Two tools were used to estimate genetic relatedness among data of multiple omics types and extract highly-related data pairs: GCTA[3] and NGSCheckMate[4]. For GCTA, GRM module was used. Basically, the genetic relatedness scores are distributed bimodally, so that one peak with higher scores indicates highly-related data pairs while the other peak indicates random (unrelated) data pairs. In this way, one can “eyeball” the distribution to determine the thresholds of genetic relatedness scores between every two omics datasets (. Using that method, we applied a threshold of 0.65 for our BrainGVEX data and extracted highly related data pairs in different omics types. For NGSCheckMate, we ran the software in VCF mode with an “-f” parameter to enact a strict VAF correlation filter. We calculated the concordance rates of the two tools based on BrainGVEX data. To determine the minimum number of variants required to extract highly related data pairs from all combination of data pairs, we re-calculated genetic relatedness scores using the BrainGVEX data based on subsampled SNPs. Data from five omics types were used, including WGS, PsychChip, ATAC-Seq, RNA-Seq, and Ribo-Seq. We used only the samples that fully matched in all the five omics types. For comparisons between each two omics types, we randomly selected 200, 400, 600, 800, 1000, 2000, 5000, and 10,000 SNPs that were called by both omics types and calculated pair-wise genetic relatedness scores using GCTA. The data pairs with genetic relatedness scores larger than 0.65 were classified as highly related data pairs and those pairs with smaller scores were classified as unrelated data pairs. For each comparison, we calculated the true positive rate and false negative rate. We did the same analysis for common (MAF>0.1) and for rare (MAF<0.1) variants.

Estimate switch directions and probabilities for mismatched data pairs

To estimate the possible switch directions and the probabilities for a mismatched data pair, the key point must be focusing on determining which data are more likely to bear the true ID. We used a logistic regression model (Formula 1) to compare the two data presented in each mismatched data pair and to estimate the possible switch direction and its probability (. Assuming the mismatched data pair “A” and “B”, if “B” is more likely to have the true ID, then the appropriate switch direction is from “A” to “B”. Three parameters (x, x, and x, values range from 0 to 1) were used in this model. The first parameter, x, indicates which data matched with more data in other omics types (Formula 2). The second parameter, x, indicates which data are more likely to have correct reported sex (Formula 3); and the third, x, is a user-defined parameter indicating the rank of omics priority. If two or more omics types were processed under the same condition or process, these omics types may have the same sample mis-labeling, a user could combine these omics types into one type in the regression. If x is not specified, the rank of omics priority will be defined based on the proportion of data that have matched reported sex and genetics-based sex for each omics type (Formula 4). In formula 2, N is the total number of omics types; and correspondingly, na and nb are the number of matched data from other omics types for data “A” and “B”, respectively. In formula 3, Sa and Sb indicate the sex matching level for data “A” and “B”, respectively. Taking Sa as an example, it indicates 1) whether the reported sex and genetics-based sex in data “A” are matched, and 2) if assigning ID “A” to data “B”, do the reported sex of data “A” and the genetics-based sex of data “B” match? We assign a score 0.5 for each of the two conditions. In formula 4, Pa and Pb represent the proportion of data with matched reported sex and genetics-based sex for data “A” and “B”, respectively. A set of hand-picked high-confidence mismatched data pairs with well-defined switch directions were used as a training set for the logistic regression model. The directions of the high-confidence mismatched data pairs were determined based on sample relationships and sex matching. We connected all the highly related data pairs among multiple omics types and created multiple groups. We extracted the high-confidence mismatched data pairs based on the following conditions: 1) If only one data had a different ID, which needs to be corrected, from the others in the group; 2) If the reported sex and genetics-based sex were matched after correcting that ID based on other data in the group. After training, the values for β0, β1, β2, and β3 were defined. Then, the model was used to predict switch directions and probabilities for the rest of the data. For the results, if p > 0.5, the switch direction will be from “A” to “B”; conversely, if p < 0.5, the switch direction will be from “B” to “A”; and, if p = 0.5, the switch direction will be uncertain. The values of p for both directions (p > 0.5, p < 0.5) were normalized (with a range from 0 to 1 respectively) to indicate the probabilities of switch directions.

A modified topological sorting algorithm to determine the potential IDs

We connected all highly related data pairs and generated multiple groups. In each group, each data was represented by a node, and each data pair was represented by an edge. Based on the logistic regression model results, the switch direction and probability for each mismatched data pair corresponded to the direction and weight of each edge. For each matched data pair, we did not assign a direction or weight for the edge. Since all the data presented in a group were supposed to have one unique ID, we used a modified topological sorting algorithm to sort all nodes in each group to determine the potential ID for each data. The modified topological sorting algorithm was based on the indegrees and outdegrees weighted by the switch probabilities which had been calculated in the logistic regression model. For each node in each group, we calculated the difference between weighted indegree and weighted outdegree. Afterward, we sorted the nodes in each group based on the difference between weighted indegrees and outdegrees. For each group, we used the data ID with the highest priority as the final ID for all data presented in the group.

Simulation data

We generated simulation data to test the performance of DRAMS on correcting data IDs (S1 Fig). At first, we generate data IDs for different samples and omics. A range of sample sizes (50, 100, 150, 200, 250, and 300) was used, with each dataset having 50% females and 50% males. The number of omics types ranged from three to six. Then, we randomly shuffled parts (5%, 10%, 15%, 20%, 25%, and 30%) of the data IDs. In total, we generated 144 simulated datasets (six sets of sample size × four sets of omics type × six proportions of mixed-up sets). In an attempt to mimic reality, we randomly introduced mislabeled sexes for 2% of samples in our simulation data. In addition, since samples are often swapped within the same batch in reality, we divided the samples into several batches, with each batch containing 25 samples. The data ID shuffling all occurred within batches. For each set of simulation data, we corrected data IDs using DRAMS (stringent mode) and calculated the proportion of mix-ups being successfully corrected or overcorrected. In stringent mode, we discarded the data groups with less than three data or with no shared IDs (all data IDs in a group are different). We simulated the process 100 times for each dataset.

Validate data ID corrections by racial assignment

To validate the tool, we started by using genotypes that shared loci in BrainGVEX and the 1000 Genomes project (1000G)[8]. For 1000G, we used only the following samples: ASW (African Ancestry in Southwest US), CEU (Utah Residents with Northern and Western European Ancestry), CHB (Han Chinese in Beijing, China), and YRI (Yoruba in Ibadan, Nigeria). We first removed genotypes with MAF < 0.01. Then, we used GCTA[3] to estimate genetic relationships among individuals and performed PCA analysis for both BrainGVEX and 1000G samples jointly. We used data from 1000G that were located in the same group as the reference, after which we compared the races of samples from BrainGVEX with the reference before and after correcting data IDs. Since BrainGVEX and 1000G used different nomenclatures for races, we used analogical names or close races between BrainGVEX and 1000G to align the data (S2 Table).

Detect QTLs before and after correcting mix-ups

We used FastQTL[17] to map QTL within the BrainGVEX samples both before and after correcting data IDs. We defined the cis-QTL region as 1 million base pairs between the SNP marker and the gene body. Since we intend to test whether the number of cis-QTLs increased, only chromosome 1 was used to save computing time. R package “qvalue”[18] was used for multiple tests. We tested four types of cis-QTLs calculations: WGS with RNA-Seq, WGS with Ribo-Seq, PsychChip with RNA-Seq, and PsychChip with Ribo-Seq. For RNA-Seq and Ribo-Seq samples, we used log2 transformed CPM quantification data calculated by VOOM[19]. We selected 30 hidden factors as covariates using the PEER software[20] for RNA-Seq and Ribo-Seq samples. As one or multiple of the hidden factors estimated by PEER were significantly associated with known covariates (age of death, diagnosis, brain bank, ethnicity, and sex) (S8 Fig), we only included the 30 PEER factors in QTL analyses. We used two cutoffs (FDR < 0.05, FDR < 0.01) for the cis-QTL results.

Code availability

The code of DRAMS was implemented in Python3 and deposited in GitHub (https://github.com/Yi-Jiang/DRAMS). We released the data preprocessing codes, including genotype calling, sample contamination checking, genetics-based sex inference, genetic relatedness score calculation, and extracting highly related data pairs. We also provided a guideline for using Cytoscape to visualize sample relationships within networks [21]. Step 1, map reads to reference genome (FASTQ to BAM, for one sample, 4Gb data size): 4 CPU, 15G memory, 1.5 hours. Step 2, call genotypes (BAM to VCF, for one sample, 4Gb data size): 12 CPU, 20G memory, 7 hours. Step 3, infer genetics-based sexes (for one sample): 1 CPU, 1G memory, 1 minute. Step 4, calculate genetic relatedness scores: 1 CPU, 1G memory, 10 minutes. Step 5, correct data IDs: 1 CPU, 1G memory, 5 minutes.

A flowchart of testing DRAMS in simulation data.

Simulation data were generated to test the performance of DRAMS. Step 2 to step 4 were repeated four times. (TIF) Click here for additional data file.

Heterozygous proportion and VerifyBamID results implied possible contaminated samples.

Two methods (Heterozygous proportion and VerifyBamID) were used to estimate sample contamination for WGS, ATAC-Seq, RNA-Seq, and Ribo-Seq data. For VerifyBamID results, “FREEMIX” (0–1 scale) was used to indicate possible sample contamination. For PsychChip and Affymetrix samples, we calculated only heterozygous proportions. (TIF) Click here for additional data file.

Distribution of genetic relatedness scores among omics types.

Genetic relatedness scores were calculated by GCTA. (TIF) Click here for additional data file.

Increased concordance of reported sex and genetics-based sex in corrected data IDs.

Genetics-based sexes were inferred using Plink. Larger F-value indicated that the sample is more likely to be male. Ribo-Seq and Affymetrix samples were not shown since we were not able to infer the genetics-based sexes. (TIF) Click here for additional data file.

Comparison of genetics-based sexes inferred from sex chromosome genotypes and XIST expression.

For sex chromosomes-inferred sexes, larger F-value indicates that the sample is more likely to be male. For XIST expression-inferred sex, the samples with XIST expression larger than zero are more likely to be female. The reported sexes were based on samples before ID correction. (TIF) Click here for additional data file.

Genetic relatedness scores calculated using subsampled common SNPs.

Genetic relatedness scores were calculated based on common SNPs (MAF>0.1) randomly selected from BrainGVEX data. Only the samples that matched well among all omics types were used. We only showed a subset of randomly selected mismatched data pairs according to the number of matched data pairs. (TIF) Click here for additional data file.

Sample relatedness scores calculated using subsampled rare SNVs.

Genetic relatedness scores were calculated based on randomly selected rare SNVs (MAF<0.1) from BrainGVEX data. Only the samples that matched well among all omics types were used. We only showed a subset of randomly selected mismatched data pairs according to the number of matched data pairs. (TIF) Click here for additional data file.

Correlation between known covariates and PEER factors.

Spearman correlation tests were performed between PEER factors and ageDeath. One-way ANOVA tests were performed between PEER factors and Diagnosis, BrainBank, Ethnicity, and Sex. P values were marked for the cells with significant correlation (P value < 0.05). (TIF) Click here for additional data file.

Sample list in BrainGVEX.

A total of 440 samples from BrainGVEX were used to correct data IDs using our DRAMS. The number of data produced from different omics types are shown for each sample. (XLSX) Click here for additional data file.

Correspondence between BrainGVEX and 1000G races.

(XLSX) Click here for additional data file.

Proportion of successfully corrected mix-ups in simulation data.

(XLSX) Click here for additional data file.

Highly related data pairs in BrainGVEX.

Genetic relatedness scores were calculated using GCTA. The highly related data pairs were extracted using a 0.65 cutoff. (XLSX) Click here for additional data file.

Corrected sample IDs in BrainGVEX.

We corrected a total of 201 sample IDs using DRAMS. PCA was performed for both BrainGVEX samples and 1000G samples using GCTA based on genetic relationships among the samples. The races in BrainGVEX samples and corresponding groups of 1000G were compared. The PCA results were shown in Fig 3.

Fig 3

Summary of highly related data pairs.

(XLSX) Click here for additional data file.

Sensitivity and specificity in extracting highly related data pairs with subsampled common variants.

Only variants with MAF>0.1 were used. TP: true positive pairs, indicating data pairs with genetic relatedness score > 0.65 and with the same ID. TN: true negative pairs, indicating data pairs with genetic relatedness score < 0.65 and with different IDs. FP: false positive pairs, indicating data pairs with genetic relatedness score > 0.65 and with different IDs. FN: false negative pairs, indicating data pairs with genetic relatedness score < 0.65 and with the same ID. (XLSX) Click here for additional data file.

Sensitivity and specificity in extracting highly related data pairs with subsampled rare variants.

Only variants with MAF<0.1 were used. (XLSX) Click here for additional data file.

The effect of different genetic relatedness thresholds on the performance of DRAMS.

The PsychENCODE BrainGVEX data were used to assess the effect of different genetic relatedness thresholds on the performance of DRAMS. Only the samples with reported sex and race information and genetics-based sexes were used (WGS: 280, PsychChip: 263, RNA-Seq: 417, Ribo-Seq: 152, ATAC-Seq: 288). Gradient thresholds were used to extract highly related data pairs. (XLSX) Click here for additional data file.

Comparison of GCTA and NGSCheckMate in estimating genetic relatedness.

(XLSX) Click here for additional data file. (DOCX) Click here for additional data file. 27 Nov 2019 Dear Dr Liu, Thank you very much for submitting your manuscript 'DRAMS: A Tool to Detect and Re-Align Mixed-up Samples for Integrative Studies of Multi-omics Data' for review by PLOS Computational Biology. Your manuscript has been fully evaluated by the PLOS Computational Biology editorial team and in this case also by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the manuscript as it currently stands. While your manuscript cannot be accepted in its present form, we are willing to consider a revised version in which the issues raised by the reviewers have been adequately addressed. We cannot, of course, promise publication at that time. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Your revisions should address the specific points made by each reviewer. Please return the revised version within the next 60 days. If you anticipate any delay in its return, we ask that you let us know the expected resubmission date by email at ploscompbiol@plos.org. Revised manuscripts received beyond 60 days may require evaluation and peer review similar to that applied to newly submitted manuscripts. In addition, when you are ready to resubmit, please be prepared to provide the following: (1) A detailed list of your responses to the review comments and the changes you have made in the manuscript. We require a file of this nature before your manuscript is passed back to the editors. (2) A copy of your manuscript with the changes highlighted (encouraged). We encourage authors, if possible to show clearly where changes have been made to their manuscript e.g. by highlighting text. (3) A striking still image to accompany your article (optional). If the image is judged to be suitable by the editors, it may be featured on our website and might be chosen as the issue image for that month. These square, high-quality images should be accompanied by a short caption. Please note as well that there should be no copyright restrictions on the use of the image, so that it can be published under the Open-Access license and be subject only to appropriate attribution. Before you resubmit your manuscript, please consult our Submission Checklist to ensure your manuscript is formatted correctly for PLOS Computational Biology: http://www.ploscompbiol.org/static/checklist.action. Some key points to remember are: - Figures uploaded separately as TIFF or EPS files (if you wish, your figures may remain in your main manuscript file in addition). - Supporting Information uploaded as separate files, titled Dataset, Figure, Table, Text, Protocol, Audio, or Video. - Funding information in the 'Financial Disclosure' box in the online system. While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see here. We are sorry that we cannot be more positive about your manuscript at this stage, but if you have any concerns or questions, please do not hesitate to contact us. Sincerely, Ilya Ioshikhes Associate Editor PLOS Computational Biology Thomas Lengauer Methods Editor PLOS Computational Biology A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: [LINK] Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: Jiang et al. developed a tool to detect and correct mixed-up samples in multi-omics data. The mix-up refers to switched IDs during the data generation process. Through analyzing the PsychENCODE BrainGVEX data, they corrected 12.5% mixed-up IDs, and this correction is shown to improve the discovery of cis-QTL. I think the paper addressed an important and interesting research question, and it was written in a convincing way. Below I list some comments that may improve the manuscript. Major comments: “We can cluster all the highly related data and consider that the data from one cluster have only one potential ID.” It was not sure how the clustering was done. Is it based on a clustering algorithm? What is the resulted cluster size? Is it restricted to be the number of data types? Is it necessary that each cluster has each data type? More details are expected to explain the clustering procedure. In Fig 1 and S2, the range of genetic relatedness score exceeds 1. Is it really possible to exceed one or just because of density curve smoothing? If the latter, the authors may consider using a way to estimate the density within a restricted range. Additionally, in Fig 1, “The text in each node represents the data ID. Different colors represent different omics types.” The pair of A-A in Step 1 Type 1 are in the same color. Should A-A be in different colors? In Fig 2, the authors assessed the proportions of successfully corrected mix-ups. I am wondering if there are non-mix-ups are wrongly classified as mix-ups and then overcorrected. For example, if using the cutoff of 0.65 of genetic relatedness scores for highly related data pairs, are there any non-mix-ups wrongly classified as mismatched data pairs? It may be related to the section of "Sensitivity and specificity in extracting highly related data pairs," but they are in different simulation settings. In the simulation, how are the different data types simulated? How is the logistic regression fitted? Is it the same fitted model based on 44 data pairs as in the real data analysis? The overcorrection may be an issue in the analysis of PsychENCODE data. “We identified a total of 1971 matched pairs and 518 mismatched pairs… In the end, we corrected 201 (12.5%) IDs for data of the six omics types.” There percentage of mismatched pairs and corrected IDs seems high. Minor comments: Fig 4 has a lot of abbreviations. Providing the full names in the caption would help reading. The Supplementary Figures need captions. “Since we intend to test whether the number of cis-QTLs increased, only chromosome 1 was used to save computing time.” This should be at least noted in the caption of Table 2 to help interpret the number of eQTLs. Page 19, “In formula 3, Sa and Sb indicate the sex matching level for data “A” and “B”, respectively. Taking Sa as an example, **the original value of Sa would be 0**. If the reported sex and genetics-based sex in data “A” are matched, Sa would be **plus 0.5**. If the reported sex of data “A” and the genetics-based sex of data “B” are matched, which means that the reported sex and genetics-based sex are matched after switch ID from “B” to “A”, then Sa would be **plus 0.5**.” Could you explain more why the values of Sa are set in this way? It seems hard to understand and confusing. Grammar error. Page 2: “As the number of datasets to be integrated increaseS” Page 3: “all omics data of the same samples were clusterED together” Page 4: “For the first step is to build highly related data pairs.” Page 18: “those pairS with smaller scores were classified” Reviewer #2: Comments to authors The manuscript by Yi and colleagues, “DRAMS: A Tool to Detect and Re-Align Mixed-up Samples for Integrative Studies of Multi-omics Data”, presents a new method to detect and re-align mixed-up samples in multi-omics studies. The authors calibrated DRAMS using simulations and applied their method to the data from PsychENCODE BrainGVEX project to correct 201 sample IDs. They further tried to validated the results by comparing PCA and eQTL results before and after correcting sample IDs. The experiment is well designed and the analyses were carefully conducted. As the increasing emergency of multi-omics data, this method would be useful. Major comments: 1. The authors should compare DRAMS with state-of-the-art methods of this kind in both simulation and real data analyses, such as MixupMapper. 2. DRAMS needs a set of hand-picked high-confidence mismatched data pairs as a training set. How many pairs with certain switch directions are required in the training process? What if there are no such mismatched data pairs? 3. The authors used a relatedness score of 0.65 to define highly related data pairs. Is it possible that the related individual pairs (due to relatedness rather than the same individual in different data types) have the relatedness score > 0.65? More justification is required here. 4. The authors claimed that more types of omics data were included, the power is higher. However, it would increase the computational complexity. In this case, how about the computational burden? It would be useful to summary the resources requirement (e.g., running time, memory usage) under different simulation scenarios. 5. In the eQTL analysis, did the authors correct for ancestry and sex? In addition, as discussed by the authors, large number of QTLs does not necessarily mean correct data ID. It would be useful to replicate the new discovery of eQTLs in an independent data set. Minor comments: 1. Figure 1 is quite busy. It is difficult for the readers to follow. Detailed legend is needed. 2. Is it a typo of “404” mismatched data pairs, given 518 mismatched pairs in total and 44 pairs with certain switch directions? 3. In Table S4, there were 80 data pairs with the same ID were defined as “mismatched pairs”. It is not clear of the definition of “matched pairs” and “mismatched pairs” in Fig. 3. The authors also mentioned that eight data were not related to any other data. It does not make sense that only eight data are not related to any other data among all samples. 4. In the validation analysis using PCA, after correcting for mismatched IDS, the ancestry of all the other samples were matched correctly? ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: None Reviewer #2: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No 23 Jan 2020 Submitted filename: Responce to reviewers.docx Click here for additional data file. 9 Feb 2020 Dear Liu, Thank you very much for submitting your manuscript "DRAMS: A Tool to Detect and Re-Align Mixed-up Samples for Integrative Studies of Multi-omics Data" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations. Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Ilya Ioshikhes Associate Editor PLOS Computational Biology Thomas Lengauer Methods Editor PLOS Computational Biology *********************** A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: [LINK] Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: My comments have been addressed. Reviewer #2: The authors addressed most of my concerns except the my question #3. It is nice that the authors added a paragraph (lines 276-385) to suggest the use of stringent relatedness score in family data. However, it is still not clear how the relatedness score will affect the result and how to choose the relatedness score in the real data analysis. It is better to test the method using a variety of relatedness score. ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: None Reviewer #2: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at . Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see 16 Feb 2020 Submitted filename: Responce to reviewers.docx Click here for additional data file. 28 Feb 2020 Dear Liu, We are pleased to inform you that your manuscript 'DRAMS: A Tool to Detect and Re-Align Mixed-up Samples for Integrative Studies of Multi-omics Data' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Ilya Ioshikhes Associate Editor PLOS Computational Biology Thomas Lengauer Methods Editor PLOS Computational Biology *********************************************************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #2: My comments have been addressed. ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #2: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #2: No 24 Mar 2020 PCOMPBIOL-D-19-01762R2 DRAMS: A Tool to Detect and Re-Align Mixed-up Samples for Integrative Studies of Multi-omics Data Dear Dr Liu, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Bailey Hanna PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Table 1

Summary of samples and sample corrections in BrainGVEX.

Data type	Number of samples	Number of contaminated samples	Number ofsex-matched samples*	Number ofsex-mismatched samples	Number of samples missing sex information	Proportion of sex-matched samples	Number of samples switched IDs	Number of samples unrelated to any other	Number ofSex-mismatched samples after correcting IDs
WGS	285	1	256	24	5	0.914	54 (19.0%)	0	0
PsychChip	263	0	244	19	0	0.928	43 (16.3%)	0	0
Affymetrix	137	0	-	-	-	-	0	1	-
ATAC-Seq	295	0	260	28	7	0.903	55 (18.6%)	3	15
RNA-Seq	426	0	414	3	9	0.993	3 (0.7%)	0	3
Ribo-Seq	197	0	-	-	-	-	50 (25.4%)	4	-
Total	1603	1	1174	74	21	-	201 (12.5%)	8	18

Note: “Number of sex-matched samples” indicates the number of samples with the same reported sex and SNP-inferred sex.

Table 2

Increased number of cis-QTLs after correcting data IDs.

QTL type	Category	Beforecorrecting IDs	Aftercorrecting IDs	Fold change	Novel QTLs after correcting IDs (π₁ in GTEx *)	Discarded QTLs after correcting IDs (π₁ in GTEx)
WGS vs. RNA-Seq	Sample size	278	273	-	-	-
	#QTLs (FDR<0.01)	57,209	96,242	1.68	43,266 (0.608)	4,233 (0.246)
	#QTLs (FDR<0.05)	90,231	147,942	1.64	66,993 (0.475)	9,282 (0.213)
WGS vs. Ribo-Seq	Sample size	191	187	-	-	-
	#QTLs (FDR<0.01)	18,178	31,345	1.72	-	-
	#QTLs (FDR<0.05)	30,641	48,306	1.58	-	-
PsychChip vs. RNA-Seq	Sample size	259	253	-	-	-
	#QTLs (FDR<0.01)	48,742	76,995	1.58	31,801 (0.638)	3,548 (0.581)
	#QTLs (FDR<0.05)	77,925	117,711	1.51	49,682 (0.519)	9,896 (0.246)
PsychChip vs. Ribo-Seq	Sample size	177	172	-	-	-
	#QTLs (FDR<0.01)	15,028	20,447	1.36	-	-
	#QTLs (FDR<0.05)	26,350	32,209	1.22	-	-
Total	#QTLs (FDR<0.01)	139,157	225,029	1.62	59,399	17,020
Total	#QTLs (FDR<0.05)	225,147	346,168	1.54	95,495	34,684

Note: Only chromosome 1 was used to save computing time.

* π1 was estimated using the “qvalue” package in R.

20 in total

1. Cytoscape: a software environment for integrated models of biomolecular interaction networks.

Authors: Paul Shannon; Andrew Markiel; Owen Ozier; Nitin S Baliga; Jonathan T Wang; Daniel Ramage; Nada Amin; Benno Schwikowski; Trey Ideker
Journal: Genome Res Date: 2003-11 Impact factor: 9.043

2. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

Authors: Aaron McKenna; Matthew Hanna; Eric Banks; Andrey Sivachenko; Kristian Cibulskis; Andrew Kernytsky; Kiran Garimella; David Altshuler; Stacey Gabriel; Mark Daly; Mark A DePristo
Journal: Genome Res Date: 2010-07-19 Impact factor: 9.043

3. GCTA: a tool for genome-wide complex trait analysis.

Authors: Jian Yang; S Hong Lee; Michael E Goddard; Peter M Visscher
Journal: Am J Hum Genet Date: 2010-12-17 Impact factor: 11.025

4. Fast and efficient QTL mapper for thousands of molecular phenotypes.

Authors: Halit Ongen; Alfonso Buil; Andrew Anand Brown; Emmanouil T Dermitzakis; Olivier Delaneau
Journal: Bioinformatics Date: 2015-12-26 Impact factor: 6.937

5. Overview and findings from the religious orders study.

Authors: David A Bennett; Julie A Schneider; Zoe Arvanitakis; Robert S Wilson
Journal: Curr Alzheimer Res Date: 2012-07 Impact factor: 3.498

6. Identification and Correction of Sample Mix-Ups in Expression Genetic Data: A Case Study.

Authors: Karl W Broman; Mark P Keller; Aimee Teo Broman; Christina Kendziorski; Brian S Yandell; Śaunak Sen; Alan D Attie
Journal: G3 (Bethesda) Date: 2015-08-19 Impact factor: 3.154

7. voom: Precision weights unlock linear model analysis tools for RNA-seq read counts.

Authors: Charity W Law; Yunshun Chen; Wei Shi; Gordon K Smyth
Journal: Genome Biol Date: 2014-02-03 Impact factor: 13.583

8. A global reference for human genetic variation.

Authors: Adam Auton; Lisa D Brooks; Richard M Durbin; Erik P Garrison; Hyun Min Kang; Jan O Korbel; Jonathan L Marchini; Shane McCarthy; Gil A McVean; Gonçalo R Abecasis
Journal: Nature Date: 2015-10-01 Impact factor: 49.962

9. NGSCheckMate: software for validating sample identity in next-generation sequencing studies within and across data types.

Authors: Sejoon Lee; Soohyun Lee; Scott Ouellette; Woong-Yang Park; Eunjung A Lee; Peter J Park
Journal: Nucleic Acids Res Date: 2017-06-20 Impact factor: 16.971

Introduction

Results

Design of DRAMS

Illustration of key steps.

Performance of DRAMS using simulation data

Performance of DRAMS in simulation data.

Performance of DRAMS using real data from the PsychENCODE BrainGVEX project

Data summary

Check sample alignment based on genetics-based sex

Detect and correct data IDs

Summary of highly related data pairs.

Validate data ID corrections by race group assignment

Validation of cross-races switched data.

Increased number of cis-QTLs after correcting data IDs

Sensitivity and specificity in extracting highly related data pairs

Discussion

Materials and methods

Sample resources

Genotype calling from data of each omics type

Estimation of sample contamination

Infer genetics-based sexes

Estimate genetic relatedness and extract highly-related data pairs

Estimate switch directions and probabilities for mismatched data pairs

A modified topological sorting algorithm to determine the potential IDs

Simulation data

Validate data ID corrections by racial assignment

Detect QTLs before and after correcting mix-ups

Code availability

A flowchart of testing DRAMS in simulation data.

Heterozygous proportion and VerifyBamID results implied possible contaminated samples.

Distribution of genetic relatedness scores among omics types.

Increased concordance of reported sex and genetics-based sex in corrected data IDs.

Comparison of genetics-based sexes inferred from sex chromosome genotypes and XIST expression.

Genetic relatedness scores calculated using subsampled common SNPs.

Sample relatedness scores calculated using subsampled rare SNVs.

Correlation between known covariates and PEER factors.

Sample list in BrainGVEX.

Correspondence between BrainGVEX and 1000G races.

Proportion of successfully corrected mix-ups in simulation data.

Highly related data pairs in BrainGVEX.

Corrected sample IDs in BrainGVEX.

Sensitivity and specificity in extracting highly related data pairs with subsampled common variants.

Sensitivity and specificity in extracting highly related data pairs with subsampled rare variants.

The effect of different genetic relatedness thresholds on the performance of DRAMS.

Comparison of GCTA and NGSCheckMate in estimating genetic relatedness.

Review 1. Review of multi-omics data resources and integrative analysis for human brain disorders.