| Literature DB >> 32664994 |
Brent S Pedersen1,2, Preetida J Bhetariya3, Joe Brown3,4, Stephanie N Kravitz3, Gabor Marth3, Randy L Jensen5, Mary P Bronner6, Hunter R Underhill7, Aaron R Quinlan8,9,10.
Abstract
BACKGROUND: When interpreting sequencing data from multiple spatial or longitudinal biopsies, detecting sample mix-ups is essential, yet more difficult than in studies of germline variation. In most genomic studies of tumors, genetic variation is detected through pairwise comparisons of the tumor and a matched normal tissue from the sample donor. In many cases, only somatic variants are reported, which hinders the use of existing tools that detect sample swaps solely based on genotypes of inherited variants. To address this problem, we have developed Somalier, a tool that operates directly on alignments and does not require jointly called germline variants. Instead, Somalier extracts a small sketch of informative genetic variation for each sample. Sketches from hundreds of germline or somatic samples can then be compared in under a second, making Somalier a useful tool for measuring relatedness in large cohorts. Somalier produces both text output and an interactive visual report that facilitates the detection and correction of sample swaps using multiple relatedness metrics.Entities:
Mesh:
Year: 2020 PMID: 32664994 PMCID: PMC7362544 DOI: 10.1186/s13073-020-00761-2
Source DB: PubMed Journal: Genome Med ISSN: 1756-994X Impact factor: 11.117
Fig. 1Comparing genotype sketches to compute relatedness measures for pairs of samples. a Observed counts for the reference (Ref.) and alternate (Alt.) allele at each of the tested 17,766 loci are converted into genotypes (see main text for details) to create a “sketch” for each sample. b The genotypes for each sample are then converted into three bit vectors: one for homozygous reference (HOMREF) genotypes, one for heterozygous (HET) genotypes, and one for homozygous alternate (HOMALT) genotypes. The length of each vector is the total number of autosomal variants in the sketch (i.e., 17,384) divided by 64, and the value for each bit is set to 1 if the sample has the particular genotype at the given variant site. For example, four variant sites are shown in b and the hypothetical individual has a homozygous alternate genotype for the second variant (the corresponding bit is set to 1), but is not homozygous for the alternate allele at the other three variant sites (the corresponding bits are set to 0). c The bit vectors for a pair of samples can be easily compared to calculate relatedness measures such as identity-by-state zero (IBS0, where zero alleles are shared between two samples) through efficient, bitwise operations on the bit arrays for the relevant genotypes
Fig. 2Glioma samples before and after correction. a A comparison of the IBS0 (number of sites where 1 sample is homozygous reference and another is homozygous alternate) and IBS2 (count of sites where samples have the same genotype) metric for 15 samples. Each point is a pair of samples. Points are positioned by the values calculated from the alignment files (observed relatedness) and colored by whether they are expected to be identical (expected relatedness), as indicated from the command line. In this case, sample swaps are visible as orange points that cluster with green points, and vice versa. The user is able to hover on each point to see the sample pair involved and to change the X and Y axes to any of the metrics calculated by Somalier. b An updated version of the plot in a after the sample identities have been corrected (per the information provided by a) in the manifest after re-running Somalier
Fig. 3Relatedness plot for thousand genomes samples. Each dot represents a pair of samples. IBS0 on the x-axis is the number of sites where 1 sample is homozygous for the reference allele and the other is homozygous for the alternate allele. IBS2, on the y-axis, is the count of sites where a pair of samples were both homozygous or both heterozygous. Points with IBS0 of 0 are parent-child pairs. The 4 points with IBS0 > 0 and IBS0 < 450 are siblings. There are also several more distantly related sample pairs
Fig. 4Sex quality control on thousand genomes samples. Each point is a sample colored as orange if the sample is indicated as female and green if it is indicated as male; all data is for the X chromosome. a The number of homozygous alternate sites on the x-axis and the number of heterozygous sites on the y-axis. Males and females separate with few exceptions. b The number of homozygous alternate sites on the x-axis compared to the mean depth on the Y chromosome. Males and females reported in the manifest separate perfectly, indicating that some females may have experienced a complete loss of the X chromosome
Speed comparison to KING. The extract step consists of conversion to a sketch for Somalier and of conversion to a plink binary bed file for KING. The relate step is the time spent measuring kinship between all pairs of samples. Times shown reflect the wall time required for completion
| Step | Somalier (wall time) | KING/plink (wall time) |
|---|---|---|
| Extract | 17 min 48 s | 812 min 40 s |
| Relate | 8 s | 31 min 34 s |