| Literature DB >> 35900151 |
Hana Rozhoňová1,2,3, Daniel Danciu1,4, Stefan Stark1,3,4, Gunnar Rätsch1,3,4,5, André Kahles1,3,4, Kjong-Van Lehmann1,3,4,6,7.
Abstract
MOTIVATION: Several recently developed single-cell DNA sequencing technologies enable whole-genome sequencing of thousands of cells. However, the ultra-low coverage of the sequenced data (< 0.05x per cell) mostly limits their usage to the identification of copy number alterations in multi-megabase segments. Many tumors are not copy number-driven, and thus single-nucleotide variant (SNV)-based subclone detection may contribute to a more comprehensive view on intra-tumor heterogeneity. Due to the low coverage of the data, the identification of SNVs is only possible when superimposing the sequenced genomes of hundreds of genetically similar cells. Thus, we have developed a new approach to efficiently cluster tumor cells based on a Bayesian filtering approach of relevant loci and exploiting read overlap and phasing.Entities:
Year: 2022 PMID: 35900151 PMCID: PMC9477524 DOI: 10.1093/bioinformatics/btac510
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.931
Fig. 1.The SECEDO pipeline. After sequencing, reads are piled up per locus and a Bayesian filter eliminates loci that are unlikely to carry a somatic SNV. For each pair of reads, SECEDO compares the filtered loci and updates the likelihoods of having the same genotype and of having different genotypes for the corresponding cells. The similarity matrix, computed as described in Section 2, is then used to cluster the cells into 2–4 groups (the number of groups depends on the data and is determined automatically by SECEDO) using spectral clustering. The algorithm is then recursively applied to each cluster until a termination criterion is reached
Fig. 2.Illustration of an overlap between two reads. The shaded positions are the positions chosen as informative. In this example, length of the overlap is 3, the number of positions where the bases are the same, xs, is 2 and the number of positions where they are different, xd, is 1. For our purposes, an overlap is fully described by the tuple (xs, xd)
Fig. 3.Clustering a synthetic dataset with nine unequally sized subclones totaling 7250 cells. Top: Theoretical phylogenetic tree of the dataset. Edge labels indicate the number of additional SNVs in each subclone relative to the parent, node labels indicate the number of cells in each subclone. Bottom: Recursive clustering by SECEDO. Each node corresponds to one SECEDO clustering step; the first row indicates the subclones assigned to that node, the second row the number of recovered cells out of the total and the third row indicates the clustering precision (correctly clustered cells relative to total cells in cluster). The scatter plots above parent nodes depict the second and third eigenvectors of the similarity matrix Laplacian. For leaf nodes, SECEDO correctly determined that further clustering is not desirable
Fig. 4.Minimum required coverage for successful clustering (>90% precision and recall) of sub-clones differing in the given number of SNVs, in three scenarios: clustering 1000 cells, with a split, with an equal split, and clustering 2000 cells with an equal split. The shaded area marks the coverage currently achievable in practice. The top labels indicate the cancer type with median mutation rate closest to the given SNV density [cancer mutation rates according to Lawrence ]
Fig. 5.Clustering of the five tumor sections in the 10x Genomics ductal carcinoma dataset. The first row in each node denotes the cluster name; for consistency, we used the same cluster numbering as CHISEL (https://github.com/raphael-group/chisel-data/). The second row denotes the number of cells recovered by SECEDO versus the total number of cells as identified by CHISEL. The last row denotes the precision of the clustering, i.e. the percentage of cells in the SECEDO cluster that match the originally reported cluster. The lower precision values are due to the fact that cells categorized by CHISEL as ‘None’ based on the CNV signature are assigned a category by SECEDO based on the genomic signature. The first section (SliceA) consists mainly of healthy cells, as reflected by the scatter plot of the second and third eigenvectors of the similarity matrix Laplacian
Fig. 6.Adjusted Rand Index (ARI) scores for the SECEDO and SBMClone clustering for Slice B and all slices of the breast cancer dataset at coverage ranging from 0.03× to 0.3×. Shaded area marks the average per-cell coverage achievable with current technology. Note that due to various factors such as cell merging and lack of ground truth the accuracy is not expected to be monotonically increasing