| Literature DB >> 33327944 |
Yixin Chen1, Sijie Chen1, Xuegong Zhang2,3.
Abstract
BACKGROUND: High throughput single-cell transcriptomic technology produces massive high-dimensional data, enabling high-resolution cell type definition and identification. To uncover the expressional patterns beneath the big data, a transcriptional landscape searching algorithm at a single-cell level is desirable.Entities:
Keywords: Cell searching; DenseFly; Locality sensitive hashing; scRNA-seq
Mesh:
Substances:
Year: 2020 PMID: 33327944 PMCID: PMC7739457 DOI: 10.1186/s12864-020-6651-8
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1The result of the cell type identification experiment. The x-axis is the hash length while the y-axis is the average Cohen’s Kappa score given by the five-fold cross-validation. The performance of all algorithms improved as the hash length becomes larger. The figure indicates that DenseFly outperforms FlyHash and SimHash under all chosen hash length conditions, and DenseFly still reaches a high score even with a relatively shorter hash length. The experiment proves DenseFly’s feasibility on the scRNA-seq data similarity search
Time consumption per hundred queries (unit: seconds)
| Reference size | SimHash | FlyHash | DenseFly |
|---|---|---|---|
| 200 | 0.012 | 0.012 | 0.012 |
| 400 | 0.017 | 0.018 | 0.018 |
| 800 | 0.025 | 0.026 | 0.024 |
| 1000 | 0.026 | 0.029 | 0.028 |
| 2000 | 0.040 | 0.042 | 0.041 |
| 4000 | 0.097 | 0.105 | 0.099 |
| 8000 | 0.156 | 0.157 | 0.158 |
Fig. 2The batch effect experiment bar plot. The x-axis lies the name of experiment designs and the y-axis is the Cohen’s Kappa score. The blue bars represent experiments conducted on DenseFly, yellow bars represent experiments on FlyHash, and gray bars represent SimHash’s results. DenseFly shows the best batch-proof performance among the three methods while SimHash is the worst. FlyHash achieves similar performance to DenseFly but it relies on large hash lengths. In general, the cross-batch mappings have lower scores than no-batch mappings, except for DenseFly’s results. There is no significant difference between “1 to 2” and “2 to 1”, which accords with our simulation settings.
Fig. 3The result of dropout experiments. a-d show the Cohen’s Kappa score changing with dropout rate under different hash length (32, 64, 128, 256). The x-axis of each figure is the dropout rate and the y-axis of each figure is Cohen’s Kappa score. It is reasonable that the performances of three algorithms all decrease as the dropout rate increase. We can see that DenseFly always outperforms others and has a stable ‘platform’ range where Cohen’s Kappa score decreases slowly when the dropout rate is small, particularly when hash length = 256. The experiments show DenseFly is robust when the dropout event occurs. It should be explained that the original data without dropout has 45% zero elements in the expression matrix, meaning that SIM III-5 dataset (dropout rate = 53.6%) is extremely sparse (over 98% elements is zeros), so all algorithms perform poorly because little information remained in the dataset
Summary of simulation parameters
| Dataset | Note | # cells | # genes | # cell types | Dropout rate |
|---|---|---|---|---|---|
| SIM I | – | 2000 | 10,000 | 5 | 0% |
| SIM II | Batch 1 | 1000 | 10,000 | 5 | 0% |
| SIM II | Batch 2 | 1000 | 10,000 | 5 | 0% |
| SIM III | SIM III-0 | 2000 | 10,000 | 5 | 0% |
| SIM III | SIM III-1 | 2000 | 10,000 | 5 | 15.44% |
| SIM III | SIM III-2 | 2000 | 10,000 | 5 | 25.15% |
| SIM III | SIM III-3 | 2000 | 10,000 | 5 | 35.28% |
| SIM III | SIM III-4 | 2000 | 10,000 | 5 | 43.41% |
| SIM III | SIM III-5 | 2000 | 10,000 | 5 | 53.60% |
Simulation parameters of SIM I, SIM II, and SIM III. Cell numbers, gene numbers, cell type numbers, and dropout rates are taken into considerations. More detail about the datasets can be found in Additional file 1