| Literature DB >> 28646174 |
Anindya Bhattacharya1,2,3, Yan Cui4,5.
Abstract
In the analysis of large-scale gene expression data, it is important to identify groups of genes with common expression patterns under certain conditions. Many biclustering algorithms have been developed to address this problem. However, comprehensive discovery of functionally coherent biclusters from large datasets remains a challenging problem. Here we propose a GPU-accelerated biclustering algorithm, based on searching for the largest Condition-dependent Correlation Subgroups (CCS) for each gene in the gene expression dataset. We compared CCS with thirteen widely used biclustering algorithms. CCS consistently outperformed all the thirteen biclustering algorithms on both synthetic and real gene expression datasets. As a correlation-based biclustering method, CCS can also be used to find condition-dependent coexpression network modules. We implemented the CCS algorithm using C and implemented the parallelized CCS algorithm using CUDA C for GPU computing. The source code of CCS is available from https://github.com/abhatta3/Condition-dependent-Correlation-Subgroups-CCS.Entities:
Mesh:
Year: 2017 PMID: 28646174 PMCID: PMC5482832 DOI: 10.1038/s41598-017-04070-4
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Summary of the synthetic datasets.
| Dataset | Type | Rows | Columns | True biclusters |
|---|---|---|---|---|
| CNST.100.3 | Constant | 100 | 75 | 3 |
| SS.150.4 | Shift and Scale | 150 | 100 | 4 |
| SS.200.5 | Constant | 200 | 120 | 5 |
| SS.200.6 | Shift and Scale | 200 | 120 | 6 |
| SS.250.7 | Shift and Scale | 250 | 120 | 7 |
Summary of the gene expression datasets and number of CCS biclusters obtained for θ = 0.8.
| Dataset | Rows | Columns | Description | Number of CCS biclusters |
|---|---|---|---|---|
| GDS531 | 12625 | 173 | Bone marrow plasma cell of multiple myeloma patients[ | 13 |
| GDS589 | 8799 | 122 | Peripheral and brain regions in rat strains[ | 20 |
| GDS3603 | 12625 | 79 | Advanced renal cancer Peripheral blood mononuclear cells[ | 14 |
| GDS3966 | 22283 | 83 | Melanoma clinical samples[ | 19 |
| GDS4794 | 54675 | 65 | Small cell lung cancer (SCLC) samples and normal samples[ | 19 |
Figure 1Recovery and Relevance scores on five synthetic datasets for CCS and the next-best performing algorithm.
Figure 2Average number of enriched terms on five gene expression datasets. All the enriched gene ontology terms with Benjamini-Hochberg FDR less than 0.01 were considered.
Figure 3Percentage of bicluster from five gene expression datasets that have at list one enriched term. All the enriched gene ontology terms with Benjamini-Hochberg FDR less than 0.01 were considered.
Figure 4Comparison between the execution time for the CPU and GPU implementations of CCS on gene expression dataset GDS531. The “x” axis shows the number base genes for bicluster search. The “y” axis shows GPU speedup from CPU vs. GPU execution times.
Figure 5The coexpression network related to two CCS biclusters from GDS589. The blue, green and white nodes represent the genes from bicluster 1, bicluster 2 and the neighboring genes respectively. Red nodes are common genes between bicluster 1 and 2. Edges represent correlation >=0.8 or <=−0.8. (A) Coexpression network based on correlations over all the samples in GDS589. (B) Coexpression network based on correlations over the samples of bicluster 1 (C) Coexpression network based on correlations over the samples of bicluster 2.
|
|
|
|
| 1. S ← NULL |
| 2. |
| 3. I ← NULL |
| 4. J ← NULL |
|
|
| 6. apply Rules(1,2,3) on {gi, gj} for sample sets Jk(k=1,2,3) |
| 7. |
| 8. |
| 9. Ii ← {gi, gj} |
| 10. |
| 11. |
| 12. Ii ← Ii ∪ gp |
| 13. |
| 14. |
| 15. |
| 16. |
| |
| 17. I ← Ii |
| 18. J ← Ji |
| 19. |
| 20. |
| 21. |
| 22. |
| 23. bicluster(gi) ← {I, J} |
| 24. |
| 25. |
| 26. |
| 27. |
| 28. |
| 29. bicluster({I, J}) ← {I ∪ |
| 30. bicluster({K, L}) ← NULL |
| 31. |
| 32. |
| 33. |