| Literature DB >> 20932336 |
István P Sugár1, Stuart C Sealfon.
Abstract
BACKGROUND: There are many important clustering questions in computational biology for which no satisfactory method exists. Automated clustering algorithms, when applied to large, multidimensional datasets, such as flow cytometry data, prove unsatisfactory in terms of speed, problems with local minima or cluster shape bias. Model-based approaches are restricted by the assumptions of the fitting functions. Furthermore, model based clustering requires serial clustering for all cluster numbers within a user defined interval. The final cluster number is then selected by various criteria. These supervised serial clustering methods are time consuming and frequently different criteria result in different optimal cluster numbers. Various unsupervised heuristic approaches that have been developed such as affinity propagation are too expensive to be applied to datasets on the order of 106 points that are often generated by high throughput experiments.Entities:
Mesh:
Year: 2010 PMID: 20932336 PMCID: PMC2967560 DOI: 10.1186/1471-2105-11-502
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Misty Mountain clustering of simulated FCM data. a) Simulated 2 dimensional FCM data. Sum of four Gaussian distributions are simulated by using Monte Carlo techniques (see Methods). The center of each Gaussian is marked by a colored arrow. b) Two-dimensional histogram, H(I,J) of the simulated data is created by using an optimal 58 × 58 equally spaced mesh. c) Projection of the 2 dimensional histogram to the (J,FREQUENCY) plane. Blue lines: levels of histogram intersections shown in d)-j) subfigures. The frequencies at the intersections are: d) 5000, e) 2900, f) 2000, g) 757, h) 756, i) 103, j) 0.
Characteristics of the clusters assigned to data in Figure 1a
| Color code | ||||
|---|---|---|---|---|
| green | 3385 | 756 | 313369 | 0.777 |
| red | 10706 | 0 | 300000 | 1 |
| pink | 2493 | 756 | 143539 | 0.697 |
| blue | 1911 | 102 | 94930 | 0.947 |
L: height of the peak.
L: height of the highest saddle next to the peak.
C: number of data points in the cluster.
f[= (L-L)/L]: measure of separateness of the peak from nearby peak(s). The parameter estimates the reliability that an element of the cluster belongs to the respective population.
Figure 2Side scattering and forward scattering of U937 cells. a) Experimental data. Side scattering is plotted against forward scattering. b) Result of cluster analysis by using the K-median clustering and spectral clustering with assuming 2 centers. c) Result of the cluster analysis by using the Misty Mountain method. Table 2 lists the characteristics of the resulting clusters. The data points assigned to the two clusters are marked by red and blue symbols.
Characteristics of the clusters assigned to data in Figure 2a
| Color code | ||||
|---|---|---|---|---|
| red | 430 | 5 | 8338 | 0.988 |
| blue | 529 | 5 | 804 | 0.991 |
(see legends to Table 1)
Summary of comparing Misty Mountain with state of the art flow cytometry specific clustering methods
| Data set | Manually gated 2D barcoding& | Simulated 5D Gaussians | Simulated 2D non-convex | 3D rituximab | 4D GvHD | Manually gated 4D | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Misty Mountain | accuracy | sens | (%) | 100 | 100 | 100 | - | - | 100 | |||
| spec | (%) | 100 | 100 | 100 | - | - | 100 | |||||
| CPU | (sec) | 10 | 196 | 6 | 0.3 | 0.8 | 3.6 | |||||
| FLAME | accuracy | sens | (%) | 20a | 60b | - | 0d* | 100d | - | - | - | |
| spec | (%) | 33a | 50b | - | 0d* | 100d | - | - | - | |||
| CPU | (sec) | 5.104 | >3.105 | 1.104 | 10 | 360 | 1.4 · 104 | |||||
| flowClust | accuracy | sens | (%) | 45a* | 60b* | 100c | 0c* | 100d | - | - | 60d* | 60* |
| spec | (%) | 60a* | 55b* | 100c | 0c* | 100d | - | - | 75d* | 38* | ||
| CPU | (sec) | 5.104 | 4.104 | 7200 | 43 | 480 | 3660 | |||||
| flowMerge | accuracy | sens | (%) | 25 | 100 | 0 | - | - | 80 | |||
| spec | (%) | 45 | 100 | 0 | - | - | 57 | |||||
| CPU | (sec) | 1.3 · 105 | 1.27 · 105 | 7200 | 124 | 1020 | 8400 | |||||
| flowJo | accuracy | sens | (%) | 45 | - | - | - | - | - | |||
| spec | (%) | 47 | - | - | - | - | - | |||||
| CPU | (sec) | 1-10 | - | - | 1-10 | 1-10 | - | |||||
a optimal cluster number: 12
b optimal cluster number: 24
a*optimal cluster number: 15
b*optimal cluster number: 22
c optimal cluster number: 5
c* optimal cluster number: 2
d optimal cluster number: 1
d* optimal cluster number: 4
* optimal cluster number: 8
&to save CPU time a data set, reduced by 80%, has been analyzed by FLAME, flowClust and flowJo
sens (sensitivity) = (# of correctly assigned clusters)/(# of clusters in gold standard)
spec(specificity) = (# of correctly assigned clusters)/(total # of assigned clusters)
Gold standards were independent expert manual clustering for experimental data and specified clusters for simulated data.
Figure 3Two-dimensional FCM data. 853,674 U937 cells are stained by two florescence dyes, Pacific Blue and APC-Cy7-A. a) The fluorescence intensity of APC-Cy7-A is plotted against the fluorescence intensity of Pacific Blue. b) Result of the cluster analysis by using the Misty Mountain method. Each cluster is marked by a code number. Table in Additional File 2 lists the characteristics of the resulting clusters.
Figure 4Four-dimensional FCM data from the graft-versus-host-disease data set. 10,463 peripheral blood mononuclear cells are stained by four florescence dyes: 1) CD4-FITC, 2) CD122-PE, 3) CD3-PerCP, 4) CD8-APC. At each axis of the plots the code number of the respective fluorescent stain is shown. Six 2D projections of the 4D data set are shown.
Figure 5Misty Mountain clustering of the graft-versus-host-disease data set. 2D projections of the 4D clustering result are shown. Code numbers of clusters assigned by Misty Mountain algorithm: 1 (red); 2 (blue); 3 (green); 4 (black); 5 (rose); 6 (light blue). Table 4 lists the characteristics of the resulting clusters.
Characteristics of clusters assigned by Misty Mountain to the 4D GvHD data in Figure 4.
| Code # | Bin # | ||||
|---|---|---|---|---|---|
| 1 | 1541 | 1033 | 1542 | 1 | 0.33 |
| 2 | 1115 | 1033 | 1116 | 1 | 0.074 |
| 4 | 230 | 25 | 1011 | 11 | 0.891 |
| 3 | 889 | 804 | 890 | 1 | 0.096 |
| 5 | 175 | 30 | 858 | 8 | 0.829 |
| 6 | 132 | 30 | 265 | 3 | 0.773 |
(see legends to Table 1)
Bin #: number of histogram bins containing the points of a cluster
Figure 6Flow chart of the main part of the Misty Mountain program. IDIM - dimension of the data space. N - the number of equidistant meshes along each coordinate for creating optimal histogram. LEVELMAX - highest frequency of the histogram. LEVEL - frequency where the actual cross section is created. NCL - actual number of (major and small) histogram peaks. NSTRA and NSTRB - matrices of labeled aggregates of two consecutive cross sections. NSTRF - structure matrix that stores the largest separate aggregates belonging to the major peaks (see e.g. colored aggregates in Figure 1d-j).
Figure 7Flow chart of the ANALYZE routine of the Misty Mountain program. ISZM - the largest label of the aggregates in a cross section. (For simplicity this flowchart assumes that ISZM is also the number of aggregates. In reality ISZM is frequently larger than the number of aggregates.) ICLU - code number of a peak. ILAB - label of an aggregate. IP - counter of peaks belonging to the same aggregate. CP - characteristic position of a peak. IIPOINT1(ICLU) - characteristic position of the ICLU-th peak. IIPOINT2(ICLU) - label of the aggregate at the ICLU-th peak. IIPOINT3(ICLU) = T - the type of the ICLU-th peak: T = 1 - single peak, T = 0- merged small peak, T = -1 - merged major peak. The values of the IIPOINT2 and IIPOINT3 vector elements are updated at each level. IILEVEL1(ICLU) - level at the top of the ICLU-th peak. IILEVEL2(ICLU) - level of the saddle where the single ICLU-th peak coalesces with another peak. IHELP(ILAB) - number of characteristic peak positions falling into an aggregate labeled by ILAB. IHELP(ILAB,IP) - the code number of the peak belonging to the IP-th characteristic peak position in the aggregate labeled by ILAB. IFINAL(ICLU) = 0 - when ICLU-th peak is eliminated from the analysis. IFINAL(ICLU) = 1 - when the aggregate belonging to the ICLU-th peak is copied into NSTRF. ISZVEC(IP) = -1 - the IP-th peak has merged with other peak at a higher level. ISZVEC(IP) = 0- the IP-th peak was a small single peak at the previous level. ISZVEC(IP) = 1 - the IP-th peak was a major single peak at the previous level. INEW - number of single peaks merging with each other at the current level. IHIGH - number of major peaks from the INEW single peaks. Other notations are at the legends to Figure 6.