| Literature DB >> 23940779 |
Hung-Chia Chen1, Wen Zou, Yin-Jing Tien, James J Chen.
Abstract
Biclustering has emerged as an important approach to the analysis of large-scale datasets. A biclustering technique identifies a subset of rows that exhibit similar patterns on a subset of columns in a data matrix. Many biclustering methods have been proposed, and most, if not all, algorithms are developed to detect regions of "coherence" patterns. These methods perform unsatisfactorily if the purpose is to identify biclusters of a constant level. This paper presents a two-step biclustering method to identify constant level biclusters for binary or quantitative data. This algorithm identifies the maximal dimensional submatrix such that the proportion of non-signals is less than a pre-specified tolerance δ. The proposed method has much higher sensitivity and slightly lower specificity than several prominent biclustering methods from the analysis of two synthetic datasets. It was further compared with the Bimax method for two real datasets. The proposed method was shown to perform the most robust in terms of sensitivity, number of biclusters and number of serotype-specific biclusters identified. However, dichotomization using different signal level thresholds usually leads to different sets of biclusters; this also occurs in the present analysis.Entities:
Mesh:
Year: 2013 PMID: 23940779 PMCID: PMC3733970 DOI: 10.1371/journal.pone.0071680
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1A synthetic data matrix ordered according to magnitudes of the singular vectors of second principal component.
The two figures above and to the right of the ordered data matrix are the plots of the corresponding values of the singular vectors.
Figure 2The plot of the same synthetic data matrix after data are dichotomized.
The boundaries between the signals (red) and non-signals (green) are apparent.
Figure 3Heatmap of the original synthetic dataset biclustered in Figure 1.
There are 4 biclusters colored as red, green, pink and blue in rows and columns; the brown and orange columns represent the overlapped columns of the biclusters.
Figure 4Heatmap of the randomly permuted synthetic data from Figure 3.
This reflects the real data collected from an experiment.
Performance of SVD-Bin(δ), Bin-SVD(δ), NMF-Bin(f, δ), Bimax(k), CC(k), xMotif(k), Spectral(f), and FABIA(k) on a quantitative synthetic dataset where δ is the tolerance threshold, f is the number of factorization ranks, and k is the number of clusters.
| Sensitivity | Specificity | Proportion of perfect identification | Number of clusters | |
| SVD-Bin(0.3) | 0.961 | 0.984 | 0.02 | 6.67 |
| SVD-Bin(0.2) | 0.952 | 0.996 | 0.11 | 5.37 |
| SVD-Bin(0.1) | 0.915 | 1.000 | 0.25 | 4.41 |
| Bin-SVD(0.3) | 1.000 | 0.980 | 0 | 8.98 |
| Bin-SVD(0.2) | 0.999 | 0.994 | 0.03 | 7.08 |
| Bin-SVD(0.1) | 0.975 | 0.998 | 0.20 | 5.40 |
| NMF-Bin(3,0.3) | 0.874 | 0.996 | 0 | 3.67 |
| NMF-Bin(4,0.3) | 0.986 | 0.997 | 0.43 | 4.71 |
| NMF-Bin(5,0.3) | 0.988 | 0.995 | 0. 14 | 5.41 |
| NMF-Bin(3,0.2) | 0.871 | 0.999 | 0 | 3.53 |
| NMF-Bin(4,0.2) | 0.983 | 0.999 | 0.62 | 4.36 |
| NMF-Bin(5,0.2) | 0.978 | 0.998 | 0.29 | 5.1 |
| NMF-Bin(3,0.1) | 0.852 | 1.000 | 0 | 3.42 |
| NMF-Bin(4,0.1) | 0.960 | 1.000 | 0.58 | 4.13 |
| NMF-Bin(5,0.1) | 0.948 | 1.000 | 0.46 | 4.31 |
| Bimax(4) | 0.387 | 1.000 | 0 | 4 |
| Bimax(5) | 0.423 | 1.000 | 0 | 5 |
| Bimax(6) | 0.475 | 1.000 | 0 | 6 |
| Bimax(7) | 0.508 | 1.000 | 0 | 7 |
| Bimax(8) | 0.544 | 1.000 | 0 | 7.92 |
| CC(4) | 0.544 | 0.110 | 0 | 3.75 |
| CC(5) | 0.546 | 0.109 | 0 | 3.82 |
| CC(6) | 0.546 | 0.109 | 0 | 3.82 |
| CC(7) | 0.546 | 0.109 | 0 | 3.82 |
| CC(8) | 0.546 | 0.109 | 0 | 3.82 |
| xMotif(4) | 0.056 | 0.225 | 0 | 3.44 |
| xMotif(5) | 0.069 | 0.231 | 0 | 3.45 |
| xMotif(6) | 0.057 | 0.217 | 0 | 3.52 |
| xMotif(7) | 0.072 | 0.227 | 0 | 3.5 |
| xMotif(8) | 0.067 | 0.221 | 0 | 3.52 |
| FABIA(4) | 0.862 | 0.988 | 0.01 | 3.99 |
| FABIA(5) | 0.828 | 0.982 | 0.07 | 4.44 |
| FABIA(6) | 0.794 | 0.975 | 0.01 | 5.11 |
| FABIA(7) | 0.707 | 0.959 | 0.02 | 5.86 |
| FABIA(8) | 0.713 | 0.950 | 0 | 6.69 |
| Spectral(3) | 0.005 | 0.996 | 0 | 0.48 |
| Spectral(4) | 0.001 | 0.996 | 0 | 0.53 |
| Spectral(5) | 0.000 | 0.991 | 0 | 1.84 |
| Plaid | 0.508 | 0.913 | 0 | 4.76 |
The results are the average over 100 simulated datasets.
Performance of SVD(δ) and Bimax(k) on a binary synthetic dataset, where δ is the tolerance threshold and k is the number of clusters. The results are the average over 100 simulated datasets.
| Sensitivity | Specificity | Proportion of perfect identification | Number of clusters | |
| SVD(0.3) | 0.995 | 0.9669 | 0 | 10.13 |
| SVD(0.2) | 0.9599 | 0.9844 | 0.01 | 6.07 |
| SVD(0.1) | 0.8354 | 0.9967 | 0.02 | 4 |
| Bimax(4) | 0.2499 | 0.9972 | 0 | 4 |
| Bimax(5) | 0.2849 | 0.9964 | 0 | 5 |
| Bimax(6) | 0.3063 | 0.9959 | 0 | 6 |
| Bimax(7) | 0.3317 | 0.9952 | 0 | 7 |
| Bimax(8) | 0.3561 | 0.9946 | 0 | 8 |
| Bimax(9) | 0.3714 | 0.9942 | 0 | 9 |
| Bimax(10) | 0.3874 | 0.9938 | 0 | 10 |
| Bimax(20) | 0.5144 | 0.9898 | 0 | 20 |
| Bimax(50) | 0.7541 | 0.9801 | 0 | 50 |
| Bimax(100) | 0.8831 | 0.9698 | 0 | 95 |
Performance of SVD-Bin(δ), Bin-SVD(δ), Bin-NMF(f, δ), and Bimax(k) on the Saccharomyces cerevisiae dataset where δ is the tolerance threshold, f is the number of factorization ranks, and k is the number of clusters.
| Sensitivity | Specificity | Number of clusters (k) | Number of significant clusters | Processing Time | |
| SVD-Bin(0.3) | 0.167 | 0.990 | 26 | 20 | 0.734 |
| SVD-Bin(0.2) | 0.082 | 0.997 | 14 | 14 | 0.657 |
| SVD-Bin(0.1) | 0.013 | 1.000 | 5 | 5 | 0.322 |
| Bin-SVD(0.3) | 0.410 | 0.957 | 56 | 48 | 1.825 |
| Bin-SVD(0.2) | 0.085 | 0.996 | 19 | 19 | 0.441 |
| Bin-SVD(0.1) | 0.015 | 1.000 | 4 | 4 | 0.228 |
| Bin-NMF(70,0.3) | 0.653 | 0.955 | 55 | 55 | 8.443 |
| Bin-NMF(70,0.2) | 0.269 | 0.991 | 21 | 19 | 8.466 |
| Bin-NMF(70,0.1) | No bicluster is found | 5.073 | |||
| Bimax(5) | 0.027 | 1 | 5 | 5 | 0.007 |
| Bimax(10) | 0.036 | 1 | 10 | 10 | 0.007 |
| Bimax(15) | 0.083 | 1 | 15 | 15 | 0.010 |
| Bimax(20) | 0.084 | 1 | 20 | 20 | 0.008 |
| Bimax(50) | 0.094 | 1 | 50 | 50 | 0.009 |
| Bimax(100) | 0.168 | 1 | 100 | 100 | 0.011 |
| Bimax(200) | 0.175 | 1 | 200 | 200 | 0.011 |
The number of the factors in NMF is the rank of the data matrix f = 70.
Performance of SVD(δ), NMF(f, δ), and Bimax(k) on the PFGE dataset consisting of the 4 serotypes, Heidelberg (n = 322), Javiana (n = 150), Newport (n = 91), and Typhimurium (n = 135), for a total of 698 isolates with 71 bands.
| Sensitivity, based on observed data | Specificity, based on observed data | Number of clusters | Number of significant cluster | Processing time | |
| SVD(0.3) | 0.632 | 0.961 | 38 | 35 | 1.042 |
| SVD(0.2) | 0.439 | 0.980 | 14 | 14 | 0.379 |
| SVD(0.1) | No bicluster is found | 0.193 | |||
| NMF(71, 0.3) | 0.425 | 0.997 | 19 | 19 | 16.833 |
| NMF(71, 0.2) | 0.478 | 0.991 | 16 | 16 | 12.974 |
| NMF(71, 0.1) | No bicluster is found | 15.496 | |||
| Bimax(4) | 0.029 | 1 | 4 | 4 | 0.009 |
| Bimax(5) | 0.029 | 1 | 5 | 5 | 0.009 |
| Bimax(6) | 0.038 | 1 | 6 | 6 | 0.009 |
| Bimax(7) | 0.040 | 1 | 7 | 7 | 0.009 |
| Bimax(8) | 0.040 | 1 | 8 | 8 | 0.008 |
| Bimax(9) | 0.043 | 1 | 9 | 9 | 0.008 |
| Bimax(10) | 0.048 | 1 | 10 | 10 | 0.008 |
| Bimax(100) | 0.104 | 1 | 100 | 100 | 0.014 |
The number of the factors in NMF is the rank of the data matrix f = 71.
The ten largest biclusters out of 14 biclusters identified by SVD(0.2).
| Majority | Sensitivity | Specificity | Isolates | Bands | log10p |
| Heidelberg | 0.9845 | 0.8883 | 359 | 16 | |
| cluster 1 | 0.9783 | 0.9814 | 322 | 12 | ≤−2402.38 |
| cluster 2 | 0.9068 | 1 | 292 | 9 | ≤−1545.73 |
| cluster 3 | 0.0807 | 0.9362 | 50 | 4 | ≤−106.22 |
| cluster 4 | 0.0776 | 0.9707 | 36 | 5 | ≤−124.87 |
| Javiana | 0.6067 | 0.8595 | 168 | 8 | |
| cluster 1 | 0.4267 | 0.8796 | 130 | 2 | ≤−146.11 |
| cluster 2 | 0.1933 | 0.9818 | 39 | 3 | ≤−69.97 |
| cluster 3 | 0.0733 | 0.9982 | 12 | 4 | ≤−42.18 |
| cluster 4 | 0.02 | 0.9964 | 5 | 3 | −6.75 |
| Newport | 0.0879 | 0.9951 | 11 | 3 | |
| cluster 1 | 0.0879 | 0.9951 | 11 | 3 | ≤−12.37 |
| Typhimurium | 0.0296 | 0.9982 | 5 | 2 | |
| cluster 1 | 0.0296 | 0.9982 | 5 | 2 | −5.03 |
The ten largest biclusters identified by Bimax(100).
| Majority | Sensitivity | Specificity | Isolates | Bands | log10p |
| Typhimurium | 0.8148 | 0.8686 | 184 | 5 | |
| cluster 1 | 0.3801 | 0.9539 | 90 | 2 | ≤−166.96 |
| cluster 2 | 0.3416 | 1 | 55 | 3 | ≤−153.05 |
| cluster 3 | 0.4144 | 0.9942 | 78 | 2 | ≤−144.70 |
| cluster 4 | 0.3908 | 0.9904 | 73 | 2 | ≤−135.42 |
| cluster 5 | 0.2429 | 0.9315 | 72 | 2 | ≤−133.57 |
| cluster 6 | 0.3026 | 0.9904 | 51 | 2 | ≤−94.61 |
| cluster 7 | 0.2535 | 0.9773 | 48 | 2 | ≤−89.05 |
| cluster 8 | 0.2482 | 0.9792 | 46 | 2 | ≤−85.34 |
| Newport | 0.2967 | 0.9489 | 58 | 2 | |
| cluster 1 | 0.2328 | 0.9451 | 58 | 2 | ≤−107.60 |
| Javiana | 0.2 | 0.9672 | 48 | 2 | |
| cluster 1 | 0.2482 | 0.9792 | 48 | 2 | ≤−89.05 |