| Literature DB >> 28881961 |
Djork-Arné Clevert1, Thomas Unterthiner2, Gundula Povysil2, Sepp Hochreiter2.
Abstract
MOTIVATION: Biclustering has become a major tool for analyzing large datasets given as matrix of samples times features and has been successfully applied in life sciences and e-commerce for drug design and recommender systems, respectively. actor nalysis for cluster cquisition (FABIA), one of the most successful biclustering methods, is a generative model that represents each bicluster by two sparse membership vectors: one for the samples and one for the features. However, FABIA is restricted to about 20 code units because of the high computational complexity of computing the posterior. Furthermore, code units are sometimes insufficiently decorrelated and sample membership is difficult to determine. We propose to use the recently introduced unsupervised Deep Learning approach Rectified Factor Networks (RFNs) to overcome the drawbacks of existing biclustering methods. RFNs efficiently construct very sparse, non-linear, high-dimensional representations of the input via their posterior means. RFN learning is a generalized alternating minimization algorithm based on the posterior regularization method which enforces non-negative and normalized posterior means. Each code unit represents a bicluster, where samples for which the code unit is active belong to the bicluster and features that have activating weights to the code unit belong to the bicluster.Entities:
Mesh:
Year: 2017 PMID: 28881961 PMCID: PMC5870657 DOI: 10.1093/bioinformatics/btx226
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1Left: Factor analysis model: hidden units (factors) , visible units , weight matrix , noise . Right: The outer product of two sparse vectors results in a matrix with a bicluster. Note that the non-zero entries in the vectors are adjacent to each other for visualization purposes only
Results are the mean of 100 instances for each simulated dataset
| Mult. model | Add. model | |||
|---|---|---|---|---|
| Method | M1 | A1 | A2 | A3 |
| RFN | ||||
| FABIA | 0.478 ± 1e-2 | 0.109 ± 6e-2 | 0.196 ± 8e-2 | 0.475 ± 1e-1 |
| FABIAS | 0.564 ± 3e-3 | 0.150 ± 7e-2 | 0.268 ± 7e-2 | 0.546 ± 1e-1 |
| SAMBA | 0.006 ± 5e-5 | 0.002 ± 6e-4 | 0.002 ± 5e-4 | 0.003 ± 8e-4 |
| xMOTIF | 0.002 ± 6e-5 | 0.002 ± 4e-4 | 0.002 ± 4e-4 | 0.001 ± 4e-4 |
| MFSC | 0.057 ± 2e-3 | 0.000 ± 0e-0 | 0.000 ± 0e-0 | 0.000 ± 0e-0 |
| Bimax | 0.004 ± 2e-4 | 0.009 ± 8e-3 | 0.010 ± 9e-3 | 0.014 ± 1e-2 |
| plaid_ss | 0.045 ± 9e-4 | 0.039 ± 2e-2 | 0.041 ± 1e-2 | 0.074 ± 3e-2 |
| CC | 0.001 ± 7e-6 | 4e-4 ± 3e-4 | 3e-4 ± 2e-4 | 1e-4 ± 1e-4 |
| plaid_ms | 0.072 ± 4e-4 | 0.064 ± 3e-2 | 0.072 ± 2e-2 | 0.112 ± 3e-2 |
| plaid_t_ab | 0.046 ± 5e-3 | 0.021 ± 2e-2 | 0.005 ± 6e-3 | 0.022 ± 2e-2 |
| plaid_ms5 | 0.083 ± 6e-4 | 0.098 ± 4e-2 | 0.143 ± 4e-2 | 0.221 ± 5e-2 |
| plaid_t_a | 0.037 ± 4e-3 | 0.039 ± 3e-2 | 0.010 ± 9e-3 | 0.051 ± 4e-2 |
| FLOC | 0.006 ± 3e-5 | 0.005 ± 9e-4 | 0.005 ± 1e-3 | 0.003 ± 9e-4 |
| ISA | 0.333 ± 5e-2 | 0.039 ± 4e-2 | 0.033 ± 2e-2 | 0.140 ± 7e-2 |
| spec | 0.032 ± 5e-4 | 0.000 ± 0e-0 | 0.000 ± 0e-0 | 0.000 ± 0e-0 |
| OPSM | 0.012 ± 1e-4 | 0.007 ± 2e-3 | 0.007 ± 2e-3 | 0.008 ± 2e-3 |
Datasets M1 and A1–A3 were multiplicative and additive bicluster, respectively. The numbers denote average consensus scores with the true biclusters together with their standard deviations in parentheses. The best results are printed bold and the second best in italics (‘better’ means significantly better according to both a paired t-test and a McNemar test of correct elements in biclusters).
Fig. 2Runtime comparison of FABIA and RFN for 10, 30, 100, 300 and 500 biclusters on synthetic inputs of n = 500 features and l = 1000 samples for 100 iterations each. Shown data are the median of five measurements, error bars are standard errors of the mean
Results on the (A) breast cancer, (B) multiple tissue samples, (C) DLBCL datasets measured by the consensus score
| (A) breast cancer | (B) multiple tissues | (C) DLBCL | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| method | sco | #bc | #g | #s | sco | #bc | #g | #s | sco | #bc | #g | #s |
| RFN | 3 | 73 | 31 | 5 | 75 | 33 | 2 | 59 | 72 | |||
| FABIA | 3 | 92 | 31 | 0.53 | 5 | 356 | 29 | 2 | 59 | 62 | ||
| FABIAS | 3 | 144 | 32 | 0.44 | 5 | 435 | 30 | 2 | 104 | 60 | ||
| MFSC | 0.17 | 5 | 87 | 24 | 0.31 | 5 | 431 | 24 | 0.18 | 5 | 50 | 42 |
| plaid_ss | 0.39 | 5 | 500 | 38 | 0.56 | 5 | 1903 | 35 | 0.30 | 5 | 339 | 72 |
| plaid_ms | 0.39 | 5 | 175 | 38 | 0.50 | 5 | 571 | 42 | 0.28 | 5 | 143 | 63 |
| plaid_ms5 | 0.29 | 5 | 56 | 29 | 0.23 | 5 | 71 | 26 | 0.21 | 5 | 68 | 47 |
| ISA_1 | 0.03 | 25 | 55 | 4 | 0.05 | 29 | 230 | 6 | 0.01 | 56 | 26 | 8 |
| OPSM | 0.04 | 12 | 172 | 8 | 0.04 | 19 | 643 | 12 | 0.03 | 6 | 162 | 4 |
| SAMBA | 0.02 | 38 | 37 | 7 | 0.03 | 59 | 53 | 8 | 0.02 | 38 | 19 | 15 |
| xMOTIF | 0.07 | 5 | 61 | 6 | 0.11 | 5 | 628 | 6 | 0.05 | 5 | 9 | 9 |
| Bimax | 0.01 | 1 | 1213 | 97 | 0.10 | 4 | 35 | 5 | 0.07 | 5 | 73 | 5 |
| CC | 0.11 | 5 | 12 | 12 | nc | nc | nc | nc | 0.05 | 5 | 10 | 10 |
| plaid_t_ab | 0.24 | 2 | 40 | 23 | 0.38 | 5 | 255 | 22 | 0.17 | 1 | 3 | 44 |
| plaid_t_a | 0.23 | 2 | 24 | 20 | 0.39 | 5 | 274 | 24 | 0.11 | 3 | 6 | 24 |
| spec | 0.12 | 13 | 198 | 28 | 0.37 | 5 | 395 | 20 | 0.05 | 28 | 133 | 32 |
| FLOC | 0.04 | 5 | 343 | 5 | nc | nc | nc | nc | 0.03 | 5 | 167 | 5 |
An ‘nc’ entry means that the method did not converge for this dataset. The best results are in bold and the second best in italics (‘better’ means significantly better according to a McNemar test of correct samples in clusters). The columns ‘sco’, ‘#bc’, ‘#g’, ‘#s’ provide the consensus score, the numbers of biclusters, their average numbers of genes, and their average numbers of samples, respectively. RFN is two times the best method and once on second place.
Fig. 3Example of an IBD segment matching the Neanderthal genome shared among Africans and Admixed Americans. The rows represent all individuals that have the IBD segment, and columns represent consecutive SNVs. Major alleles are shown in yellow, minor alleles of tagSNVs in violet, and minor alleles of other SNVs in cyan. The row labeled model L indicates tagSNVs identified by RFN in violet. The rows Ancestor, Neanderthal and Denisova show bases of the respective genomes in violet if they match the minor allele of the tagSNVs (in yellow otherwise). For the Ancestor genome we used the reconstructed common ancestor sequence that was provided as part of the 1000 Genomes Project data