| Literature DB >> 31906867 |
Xihui Lin1, Paul C Boutros2,3,4.
Abstract
BACKGROUND: Non-negative matrix factorization (NMF) is a technique widely used in various fields, including artificial intelligence (AI), signal processing and bioinformatics. However existing algorithms and R packages cannot be applied to large matrices due to their slow convergence or to matrices with missing entries. Besides, most NMF research focuses only on blind decompositions: decomposition without utilizing prior knowledge. Finally, the lack of well-validated methodology for choosing the rank hyperparameters also raises concern on derived results.Entities:
Keywords: Deconvolution; Imputation; Non-negative matrix factorization
Mesh:
Year: 2020 PMID: 31906867 PMCID: PMC6945623 DOI: 10.1186/s12859-019-3312-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Comparison of different algorithms in convergence
Comparing performance of different algorithms on a subset of a non-small cell lung cancer dataset, with k=15
| SCD-MSE | LEE-MSE | LEE-MSE-1 | SCD-MKL | LEE-MKL | |
|---|---|---|---|---|---|
| MSE | 0.155 | 0.1565 | 0.1557 | 0.1574 | 0.1579 |
| MKL | 0.01141 | 0.01149 | 0.01145 | 0.01119 | 0.01122 |
| Rel. tol. | 1.325e-05 | 0.0001381 | 0.000129 | 6.452e-08 | 9.739e-05 |
| Total epochs | 5000 | 5000 | 5000 | 5000 | 5000 |
| Time (Sec.) | 1.305 | 1.35 | 8.456 | 49.17 | 41.11 |
MSE = mean square error; MKL = mean KL divergence; Rel. tol. = relative tolerance. Elapsed time = actual running time. SCD-MSE = SCD algorithm with MSE loss and 50 inner iterations and LEE-MSE-1 = Lee’s algorithm with MSE loss and 1 inner iteration, i.e., the original multiplicative algorithm
Fig. 2Comparison of imputation methods. k=2 is used for NMF
A comparison of different imputation methods
| Baseline | Medians | MICE | MissForest | NMF | |
|---|---|---|---|---|---|
| MSE | 4.4272 | 0.5229 | 0.9950 | 0.4175 | 0.4191 |
| MKL | 0.3166 | 0.0389 | 0.0688 | 0.0298 | 0.0301 |
| Time (Sec.) | 0.0000 | 0.0000 | 90.2670 | 42.4010 | 0.1400 |
Imputations on a subset of NSCLC microarray data, which composes 200 genes and 100 samples. 30% of the entries are randomly deleted, i.e., missed. MSE = mean square error, MKL = Mean KL-divergence distance and Time = user time
Fig. 3Determine optimal rank k in NMF using imputation
Fig. 4Comparing NMF to ISOpure [16] for tumour content deconvolution