| Literature DB >> 29430036 |
Abstract
In high-dimensional multivariate regression problems, enforcing low rank in the coefficient matrix offers effective dimension reduction, which greatly facilitates parameter estimation and model interpretation. However, commonly used reduced-rank methods are sensitive to data corruption, as the low-rank dependence structure between response variables and predictors is easily distorted by outliers. We propose a robust reduced-rank regression approach for joint modelling and outlier detection. The problem is formulated as a regularized multivariate regression with a sparse mean-shift parameterization, which generalizes and unifies some popular robust multivariate methods. An efficient thresholding-based iterative procedure is developed for optimization. We show that the algorithm is guaranteed to converge and that the coordinatewise minimum point produced is statistically accurate under regularity conditions. Our theoretical investigations focus on non-asymptotic robust analysis, demonstrating that joint rank reduction and outlier detection leads to improved prediction accuracy. In particular, we show that redescending [Formula: see text]-functions can essentially attain the minimax optimal error rate, and in some less challenging problems convex regularization guarantees the same low error rate. The performance of the proposed method is examined through simulation studies and real-data examples.Entities:
Keywords: Low-rank matrix approximation; Non-asymptotic analysis; Robust estimation; Sparsity
Year: 2017 PMID: 29430036 PMCID: PMC5793675 DOI: 10.1093/biomet/asx032
Source DB: PubMed Journal: Biometrika ISSN: 0006-3444 Impact factor: 2.445
Fig. 1.Arabidopsis thaliana data: outlier detection paths obtained by the robust reduced-rank regression. Sample 3 and sample 52 are captured as outliers, whose paths are shown as a dotted line and a dashed line, respectively. The path plot also suggests sample 27 as a potential outlier.
Fig. 2.Arabidopsis thaliana data: factor coefficients of the 62 response genes from plastoquinone, carotenoid, phytosterol, and chlorophyll pathways. From left to right the panels correspond to the top three factors estimated by the robust reduced-rank regression. For the th factor (), two horizontal lines are plotted at heights , and three vertical lines separate the genes into four different pathways.
Arabidopsis thaliana data: percentage of genes on each response pathway that show significance of a given factor, with the familywise error rate controlled at level
| Pathway | Number of genes | Factor 1 | Factor 2 | Factor 3 |
|---|---|---|---|---|
| Carotenoid | 11 | 55% | 73% | 9% |
| Phytosterol | 25 | 20% | 48% | 32% |
| Chlorophyl | 24 | 75% | 21% | 0% |