Computes a robust version of PCA (principal component analysis) on data. function [U, cost] = robustpca(X, d) Given a matrix X of size p by n, such that each column represents a point in R^p, this computes U: an orthonormal basis of size p by d such that the column space of U captures the points X as well as possible. More precisely, the function attempts to compute U as the minimizer over the Grassmann manifold (the set of linear subspaces) of: f(U) = (1/n) Sum_{i = 1:n} dist(X(:, i), the space spanned by U) = (1/n) Sum_{i = 1:n} || U*U'*X(:, i) - X(:, i) || The output cost represents the average distance achieved with the returned U. Notice that norms are not squared, for robustness. In practice, because this function is nonsmooth, it is smoothed with a pseudo-Huber loss function of parameter epsilon (noted e for short), and the smoothing parameter is iteratively reduced (with warm starts): f_e(U) = (1/n) Sum_{i = 1:n} l_e(|| U*U'*X(:, i) - X(:, i) ||) with l_e(x) = sqrt(x^2 + e^2) - e (for e = 0, this is absolute value). The intermediate optimization of the smooth cost over the Grassmann manifold is performed using the Manopt toolbox. Ideally, the non-outlier data should be centered. If not, this pre-processing centers all the data, but bear in mind that outliers will shift the center of mass too. X = X - repmat(mean(X, 2), [1, size(X, 2)]); There are no guarantees that this code will return the optimal U. This code is distributed to illustrate one possible way of optimizing a nonsmooth cost function over a manifold, using Manopt with smoothing. For practical use, the constants in the code would need to be tuned.

- grassmannfactory Returns a manifold struct to optimize over the space of vector subspaces.
- trustregions Riemannian trust-regions solver for optimization on manifolds.
- multiprod Multiplying 1-D or 2-D subarrays contained in two N-D arrays.
- multiscale Multiplies the 2D slices in a 3D matrix by individual scalars.
- multitransp Transposing arrays of matrices.

0001 function [U, cost] = robust_pca(X, d) 0002 % Computes a robust version of PCA (principal component analysis) on data. 0003 % 0004 % function [U, cost] = robustpca(X, d) 0005 % 0006 % Given a matrix X of size p by n, such that each column represents a 0007 % point in R^p, this computes U: an orthonormal basis of size p by d such 0008 % that the column space of U captures the points X as well as possible. 0009 % More precisely, the function attempts to compute U as the minimizer 0010 % over the Grassmann manifold (the set of linear subspaces) of: 0011 % 0012 % f(U) = (1/n) Sum_{i = 1:n} dist(X(:, i), the space spanned by U) 0013 % = (1/n) Sum_{i = 1:n} || U*U'*X(:, i) - X(:, i) || 0014 % 0015 % The output cost represents the average distance achieved with the 0016 % returned U. Notice that norms are not squared, for robustness. 0017 % 0018 % In practice, because this function is nonsmooth, it is smoothed with a 0019 % pseudo-Huber loss function of parameter epsilon (noted e for short), and 0020 % the smoothing parameter is iteratively reduced (with warm starts): 0021 % 0022 % f_e(U) = (1/n) Sum_{i = 1:n} l_e(|| U*U'*X(:, i) - X(:, i) ||) 0023 % 0024 % with l_e(x) = sqrt(x^2 + e^2) - e (for e = 0, this is absolute value). 0025 % 0026 % The intermediate optimization of the smooth cost over the Grassmann 0027 % manifold is performed using the Manopt toolbox. 0028 % 0029 % Ideally, the non-outlier data should be centered. If not, this 0030 % pre-processing centers all the data, but bear in mind that outliers will 0031 % shift the center of mass too. 0032 % X = X - repmat(mean(X, 2), [1, size(X, 2)]); 0033 % 0034 % There are no guarantees that this code will return the optimal U. 0035 % This code is distributed to illustrate one possible way of optimizing 0036 % a nonsmooth cost function over a manifold, using Manopt with smoothing. 0037 % For practical use, the constants in the code would need to be tuned. 0038 0039 % This file is part of Manopt and is copyrighted. See the license file. 0040 % 0041 % Main author: Nicolas Boumal and Teng Zhang, May 2, 2014 0042 % Contributors: 0043 % 0044 % Change log: 0045 % 0046 % March 4, 2015 (NB): 0047 % Uses a pseudo-Huber loss rather than a Huber loss: this has the 0048 % nice advantage of being smooth and simpler to code (no if's). 0049 % 0050 % April 8, 2015 (NB): 0051 % Built-in test data for quick tests; added comment about centering. 0052 0053 0054 0055 % If no inputs, generate random data for illustration purposes. 0056 if nargin == 0 0057 % Generate some data points aligned on a subspace 0058 X = rand(2, 1)*(1:30) + .05*randn(2, 30).*[(1:30);(1:30)]; 0059 % And add some random outliers to the mix 0060 P = randperm(size(X, 2)); 0061 outliers = 10; 0062 X(:, P(1:outliers)) = 30*randn(2, outliers); 0063 % Center the data 0064 % X = X - repmat(mean(X, 2), [1, size(X, 2)]); 0065 d = 1; 0066 end 0067 0068 0069 0070 0071 0072 % Prepare a Manopt problem structure for optimization of the given 0073 % cost (defined below) over the Grassmann manifold. 0074 [p, n] = size(X); 0075 manifold = grassmannfactory(p, d); 0076 problem.M = manifold; 0077 problem.cost = @robustpca_cost; 0078 problem.egrad = @robustpca_gradient; 0079 0080 % Do classical PCA for the initial guess. 0081 % This is just one idea: it is not necessarily useful or ideal. 0082 % Using a random initial guess, and starting over for a few different 0083 % ones is probably much better. For this example, we keep it simple. 0084 [U, ~, ~] = svds(X, d); 0085 0086 0087 % Iteratively reduce the smoothing constant epsilon and optimize 0088 % the cost function over Grassmann. 0089 epsilon = 1; 0090 n_iterations = 6; 0091 reduction = .5; 0092 options.verbosity = 2; % Change this number for more or less output 0093 warning('off', 'manopt:getHessian:approx'); 0094 for iter = 1 : n_iterations 0095 U = trustregions(problem, U, options); 0096 epsilon = epsilon * reduction; 0097 end 0098 warning('on', 'manopt:getHessian:approx'); 0099 0100 0101 % Return the cost as the actual sum of distances, not smoothed. 0102 epsilon = 0; 0103 cost = robustpca_cost(U); 0104 0105 0106 0107 % If working with the auto-generated input, plot the results. 0108 if nargin == 0 0109 scatter(X(1,:), X(2,:)); 0110 hold on; 0111 plot(U(1)*[-1, 1]*100, U(2)*[-1 1]*100, 'r'); 0112 hold off; 0113 % Compare to a standard PCA 0114 [Upca, ~, ~] = svds(X,1); 0115 hold on; 0116 plot(Upca(1)*[-1, 1]*100, Upca(2)*[-1 1]*100, 'k'); 0117 hold off; 0118 xlim(1.1*[min(X(1,:)), max(X(1,:))]); 0119 ylim(1.1*[min(X(2,:)), max(X(2,:))]); 0120 legend('data points', 'Robust PCA fit', 'Standard PCA fit'); 0121 end 0122 0123 0124 0125 % Smoothed cost 0126 function value = robustpca_cost(U) 0127 0128 vecs = U*(U'*X) - X; 0129 sqnrms = sum(vecs.^2, 1); 0130 vals = sqrt(sqnrms + epsilon^2) - epsilon; 0131 value = mean(vals); 0132 0133 end 0134 0135 % Euclidean gradient of the smoothed cost (it will be transformed into 0136 % the Riemannian gradient automatically by Manopt). 0137 function G = robustpca_gradient(U) 0138 0139 % Note that the computation of vecs and sqnrms is redundant 0140 % with their computation in the cost function. To speed 0141 % up the code, it would be wise to use the caching capabilities 0142 % of Manopt (the store structure). See online documentation. 0143 % It is not done here to keep the code a bit simpler. 0144 UtX = U'*X; 0145 vecs = U*UtX-X; 0146 sqnrms = sum(vecs.^2, 1); 0147 % This explicit loop is a bit slow: the code below is equivalent 0148 % and faster to compute the gradient. 0149 % G = zeros(p, d); 0150 % for i=1:n 0151 % G = G + (1/sqrt(sqnrms(i) + epsilon^2)) * vecs(:,i) * UtX(:,i)'; 0152 % end 0153 % G = G/n; 0154 G = mean(multiscale(1./sqrt(sqnrms + epsilon^2), ... 0155 multiprod(reshape(vecs, [p, 1, n]), ... 0156 multitransp(reshape(UtX, [d, 1, n])))), 3); 0157 end 0158 0159 end

Generated on Tue 19-May-2020 18:46:12 by