An iterative greedy algorithm for sparsityconstrained optimization - Nguyen Ngoc Minh

Tài liệu An iterative greedy algorithm for sparsityconstrained optimization - Nguyen Ngoc Minh: Nghiên cứu khoa học công nghệ Tạp chí Nghiên cứu KH&CN quân sự, Số 40, 12 - 2015 109 AN ITERATIVE GREEDY ALGORITHM FOR SPARSITY- CONSTRAINED OPTIMIZATION Nguyen Ngoc Minh1*, Nguyen Nam2 Abstract: This paper presents an iterative greedy algorithm, namely gradient matching pursuit (gradMP), to solve a general sparsity-constrained optimization. Our algorithm generalizes the idea of the CoSaMP algorithm to handle the non- quadratic structure of the objective function together with the general form of the sparse constraint. The algorithm thus can solve for broader class of objective functions and sparsity constraints. In addition, our simulations on various problems also demonstrates that our algorithm usually leads to comparable performance, if not better, than the convex minimization approach. Keywords: Matching pursuit, GradMP, CoSaMP. 1. INTRODUCTION Over the past decade, the problem of recovering low-dimensional structured models from highly incomplete info...

9 trang | Chia sẻ: quangot475 | Lượt xem: 524 | Lượt tải: 0

Bạn đang xem nội dung tài liệu An iterative greedy algorithm for sparsityconstrained optimization - Nguyen Ngoc Minh, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên

Nghiên cứu khoa học công nghệ Tạp chí Nghiên cứu KH&CN quân sự, Số 40, 12 - 2015 109 AN ITERATIVE GREEDY ALGORITHM FOR SPARSITY- CONSTRAINED OPTIMIZATION Nguyen Ngoc Minh1*, Nguyen Nam2 Abstract: This paper presents an iterative greedy algorithm, namely gradient matching pursuit (gradMP), to solve a general sparsity-constrained optimization. Our algorithm generalizes the idea of the CoSaMP algorithm to handle the non- quadratic structure of the objective function together with the general form of the sparse constraint. The algorithm thus can solve for broader class of objective functions and sparsity constraints. In addition, our simulations on various problems also demonstrates that our algorithm usually leads to comparable performance, if not better, than the convex minimization approach. Keywords: Matching pursuit, GradMP, CoSaMP. 1. INTRODUCTION Over the past decade, the problem of recovering low-dimensional structured models from highly incomplete information has attracted increasing attentions by researchers from diverse different fields such as signal processing, machine learning and statistics. Work in compressed sensing has shown that a sparse signal can be precisely reconstructed from a small number of nonadaptive linear measurements, which is proportional to a logarithmic factor multiplying with the signal's sparsity level. In high-dimensional statistics, it has demonstrated that the linear regression parameter can be statistically well estimated from a small set of observations assuming that the parameter vector is sufficiently sparse[1]. Recently, these ideas have been extended to other models such as a low-rank structure, in which it has been studied that a low-rank data matrix can be recovered from a small amount of samples being proportional to the degree of freedom of the matrix. On the other direction, Gaussian graphical model estimation can be formulated as a sparse recovery problem with the log likelihood loss, in which it has been shown that the graph can be re- ensembled assuming that underlying graph is sufficiently sparse. A common theme of these problems is the assumption of low-dimensional structures imposed on the signal or the parameters need to be estimated. In general, in order to recover the signal or the parameter of interest, those problems mentioned above can be formulated as minimizing a loss function subject to a sparse constraint: min ( ) x x subject to 0,D x k (1) where 0,D x is the norm which measures the "sparsity" of x with respect to a given set D. For example, if the set D consists of n standard vectors in n-dimension, then 0,D x is the 0 -norm of x which counts the number of nonzero coefficients of x. Also in this optimization, k is a user-defined parameter which is used to control the sparsity level of the estimator, and ( )x is the objective function which measures the consistency between the observation and the parameter needing to be estimated. In this paper, ( )x is a smooth, but not necessarily a convex function. This estimation problem is sufficiently general to cover various problems of interest. For instance, in compressed sensing and sparse linear regression, we would like to recover a sparse signal * nx  from limited linear observations *:ny y Ax e   , where A is the m n measurement matrix and e is Công nghệ thông tin & Khoa học máy tính N.N.Minh, N.Nam, “An interative greedy algorithm for sparsity-constrained optimization.” 110 the noise vector. Naturally, the objective function ( )x is formulated as the quadratic loss: 2 2 ( )x y Ax  and the constraint is the 0 -norm. Similarly, in matrix recovery where one would like to recover a low-rank matrix *X from linear observations *y Ax e  , the loss function is again quadratic 2 2 y Ax and the constraint is now the matrix rank. Recent years have witnessed a blossom of various algorithms proposed to solve (1) or variants of this optimization with specific loss functions and constraints. In general, these algorithms can be categorized into two main approaches: The first approach is to recast this nonconvex programming into a convex counterpart by using the 1 -norm to relax the 0 counterpart. In particular, if ( )x is the convex function and 0,Dx k is the standard 0 -norm of x, the regularized convex relaxation of (1) is as follows 1min ( ) x x x (2) where  is the positive regularization parameter and the 1 -norm 1x is simply defined as the absolute sum of its coefficients (we note that in the case x is the matrix, the 0 - norm is considered as the matrix rank and 1 -norm of x is replaced by the nuclear norm, which is the sum of singular values of x ). The optimization (2) is convex and can be solved efficiently by many proposed algorithms. Together with the 1 minimization approach, iterative methods such as greedy [1] [2] and iterative hard thresh holding [3] has shown to be very efficient in recovering sparse signals from very limited amount of observations. These methods are usually easier to implement and in many cases computationally faster than the 1 counterpart. The greedy methods are initially introduced under the name of matching pursuit (MP) [1] and orthogonal matching pursuit (OMP). While MP and OMP are simple and easy to implement, their lack of provable reconstruction quality might prevent them from being widely applied in practice. Recently, several researchers have developed various extensions of the MP and OMP algorithms to significantly improve the recovery performance as well as to provide the optimal theoretical guarantees. Among them are compressive sampling matching pursuit (CoSaMP) [2] and subspace pursuit [4]. However, these algorithms only consider the situations in which the loss function is quadratic. In this paper, we propose an iterative greedy algorithm to solve the optimization (1). Our algorithm generalizes the idea of the CoSaMP algorithm to handle the non-quadratic structures of the objective function as well as the general forms of the constraint. Our contribution is to provide strong theoretical bounds for the recovery error as well as the linear convergence rate of the algorithm. Leveraging on the notions of D-restricted strong convexity (D-RSC) and D-restricted strong smoothness (D -RSS) enforced on the loss function, we show that the 2 -norm error between the optimal solution x  of (1) and iterate tx obtained by the algorithm at the t -th iteration decays exponentially. These two properties D -RSC and D -RSS are analogous to the RSC and RSS, which has been the essential tools to show the efficient estimation and fast convergence of the convex programming [5] [6]. We apply this general theory to estimate the recovery error of various standard recovery problems such as compressed sensing, matrix/tensor recovery, Nghiên cứu khoa học công nghệ Tạp chí Nghiên cứu KH&CN quân sự, Số 40, 12 - 2015 111 and generalized linear regression. When the loss function is quadratic and the estimator has the vector (matrix) form, we obtain a similar theoretical guarantee as the CoSaMP [2] (ARMiRA [7]). In addition, our simulations on several problems in statistics demonstrate that our algorithm usually leads to comparable performance, if not better, than the convex minimization approach. 2. GRADIENT MATCHING PURSUIT ALGORITHM We present in this section the algorithm to solve (1). The algorithm, called gradient matching pursuit (GradMP), is described in Algorithm 1. The name gradient comes from the first step of the algorithm that the signal support is estimated via taking the gradient of the loss function at each iteration and chooses the support domain associated with the direction that maximizes the gradient. In general, the algorithm follows the similar flow as the CoSaMP. However, we have made several changes to address the non-quadratic structure of the loss function and the general form of the sparse constraint. The algorithm can be seen as the generalizations of the CoSaMP [2] for sparse vector recovery and the ADMiRA [7] for low-rank matrix recovery to the nonlinear settings. Particularly compared to CoSaMP and ADMiRA, GradMP differs in the proxy step where instead of taking the correlation between the sensing matrix and the error residual, we take the gradient of the loss function. In addition, in the identification step, the signal support is estimated by projecting the gradient onto the spaces spaned by subsets of the set D. In fact, when the loss function is quadratic and the set D consists of standard basis, Algorithm 1 reduces to the CoSaMP. Despite the algorithmic similarity, the analysis of this more generalized algorithm is considerably different. It is because we do not have the luxury of accessing the closed form solution of the restricted optimization in the estimation step, as the least-square does in the CoSaMP. Intuitively, GradMP is performed as follows: instead of solving the difficult nonconvex problem (1) directly, at each iteration, GradMP looks for a subspace based on the previous estimate and then seeks for a new solution by solving a sub-optimization problem which takes place at the estimation step. This subproblem is often much easier to optimize due to its low-dimensional structure. In particular, the optimization at the estimation step can be written as ˆ| | ˆ min ( )D       Provided k is considerable smaller than the ambient dimension n , this optimization is on 3k space, as opposed to n in the original problem (1). In addition, in many practical problems in statistics, although ( )x is nonconvex, it is often convex on a restricted set. Throughout the paper, we denote by xˆ the feasible solution of the minimization (1): arg min ( ) x x x   subject to 0,D x k (3) We assume T to be the support set of xˆ with respect to D. That is, xˆ can be decomposed as ˆ i i i T x d   . For a given vector x , let TDP x be the orthogonal projection of x onto the space spanned by columns of matrix TD defined by 2 argminT T D u D P x x u    (4) Công nghệ thông tin & Khoa học máy tính N.N.Minh, N.Nam, “An interative greedy algorithm for sparsity-constrained optimization.” 112 where the norm is defined as the 2 -norm for vectors or Frobenius norm for matrices. We also denote DC  as the space spanned by columm of D : | |{ : , }DC x x D     (5) It is clear that DC  is the convex set. Algorithm 1. GradMP algorithm 1. input: k and stopping criterion 2. initialize:   , 0x , and 0t  3. repeat proxy: ( ) tu x   identify: 2 argmax { 2 } :| |DP u k    merge: ˆ   estimate: ˆ argmin . ( ) s t.x Db x x C    prune: 2 argmax { :| | }DP b k    update: 1 t Dx P b   t = t+1 4. until: halting criterion true 5. output: tx 3. SIMULATION AND RESULTS 3.1. Covariance matrix estimation In this section, we apply the proposed algorithm to solve some important problems in statistics in which the underlying loss functions are generally not quadratic. Our first consideration is the covariance matrix estimation. We will demonstrate via extensive simulations that GradMP is a potential algorithm for this problem. Covariance matrix estimation has been a long standing problem in statistics (e.g. [8]). Given a Gaussian random vector (1) ( )( ,., )p T px x x  with mean  and unknown covariance , the ultimate goal is to estimate  from n random samples 1,.,{ }i i nx  of x . If we denote a Gaussian graphical model as ( , )G V E where each i th vertex is associated with a random vector ( )ix , then the structure of the graph is encoded in the covariance matrix. In particular, there is no edge between the two vertices i th and j th if the ( , )i j entry of the concentration matrix 1 is zero. Using maximum likelihood estimation, we need to solve the following optimization  1 argmax trace( ) logdet( )S        (6) where 1 1 ( )( ) n T i i i S x x n      is the empirical covariance matrix. Without imposing any constraint on the covariance matrix, it is clear that the solution of (6) might not be Nghiên cứu khoa học công nghệ Tạp chí Nghiên cứu KH&CN quân sự, Số 40, 12 - 2015 113 consistent, especially in the scenarios in which the number of observations n is significantly less than the data dimension p . Notice that in multivariate Gaussian distribution, if the ( , )i j component of 1 is zero, then two variables ( )ix and ( )jx are conditionally independent, given the other variables. Thus, it would make sense to enforce sparsity on the concentration matrix 1   in the  0,off ( ) logdet( argmin tr ) s.t ac .e S k      (7) where 0,off denotes the number of off-diagonal nonzero entries of  . Over the last few years, there has been an extensive amount of work on recasting this problem to the convex program (e.g. [9],[10]) and establishing strong statistical guarantees for this estimator (e.g. [11],[12]). In [9], Yuan and Lin proposed to solve (7) by using the 1 penalty  1,off ( ) logd argmin et(trace )S       (8) where the off-diagonal 1 regularizer is denoted as  1,off | |ij i j   . Based on the work of Banerjee [13], Friedman [10] developed a very efficient co-ordinate descent algorithm to solve (8). On the theoretical side, Roth [11] provided an upper bound in 2 -norm for the estimation error. Under some tight conditions on the population covariance matrix, Ravikumar [12] showed the capability of (8) to recover the exact graph structure where the number of observations is in the order of 2( log )k p , where k is the maximum number of edges in any vertex. Very recently, leveraging on the forward-backward greedy algorithm of [14], Johnson [15] proposed an efficient greedy approach to solve (7). They showed that under a weak assumption on  , only ( log )k p samples is necessary to achieve exact graph structure. This is a significant improvement over previous algorithms and results in the literature. Despite all the advantages, the computational complexity of the algorithm is considerably heavy. It is because at each iteration, the algorithm requires to perform the minimization ( )t lrac og te de ( )S   for a fixed support set of  two times. Furthermore, at each iteration, the algorithm has to solve two other minimizations to select the best entry or to remove the unnecessary entry from the current estimation. In this section, we propose to solve (7) using our GradMP algorithm. Denote trac( ) ( ) log det( )e S     . It is clear that 1) ( S     . We note that in the algorithm , offU is denoted as the matrix U whose entries on the diagonal are equal to zeros. Similar notation is used for offB . Furthermore, in Step 5 , off ,2supp( )kU is the operator that returns the set  containing locations of the 2k largest entries of the matrix offU . As we can see, at each iteration, our GradMP only requires one minimization in step 7. Thus, the computational complexity of our algorithm is significantly less than that of [15]. The optimization in step 7 is ˆmin ( ) logdet( ) s.t. sutrac pp(e )S       (9) Công nghệ thông tin & Khoa học máy tính N.N.Minh, N.Nam, “An interative greedy algorithm for sparsity-constrained optimization.” 114 which can be seen as a similar version of (7). However, this optimization is essentially easier to solve since the support ˆ of  is already defined. The algorithm for the minimization (9) can be found in Section 17.3.1 of [8]. Algorithm 2. Covariance matrix estimation algorithm Initialization: 0 and 1t  repeat 1U S    ˆ   ˆ( ) subject t argmin suppo ( )B      off ,( s p )up kB  t B   , 0c t    1t t  until halting criterion true Output: ˆ t   3.2 Results In this section, we compare the graph recovery performance of our GradMP algorithm with the algorithm proposed in [15]. As shown in the experiments of [15], the greedy algorithm significantly outperforms the 1 approach in estimating the Gaussian graph structures. In particular, the authors of [15] demonstrated that the probability of success in recovering the graph with forward-backward greedy approach is much higher than that of the 1 counterpart. In the following experiments, we show that our GradMP offers the same, if not better, graph recoverability while the computational complexity is substantially reduced. We conduct similar experiments as in [15] in which three different graph structures are studied including chain, star and grid structures, as demonstrated in Fig.1. For these graphs, the number of nodes varies with {36,64,100}p  where the maximum number of node degree is set to 2k  with chain graph. These values are set to 4k  and 0.1k p for grid and star graphs, respectively. We varied the number of samples via a scaling parameter 70 log n x k p  with (0,3)x . For each fixed n and p , we perform 50 trials where in each trial, we claim the success if the algorithm recover the exact graph edges. More detailed experimental setup and the construction of the covariance matrices can be found in [15]. As demonstrated in [15], the greedy forward-backward (Fo-Ba) algorithm outperforms the graphical lasso (glasso) that implements the optimization (7) in term of the sample size needed for perfect edge recovery. In this simulation, we show that our proposed GradMP has similar performance, if not better, than the Fo-Ba algorithm while the computational complexity is significantly smaller. off ,2 supp( )kU  Nghiên cứu khoa học công nghệ Tạp chí Nghiên cứu KH&CN quân sự, Số 40, 12 - 2015 115 Fig.2 compares our proposed GradMP algorithm with the greedy forward-backward (Fo-Ba) algorithm investigated in [15]. Figures on the left column represent the probability of success with respect the scaling parameter x controlling the number of sample size n . Figures on the right column show the average computational time used to estimate the graph in 50 trials. As can be seen from Fig.2 , our proposed GradMP attains similar recoverability as the greedy Fo-Ba algorithm for chain and grid graphs, while GradMP performs better on star graph. Furthermore, our proposed algorithm requires significantly less computational complexity. For all three graph structures, the computational time of GradMP does not increase considerably according to the grow of the sample size n and the graph size p , whereas it increases significantly with the algorithm in [15]. Figure 1: Graph structures. Left: chain graph, middle: star graph, and right: grid graph. Công nghệ thông tin & Khoa học máy tính N.N.Minh, N.Nam, “An interative greedy algorithm for sparsity-constrained optimization.” 116 Figure 2. Probability of success and computational complexity. Top: chain graph, middle: star graph, and bottom: grid graph. 4. CONCLUSION In this paper, an iterative greedy algorithm, namely gradient matching pursuit (gradMP) has been presented and analyzed. This algorithm can be used to solve a general sparsity-constrained optimization. Our algorithm generalizes the idea of the CoSaMP algorithm to handle the non-quadratic structure of the objective function together with the general form of the sparse constraint. The simulations on several problems in statistics also demonstrate that our algorithm usually leads to comparable performance, if not better, than the convex minimization approach. This algorithm can also be extended to use in various other fields. REFERENCES [1]. S. Mallat and Z. Zhang. “Matching pursuits with time-frequency dictionaries”. IEEE Trans. Signal Process., 41(12):3397-3415, Dec. 1993. [2]. D. Needell and J. A. Tropp. CoSaMP: “Iterative signal recovery from incomplete and inaccurate samples”. Applied Comput. Harmon. Anal., 26:301-321, 2008. [3]. T. Blumensath and M. E. Davies. “Iterative hard thresh holding for compressed sensing”. Applied and Computational Harmonic Analysis, 27(3):265-274, 2009. [4]. W. Dai and O. Milenkovic. “Subspace pursuit for compressive sensing signal reconstruction”. IEEE Trans. Inf. Theory, 55(5):2230-2249, May 2009. [5]. A. Agarwal, S. Negahban, and M. J. Wainwright. “Fast global convergence of gradient methods for high-dimensional statistical recovery”. Ann. Statist., 40(5):2452-2482, 2012. [6]. S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. “An unified framework for high-dimensional analysis of M-estimators with decomposable regularizers”. Stat. Scien., 27(4):538-557, Dec. 2012. [7]. K. Lee and Y. Bresler. ADMiRA: “Atomic decomposition for minimum rank approximation”. IEEE Trans. Inf. Theory, 56(9):4402-4416, 2010. [8]. T. Hastie, R. Tibshirani, and J. Friedman. “The Element of Statistical Learning: Data mining, Inference, and Prediction”. Springer, 3rd edition, 2009. [9]. M. Yuan and Y. Lin. “Model selection and estimation in the gaussian graphical model”. Biometrika, 94(1):19-35, 2007. [10]. J. Friedman, T. Hastie, and R. Tibshirani. “Sparse inverse covariance estimation with the graphical lasso”. Biostatistics, 9(3):432-441, 2008. Nghiên cứu khoa học công nghệ Tạp chí Nghiên cứu KH&CN quân sự, Số 40, 12 - 2015 117 [11]. A. J. Rothman, P. J. Bickel, E. Levina, and J. Zhu. “Sparse permutation invariant covariance estimation”. Electron. J. Statist., 2:494-515, 2008. [12]. P. Ravikumar, M. J. Wainwright, G. Raskutti, and B. Yu. “High-dimensional covariance estimation by minimizing `1-penalized log-determinant divergence”. Electron. J. Statist., 5:935-980, 2011. [13]. O. Banerjee, L. E. Ghaoui, and A. d'Aspremont. “Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data”. J. Machine Learn. Res., 9:485-516, 2008. [14]. T. Zhang. “Adaptive forward-backward greedy algorithm for learning sparse representations”. IEEE Trans. Infor. Theory, 57(7):4689-4708, July 2011. [15]. C. C. Johnson, A. Jalali, and P. Ravikumar. “High-dimensional sparse inverse covariance estimation using greedy methods”. In Proc. 15th Inter Conf. Artic. Intell. Statist. (AISTAT), La Palma, Canary Islands, Apr. 2012. TÓM TẮT MỘT THUẬT TOÁN GREEDY CHO VIỆC TỐI ƯU VỚI CÁC GIỚI HẠN RỜI RẠC Bài báo này trình bày một thuật toán greedy lặp, cụ thể là gradient matching pursuit (gradMP), để giải quyết tối ưu hóa một hệ ràng buộc rời rạc chung. Thuật toán của chúng tôi khái quát ý tưởng của thuật toán CoSaMP để xử lý các kết cấu phi bậc hai của hàm mục tiêu cùng với các khung chung của các ràng buộc rời rạc. Các thuật toán như vậy có thể giải quyết cho các lớp rộng lớn hơn về chức năng mục tiêu và các ràng buộc rời rạc. Ngoài ra, mô phỏng của chúng tôi về các vấn đề khác nhau cũng cho thấy rằng thuật toán của chúng tôi thường dẫn đến hiệu năng tương đương, nếu không tốt hơn, so với phương pháp giảm thiểu convex. Từ khóa: Matching pursuit, GradMP, CoSaMP Nhận bài ngày 29 tháng 9 năm 2015 Hoàn thiện ngày 13 tháng 11 năm 2015 Chấp nhận đăng ngày 25 tháng 12 năm 2015 Address: 1 Department of Electronic Engineering, Posts and Telecommunications Institue of Technology, 122 Hoang Quoc Viet, Hanoi, Vietnam; 2 Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, United States *Email: minhnn@ptit.edu.vn

Các file đính kèm theo tài liệu này:

15_minh_8762_2149262.pdf