![]() |
活泼的海龟 · 【论文笔记】迁移自适应学习综述- ...· 5 月前 · |
![]() |
活泼的海龟 · 大连理工大学主页平台管理系统樊鑫--樊鑫的主 ...· 5 月前 · |
![]() |
活泼的海龟 · 论文阅读“Deep Kernel ...· 5 月前 · |
where are the size and centroid of the c -th cluster.
The optimization problem in Eq.(1) can be rewritten as the following matrix-vector form,
where K is a kernel matrix with K ij = ϕ ( x i ) ⊤ ϕ ( x j ), and is a column vector with all elements being 1.
The variable Z in Eq.(2) is discrete, and this makes the optimization problem difficult to solve. A common approach is to relax Z to take real values. Specifically, by defining and letting H take real values, a relaxed version of the above problem can be obtained as
where I k is an identity matrix with size k × k . The optimal H for Eq.(3) can be obtained by taking the k eigenvectors having the larger eigenvalues of K ( Jegelka et al. 2009 ).
In a multiple kernel setting, each sample has multiple feature representations defined by a group of feature mappings . Specifically, each sample is represented as ϕ β ( x ) = [ β 1 ϕ 1 ( x ) ⊤ , ⋯, β m ϕ m ( x ) ⊤ ] ⊤ , where β = [ β 1 , ⋯, β m ] ⊤ consists of the coefficients of the m base kernels. These coefficients will be optimized during learning. Based on the definition of ϕ β ( x ), a kernel function can be expressed as
By replacing the kernel matrix K in Eq.(3) with K β computed via Eq.(4) , the objective of MKKM can be written as
This problem can be solved by alternately updating H and β : i) Optimizing H given β . With the kernel coefficients β fixed, H can be obtained by solving a kernel k -means clustering optimization problem shown in Eq.(3) ; ii) Optimizing β given H. With H fixed, β can be optimized via solving the following quadratic programming with linear constraints,
As noted in ( Yu et al. 2012 ; Gönen and Margolin 2014 ), using a convex combination of kernels to replace K β in Eq.(5) is not a viable option, because this could make only one single kernel be activated and all the others assigned with zero weights. Other recent work using ℓ 2 -norm combination can be found in ( Kloft et al. 2011 ; 2009 ; Cortes, Mohri, and Rostamizadeh 2009 ; Liu et al. 2013 ).
Let s p (1 ≤ p ≤ m ) denote the sample indices for which the p -th view is present and be used to denote the kernel sub-matrix computed with these samples. Note that this setting is consistent with the literature, and it is even more general since it does not require that there be at least one complete view across all the samples, as assumed in ( Trivedi et al. 2010 ).
The absence of rows and columns from base kernels makes clustering challenging. Existing two-stage approaches first impute these base kernels and then apply a conventional clustering algorithm with them. We have the following two arguments. Firstly, although such imputation is sound from the perspective of “general-purpose”, it may not be an optimal option when it has been known that the imputed kernels are used for clustering. This is because for most, if not all, practical tasks a belief holds that these pre-selected base kernels or views (when in their complete form) shall, more or less, be able to serve clustering. However, such a belief was not exploited by these two-stage approaches as prior knowledge to guide the imputation process. Secondly, from the perspective that the ultimate goal is to appropriately cluster data, we shall try to directly pursue the clustering result, by treating the absent kernel elements as auxiliary unknowns during this course. In other words, imputed kernels could be merely viewed as the by-products of clustering.
These two arguments motivate us to seek a more natural and reasonable manner to deal with the absence in multiple kernel clustering. That is to perform imputation and clustering in a joint way: 1) impute the absent kernels under the guidance of clustering; and 2) update the clustering with the imputed kernels. By this way, the above two learning processes can be seamlessly coupled and they are allowed to negotiate with each other to achieve better clustering . In specific, we propose the multiple kernel k -means algorithm with incomplete kernels as follows,
The only difference between the objective function in Eq.(7) and that of traditional MKKM in Eq.(5) lies at the incorporation of optimizing . Note that the constraint is imposed to ensure that K p maintains the known entries during the course. Though the model in Eq.(7) is simple, it admits the following advantages: 1) Our objective function is more direct and well targets the ultimate goal, i.e., clustering, by integrating kernel completion and clustering into one unified learning framework, where the kernel imputation is treated as a by-product; 2) Our algorithm works in a MKL scenario ( Rakotomamonjy et al. 2008 ), which is able to naturally deal with a large number of base kernels and adaptively combine them for clustering; 3) Our algorithm does not require any base kernel to be completely observed, which is however necessary for some of the existing imputation algorithms such as ( Trivedi et al. 2010 ). Besides, our algorithm is parameter-free once the number of clusters to form is specified.
Although Eq.(7) is not difficult to understand, the positive semi-definite (PSD) constraints on make it difficult to optimize. In the following, we design an efficient algorithm to solve it. In specific, we design a three-step algorithm to solve this problem in an alternate manner:
Given β and , the optimization in Eq.(7) for H reduces to a standard kernel k -means problem, which can be efficiently solved as Eq.(3) ;
Proposed Multiple Kernel k -means with Incomplete Kernels
1: | Input: , and ϵ 0 |
2: | Output : H , β and . |
3: | Initialize β (0) = 1 m / m , and t = 1. |
4: | repeat |
5: | |
6: | Update H ( t ) by solving Eq.(3) with . |
7: | Update with H ( t ) by Eq.(12) . |
8: | Update β ( t ) by solving Eq.(6) with H ( t ) and . |
9: | t = t + 1. |
10: | until (obj ( t −1) − obj ( t ) )/obj ( t ) ≤ ϵ 0 |
Given β and H , the optimization in Eq.(7) with respect to is equivalent to the following optimization problem,
Directly solving the optimization problem in Eq.(8) appears to be computationally intractable because it involves multiple kernel matrices. Looking into this optimization problem, we can find that the constraints are separately defined on each K p and that the objective function is a sum over each K p . Therefore, we can equivalently rewrite the problem in Eq.(8) as m independent sub-problems, as stated in Eq.(9) ,
where U = I n − HH ⊤ and p = 1, ⋯, m .
Considering that K p is PSD, we can decompose K p as . Inspired by the work in ( Trivedi et al. 2010 ), we write with . In this way, the optimization problem in Eq.(9) can be rewritten as
where the matrix U is expressed in a blocked form as
By taking the derivative of Eq.(10) with respect to and letting it vanish, we can obtain an analytical solution to the optimal as
Correspondingly, we have a closed-form expression for the optimal K p in Eq.(9) :
Given H and , the optimization in Eq.(7) for β is a quadratic programming with linear constraints, which can be efficiently solved as in Eq.(6) .
In sum, our algorithm for solving Eq.(7) is outlined in Algorithm 1 , where the absent elements of are initially imputed with zeros and obj ( t ) denotes the objective value at the t -th iteration. It is worth pointing out that the objective of Algorithm 1 is guaranteed to be monotonically decreased when optimizing one variable with others fixed at each iteration. At the same time, the objective is lower-bounded by zero. As a result, our algorithm is guaranteed to converge. Also, as shown in the experimental study, it usually converges in less than 30 iterations. As MKKM, our algorithm solves an eigen-decomposition and a QP problem per iteration, which brings no much extra computation since imputation is done analytically in Eq.(12) .
Datasets used in our experiments.
Dataset | #Samples | #Kernels | #Classes |
---|---|---|---|
Flower17 | 1360 | 7 | 17 |
Flower102 | 8189 | 4 | 102 |
Caltech102 | 3060 | 10 | 102 |
CCV | 6773 | 6 | 20 |
The proposed algorithm is experimentally evaluated on four widely used MKL benchmark data sets shown in Table 1 . They are Oxford Flower17 1 , Oxford Flower102 2 , Columbia Consumer Video (CCV) 3 and Caltech102 4 . For Flower17, Flower102 and Caltech102 data sets, all kernel matrices are pre-computed and can be publicly downloaded from the above websites. For Caltech102, we use its first ten base kernels for evaluation. For CCV data set, we generate six base kernels by applying both a linear kernel and a Gaussian kernel on its SIFT, STIP and MFCC features, where the widths of the three Gaussian kernels are set as the mean of all pairwise sample distances, respectively.
We compare the proposed algorithm with several commonly used imputation methods, including zero filling (ZF), mean filling (MF), k -nearest-neighbor filling (KNN) and the alignment-maximization filling (AF) proposed in ( Trivedi et al. 2010 ). The algorithms in ( Xu, Tao, and Xu 2015 ; Shao, He, and Yu 2015 ; Zhao, Liu, and Fu 2016 ) are not incorporated into our experimental comparison since they only consider the absence of input features while not the rows/columns of base kernels. Compared with ( Bhadra, Kaski, and Rousu 2016 ), the imputation algorithm in ( Trivedi et al. 2010 ) is much simpler and more computationally efficient. Therefore, we choose ( Trivedi et al. 2010 ) as a representative algorithm to demonstrate the advantages and effectiveness of joint optimization on imputation and clustering. The widely used MKKM ( Gönen and Margolin 2014 ) is applied with these imputed base kernels. These two-stage methods are termed ZF+MKKM, MF+MKKM, KNN+MKKM and AF+MKKM in this experiment, respectively. We do not include the EM-based imputation algorithm due to its high computational cost, even for small-sized samples. The Matlab codes of kernel k -means and MKKM are publicly downloaded from https://github.com/mehmetgonen/lmkkmeans .
Following the literature ( Cortes, Mohri, and Rostamizadeh 2012 ), all base kernels are centered and scaled so that we have κ p ( x i , x i ) = 1 for all i and p . For all data sets, it is assumed that the true number of clusters is known and it is set as the true number of classes. To generate incomplete kernels, we create the index vectors as follows. We first randomly select round( ε ∗ n ) samples, where round(·) denotes a rounding function. For each selected sample, a random vector v = ( v 1 , ⋯, v m ) ∈ [0, 1] m and a scalar v 0 ( v 0 ∈ [0, 1]) are then generated, respectively. p -th view will be present for this sample if v p ≥ v 0 is satisfied. In case none of v 1 , ⋯, v m can satisfy this condition, we will generate a new v to ensure that at least one view is available for a sample. Note that this does not mean that we require a complete view across all the samples. After the above step, we will be able to obtain the index vector s p listing the samples whose p -th view is present. The parameter ε , termed missing ratio in this experiment, controls the percentage of samples that have absent views, and it affects the performance of the algorithms in comparison. Intuitively, the larger the value of ε is, the poorer the clustering performance that an algorithm can achieve. In order to show this point in depth, we compare these algorithms with respect to ε . Specifically, ε on all the four data sets is set as [0.1 : 0.1 : 0.9].
The widely used clustering accuracy (ACC), normalized mutual information (NMI) and purity are applied to evaluate the clustering performance. For all algorithms, we repeat each experiment for 50 times with random initialization to reduce the affect of randomness caused by k -means, and report the best result. Meanwhile, we randomly generate the “incomplete” patterns for 30 times in the abovementioned way and report the statistical results. The aggregated ACC, NMI and purity are used to evaluate the goodness of the algorithms in comparison. Taking the aggregated ACC for example, it is obtained by averaging the averaged ACC achieved by an algorithm over different ε .
Figure 1 presents the ACC, NMI and purity comparison of the above algorithms with different missing ratios on the four data sets. To help understand the performance achieved by our algorithm, we also provide MKKM as a reference. Note that there is not any absence in the base kernels of MKKM. As observed: 1) The proposed algorithm (in red) consistently demonstrates the overall best performance among the MKKM methods with absent kernels in all the sub-figures; 2) The improvement of our algorithm is more significant with the increase of missing ratio. For example, it improves the second best algorithm (AF+MKKM) by nearly five percentage points on Flower102 in terms of clustering accuracy when the missing ratio is 0.9 (see Figure 1(c) ); 3) The variation of our algorithm with respect to the missing ratio is relatively smaller when compared with other algorithms, demonstrating its stability in the case of intensive absence; and 4) The performance of our algorithm is the closest one to or even better than the performance of MKKM (in green) in multiple cases.
Clustering accuracy, NMI and purity comparison with the variation of missing ratios on four data sets. Note that MKKM (in green) is provided as a reference. There is not any absence in its base kernels.
We attribute the superiority of our algorithm to its joint optimization on imputation and clustering. On one hand, the imputation is guided by the clustering results, which makes the imputation more directly targeted at the ultimate goal. On the other hand, this meaningful imputation is beneficial to refine the clustering results. These two learning processes negotiate with each other, leading to improved clustering performance. In contrast, ZF+MKKM, MF+MKKM, KNN+MKKM and AF+MKKM algorithms do not fully take advantage of the connection between the imputation and clustering procedures. This could produce imputation that does not well serve the subsequent clustering as originally expected, affecting the clustering performance. The aggregated ACC, NMI and purity, and the standard deviation are reported in Table 2 , where the one with the highest performance is shown in bold. Again, we observe that the proposed algorithm significantly outperforms ZF+MKKM, MF+MKKM, KNN+MKKM and AF+MKKM algorithms, which is consistent with our observations in Figure 1 .
Aggregated ACC, NMI and purity comparison (mean±std) of different clustering algorithms on four data sets.
Datasets | ZF+MKKM | MF+MKKM | KNNF+MKKM | AF+MKKM ( Trivedi et al. 2010 ) | Proposed |
---|---|---|---|---|---|
ACC | |||||
Flower17 | 37.09 ± 0.42 | 36.93 ± 0.48 | 37.88 ± 0.62 | 42.46 ± 0.59 | 44.56 ± 0.61 |
Flower102 | 17.95 ± 0.15 | 17.92 ± 0.16 | 18.26 ± 0.14 | 19.09 ± 0.17 | 21.40 ± 0.18 |
Caltech102 | 23.10 ± 0.26 | 23.15 ± 0.24 | 23.87 ± 0.26 | 26.56 ± 0.22 | 28.22 ± 0.27 |
CCV | 14.80 ± 0.16 | 15.03 ± 0.16 | 14.73 ± 0.19 | 16.51 ± 0.25 | 19.91 ± 0.32 |
NMI | |||||
Flower17 | 37.40 ± 0.35 | 37.38 ± 0.40 | 38.36 ± 0.46 | 41.85 ± 0.42 | 43.50 ± 0.42 |
Flower102 | 37.39 ± 0.08 | 37.39 ± 0.08 | 37.83 ± 0.09 | 38.32 ± 0.11 | 39.55 ± 0.10 |
Caltech102 | 44.90 ± 0.15 | 44.94 ± 0.14 | 45.67 ± 0.18 | 47.74 ± 0.14 | 49.10 ± 0.18 |
CCV | 10.11 ± 0.13 | 10.23 ± 0.13 | 10.25 ± 0.16 | 11.76 ± 0.19 | 14.80 ± 0.20 |
Purity | |||||
Flower17 | 38.61 ± 0.40 | 38.49 ± 0.48 | 39.38 ± 0.56 | 43.96 ± 0.54 | 45.92 ± 0.53 |
Flower102 | 22.44 ± 0.12 | 22.43 ± 0.11 | 22.82 ± 0.14 | 23.63 ± 0.15 | 25.95 ± 0.14 |
Caltech102 | 24.62 ± 0.25 | 24.66 ± 0.26 | 25.44 ± 0.27 | 28.15 ± 0.22 | 29.87 ± 0.25 |
CCV | 18.26 ± 0.15 | 18.48 ± 0.16 | 18.33 ± 0.20 | 19.83 ± 0.26 | 23.79 ± 0.28 |
Besides comparing the above-mentioned algorithms in terms of clustering performance, we would like to gain more insight on how close the imputed base kernels (as a byproduct of our algorithm) are to the ground-truth, i.e., the original, complete base kernels. To do this, we calculate the alignment between the ground-truth kernels and the imputed ones. The kernel alignment, a widely used criterion to measure the similarity of two kernel matrices, is used to serve this purpose ( Cortes, Mohri, and Rostamizadeh 2012 ). We compare the alignment resulted from our algorithm with those from existing imputation algorithms. The results under various missing ratios are shown in Figure 2 . As observed, the kernels imputed by our algorithm align with the ground-truth kernels much better than those obtained by the existing imputation algorithms. In particular, our algorithm wins the second best one (KNN+MKKM) by more than 22 percentage points on Caltech102 when the missing ratio is 0.9. The aggregated alignment and the standard deviation are reported in Table 3 . We once again observe the significant superiority of our algorithm to the compared ones. These results indicate that our algorithm can not only achieve better clustering performance, but is also able to produce better imputation result by exploiting the prior knowledge of “serve clustering”.
Kernel alignment between the original kernels and the imputed kernels by different algorithms under different missing ratios.
Aggregated alignment between the original kernels and the imputed kernels (mean±std) on four data sets.
Datasets | ZF+MKKM | MF+MKKM | KNNF+MKKM | AF+MKKM ( Trivedi et al. 2010 ) | Proposed |
---|---|---|---|---|---|
Flower17 | 80.07 ± 0.08 | 80.05 ± 0.08 | 81.45 ± 0.06 | 86.50 ± 0.08 | 88.70 ± 0.12 |
Flower102 | 75.55 ± 0.05 | 75.55 ± 0.05 | 73.35 ± 0.04 | 76.71 ± 0.05 | 79.20 ± 0.06 |
Caltech102 | 74.40 ± 0.05 | 74.43 ± 0.05 | 83.32 ± 0.05 | 80.00 ± 0.05 | 95.99 ± 0.03 |
CCV | 75.03 ± 0.07 | 76.60 ± 0.07 | 79.15 ± 0.06 | 77.10 ± 0.07 | 85.11 ± 0.24 |
From the above experiments, we conclude that the proposed algorithm: 1) effectively addresses the issue of row/columns absence in multiple kernel clustering; 2) consistently achieves performance superior to the comparable ones, especially in the presence of intensive absence; and 3) can better recover the incomplete base kernels by taking into account the goal of clustering. In short, our algorithm well utilizes the connection between imputation and clustering procedures, bringing forth significant improvements on clustering performance. In addition, our algorithm is theoretically guaranteed to converge to a local minimum according to ( Bezdek and Hathaway 2003 ). In the above experiments, we observe that the objective value of our algorithm does monotonically decrease at each iteration and that it usually converges in less than 30 iterations. Two examples of the evolution of the objective value on Flower17 and Flower102 are demonstrated in Figure 3 .
While MKC algorithms have recently demonstrated promising performance in various applications, they are not able to effectively handle the scenario where base kernels are incomplete. This paper proposes to jointly optimize the kernel imputation and clustering to address this issue. It makes these two learning procedures seamlessly integrated to achieve better clustering. The proposed algorithm effectively solves the resultant optimization problem, and it demonstrates well improved clustering performance via extensive experiments on benchmark data sets, especially when the missing ratio is high. In the future, we plan to further improve the clustering performance by considering the correlations of different base kernels ( Bhadra, Kaski, and Rousu 2016 ).
This work was supported by the National Natural Science Foundation of China (project No. U1435219, 61403405, 61672528 and 61232016). The authors would like to thank Dr. Luping Zhou for discussions on the application of our algorithm to the Alzheimers disease prediction.
Xinwang Liu, School of Computer, National University of Defense Technology Changsha, China, 410073.
Miaomiao Li, School of Computer, National University of Defense Technology Changsha, China, 410073.
Lei Wang, School of Computer Science and Software Engineering University of Wollongong NSW, Australia, 2522.
Yong Dou, School of Computer National University of Defense Technology Changsha, China, 410073.
Jianping Yin, School of Computer National University of Defense Technology Changsha, China, 410073.
En Zhu, School of Computer National University of Defense Technology Changsha, China, 410073.