GAN-based clustering solution generation and fusion of diffusion

In this paper, we propose a framework to generate diverse clustering solutions and conduct solution retrieval to improve performance. Specifically, we first project unlabelled data from multiple domains into a shared space while preserving the respective semantics. This space allows that representations of samples in a hard domain are recovered by a linear combination of those of others in the easy domains. Meanwhile, a clustering algorithm is adopted to provide pseudo labels for a conditional generative adversarial network to synthesize representations that in turn promote the learning of the above space. Second, we conduct the joint learning of feature projection and partition matrices on batches of representations, where the former ones are considered as clustering solutions and input into another generative adversarial network to generate more solutions. Third, we utilize the fusion of diffusion to effectively retrieve and extract the knowledge in multiple solutions to obtain the final clustering. We perform comparative experiments against other methods on multiple benchmark data sets. Experimental results demonstrate the effectiveness and superiority of our proposed method.


Introduction
Clustering splits unlabelled data into groups, so that data in the same group are more similar to each other than to those in different groups (Gu et al., 2011, August;Gu & Zhou, 2009, December;Yang et al., 2015). Clustering has been demonstrated by successful applications including image analysis (Hershey et al., 2016, March;Shaham et al., 2018;Yang et al., 2016), information retrieval (Bai et al., 2019Hsu & Lin, 2018), data compression (Jiang et al., 2017, August), bio-informatics (Niu et al., 2016), etc. Conventional clustering approaches have considered that features play a key role in improving performances and focused on designing sophisticated methods of feature selection to learn discriminative features (Zhang, 2015, January;Zhang & Zhang, 2010;Zhang et al., 2016), on which clustering algorithms are adopted. However, when there is no label available in conducting feature selection (Vega-Pons & Ruiz-Shulcloper, 2011), it would lead to information loss and performance deterioration. Consensus clustering (Gionis et al., 2007;Shi et al., 2018;Yu et al., 2012, May) has provided a promising alternative by combining advantages of multiple feature transformations to explore underlying structures of data from different perspectives. Besides, consensus clustering has adopted voting schemes to fuse predictions from CONTACT Wenming Cao wenmincao2-c@my.cityu.edu.hk various clustering algorithms into a unified result (Fan et al., 2017, August;Rathore et al., 2018, June). Although consensus clustering has achieved remarkable advancements, the solutions are obtained by various clustering methods. The diversity of solutions is derived from transformations conducted on features or random feature subspace techniques (Gionis et al., 2007;Yu et al., 2012, May). Since performances of consensus clustering depend on the diversity of clustering solutions, it has become a focal point that how to design feature transformations or delicate clustering algorithms. Another scenario is that we perform clustering on multiple related tasks in real-world applications. When handling them individually, performances are extremely poor, especially when data are high-dimensional. To address this issue, multi-task clustering methods have been proposed to cluster tasks jointly (Gu et al., 2011, August;Yang et al., 2015), where the key lies in searching a shared subspace (Gu & Zhou, 2009, December;Zhang, 2015, January) for related tasks and retaining specific information about individual tasks simultaneously. The shared subspace has captured interdependencies among related tasks to guide the knowledge transfer (Zhang et al., 2016), while specific information has clustered in the original space. In light of consensus clustering, multi-task consensus clustering (Niu et al., 2016) has been proposed in the context of bioinformatics, which can be considered as a combination of multi-task clustering and consensus clustering. However, limited by the disadvantages of consensus clustering, the diversity of clustering solutions still needs to be further investigated from a new perspective, beyond feature transformations or clustering algorithms.
Recently, deep clustering approaches (Kamran et al., 2017, October;Wang et al., 2016;Xie et al., 2016;Yang et al., 2016) have been proposed, which can be divided into two stages: learning deep features with auto-encoder or convolutional kernels and employing clustering algorithm on the above learnt features (Caron et al., 2018;Chang et al., 2017, October;Ji et al., 2017, December;Jiang et al., 2017, August;Peng et al., 2016). A similar yet different research field has been deep unsupervised domain adaptation (Ganin & Lempitsky, 2015, July;Long et al., 2015Long et al., , 2016Tzeng et al., 2017). The difference between deep clustering and deep unsupervised domain adaptation Hu et al., 2018, June;Long et al., 2018;Volpi et al., 2018) lies in that the former has focused on single task, while the latter has utilized the knowledge from the labelled source task to handle the unlabelled target task by using deep shared features or aligning their marginal distributions (Carlucci et al., 2017;Long et al., 2017;Sener et al., 2016;Swami et al., 2017). Furthermore, deep unsupervised domain adaptation (Herath et al., 2017;Luo et al., 2017, December;Murez et al., 2018) has been differentiated from multi-task clustering in that the former has exploited the label information from the source task, while the latter has no label information available in all the tasks.
In this paper, we focus on solving a more challenging problem that there are multiple related clustering tasks, where the clustering solutions on some tasks are difficult to obtain or the quality of clustering solutions is far away from our tolerance. Consequently, multi-task clustering or deep unsupervised domain adaptation methods fail to obtain satisfactory clustering results. To address this issue, as shown in Figure 1, we have two important considerations: first, a shared space can be learnt based on compact representations of all the tasks, on which clustering algorithms are adopted to provide pseudo labels for conditional generative adversarial networks (GANs) (Mirza & Osindero, 2014) to generate more diverse representations that in turn help the searching of the shared space; second, by combining the original and synthetic representations, we conduct the joint learning of feature projection and partition matrices on batches, and input feature projection matrices into another GANs (Goodfellow et al., 2014, December) for synthesizing more feature projection matrices. Thus it reduces the dependence on clustering algorithms and feature transformations. We adopt the fusion of diffusion to integrate the knowledge of feature projection matrices into a unified one, in which feature projection matrices are regarded as clustering solutions, and transform the original representations, followed by conducting clustering algorithms to obtain the final results.
The contributions of this paper are summarized as follows: (1) We propose a framework to synthesize clustering solutions using GANs, which exploits the knowledge from other tasks to generate diverse enough feature projection matrices for the difficult task as the clustering solution.
(2) We design a fusion strategy of diffusion to perform consensus clustering by integrating discriminative and complementary knowledge in the affinity graphs based on feature projection matrices. (3) We provide theoretical convergence analysis for the fusion strategy of diffusion and conduct experiments on real-world data sets to verify the effectiveness of our method.
The remainder of this paper is organized as follows. In Section 2, we review the literature on multitask clustering, consensus clustering, deep clustering and deep unsupervised domain adaptation. In Section 3, we elaborate our proposed approach. Section 4 conducts theoretical convergence analysis for the fusion of diffusion. In Section 5, we present the comparative results between our proposed approach and other counterparts over multiple real-world data sets. Section 6 draws the conclusion for this paper.

Related work
In this section, we review the literature concerning multitask clustering, consensus clustering, deep clustering and deep unsupervised domain adaptation.
Many multitask clustering approaches (Gu et al., 2011, August;Gu & Zhou, 2009, December;Yang et al., 2015;Zhang, 2015, January;Zhang & Zhang, 2010, 2013Zhang et al., 2016) have been proposed to improve performances through jointly learning related tasks, in which the shared knowledge is explored and transferred among them. Generally, this shared knowledge is mined by searching the shared low-dimensional subspace (Gu & Zhou, 2009, December) or reproducing kernel Hilbert space (Gu et al., 2011, August), followed by conducting single-task clustering and multi-task clustering iteratively. In Yang et al. (2015), the relatedness among tasks is dissected by exploring intertask and intratask correlations with a designed 2,p -norm regularization, in which the feature projection and partition matrices are learnt simultaneously. Since tasks have different sensitivities to the shared knowledge, it is necessary to consider negative effects brought about by the knowledge transfer. A self-adapted multi-task clustering method (Zhang et al., 2016) is proposed, in which reusable instances are identified from related clusters to construct subtasks, followed by using the information from subtasks to original tasks to avoid negative transfer. Convex discriminative multi-task feature clustering and multi-task relationship clustering methods (Zhang, 2015, January) explore the shared features embedded in a covariance matrix that models relationships between features and that models relationships between task-specific models with the Gaussian prior, respectively. Consensus clustering (Vega-Pons & Ruiz-Shulcloper, 2011) reconciles the solutions obtained on the same data sets with different clustering algorithms or different runs of the same one. Existing consensus clustering is divided into clustering ensemble (Shi et al., 2018;Yu et al., 2012, May) and clustering aggregation (Fan et al., 2017, August;Gionis et al., 2007;Rathore et al., 2018, June). The former considers the clustering problem by various formulations and cast this problem as a hyper-graph partition or correlation clustering problem, thus the solution is to compute the best k-partition of the graph. The latter fuses a collection of soft clustering solutions generated by random projections, in which each instance is denoted by a vector whose entries represent the posterior probability.
Deep learning has promoted the emerging of deep clustering (Ji et al., 2017, December;Jiang et al., 2017, August;Kamran et al., 2017, October;Peng et al., 2016;Xie et al., 2016;Yang et al., 2016). Specifically, multi-layer autoencoder (Xie et al., 2016) provides initialization to learn effective latent representations for clustering. Yang et al. (2016) build successive operations in clustering on the top of representations learnt by convolutional neural network (CNN), where image clustering and representation learning are conducted in the forward and backward pass, respectively. Kamran et al. (2017, October) propose a clustering objective function with a logistic regression imposed on deep embedding learnt by CNN. Jiang et al. (2017, August) propose a variational deep embedding method, where variational autoencoder performs unsupervised generative clustering using a Gaussian mixture model. Ji et al. (2017, December) utilize a fullyconnected layer to learn self-expressive matrix in the subspace clustering. Peng et al. (2016) propose to integrate the prior information about the local and global structure into learning latent subspaces with deep autoencoder. In Peng et al. (2016), the locality for reconstructing input and structured global prior is integrated into deep subspace clustering. Chang et al. (2017, October) propose deep adaptive image clustering that formulates the clustering problem as a binary pairwise classification, where similarities are measured by the cosine distances between label features of images. Caron et al. (2018) present the deep clustering method, which allows jointly learning parameters of neural network and the clustering result on the obtained features.
Deep domain adaptation methods have emerged. Specifically, Long et al. (2016) propose to learn adaptive classifiers and transferable features from labelled source domain to unlabelled target domain, where features are embedded into a reproducing kernel Hilbert space (RKHS) to match distributions. Ganin and Lempitsky (2015, July) consider the discrimination and domain-invariance of adaptive features by jointly optimizing the label predictor and domain classifier. Long et al. (2015) generalize CNN to domain adaptation, in which all the latent task-specific representations are embedded into an RKHS, where mean embedding of distributions of different domains is matched. Tzeng et al. (2017) propose the adversarial discriminative domain adaptation framework by combining the discriminative modelling, untied weight sharing and a GAN loss. Long et al. (2017) match the joint distributions of domain-specific layers over multiple domains using the joint maximum mean discrepancy. Sener et al. (2016) propose a unified framework for unsupervised domain adaptation to learn transferable representations, in which feature representations and domain transformation are alternatively optimized. Gebru et al. (2017) design an adaptation multi-task loss based on the attributes for fine-grained recognition by leveraging labels of source domain, where the loss is composed of supervised loss, unsupervised adaptation loss and attribute consistency loss. Xu et al. (2018) propose a multi-source domain adaptation method for mitigating both the domain and class shifts, in which the distribution of target domain is represented by a combination of source domain distributions. Zhao et al. (2018) provide new generalization bounds for unsupervised multi-source domain adaptation in classification and regression, where H-divergence is utilized to measure the distance between two distributions. Unlike deep unsupervised domain adaptations, our method does not exploit any labels from source domains. Additionally, our proposed method generates not only representations of real-like data but also feature projection matrices as clustering solutions.

Proposed methodology
In this section, we elaborate our clustering solution generation and retrieval approach. The overview framework of the proposed approach is in Figure 2, which includes five stages: the learning of compact representations, the generation of diverse representations, the enhancement of representations guided by clustering, the generation of clustering solutions and the fusion of solutions using the diffusion.

Problem formulation
Suppose that there are data samples from one or more than one easy tasks and one difficult task, which are denoted as: {X s 1 ∈ R d s 1 ×n s 1 } m s=1 and X 2 ∈ R d t 1 ×n t , where d s 1 and d t 1 denote the dimensions of samples in the first easy and the difficult tasks. n s 1 and n t denote the numbers of samples in the first easy and the difficult tasks, respectively. m denotes the number of easy tasks available. The goal is to explore the feasibility that utilizes various diverse clustering solutions obtained from easy tasks to retrieve and employ them into X 2 , when the solutions directly obtained from X 2 are extremely poor or far away from our capability.

Motivation
The motivation is that the knowledge embedded in the solutions obtained from multiple unlabelled data sets can be transferred to other unlabelled related data sets, on which the solutions directly obtained are extremely poor or even cannot be accessed. This concept is derived from domain adaptation. There are two key differences between our approach and domain adaption. First, our approach does not rely on any labelled information, while domain adaptation needs labelled information about the source domain or both domains. Second, our approach focuses on both the latent compact representations and the task-specific information. Besides, our approach is capable of generating clustering solutions using generative adversarial networks, which provide diverse enough solutions. With these synthetic solutions, we adopt the diffusion of fusion to integrate multiple solutions into one that achieves satisfactory performance on both data sets. Compared to clustering ensemble using the majority voting mechanism, the diffusion of fusion takes into account multiple metrics, as described below.

The learning of compact representations for all the tasks
The learning of compact representations of data samples from multiple tasks is the cornerstone that our method works. It needs to meet the following two requirements: (1) compact representations can recover the original input in their respective task and (2) compact representations of data samples in a difficult task can be reconstructed with a linear combination of those of others in easy tasks. It is worth mentioning that when reconstructing the representation of a sample in a difficult task with those of other samples, we do not enforce that these samples are from the same task. This means that our method is capable of dealing with a difficult task using easy tasks whose semantics are not necessary to be the same, as long as the union of the semantics of easy tasks is completely overlapped with that of the given difficult task. The loss of this learning is composed of the loss functions corresponding to the above requirements, which is defined by where α is a trade-off parameter to control the relative importance of L r and L m . They are defined as follows: where E(·) and D(·) are the encoder and decoder. X j 1 is composed of the k-nearest neighbouring samples from easy tasks for each sample in 1 , X s 1 ) denotes the set of samples in the s 1 easy task whose compact representations in the k-nearest neighbourhood of x 2 1 .N denotes the mean of each sample set in N .

Clustering on low-dimensional mutual representations
After obtaining the latent compact representations E(X 1 ) of X 1 , we perform clustering via the objective function below: where M 1 and P 1 are the cluster centre and partition matrices on the subspace spanned by low-dimensional representations, respectively. · F is Frobenius norm. For P 1 ∈ R N×K (N and K are the numbers of data samples and clusters respectively), the index of the largest value in each row denotes which cluster the corresponding data sample belongs to. For example, suppose the index of the largest value in the ith row vector P i 1 ∈ R 1×K of P 1 is k, we consider that the ith data sample belongs to the kth cluster. We can easily obtain a binary matrix with the same size as that of the partition matrix. A one-hot vector can be used to denote a cluster and the index of element 1 corresponds to a specific cluster. Each pair of clusters is distinguished by the indices of element 1 in corresponding one-hot vectors.
Considering that the auto-encoder network pretrained on the input data samples may fail to evenly cover the original space, we would utilize the pseudo-labels obtained in this stage as the condition to generate more diverse representations.

Generating low-dimensional mutual representations
After clustering on E(X 1 ), we use conditional generative adversarial networks (CGAN) to synthesize latent representations with the same dimension to E(X 1 ), where the predictive labels are considered conditions. The loss function of CGAN is defined by where e x denotes a real sample following the distribution of E(X 1 ) and z is a noise variable drawn from a uniform distribution.ỹ is a one-hot vector of size K derived from the clustering partition matrix, which will be fed into the first generator as pseudo-label/pseudo-supervised information to generate cluster-specific representations. Initially, the clustering performance can be poor when the learned representations are not satisfactory, resulting in undesirable synthetic representations. However, the learned representations will become more discriminative with the increase of iterations, which improves the performance of clustering and the quality of synthetic representations.

Learning the feature projection matrix on each batch
We shuffle and combine the synthetic latent representations S E(X 1 ) and E(X 1 ) to form a set of batches {B j } n b j=1 , where n b is the number of batches. For each batch, we learn the feature projection and partition matrices simultaneously as follows: where P j and W j are the partition and the feature projection matrices learnt on B j , respectively. L j is the Laplacian matrix for B j . λ and β are two parameters. It is worth mentioning that we consider W j as the clustering solution rather than P j , because the latter has stronger dependence on the specific data, compared with the former. As a result, it is necessary to align data samples to the clusters delicately.

Generating the feature projection matrix
After completing the computation on the whole batches, we obtain a series of feature projection matrices, which are considered as the ground-truth and fed into another GAN for generating more diverse ones. Specifically, we generate the synthetic feature projection matrix as follows: (7) where p w (w) denotes the distribution of the feature projection matrix, and p t (t) denotes a uniform distribution, from which the noise variable t is sampled.

The fusion of clustering solutions
With the synthetic feature projection matrices and the learnt ones trained on batches, we adopt the fusion of diffusion strategy to merge the information about them via: where A denotes the optimal similarity matrix that fuses all the information in M feature projection matrices W 1 , W 2 , . . . , W M . I is an identity matrix. S (m) is the weight matrix of the graph after conducting transformation on the learned representations E(X 1 ) using the feature projection is the weight of the edge connecting the ith and jth samples.
ij . mn denotes the weight that connects the mth and nth affinity graphs concerning W (m) and W (n) . The first term controls the smoothness of pairwise product graphs. More specifically, if x i is similar to x j in the mth similarity space spanned by W m , i.e. large S m ij , and x k is similar to x l in the nth similarity space spanned by W n , i.e. large S n kl , the learned A ki and A lj should have a small difference. The second term indicates the learned optimal similarity should preserve self-similarity, i.e. I kk . τ is the weight regularization parameter, controlling the distribution of mn . If the fusion from S (m) to S (n) is nonsmooth, the value of the first term will be large, and it will assign a small weight to mn to ensure that the objective value will decrease during the iteration.

Optimization process for Equation (8)
In this part, we present the process of optimizing Equation (8). where

Update A when is fixed
In this setting, the third term in Equation (9) is a constant, and Equation (9) can be transformed into where S (m,n) = S (m) ⊗ S (n) . The partial derivation of Equation (10) with respect toÃ can be denoted by By setting Equation (11) to 0, we obtaiñ where Thus the optimal A is obtained by vec (−1) (Ã).

Update when A is fixed
In this setting, the second term in Equation (9) is a constant, and Equation (9) which is solved by using coordinate descent. In each iteration, we update two elements with the remainders fixed, which leads to the update scheme as follows: The optimization pseudo-code is shown in Algorithm 1.

Theoretical analysis
In this section, we perform analysis for the fusion of diffusion which can play the key role in performance improvement. First, we prove that the close-form solution of components in the square brackets of Equation (8) can be expressed by where S denotes the Kronecker product of S (1) and S (2) , i.e. S = S (1) ⊗ S (2) . S (1) and S (2) are the transition matrices of two kinds of similarities. vec(·) denotes the vectorization of a matrix by concatenating its columns one by one, and vec −1 denotes its inverse function. α ∈ (0, 1) is a trade-off parameter.
Proof: When we set all the entries in to 1, the third term is a constant. In this setting, the first term in the square brackets of Equation (8) Since we can rewrite Equation (17) as follows: kl , Equation (19) equals: The second term in this bracket can be rewritten by Therefore, the addition of these two terms would become and the partition derivation with respect toÃ can be given by Cluster latent representations for producing pseudo-labels via solving Equation (4); 6: Update the Generator in the first GAN via: Update the Discriminator in the first GAN via: Learn feature project matrices as ground-truth clustering solutions on batches via solving Equation (6) Update the Generator in the second GAN via: Update the Discriminator in the second GAN via: To find the optimal solution of Equation (22), we set ∂J 0 ∂Ã = 0 and obtaiñ By setting μ = 1 α − 1, we can obtaiñ which equals Equation (16) after applying vec −1 (·) at both sides.

Lemma 4.1. With sufficient iterations, any fusion denoted as
which is derived from certain diffusion converges to Equation (16).

Experiment
In this section, we conduct comparative experiments, where accuracy (AC) and normalized mutual information (NMI) are adopted to quantify the clustering quality. Second, we divide our method into five stages and analyse their contributions to improvements. Third, we explore effects of network architecture of the feature extractor on performance and effects of trade-off parameters in the fusion of diffusion. Finally, we conduct exploration a situation where the label space of easy data sets is not fully overlapped while the union of their labels is the same as that of difficult data set.

Data set
In this part, we introduce the real-world data sets to evaluate the effectiveness of our approach. These data sets include Amazon reviews, news texts, travel reviews, autistic spectral disorder screening data, KEGG metabolic relation network data, letter recognition and EMNIST. Their descriptions are in Table 1. Specifically, Amazon reviews Letter recognition data has 20,000 samples with 16 features and EMNIST has 145,600 samples with 256 features.

Parameter setting
In this part, we describe parameter settings of our approach and comparative counterparts. Concerning the Amazon Review and News Texts, the Feature Extractor consists of four hidden fully-connected layers with 2500, 1000, 500, 100 neurons, respectively; the generator in C-GAN consists of four layers with 10, 25, 50 and 100 neurons, respectively, while + the discriminator in the C-GAN consists of four layers. The generator and the discriminator in the second GAN used to synthesize the feature projection matrices are symmetrical in the structures, where the former one consists of three layers with 40, 80 and 200 neurons, and the latter one consists of  Figure 3. The trade-off parameter in LSSMTC for controlling relative importance between single-task clustering and multi-task clustering is in the range {0.1, 0.2, . . . , 0.9}, and the dimension of the shared embedding space is searched across in {2, 2 2 , . . . , 2 6 }. The number of neighbours for spectral clustering is in the range {3, 6, . . . , 15}. The number of clustering solutions in consensus clustering is in the range {2, 4, . . . , 8}. For Amazon Review and News Texts, the sample rate of features selected in consensus clustering is in the range {0.1, 0.2, . . . , 0.9}.
Before illustrating experimental results, we first show trends of cost function values concerning the iteration. From Figure 4, we observe that values of the cost function can quickly converge within 10 iterations over all the data sets. It indicates that the stage fusion of diffusion costs little running time to obtain the final clustering result.

Comparative experiments with other related methods
In this section, we present and analyse the comparative experiments on the above data sets against related methods including K-means, spectral clustering, consensus clustering, multi-task clustering, deep clustering and adversarial multi-source domain adaptation (MDAN) (Zhao et al., 2018). We choose LSSMTC (Gu & Zhou, 2009, December) and deep embedding clustering (DEC) (Xie et al., 2016) as multi-task clustering and deep clustering, respectively. The comparison results are shown in Tables 3 and 4. From these tables, we have the following observations: (1) Consensus clustering has achieved better performances than K-means and spectral clustering since it can combine and fuse the multi-view informative knowledge in multiple clustering solutions. The solutions are obtained by conducting random subspace tricks or random transformation on the original features.
(2) LSSMTC performs better than consensus clustering, in which the knowledge has been explored and transformed among tasks. It indicates that compared with the informative knowledge in multiple solutions, the knowledge discovered from other tasks can play a more important role in performance improvement. (3) DEC achieves better or comparable performance, compared with LSSMTC, although it does not utilize multiple solutions or tasks. It indicates that a neural network with appropriate structure has enough capability to learn effective information beneficial to downstream tasks. (4) Both MDAN-hard and MDAN-soft significantly outperform LSSMTC and DEC, which indicate that combining the knowledge from other tasks with the powerful learning capability of neural networks can further enrich the discriminative and complementary of features. (5) Our proposed method has achieved better performance than MDAN-hard and MDAN-soft in the majority of cases. The advantage of our method over MDAN-hard (soft) can attribute to the clusteringguided feature extractor for learning the discriminative information, the exploration of common knowledge of related tasks and the fusion of clustering solutions generated by GAN.
From the comparison results, we observe that although consensus clustering can generate diverse clustering solutions by adopting random subspace or transformation on the original features, it may suffer from information loss and ignore different robustness of tasks to transfer of the mined common information. LSSMCT achieves performance improvement via searching a shared subspace across related tasks. However, it works under the assumption that related works share the same semantic space, limiting the applications. DEC has tried to identify a clustering-friendly embedding space via optimizing the representations and clustering assignment alternately. However, its capability of mining  transferable knowledge among tasks may be restricted by the lack of generating high-dimensional samples, especially for a difficult task. Unlikely, MDAN adopts adversarial domain adaptation to learn domain invariant and task-discriminative features under multiple source domains. However, it does not embrace the capability of the generative model and adjusts the balance between domain invariant features and task-discriminative features dynamically.

Explore effects of components on clustering performance
In this part, we divide our proposed method into five stages and explore their contributions to performance improvement. Specifically, in the first stage, we denote the clustering result obtained from the learnt mutual representations by S1. In the second stage, we denote the clustering result with the help of synthetic mutual representations by S2, since we assume that the clustering provides pseudo-labels, based on which more representations can be synthesized to in turn improve the learning of representations. In the third stage, we denote the clustering result by considering feature projection matrices learnt on batches of mutual representations as S3. In the fourth stage, we denote the clustering results of combining the feature matrices learnt on batches and those generated by GANs using a certain voting scheme as S4. In the fifth stage, we denote the clustering results by fusing feature projections with diffusion as S5. Experimental results are shown in Tables 5-6. From these tables, we have the following observations: all the five stages have brought about positive effects on improvement, and there exist some contribution differences among these stages. For example, on the data set Amazon Review, the performance improvement mainly attributes to S2 and S4. It means that the increasing amount of latent representations does help the learning of discriminative informative knowledge, no matter they are from the original input or GANs. This is the key why S4 and S5, since both of them need more diverse and discriminative information. However, on data set travel review TrRe-1 and TrRe-2, these stages play the approximately same role in enhancement. Besides, S5 outperforms S4 all the data sets, indicating that the fusion of diffusion can better integrate effective knowledge from multiple affinity matrices into a unified solution rather than the voting scheme (S4). The reason is that the order of information fusion makes a difference, while the order has little effect on the voting scheme.
Apart from dissecting contributions of each stage, we shall investigate the effects of the number of clustering solutions on performance over data set Amazon Review by comparing consensus clustering with our proposed method, as shown in Figure 5. From this figure, it is observed that when the number of clustering solutions increases, performances obtained by consensus clustering and our proposed method show upward trends. However, the improvement gain brought about by the clustering solutions becomes diminishing in the case of consensus clustering, while the improvement gain can be last over a larger range of solution numbers. In addition, with the same increment of clustering solutions, our method enjoys a larger improvement, compared with consensus clustering. We deduce that the reason is that the generated diverse representations in the latent space provide the feasibility to analyse the data from multiple perspectives. Meanwhile, more clustering solutions are obtained before feeding into the second GAN to synthesize such solutions that contain complementary information.
Besides, we explore the effects of the number of easy data sets on clustering performance over Amazon Review and News Texts, as shown in Figure 6. It can be observed that when the number of easy data sets increases, the performance shows a significant improvement over all the cases. It indicates that with a larger number of easy data sets available, much richer complementary information can be provided, thus the feature extractor can learn more informative and discriminative representations which are beneficial to downstream tasks. As a result, more diverse clustering solutions can be generated.

Effects of the feature extractor's architectures and trade-off parameters
In this section, we first investigate the effects of the feature extractor's architecture that includes the network depth and the network width on clustering performance. Second, we explore the effects of tradeoff parameters used in the fusion of diffusion.

Effects of the feature extractor's architecture
We have explored the effects of the network depth and width in the Feature Extractor on clustering performance in terms of AC and NMI over data sets Amazon to review and News texts, as shown in Figures 7 and 8. Here the  depth denotes the number of hidden layers, and the width denotes the number of neurons or the number of convolutional layers. From these figures, we observe that with the increase of depth, performances show a rapid upward trend before remaining steady. The performance gains become diminishing when the number of hidden layers is larger than a certain number. Similar observations concerning the effects of the network width have been found. It indicates that when the network structures become more complex, it requires an increasing amount of data to train this network for achieving satisfactory performance. It is worth mentioning that the width scale in Figure 6. Effects of the number of data sets on performance in terms of AC and NMI over Amazon Review and News texts, where RC, NG, Reu and TD stand for RCV1-4, 20NG-4, Reuters-4 and TDT2-4, respectively.

Effects of trade-off parameters in the fusion of diffusion
We have explored the effects of μ and τ on Amazon review and News texts. μ controls the importance between the smoothness of pairwise affinity graphs and the self-similarity preservation. τ controls the distribution of the weights of pairwise affinity graphs. Figure 9 shows trends of performance when μ changes in the range between 0.1 and 0.9. Generally, when μ increases gradually, performances show upward trends before remaining steady or downward trends to various extents. The optimal μ of Amazon review is 0.5, which indicates that the smoothness of affinity graphs and the self-similarity preservation plays an approximately equal role in performance improvement. While the optimal μ of News texts is 0.6, which means that the self-similarity preservation makes a larger contribution than the above smoothness. Figure 10 displays effects of τ on performance in terms of AC and NMI, where τ varies in the range from 15 to 135. From this figure, we observe that when τ is smaller than 75 on Amazon review or 60 on News texts, clustering performances remain steady with slight fluctuations. However, with the continuous increase of τ , the performances become significantly poorer. It indicates that τ can not only preserve the discriminative knowledge in the informative affinity graphs but also suppress the negative effects of non-informative affinity graphs in a relatively wide range. Overall, compared to μ, performances show more robust to τ . Besides, we have conducted experiments concerning determining the trade-off parameter. The detailed experimental results are shown in Figure 11. As described in Equation (1), α controls relative importance between the learning of reconstructive representations in respective tasks and the learning of common representations. From Figure 11, one can observe that with the increase of α, the performance shows an obvious upward trend, which indicates that the learning of common representations is beneficial to knowledge mining and transfer from easy to difficult tasks. When α is larger than a certain value, performance shows various trends with the increase of α. For example, on data Amazon review, when α is larger than 0.6, performances of the cases 'D, E, K→B' and 'B, E, K→D' show significant downward trends, while those of the cases 'B, D, K→E' and 'B, D, E→K' remains relatively steady. It indicates different tasks show various robust to knowledge transfer. Differently, when α is larger than 0.6, performances for all the cases become much worse on data News text. Overall, a large α encourages the learning of common representations, from which the shared knowledge is mined and transferred. However, a large α results in the poor learning of reconstructive representation, decreasing the performance of each clustering task. Since tasks have various sensitivities to knowledge transfer, it is reasonable to set a small value to initially and increase its value delicately with the continuity of iterations.

Extendable exploration
We explore the feasibility in a more challenging application where the label spaces of easy data sets are not completely overlapped, but the union of unique classes from these data sets fully covers the classes of hard data set. To verify the extendability, we construct the following scenarios: (1) Easy data sets Letter1 and EMNIST1 are generated from Letter Recognition and EMNIST Letter by randomly selecting 20 classes and the number of unique classes is 26. The hard data set is Chars74K-E, which contains 26 classes. (2) Easy data sets Pen-based1 and Optical1 are generated from Pen-based Recognition and Optical Recognition by randomly selecting 8 classes and the number of unique classes is 10. The hard data set is Semion handwritten digit. (3) Easy data sets USPS1 and MNIST1 are generated from USPS and MNIST by randomly selecting eight classes and the number of unique classes is 10. The hard data set is SVHN. The extendable exploration experiments in the above scenarios are shown in Figure 12, which compares our proposed method with DEC, MDAN-hard and MDAN-soft. From this figure, we observe that DEC achieves the worst performances. The reasons are that (1) DEC fails to learn the discriminative features when the label spaces of data sets are not overlapped and (2) DEC does not contain the adversarial mechanism to generate diverse representations for information enrichment. On the contrary, our method outperforms the other competitors over all the cases. It indicates that our method can not only utilize the knowledge from multiple data sets by exploring the common features as what MDAN-hard and MDAN-soft do, but also better integrate the knowledge from multi-affinity graphs obtained by the feature projection matrices.

Conclusion
In this paper, we propose a framework for generating diverse clustering solutions, followed by conducting solution retrieval to improve the clustering performance. More specifically, we first utilize a feature extractor to learn compact representations for multiple domains, thus each representation in a difficult domain is represented with a combination of others in easy domains. Second, we conduct clustering on these representations, based on which a conditional GAN is introduced to generate more diverse representations conditioned on the clustering pseudo-labels. The synthetic representations facilitate finding a better embedding space. Third, we learn the feature projection and partition matrices on batches of the above representations and consider feature projection matrices as clustering solutions that are input into another GAN to generate more clustering solutions. Finally, these solutions are merged by the fusion of diffusion, followed by performing a transformation on latent representations. We conduct extensive comparative experiments against state-of-the-art methods on multiple real-world applications, and experimental results demonstrate the effectiveness of our proposed framework. Since the proposed approach is not an endto-end clustering solution generation scheme, resulting in lower efficiency, we will design an end-to-end scheme to mitigate it. Meanwhile, we shall improve a better fusion strategy to integrate the information of clustering solution to promote knowledge transfer from easy source tasks to hard ones.

Disclosure statement
No potential conflict of interest was reported by the author(s).