Towards semi-supervised ensemble clustering using a new membership similarity measure

Hierarchical clustering is a common type of clustering in which the dataset is hierarchically divided and represented by a dendrogram. Agglomerative Hierarchical Clustering (AHC) is a common type of hierarchical clustering in which clusters are created bottom-up. In addition, semi-supervised clustering is a new method in the field of machine learning, where supervised and unsupervised learning are combined. Clustering performance is effectively improved by semi-supervised learning, as it uses a small amount of labelled data to aid unsupervised learning. Meanwhile, ensemble clustering by combining the results of several individual clustering methods can achieve better performance compared to each of the individual methods. Considering AHC with semi-supervised learning for ensemble clustering configuration has received less attention in the past literature. In order to achieve better clustering results, we propose a semi-supervised ensemble clustering framework developed based on AHC-based methods. Here, we develop a flexible weighting mechanism along with a new membership similarity measure that can establish compatibility between semi-supervised clustering methods. We evaluated the proposed method with several equivalent methods based on a wide variety of UCI datasets. Experimental results show the effectiveness of the proposed method from different aspects such as NMI, ARI and accuracy.


Introduction
Currently, there are different types of machine learning systems, which are classified into four general groups: supervised learning, unsupervised learning, Semi-Supervised Learning (SSL), and reinforcement learning [1,2].Supervised learning includes data whose class labels are known and available in the learning phase.One of the common problems in this type of learning is the classification problem.Some of the most common classification algorithms are linear regression, logistic regression, k-nearest neighbours, support vector machine, neural networks, decision trees and random forests [3,4].In unsupervised learning, the data class label is not available and the learning process seeks to assign the appropriate label to each data.One of the common problems in this type of learning is the clustering problem.In clustering, groups of similar objects should be identified.Some of the most common clustering algorithms are K-Means, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Hierarchical Cluster Analysis (HCA), FCM and cmeans [5].
One of the successful approaches in recent years to improve clustering performance is ensemble clustering methods [6,7].The main idea of learning in ensemble clustering is to combine the prediction results of different individual clustering models.Multiple clustering methods can create higher quality clusters by combining the output partitions of several basic models.In this regard, it can be expected that the use of ensemble clustering in the context of hierarchical clustering can provide a higher quality for creating the final partition [8].According to the latest studies, the problem of ensemble hierarchical clustering has not received much attention so far.Hence, we draw inspiration from hierarchical clustering and SSL to develop an efficient ensemble clustering framework [9,10].
In this paper, a flexible weighting mechanism is developed to describe the consistency between semisupervised clustering models used to generate base partitions.In general, the proposed algorithm consists of three main steps: creating primary clusters with different Agglomerative Hierarchical Clustering (AHC) methods [11], developing a new membership similarity measure to calculate the similarity between objects, and finally re-clustering the primary clusters to create final clusters.We generate primary clusters by four linkagebased AHC methods.The results are evaluated at the cluster and partition levels using a robustness measure to calculate the similarity between objects.We measure the weight of primary clusters based on their robustness.The primary clusters with the highest weight are selected for the final consensus to form the final partition.Here, the consensus function is developed based on the meta-clustering technique (i.e.re-clustering of the primary clusters).Finally, the final partition is created by assigning objects to meta-clusters with the highest similarity.
The main contribution of this paper is as follows: • Configuration of a new membership similarity measure between objects inspired by the evaluation of clusters and partitions • Development of a flexible weighting mechanism to generate consistent base partitions • Improving the learning process in ensemble clustering using semi-supervised hierarchical clustering The outline of the rest of the paper is as follows: Related works are reviewed in Section 2. General concepts related to clustering are given in Section 3. Section 4 explains the details of the proposed algorithm.Section 5 is related to the results of experiments and evaluations.Finally, Section 6 concludes the paper.

Related works
This section is a literature review to understand the problem of ensemble clustering and related concepts of semi-supervised framework [12][13][14].A summary of the aforementioned studies is given in Table 1.
Zhang et al. [15] presented a Two-Stage approach for Semi-supervised Ensemble Clustering based on constraint weight (TSSEC).The authors propose some pairwise constraints to improve the clustering process: the supervised data is only used for the ensemble process, the final clusters are formed without considering the redundancy, and the influence of different clusters is ignored when forming the final clusters.To address these constraints, TSSEC can select appropriate clusters and consider cluster weights for the clustering task.Here, pairwise constraints are used to select clusters and cluster weights.TSSEC selects a subset of primary clusters based on the quality and diversity of the monitored data.The quality of selected clusters is calculated through unsupervised and supervised data.Finally, TSSEC uses a weighted correlation matrix to generate final clusters.
Yang et al. [16] proposed a semi-supervised consensus clustering approach using closed patterns.The authors developed their previous work on Multi-Cons multiple consensus clustering and presented the Semi-MultiCons approach.Semi-MultiCons does not depend on the number of clusters and creates final clusters based on different pairwise constraints.In addition, this approach can reduce the negative effects related to the integration of constraints in the clustering process.
Kadhim et al. [17] presented an ensemble clustering approach based on the Self-Directed Learning (SDL) framework.This approach can help the consensus function to achieve the highest evaluation in satisfying performance measurement.In general, SDL includes a combination of Predicting Test-set Labels (PTL) and Detecting Best Results (DBR).PTL combines clustering results sequentially to produce satisfactory results, where it helps to predict labels.Meanwhile, DBR can find the correct result when predicting several different results for the same model.In addition, the authors introduced new performance measurements for clustering validation, the most important of which is the Correction Ratio (CR).
Li et al. [18] proposed a new ensemble clustering algorithm for data with different scales.The authors introduce the Meta-Clustering Ensemble method based on Model Selection (MCEMS), which is a multi-step approach for data clustering.MCEMS tries to calculate the similarity between objects by considering several primitive clusters from different models.In addition, MCEMS is equipped with a clustering model selection technique considering quality and diversity.

Proposed algorithm
Ensemble clustering is proven to be an ideal alternative in terms of robustness and stability to an individual clustering algorithm [19].The aim of this paper is to  combine the advantages of SSL and ensemble clustering to improve clustering performance.Figure 1 describes the general framework of the proposed algorithm.First, the dataset is clustered by several semi-supervised AHC-based models.Two aspects are considered for applying SSL: information based on pairwise constraints and information based on metric constraints.This information can provide different aspects of the dataset with more flexibility for clustering.In both sections, we use four linkage-based AHC methods for clustering: single, centroid, average, and complete.Meanwhile, we present an innovative approach to measure the distance between objects, which is based on Euclidean distance and cluster size.

System model
Any clustering method can be applied to a given dataset and return a partition as output [20].Let X = {x 1 , x 2 , . . ., x N } be a dataset with N objects.Here, x i ∈ X represents the i-th object of the dataset X.Let x i = [x i 1 , x i 2 , . . ., x i M ] be the vector of M features corresponding to x i .Let π be an individual clustering method and π(x i ) is the label of the cluster belonging to x i .In the ensemble clustering problem, X is clustered by a set of P individual methods.Let = {π 1 , π 2 , . . ., π P } be a set of P individual clustering methods where each method provides a partition as clustering output.Each partition contains several clusters that can be different according to the clustering methods used.Let

Semi-supervised AHC clustering based on pairwise constraints
Basically, constraint-based knowledge can lead to improved clustering performance, because it is easier to obtain than object labels.Pairwise constraints indicate whether a pair of objects can be included in a group or not [21].In general, pairwise constraints include must-link and cannot-link.Let ML = {(x i , x j )} denotes the must-link, where x i and x j can be grouped into a cluster.Also, let CL = {(x i , x j )}.denotes cannotlink, where x i and x j must be grouped into different clusters.Both must-link and cannot-link as pairwise constraints include properties of symmetry and transitivity.Let x i , x j and x k be three objects of X.According to this, the properties of symmetry and transitivity in pair constraints are defined by Equations ( 1) and (2), respectively.
Let d i,j ∈ D be the distance between x i and x j in the distance matrix D. According to the definition of pair constraints, the distance matrix is defined.If (x i , x j ) ∈ ML, then d i,j = 0 and if (x i , x j ) ∈ CL, then d i,j = ∞.Meanwhile, let s i,j ∈ S be the similarity between x i and x j in the similarity matrix S. We define the similarity matrix by Equation (3).
where ||x i − x j || is equivalent to d i,j , and σ i and σ j are the corresponding parameters for x i x j , respectively.
Finally, the clustering of the dataset X is done considering the similarity matrix S. Here, we use four linkagebased AHC clustering methods including single, centroid, average, and complete for clustering and creating partitions.All these methods provide clustering results by dendrogram.Each level of the dendrogram is considered as a partition.In this paper, Bayesian PAC learning [22] is used to select the appropriate level of the dendrogram and determine the appropriate partition.By determining the appropriate partition, the number of clusters (i.e.K) in each method is determined automatically.

Semi-supervised AHC clustering based on metric constraints
Huang et al. [23] proposed the large margin nearest cluster (LMNC) distance metric for semi-supervised clustering.LMNC is inspired by the Mahalanobis metric to realize the min-max principle.This principle states that robust clustering is achieved by minimizing the distances between objects in similar clusters and maximizing the distance between objects in different clusters [24].Let {(x i , y j )} N i=1 be a dataset with N objects, where x i ∈ R M refers to objects and y j ∈ {1, 2, . . ., K} refers to class labels.Also, let M be a symmetric matrix of size M × M. The distance square for each pair of objects x i and x j in the R M space is formulated by Equation (5).
Basically, M is considered as a positive semi-definite matrix, where M ≥ 0. LMNC includes a cost function for learning the M matrix, as shown in Equation (6).
where, a i,j ∈ {0, 1} represents the ordered weight with x i and x j .Here, a i,j = 1 means that class label y i and y j are same for x i and x j respectively.Moreover, c > 0 is a positive constant, z j is the centre of cluster j, and [f ] + = max(f , 0) is the loss function.LMNC formulates the loss metric as an optimization problem to realize the min-max principle, as shown in Equation (7).
where ξ i,j,l is used as a slack term to induce the loss function.This optimization problem in LMNC is solved by gradient projection algorithm.Finally, the clustering of the dataset X is done considering the distance matrix D. In this section, four linkage-based AHC clustering methods including single, centroid, average, and complete are used for clustering.Similarly, Bayesian PAC learning technique is used to determine the appropriate level and optimal partition.

Weighing mechanism
In general, the robustness of a partition may be evaluated as weak, while it has one or more clusters of high quality.Therefore, it is not recommended to use all partitions as well as all primary clusters generated in the final consensus [5,25].This may even lead to a decrease in the ensemble clustering performance and an increase in the computational complexity of the consensus function.Normalized Mutual Information (NMI) is a common performance metric for evaluating clustering.The evaluation in NMI is based on the diversity of labels in two partitions, as shown in Equation (8).Measuring the diversity by NMI between an output partition and reference partition can evaluate the quality of the clustering method.Therefore, the robustness of partitions created in can be measured by Weight NMI (π γ ) = NMI(π γ , π * ).Here, we consider the robustness of a partition as its weight, where π * represents the reference partition.With converting a cluster to a partition, NMI can be used to evaluate clusters.Let Weight NMI (c i ) be the weight of cluster c i . Let ] be the set of all primary clusters of P partitions available.The goal is to select a subset of highquality AC to participate in the consensus function.Let SC = [c 1 , c 2 , . . ., c i , . . ., c |SC| ] be the set of selected clusters from AC that participate in the final consensus.If c i ∈ SC, then c i satisfies the predefined threshold.We define this threshold based on Weight NMI (c i ) ≥ θ, where θ is a fixed parameter to determine the merit of the clusters.Experimentally, θ is set to 0.35.

Results and discussion
In this section, we evaluate the proposed algorithm and its results.All experiments were performed by the MATLAB 2021a simulator on a desktop with Intel ® Core TM i7-2600 Processor (8M Cache, up to 3.80 GHz), 32 GB of RAM DDR4 and 64-bit Windows 10.We use various evaluation metrics to demonstrate the superiority of the proposed algorithm, for example, NMI, Adjusted Rand Index (ARI), accuracy and running time.

Datasets
In order to evaluate the proposed algorithm in comparison with other existing clustering methods, several different datasets from the UCI machine learning repository have been used.Table 2 shows the characteristics of these datasets.

Discussion and comparisons
This section is related to the evaluation and validation of the proposed algorithm in terms of different performance metrics.We compare the proposed algorithm based on NMI, ARI, accuracy and running time with some equivalent methods such as TSSEC [15], Semi-MultiCons [16], SDL [17] and MCEMS [18].The accuracy of the proposed algorithm in clustering compared to the existing methods is shown in Figure 2. The proposed algorithm and each of the methods are compared in a subplot.The results show the superiority of the proposed algorithm in most of the datasets.The proposed algorithm outperforms TSSEC and MCEMS in all datasets.The average superiority over all datasets is reported as 13.41% and 15.18%, respectively.Compared to SDL, the proposed algorithm has absolute superiority in all datasets except Voice_9 and Secom.The accuracy results show that on average the proposed algorithm is more than 6.5% superior to the SDL method.As illustrated, the results of the proposed algorithm are competitive compared to Semi-MultiCons.However, the proposed algorithm provides an average of 3.34% better accuracy than this method.
Table 3 shows the average performance calculated by the ARI metric through the standard deviation.These results for the NMI metric are reported in Table 4.Meanwhile, the runtime for each method is reported in Table 5.The bold results in these tables represent the best values for each method.The results clearly prove the better performance of the proposed algorithm.As illustrated, the results of the proposed algorithm are better compared to existing methods on large-scale datasets.This is clearly evident when looking at the results associated with the BNG Spect and BNG Vote datasets.Compared to TSSEC, MCEMS, SDL and Semi-MultiCons, the proposed algorithm is superior in ARI metric by 20.49%, 12.68%, 8.75% and 1.69%, respectively.This superiority for NMI metric is reported as 7.97%, 11.27%, 4.39% and 1.76%, respectively.In terms of runtime, the proposed algorithm has the least complexity on average.

Conclusions
In this paper, we developed AHC-based ensemble clustering inspired by SSL.Here, we develop a flexible weighting mechanism that can describe the consistency between semi-supervised clustering methods used to generate base partitions.Also, we presented a new membership similarity measure to calculate the similarity between objects that uses the results from evaluating clusters and partitions simultaneously.Evaluations on some datasets from the UCI repository show that the proposed algorithm is significantly superior compared to equivalent methods.This superiority exists in many performance metrics such as NMI, ARI and accuracy.
For future work, we develop the proposed algorithm for modelling to avoid reassembling the entire dataset in each run.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Figure 1 .
Figure 1.Framework of the proposed clustering algorithm.
the primary clusters generated by the k-th member of the ensemble, where |π k | represents the partition size (number of clusters generated).The consensus function in ensemble clustering can provide the final partition by merging the generated partitions.Let π * = π 1 , π 2 , . . ., π P be a consensual consensus function applied to P generated partitions of .Here, as a consensus function can produce the final partition π * .Let π * = [c * 1 , c * 2 , . . ., c * K ] be the final partition generated with K clusters obtained from the consensus of results in .

Figure 2 .
Figure 2. of different methods based on clustering accuracy.

Table 1 .
A summary of the reviewed studies.

Table 2 .
Details of datasets used in the experiments.

Table 3 .
ARI results for different methods.

Table 4 .
NMI results for different methods.

Table 5 .
Running time (s) results for different methods.