A semi-supervised multi-label classification framework with feature reduction and enrichment*

ABSTRACT Multi-label classification (MLC) has drawn much attention thanks to its usefulness and omnipresence in real-world applications in which objects may be characterized by more than one label as in the traditional approach. Getting multi-label examples is costly and time-consuming; therefore, semi-supervised learning approach should be considered to take advantages of both labelled and unlabelled data. In this work, we propose a semi-supervised MLC algorithm exploiting the specific features of the prominent class label(s) chosen by a greedy approach as an extension of the LIFT algorithm, and unlabelled data consumption mechanism from the TESC algorithm. We also make a semi-supervised MLC application framework for Vietnamese texts with several feature enrichment steps including (a) a stage of enriching features by adding hidden topic features and (b) a stage of dimensional reduction for subtracting irrelevant features. Experimental results on a dataset of hotel reviews (for tourism) indicate that a reasonable amount of unlabelled data helps to increase the F1 score. Interestingly, with a small amount of labelled data, our algorithm can reach a comparative performance to the case of using a larger amount of labelled data.

Problem transformation is one of the two common approaches in MLC, which transforms the multi-label problem into one or more single-label problems that are solved with a single-label classification algorithm. It is worth noting that this method assumes the labels are independent that leads to the ignorance of correlation among labels, thus, in certain situation, it may down the performance of system. In this approach, all labels are discriminated based on identical feature representation of the instance. This method seems to be suboptimal as each label is supposed to possess specific characteristics of its own. Therefore, a lot of work was proposed to exploit the specific features by dividing the training dataset into two groups based on positive and negative instances to build features in different ways (Huaqiao, Zhang, Liu, & Zhao, 2011;Ju-Jie, Fang, & Li, 2015;Zhang & Wu, 2015). For example, Min-Ling Zhang and Lei Wu (2015) proposed an intuitively effective multi-label learning with Label specIfic FeaTures algorithm (named LIFT), which constructs the specific features of each label by conducting clustering analysis on its positive and negative instances, and then carries out training and testing by querying the clustering results. Similarly, Zhang, Tang, and Yoshida (2015) proposed to employ spectral clustering to find the closely located local structures between positive and negative instances, which are assumed to be of more discriminative information, and then transform the original dataset by consulting the clustering results in classification. Huaqiao et al. (2011) computes feature density and then selects features of high density on the positive and negative instance set for each label, respectively; the label-specific features are constructed by taking the intersection on the corresponding label. Finally, each label is classified based on its specific features.
These approaches used clustering, a fundamentally unsupervised learning technique, to group a set of objects in such a way that objects in the same group are more similar (i.e. regarding to a similarly/distance measure) to each other than those in another group (Basu, 2005). Clustering can be used to aid the classification in term of discovering the kinds of structure in training examples. The fact is that manual labelling is timeconsuming and expensive; the combination of both labelled and unlabelled data in semi-supervised classification framework provides a more effective and cheaper approach to enhance the performance.
Recently, semi-supervised clustering technique has been applied to single classification (Dara, Kermer, & Stacey, 2002;Demirez, Bennett, & Embrechts, 1999;Luo, Liu, Yang, Wang, & Zhou, 2015;Tian, 2014;. Self-organizing mapping (SOM) technique (Dara et al., 2002) is used to cluster unlabelled instances and to infer possible labellings from the clusters which are presented to a multi-layer perception along with labelled instances. This method significantly improves the classification performance on all the experimental datasets. Demirez et al. (1999) takes the approach of incorporating label information into an unsupervised learning approach to group both labelled and unlabelled instances into the clusters where each cluster is as pure as possible in term of class distribution provided by labelled instances. The clusters help characterize segment of the population likely or unlikely to possess the target characteristic represented by the label. This method can be used for both inductive and transductive inference.  introduced a novel semi-supervised learning method, called TExt classification using Semi-supervised Clustering (TESC). In clustering process, TESC uses labelled texts to capture silhouettes of text clusters; next, the unlabelled texts are added to the corresponding clusters to adjust their centroid. These clusters are used for building the model in the classification phase. Given a new unlabelled text, the label associated with it is taken after the label of its nearest text cluster. For MLC, Kong, Ng, and Zhou (2013) proposed a model of transductive multi-label learning by using label set propagation TRAM to assign a set of multiple labels to each instance. Firstly, TRAM formulates the transductive multi-label learning as an optimization problem of estimating label concept compositions. Then, it derives a closed-form solution to this optimization problem and develops an algorithm to assign label sets to unlabelled instances.
In this paper, we propose a framework for MLC with a key contribution of a semi-supervised classification algorithm (called MULTICS 1 ) which could exploit specific features of label/label set based on a semi-supervised clustering technique to extract useful information of both labelled and unlabelled instances together. In addition, we expand the framework for MLC with several steps of feature reduction and feature enrichment to evaluate the proposed algorithm and boost up the performance of overall system. The rest of this paper is organized as follows. In the next section, we will describe the proposed framework of semi-supervised learning approach for MLC focusing on the details of the algorithm MULTICS. An application framework using the proposed algorithm in MLC for Vietnamese texts is described in Section 3. We also discuss the experimental results and evaluate the performance of system in this section. The last section presents conclusions and summary. and unlabelled instances, respectively. The task of semi-supervised MLC is to construct the classification function f : D L < D U 2 L . In our proposed semi-supervised MLC algorithm -MULTICS, the goal in building the classifier is to find a partition C derived from D, such that All instances in cluster C i are given the same nonempty label set (called cluster label) l C i .
In the traditional unsupervised clustering method, the number of clusters is often predefined and chosen manually. However, in our algorithm, the number of clusters (m) is automatically identified based on the label set in combination with the labelled and unlabelled dataset.
After we have obtained the partition C, given a new unlabelled instance d u [ D U , f employs the 1-nearest neighbor to get the nearest cluster C j = arg C p min dis(d u , c p ), where c p is the centroid of the cluster C p and dis(.) is the distance between two instances, then the cluster label of C j is assigned to d u , i.e. l(d u ) = l C j . Our contribution is to consume both labelled and unlabelled instances to find the partition C to form the classification model f , which could predict the associated label(s) of unlabelled instances in D U .

Brief summary of LIFT and TESC
2.2.1. LIFT algorithm LIFT was proposed for enhancing the performance of supervised MLC using label-specific features. Concretely, LIFT, at the first step, aims at figuring out features with label-specific characteristics, so as to provide appropriately discriminative information to facilitate its learning as well as classification. For each class label l k [ L, the set of positive and negative training instances are founded as the set of the training instances with and without label l k , respectively. After that, clustering analysis is performed on its positive and negative sets to extract the features specific to l k . In the second step, q binary classifiers, one for each class label l k using l k -specific features, are used to check whether a new instance has the label l k . The approach in LIFT is a supervised method, in which the input is a labelled dataset for the training process and the output is a classification model including the family of q classifiers corresponding to q labels. Given an unseen instance, its associated label set is predicted by going through q classifiers to get prediction for each label.

TESC algorithm
TESC was proposed for single-label classification where each instance can be associated with only a single label. In this work, the task of constructing classification model is based on the semi-supervised clustering technique. The basic assumption is that the data samples come from multiple components. Therefore, in the training step, TESC uses clustering to identify components from both labelled and unlabelled instances. The labelled instances are clustered to find the silhouettes of instances, then, the unlabelled instances are added to adjust the clusters. The label of cluster is assigned to the newly added unlabelled instances.
Let D = D L < D U be an instance collection, where D L and D U are the collections of (single-label) labelled and unlabelled instances, respectively. Let L be the label set on D L including q labels, i.e. L = {l 1 , l 2 , . . . , l q }, and C be the partition derived from D after semi-supervised clustering (i.e. the training phase) After this process, the derived cluster set C is regarded as the model of the classification function. In the classification step, the predicted label of a new instance is assigned as the label of its nearest cluster, i.e. given an unseen example, the label of its nearest cluster c j [ C is used to assign to it.

The proposed framework
Our proposed framework using a semi-supervised learning approach for MLC is described in Figure 1. In this framework, we combine several techniques to enhance the performance of the classifier.
The proposed framework for MLC using semi-supervised algorithm consists of several steps of building or/and enriching features, selecting features and applying proposed algorithm MULTICS.
We build the MULTICS algorithm for MLC that constructs the specific features for each label/label set based on the idea proposed in LIFT (Zhang & Wu, 2015) with several improvements. In LIFT, the features specific to each label were built in the same manner. In our model, the first step is to find the prominent labels in a cluster following the greedy approach, i.e. select the best choice at the moment (or local optimization) with the hope that this choice would lead to a globally optimal solution. Since, with the labels with few occurrences, it is not good enough to form a cluster, we propose to select the maximum occurrence label (the prominent label) as clue to build clusters. In case of several labels with maximum occurrences in D L , the one with smallest index will be chosen as prominent label. The next step in LIFT is to extract features specific to each label by the k-means clustering algorithm on its positive and negative samples. Our model makes some important changes in this stage. We divide the instance collection into three different subsets: (1) instances with expansion of the only prominent label l, (2) instances with a set of label including l and (3) instances without l. After that, we perform semi-clustering analysis on these three subsets to get a partition derived from collection of unlabelled and labelled instances. The semi-supervised clustering technique in TESC  is applied in our model to consume unlabelled instances, i.e. an unlabelled instance is added to its nearest cluster, and its label set is the same as the cluster label. Finally, the partition derived from both labelled and unlabelled instances is used as the classification model. No additional classification algorithm is used in our approach. This is different from LIFT, which uses q (i.e. the cardinality of the label set) binary classifiers with label-specific features in the classification phase.
The proposed algorithm includes two phases: one is the learning phase, which uses clustering to identify the components (i.e. clusters) from both labelled and unlabelled instances based on the prominent label; the other is classification, which identifies the nearest instance cluster to assign label for the unseen instance in D U .
In the learning phase, we use the semi-supervise clustering method in Tian (2014) to take advantages of the TESC algorithm to get partition derived from the instance collection D. The recursive training procedure is called MULTICSLearn(.) whose pseudo-code is shown in Figure 2.
In order to find the partition C (i.e. the model of our classification algorithm), we first initialize C = {}, then call MULTICSLearn( D, {}, L, C). The result set of clusters C is regarded as components and used to predict labels of unlabelled instances in the classification phase as shown in Figure 3.
In the classification process, the input includes the collection C of labelled data clusters (resulted from the MULTICSLearn(.)) and an unlabelled instance d U needed to be labelled. The output is a predicted label set l U assigned to the d U . We calculate the distances from unlabelled instance to the centroids of all clusters to find out the nearest centroid. Then, the unlabelled instance will be assigned the label set of its nearest cluster.

An application frame for semi-supervised MLC
In application, we applied this framework for multi-label text classification with following detail stages: (a) a stage of enriching features by using the hidden topic model (Latent Dirichlet Allocation, LDA) (Blei, 2012;Blei, Ng, & Jordan, 2003) to exploit information of semantic meaning of text representation; (b) a stage of selecting features with Mutual Information (MI) (Doquire & Verleysen, 2011) to get the most relevant ones and removing irrelevant ones and (c) a stage of using proposed semi-supervised MLC algorithm.
First of all, we make use of the hidden topic model of LDA to build the features of hidden topic probabilities for each document. This kind of features provides a much richer semantic meaning of text representation.
Next, a feature selection method based on MI is applied to improve the features for the classifier.
In the last step, a multi-label classifier is built based on the MULTICS algorithm. This classifier will be used to classify new documents (Figure 4).

Building features by applying the LDA model
Hidden topic probability models (Blei, 2012;Blei et al., 2003)  In L-LDA, the training of the LDA model is adapted to account for multi-label corpora by putting 'topics' in 1-1 correspondence with labels and then restricting the sampling of  topics for each document to the set of labels that were assigned to the document. Rubin et al. (2012) proposed some more flexible LDA-based models including the Prior-LDA model that takes into account prior label frequencies, and Dependency-LDA that can additionally account for label dependencies. These LDA-based models were applied to multi-label documents that associate individual word tokens with different labels. In this work, we take after the idea of Pham et al. (2013) and Trejo et al. (2015) to enrich features for Vietnamese short documents by combining various features like traditional TFIDF, bigrams, unigrams and LDA features.
The hidden topic model of LDA has been determined by applying the GibbsLDA++ tool (Trejo et al., 2015) on a universal dataset with different numbers of topics. Due to the number of classes, we select the hidden topic numbers of 10, 15, 25, 50 and 100 to evaluate. These models will be applied on the training/testing data to generate probabilities of assigning hidden topics for each document.
Denote p(d, j) be the probability that a review d belongs to the hidden topic j (i.e. j = 1 … k, where k is the number of hidden topics). The vector (p(d,1), p(d,2), … ., p(d,k)) is called the hidden topic feature vector of the review. These features will be combined with other features of documents to build the feature set for the classifier.

Selecting features based on the MI
Feature selection is one of the fundamental steps in the fields of machine learning and data mining for reducing dimensionality, choosing a small subset of the relevant features from the original ones, which usually leads to better learning performance, lower computational cost and better model interpretability.
In this work, we apply the method of MI (Doquire & Verleysen, 2011) to perform feature selection in MLC problems. MI measures the amount of information contained in variable X in order to predict variable Y in any relation, not only linear ones. In addition, the MI concept is directly applicable to groups of variables. The MI is given in Gómez-Verdejo, Verleysen, and Fleury (2007) as below: where p x (x) and p y (y) are the marginal probability density function of X and Y, respectively, and p x,y (x, y) is the joint probability density function of X and Y. The method of achieving feature selection based on MI (Doquire & Verleysen, 2011) includes the following steps. First, the multi-label problem is transformed using the pruned problem transformation method (Read, 2008). Then, for each single-label classification task, a forward/backward selection algorithm based on MI is employed to choose the 'optimal' feature subset. The search strategy used in this model is considered as backward elimination, which starts with the set of all features and recursively removes them one at a time. The procedure ends when the predefined number of features has been reached. Another search strategy called a greedy forward feature selection algorithm that begins with an empty set of features and first selects the feature whose MI to the class vector is the highest. Then, the algorithm sequentially selects the feature not yet selected whose addition to the current subset of selected features leading to the set having the highest MI to the output.

The datasets
We applied the proposed framework on the set of customer's reviews about Vietnamese hotels retrieved from several famous Vietnamese websites on tourism and hotels. We build labelled, unlabelled and testing datasets with different numbers of documents to evaluate the effectiveness of labelled and unlabelled data on the model. After some preprocessing steps on these datasets, i.e. main text context extraction, word segmentation, and stop word removal, we got about 1800 reviews. One thousand and five hundred reviews were manually tagged to create the labelled dataset of 1250 reviews, and the testing set of 250 reviews. The remaining 300 reviews were left intact to create the unlabelled dataset set. We considered reviews on five aspects: (a) location and price, (b) service, (c) facilities, (d) room standard and (e) food.
In order to train the LDA model for generating the hidden topic models, the universal dataset of 24,000 articles, introductions and comments about hotels in Vietnam (from the above sources) was also crawled. The preprocessing step is applied to all datasets for LDA construction and classification.

Experimental results
We took several experiments with different configurations to evaluate the effect of the proposed algorithm. In order to analyse the contribution of the labelled datasets, we also generated some subsets with the size of 500, 750 and 1000 reviews. The contribution of unlabelled datasets is also evaluated in each category with different sizes of 0, 50, 100, 200 and 300 reviews, where the unlabelled dataset size of 0 is used as the baseline.
We performed four groups of experiments with different settings to evaluate the effectiveness of the proposed framework as follows: . For the hidden topic probability features, we build the hidden topic model of LDA (Phan & Nguyen, 2007) with different topic numbers, i.e. 10, 15, 25, 50 and 100.
Step of calculating distance between instances could be apply a set of distance measure and in our experiments below, we choose a popular distance measure of Euclidean. We should try another one measure in the future.
We used the label-based measures (Tsoumakas et al., 2010) for evaluating the performance of the proposed model. For each class label y j , TP j , FP j , TN j and FN j which are the number of true positive, false positive, true negative and false negative test samples were recorded. Let B(TP j , FP j , TN j , FN j ) be some specific binary classification measures (e.g. B(.) = {P, R, F1}, where P = TP j /(TP j + FP j ), R = TP j /(TP j + FN j ), and F1 = P * R/2(P + R)).
The micro-averaging measures are calculated as follows: where q is the total number of labels. For these metrics, the bigger the value, the better the classification performance. In Tables 1 and 2, the worst cases are formatted in italic values and the best cases are formatted in bold values. The results of Experiments 1 and 2 are described in Table 1, in which we compare the performance of using binary features (BN), and binary features with MI feature selection. The results show the contribution of unlabelled datasets, i.e. the performance of all the cases using unlabelled datasets is better than that of the baseline method. Also with the contribution of labelled datasets, when the amount of labelled data increases, the model gets better results. In term of features, the performances of binary features without feature selection are all better than that of the case using both binary features and MI feature selection. In our opinion, the way of building specific features for prominent label/label set is already a kind of feature selection. Thus, the Experiment 2 (BN + MI) has two feature selection steps, which do not help to increase the performance.
In the experiments with hidden topic probabilities, we realize that the combination of binary features with hidden topic features does not increase the performance of the classifier. The reason may be the different kinds of discrete features (of binary representation) and real-valued features (of hidden topic probabilities). Therefore, in Experiments 3 and 4, we combine the features of TFIDF with the features of the hidden topic probabilities to build continuous features. Because the combination of the different unlabelled dataset sizes (5), with different numbers of hidden topics (5), and different labelled dataset sizes (3) is big, we only selected the best result of the labelled dataset for comparison. Table 2 shows the best results of Experiments 3 and 4 in which the labelled dataset size of both the Experiments 3 and 4 is 1000 documents. When comparing the results in Tables 1 and 2, we see that the performances of enriched features with hidden topic probabilities in Experiments 3 and 4 are better than those of Experiments 1 and 2. The best F1 in Experiments 3 and 4 are 85.3% and 84%, respectively, while the best F1 in Experiment 1 is 83.2%. Additionally, Table 2 shows that the Experiment 3 outperforms the Experiment 4. In other words, the Experiment 4 using the feature selection technique on the feature set of TFIDF and hidden topic probabilities is not as good as the Experiment 3 using the feature set of TFIDF and hidden topic probability. These results confirm that MULTICS builds the features good enough for the classifier. We may apply MULTICS directly on the original feature set without feature selection step.

Conclusions and future work
In this paper, we build a framework for MLC including the process of enriching features, the process of feature selection and the process of classification of MULTICSan approach for semi-supervised MLC to exploit label-specific features. Using two basic assumptions including the effect of label-specific features in the learning phase and the multiple components in each label which can be identified by clustering, our proposed model brings major contribution in building label-specific features for multi-label learning with an approach of semi-supervised clustering technique. The experiments show that MULTICS gives a promising result for MLC, and the combination with the hidden topic probability features also contributes better performance.
For future direction, we are making some more improvements, e.g. method to effectively select unlabelled instances, or post-processing to prune the result clusters to remove outliers, to evaluate the proposed approach. Note 1. A novel semi-supervised MULTI-label ClaSsification which can exploit both unlabelled data and specific features to enhance the performance.

Disclosure statement
No potential conflict of interest was reported by the authors.