A semi-supervised multi-label classification framework with feature reduction and enrichment*

ABSTRACT Multi-label classification (MLC) has drawn much attention, thanks to its usefulness and omnipresence in real-world applications in which objects may be characterized by more than one label as in the traditional approach. Getting multi-label examples is costly and time-consuming; therefore, semi-supervised learning approach should be considered to take advantages of both labelled and unlabelled data. In this work, we propose a semi-supervised MLC algorithm exploiting the specific features of the prominent class label(s) chosen by a greedy approach as an extension of the LIFT algorithm, and unlabelled data consumption mechanism from the TExt classification using Semi-supervised Clustering algorithm. We also make a semi-supervised MLC application framework for Vietnamese texts with several feature enrichment steps including (a) a stage of enriching features by adding hidden topic features and (b) a stage of dimensional reduction for subtracting irrelevant features. Experimental results on a data set of hotel reviews (for tourism) indicate that a reasonable amount of unlabelled data helps to increase the F1 score. Interestingly, with a small amount of labelled data, our algorithm can reach a comparative performance to the case of using a larger amount of labelled data.

Problem transformation is one of the two common approaches in MLC, which transforms the multi-label problem into one or more single-label problems that are solved with a singlelabel classification algorithm. It is worth noting that this method assumes that the labels are independent that leads to the ignorance of correlation among labels; thus, in certain situation, it may down the performance of system. In this approach, all labels are discriminated based on identical feature representation of the instance. This method seems to be suboptimal, as each label is supposed to possess specific characteristics of its own. Therefore, a lot of work was proposed to exploit the specific features by dividing the training data set into two groups based on positive and negative instances to build features in different ways (Huaqiao, Zhang, Liu, & Zhao, 2011;Zhang, Fang, & Li, 2015;Zhang & Wu, 2015). For example, Zhang and Wu (2015) proposed an intuitively effective multi-label learning with Label-specIfic FeaTures algorithm (named LIFT), which constructs the specific features of each label by conducting clustering analysis on its positive and negative instances, and then carries out training and testing by querying the clustering results. Similarly, Zhang, Tang, and Yoshida (2015) proposed to employ spectral clustering to find the closely located local structures between positive and negative instances, which are assumed to be of more discriminative information, and then transform the original data set by consulting the clustering results in classification. Huaqiao et al. (2011) compute feature density and then select features of high density on the positive and negative instance set for each label, respectively; the label-specific features are constructed by taking the intersection on the corresponding label. Finally, each label is classified based on its specific features.
These approaches used clustering, a fundamentally unsupervised learning technique, to group a set of objects in such a way that objects in the same group are more similar (i.e. regarding to a similarity/distance measure) to each other than those in another groups (Basu, 2005). Clustering can be used to aid the classification in terms of discovering the kinds of structure in training examples. The fact is that manual labelling is time-consuming and expensive; the combination of both labelled and unlabelled data in semisupervised classification framework provides a more effective and cheaper approach to enhance the performance.
Recently, semi-supervised clustering technique has been applied to single classification (Dara, Kermer, & Stacey, 2002;Demirez, Bennett, & Embrechts, 1999;Luo, Liu, Yang, Wang, & Zhou, 2015;Tian, 2014;Zhang, Tang, et al., 2015). Self-organizing mapping technique (Dara et al., 2002) is used to cluster unlabelled instances and to infer possible labellings from the clusters which are presented to a multi-layer perception along with labelled instances. This method significantly improves the classification performance on all the experimental data sets. Demirez et al. (1999) take the approach of incorporating label information into an unsupervised learning approach to group both labelled and unlabelled instances into the clusters where each cluster is as pure as possible in term of class distribution provided by labelled instances. The clusters help characterize segment of the population likely or unlikely to possess the target characteristic represented by the label. This method can be used for both inductive and transductive inference. Zhang, Tang, et al. (2015) introduced a novel semi-supervised learning method, called TExt classification using Semi-supervised Clustering (TESC). In clustering process, TESC uses labelled texts to capture silhouettes of text clusters; next, the unlabelled texts are added to the corresponding clusters to adjust their centroid. These clusters are used for building the model in the classification phase. Given a new unlabelled text, the label associated with it is taken after the label of its nearest text cluster. For MLC, Xiangnan, Ng, and Zhou (2013) proposed a model of transductive multi-label learning by using label set propagation TRAsductive Multi-label classification (TRAM) to assign a set of multiple labels to each instance. Firstly, TRAM formulates the transductive multi-label learning as an optimization problem of estimating label concept compositions. Then it derives a closed-form solution to this optimization problem and develops an algorithm to assign label sets to unlabelled instances.
In this paper, we propose a framework for MLC with a key contribution of a semi-supervised classification algorithm (called MULTICS 1 ) which could exploit specific features of label/label set based on a semi-supervised clustering technique to extract useful information of both labelled and unlabelled instances together. In addition, we expand the framework for MLC with several steps of feature reduction and feature enrichment to evaluate the proposed algorithm and boost up the performance of overall system. The rest of this paper is organized as follows. In the next section, we will describe the proposed framework of semi-supervised learning approach for MLC focusing on the details of the algorithm MULTICS. An application framework using the proposed algorithm in MLC for Vietnamese texts will be shown in Section 3. We also discuss the experimental results and evaluate the performance of system in this section. The last section presents conclusions and summary. and unlabelled instances, respectively. The task of semi-supervised MLC is to construct the classification function f : D L < D U 2 L . In our proposed semi-supervised MLC algorithm -MULTICSthe goal in building the classifier is to find a partition C derived from D, such that . All instances in cluster C i are given the same nonempty label set (called cluster label) l C i . In the traditional unsupervised clustering method, the number of clusters is often predefined and chosen manually. However, in our algorithm, the number of clusters (m) is automatically identified based on the label set in combination with the labelled and unlabelled data set.
After we have obtained the partition C, given a new unlabelled instance d u [ D U , f employs the 1-nearest neighbour to get the nearest cluster C j = arg where c p is the centroid of the cluster C p and dis (.) is the distance between two instances, then the cluster label of C j is assigned to d u , i.e. l(d u ) = l C j . Our contribution is to consume both labelled and unlabelled instances to find the partition C to form the classification model f , which could predict the associated label(s) of unlabelled instances in D U .

Brief summary of LIFT and TESC
2.2.1. LIFT algorithm LIFT was proposed for enhancing the performance of supervised MLC using label-specific features. Concretely, LIFT, at the first step, aims at figuring out features with label-specific characteristics, so as to provide appropriately discriminative information to facilitate its learning as well as classification. For each class label l k [ L, the set of positive and negative training instances are founded as the set of the training instances with and without label l k , respectively. After that, clustering analysis is performed on its positive and negative sets to extract the features specific to l k . In the second step, q binary classifiers, one for each class label l k using l k -specific features, are used to check whether a new instance has the label l k . The approach in LIFT is a supervised method, in which the input is a labelled data set for the training process and the output is a classification model including the family of q classifiers corresponding to q labels. Given an unseen instance, its associated label set is predicted by going through q classifiers to get prediction for each label.

TESC algorithm
TESC was proposed for single-label classification where each instance can be associated with only a single label. In this work, the task of constructing classification model is based on the semi-supervised clustering technique. The basic assumption is that the data samples come from multiple components. Therefore, in the training step, TESC uses clustering to identify components from both labelled and unlabelled instances. The labelled instances are clustered to find the silhouettes of instances, then, the unlabelled instances are added to adjust the clusters. The label of cluster is assigned to the newly added unlabelled instances.
Let D = D L < D U be an instance collection, where D L and D U are the collections of (single-label) labelled and unlabelled instances, respectively. Let L be the label set on D L including q labels, i.e. L = {l 1 , l 2 , . . . , l q }, and C be the partition derived from D after semi-supervised clustering (i.e. the training phase) After this process, the derived cluster set C is regarded as the model of the classification function. In the classification step, the predicted label of a new instance is assigned as the label of its nearest cluster, that is, given an unseen example, the label of its nearest cluster c j [ C is used to assign to it.

The proposed framework
Our proposed framework using semi-supervised learning approach for MLC is described in Figure 1. In this framework, we combine several techniques to enhance the performance of the classifier. The proposed framework for MLC using semi-supervised algorithm consists of several steps of building or/and enriching features, selecting features and applying proposed algorithm MULTICS.
We build the MULTICS algorithm for MLC that constructs the specific features for each label/label set based on the idea proposed in LIFT (Zhang & Wu, 2015) with several improvements. In LIFT, the features specific to each label were built in the same manner. In our model, the first step is to find the prominent labels in a cluster following the greedy approach, that is, select the best choice at the moment (or local optimization) with the hope that this choice would lead to a globally optimal solution. Since, with the labels with few occurrences, it is not good enough to form a cluster, we propose to select the maximum occurrence label (the prominent label) as clue to build clusters. In case of several labels with maximum occurrences in D L , the one with smallest index will be chosen as prominent label. The next step in LIFT is to extract features specific to each label by the k-means clustering algorithm on its positive and negative samples. Our model makes some important changes in this stage. We divide the instance collection into three different subsets: (1) instances with expansion of the only prominent label l, (2) instances with a set of label including l, and (3) instances without l. After that, we perform semi-clustering analysis on these three subsets to get a partition derived from collection of unlabelled and labelled instances. The semi-supervised clustering technique in TESC (Zhang, Tang, et al., 2015) is applied in our model to consume unlabelled instances, that is, an unlabelled instance is added to its nearest cluster, and its label set is the same as the cluster label. Finally, the partition derived from both labelled and unlabelled instances is used as the classification model. No additional classification algorithm is used in our approach. This is different from LIFT, which uses q (i.e. the cardinality of the label set) binary classifiers with label-specific features in the classification phase. The proposed algorithm includes two phases: one is the learning phase, which uses clustering to identify the components (i.e. clusters) from both labelled and unlabelled instances based on the prominent label; the other is classification, which identifies the nearest instance cluster to assign label for the unseen instance in D U .
In the learning phase, we use the semi-supervised clustering method in Tian (2014) to take advantages of the TESC algorithm to get partition derived from the instance collection D. The recursive training procedure is called MULTICSLearn(.) whose pseudo-code is shown in Figure 2.
In order to find the partition C (i.e. the model of our classification algorithm), we first initialize C ={}, then call MULTICSLearn( D, {}, L, C). The result set of clusters C is regarded as components and used to predict labels of unlabelled instances in the classification phase as shown in Figure 3.
In the classification process, the input includes the collection C of labelled data clusters (resulted from the MULTICSLearn(.)) and an unlabelled instance d U needed to be labelled. The output is a predicted label set l U assigned to the d U . We calculate the distances from unlabelled instance to the centroids of all clusters to find out the nearest centroid. Then the unlabelled instance will be assigned the label set of its nearest cluster.

An application frame for semi-supervised MLC
In application, we applied this framework for multi-label text classification with following detail stages: (a) a stage of enriching features by using the hidden topic model (latent Dirichlet allocation, LDA) (Blei, 2012;Blei, Ng, & Jordan, 2003) to exploit information of semantic meaning of text representation; (b) a stage of selecting features with mutual information (MI) (Doquire & Verleysen, 2011) to get the most relevant ones and removing irrelevant ones, and (c) a stage of using proposed semi-supervised MLC algorithm. The application framework for multi-label text classification is described in Figure 4.
First of all, we make use of the hidden topic model of LDA to build the features of hidden topic probabilities for each document. This kind of features provides a much richer semantic meaning of text representation.
Next, a feature selection method based on MI is applied to improve the features for the classifier.
In the last step, a multi-label classifier is built based on the MULTICS algorithm. This classifier will be used to classify new documents.

Building features by applying the LDA model
Hidden topic probability models (Blei, 2012;Blei et al., 2003) are useful to improve the semantic meaning of the text representations, such as (Blei, 2012;Pham, Phan, Nguyen, & Ha, 2013;Ramage, Hall, Nallapati, & Manning, 2009;Rubin, Chambers, Smyth, & Steyvers, 2012;Trejo, Sidorov, Miranda-Jiménez, Ibarra, & Martínez, 2015). Ramage et al. (2009) proposed the labelled LDA (L-LDA) model specifically for multi-label settings. In L-LDA, the training of the LDA model is adapted to account for multi-label corpora by putting 'topics' in 1-1 correspondence with labels and then restricting the sampling of topics for each document to the set of labels that were assigned to the document. Rubin et al. (2012) proposed some more flexible LDA-based models including the prior-LDA model that takes into account prior label frequencies, and dependency-LDA that can additionally account for label dependencies. These LDA-based models were applied to multi-label documents that associate individual word tokens with different labels. In this work, we  take after the idea of Pham et al. (2013) and Trejo et al. (2015) to enrich features for Vietnamese short documents by combining various features like traditional Term Frequency Inverse Document Frequency (TFIDF), bigrams, unigrams and LDA features. The hidden topic model of LDA has been determined by applying the GibbsLDA++ tool (Trejo et al., 2015) on a universal data set with different numbers of topics. Due to the number of classes, we select the hidden topic numbers of 10, 15, 25, 50 and 100 to evaluate. These models will be applied on the training/testing data to generate probabilities of assigning hidden topics for each document.
Denote p(d, j) as the probability that a review d belongs to the hidden topic j (i.e. j = 1, … , k, where k is the number of hidden topics). The vector (p(d,1), p(d,2), … ., p(d,k)) is called the hidden topic feature vector of the review. These features will be combined with other features of documents to build the feature set for the classifier.

Selecting features based on the MI
Feature selection is one of the fundamental steps in the fields of machine learning and data mining for reducing dimensionality, choosing a small subset of the relevant features from the original ones, which usually leads to better learning performance, lower computational cost and better model interpretability.
In this work, we apply the method of MI (Doquire & Verleysen, 2011) to perform feature selection in MLC problems. MI measures the amount of information contained in variable X in order to predict variable Y in any relation, not only linear ones. In addition, the MI concept is directly applicable to groups of variables. The MI is given in Gómez-Verdejo, Verleysen, and Fleury (2007) as follows: where p x (x) and p y (y) are the marginal probability density function of X and Y, respectively, and p x,y (x, y) is the joint probability density function of X and Y. The method of achieving feature selection based on MI (Doquire & Verleysen, 2011) includes the following steps. First, the multi-label problem is transformed using the pruned problem transformation method (Read, 2008). Then, for each single-label classification task, a forward/backward selection algorithm based on MI is employed to choose the 'optimal' feature subset. The search strategy used in this model is considered as backward elimination, which starts with the set of all features and recursively removes them one at a time. The procedure ends when the predefined number of features has been reached. Another search strategy called a greedy forward feature selection algorithm that begins with an empty set of features and first selects the feature whose MI to the class vector is the highest. Then, the algorithm sequentially selects the feature not yet selected whose addition to the current subset of selected features leading to the set having the highest MI to the output.

The data sets
We applied the proposed framework on the set of customer's reviews about Vietnamese hotels retrieved from several famous Vietnamese websites on tourism and hotels. We build labelled, unlabelled and testing data sets with different numbers of documents to evaluate the effectiveness of labelled and unlabelled data on the model. After some pre-processing steps on these data sets, that is, main text context extraction, word segmentation, and stop word removal, we got about 1800 reviews. Thousand five hundred reviews were manually tagged to create the labelled data set of 1250 reviews, and the testing set of 250 reviews. The remaining 300 reviews were left intact to create the unlabelled data set. We considered reviews on five aspects: (a) location and price, (b) service, (c) facilities, (d) room standard, and (e) food.
In order to train the LDA model for generating the hidden topic models, the universal data set of 24,000 articles, introductions, comments about hotels in Vietnam (from the above sources) were also crawled. The pre-processing step is applied to all data sets for LDA construction and classification.

Experimental results
We took several experiments with different configurations to evaluate the effect of the proposed algorithm. In order to analyse the contribution of the labelled data sets, we also generated some subsets with the size of 500, 750, 1000 reviews. The contribution of unlabelled data sets is also evaluated in each category with different size of 0, 50, 100, 200, 300 reviews, where the unlabelled data set size of 0 is used as the baseline.
We performed four groups of experiments with different settings to evaluate the effectiveness of the proposed framework as follows: . For the hidden topic probability features, we build the hidden topic model of LDA (Phan & Nguyen, 2007) with different topic numbers, that is, 10, 15, 25, 50 and 100. In order to calculate distance between intances, we can apply different kinds of distance measure, but in our experiments below, we chose the Euclidean distance -a popular measure. We should try another measure in the future.
We used the label-based measures (Tsoumakas et al., 2010) for evaluating the performance of the proposed model. For each class label y j , TP j , FP j , TN j , and FN j which are the number of true-positive, false-positive, true-negative and false-negative test samples were recorded. Let B(TP j , FP j , TN j , FN j ) be some specific binary classification measures (e.g. B(.) = {P, R, F1}, where P = TP j /(TP j + FP j ), R = TP j /(TP j + FN j ), and F1 = P * R/2(P + R)). The micro-averaging measures are calculated as follows: where q is the total number of labels. For these metrics, the bigger the value, the better the classification performance. In Tables 1 and 2, the worst cases are formatted in italic, and the best cases are formatted in bold.
The results of Experiments 1 and 2 are described in Table 1 in which we compare the performance of using binary features (BN), and binary features with MI feature selection. The results show the contribution of unlabelled data sets, that is, the performance of all the cases using unlabelled data sets is better than that of the baseline method. Also with the contribution of labelled data sets, when the amount of labelled data increases, the model gets better results. In terms of features, the performances of binary features without feature selection are all better than that of the case using both binary features and MI feature selection. In our opinion, the way of building specific features for prominent label/label set is already a kind of feature selection. Thus, the Experiment 2 (BN + MI) has two feature selection steps which do not help to increase the performance.
In the experiments with hidden topic probabilities, we realize that the combination of binary features with hidden topic features does not increase the performance of the classifier. The reason may be the different kinds of discrete features (of binary representation) and real-valued features (of hidden topic probabilities). Therefore, in the Experiments 3 and 4, we combine the features of TFIDF with the features of the hidden topic probabilities to build continuous features. Because the combination of the different unlabelled data set sizes (5), with different numbers of hidden topics (5), and different labelled data set sizes (3) is big, we only selected the best result of the labelled data set for comparison. Table 2 shows the best results of Experiments 3 and 4 in which the labelled data set size of both the Experiments 3 and 4 is 1000 documents. When comparing the results in Tables 1 and 2, we see that the performances of enriched features with hidden topic probabilities in Experiments 3 and 4 are better than those of Experiments 1 and 2. The best F1 in Experiments 3 and 4 is 85.3% and 84%, respectively, while the best F1 in Experiment 1 is 83.2%.
Additionally, Table 2 shows that the Experiment 3 outperforms the Experiment 4. In other words, the Experiment 4 using the feature selection technique on the feature set of TFIDF and hidden topic probabilities is not as good as the Experiment 3 using the feature set of TFIDF and hidden topic probability. These results confirm that MULTICS builds the features good enough for the classifier. We may apply MULTICS directly on the original feature set without feature selection step.

Conclusions and future work
In this paper, we build a framework for MLC including the process of enriching features, the process of feature selection, and the process of classification of MULTICSan approach for semi-supervised MLC to exploit label-specific features. Using two basic assumptions including the effect of label-specific features in the learning phase and the multiple components in each label which can be identified by clustering, our proposed model brings major contribution in building label-specific features for multi-label learning with an approach of semi-supervised clustering technique. The experiments show that MULTICS gives promising results for MLC, and the combination with the hidden topic probability features also contributes better performance. For future direction, we are making some more improvements, for example, method to effectively select unlabelled instances, or post-processing to prune the result clusters to remove outliers, to evaluate the proposed approach. Note 1. A novel semi-supervised MULTI-label ClaSsification which can exploit both unlabelled data and specific features to enhance the performance.

Disclosure statement
No potential conflict of interest was reported by the authors.

Funding
This work was supported in part by Vietnam National University Grant QG-15-22.

Notes on contributors
Thi-Ngan Pham received her bachelor's degree in Faculty of Mathematics and Information, the Vietnam People's Security Academy in 2006, and a master's degree in Faculty of Information Technology (FIT), Vietnam National University (VNU), Hanoi, in 2011. She is now a PhD student in FIT, VNU, Hanoi. She has been working as a lecturer in Faculty of Mathematics and Information, the Vietnamese People's Police Academy since 2006.
Van-Quang Nguyen is a fourth-year student in an honors program in FIT, VNU, Hanoi.
Van-Hien Tran received his bachelor's degree in FIT, VNU, Hanoi, in 2014. He is also a master's student in FIT. He is now an assistant lecturer in FIT, VNU, Hanoi.