Information granulation construction and representation strategies for classification in imbalanced data based on granular computing

ABSTRACT Imbalanced datasets play an important role in many fields in real applications such as medical diagnosis, business risk management, abnormal product testing and evaluation. In these cases, the minority classes are usually important. Granular computing has been developed and effectively applied to many problems especially imbalanced data classification. In this paper, we propose a new strategy to build information granulations (IGs) for each class separately and represent sub-attributes based on categorical values (including discretized values of the numerical attributes) to solve the overlapping among IGs. This strategy reduces the computational time, improves classification performance and considers high-balanced accuracy among classes. The experimental results on several datasets have demonstrated the effectiveness of our proposal.


Introduction
In imbalanced datasets, a class can be represented with a large number of samples while the others are represented with a few samples (Su, Chen, & Yih, 2006). Traditional classification methods are not suitable to handle the imbalanced dataset (Batista, Prati, & Monard, 2004) because they often classify the data into the majority classes (Chen, Chen, Hsu, & Zeng, 2008). Therefore, these algorithms often have high accuracy for the majority classes, but give low accuracy for minority classes (Chen et al., 2008).
Many algorithms were proposed to solve the problem of classification in imbalanced data. They are classified into two major groups: (1) Data level (external) techniques add a pre-processing step where the data distribution is rebalanced in order to decrease the effect of the skewed class distribution in the learning process (Galar et al., 2012). (2) Algorithm level (internal) approaches create or modify existing algorithms to take into account the significance of minority examples (Galar et al., 2012). A recently emerging approach is use of information granulation. It tries to establish higher level concepts via the construction of information granules (López, Fernández, Moreno-Torres, & Herrera, 2012).
The approach based on granular computing (GrC) mimicking the information-processing capability of human increases efficiency and improves the classification of unbalance class status, consistent with the treatment information which is vague, unclear and incomplete (Chen et al., 2008). Advantages and disadvantages of GrC methods compared with others were outlined in the publication of Chen et al. (2008). Su et al. applied GrC to the cellular phone test process (Su, Chen, & Chiang, 2006) and proposed the KAIG model (Su, Chen, & Yih, 2006) which effectively solved classification problems with imbalanced data. Fuzzy ART (adaptive resonance theory) was applied to build the IGs and the two following indexes Purity and Centrality were used to measure the particle information (Su, Chen, & Chiang, 2006) or H-index and U-ratio were employed to determine the appropriate level of granularity (Su, Chen, & Yih, 2006). They used the concept of 'sub-attributes' to represent the IGs for solving the overlapping between the IGs. Then, three methods including decision tree, rough sets and neural networks (BP) were applied to select features to extract knowledge from the IGs for classification target (Su, Chen, & Yih, 2006). Chen et al. (2008) proposed a model solving the problem of imbalanced data classification including three steps: (1) IG construction.
Initially, the K-means algorithm is used to build the IGs and the H-index, U-ratio in order to measure the appropriate level of granularity. Second, the concept of sub-attributes is used to handle the overlapping between IGs in the IG representation step. All continuous attributes are discretized before creating the sub-attributes. Then, they use latent semantic indexing (LSI) to reduce the number of sub-attributes before taking neural networks with back-propagation (BP) for classification task. This approach has the advantages of reducing the size of the data and reducing the dimensionality of the features.
In addition, Chen et al. (2008) also proposed a strategy (as shown in Figure 1) to build the IGs for datasets with high-bias class distribution (the number of minority class samples is less 10% than the number of the samples of the whole of dataset). For the strategy of building IGs from majority class (C ma ) samples and retain minority class (C mi ) samples, each sample is considered as an IG (Chen et al., 2008). In Figure 1, the small squares represent the majority class samples, the ellipses represent the minority class samples, the big rectangles represent the IGs created from majority class samples.
In the above GrC approaches, they have used the indexes H-index and U-ratio (Su, Chen, & Yih, 2006) or Purity and Centrality (Su, Chen, & Chiang, 2006) as well as setting up optimal thresholds which depend on the dataset (the distribution of values in the dataset). Moreover, the IG construction strategy from the majority class samples (Chen et al., 2008) could not be applied to datasets with the large number of samples, although the number of minority class samples is 10% less than the number of samples in the whole dataset. In fact, if the number of IGs highly increases, classification performance will be affected. For example, a dataset consisting of 2000 records with class 0 having 200 samples (10%) and class 1 having 1800 samples (90%) will create more than 200 IGs that decreases classifier performance.
In this study, we improved Chen's IGs construction strategy (Chen et al., 2008) by constructing IGs for each class, without homogeneity and indistinguishable checking (do not calculate H-index, U-ratio) to reduce computational time. Besides, we apply the equalinterval binning technique (Witten & Frank, 2000) to continuous values discretization. In addition, to represent IG in the form of sub-attributes, we consider only the appearance of the categorical values (including discrete values for continuous attributes) in each IG, this leads to significantly reduced computational time.
In the remaining of this paper, we introduce some basic concepts and review GrC models, especially the model of Chen et al. (2008). Then we propose our new strategy. We also present our experimental results and discuss conclusion and some future works.

Standardization of numeric data
All numeric attributes are normalized to be considered on the same scale by normalizing to a fixed range, from zero to one; or standardizing a statistical variable and resulting in a set of values whose mean is zero and standard deviation is one (Witten & Frank, 2000). We use the first technique to standardize the data in this work.

Data discretization
Some real datasets, which have many attributes with different data types, can be classified into two types: continuous data (numeric data) and discrete data (categorical or nominal data) or mixed data (combined the above two types). To handle datasets containing mixed attributes, there are numerous discrete techniques discussed by Witten and Frank (2000) including equal-interval binning, equal-frequency binning, entropy-based, etc.  (Chen et al., 2008).
Discretization process leads to loss of information (Ahmad & Dey, 2007;Witten & Frank, 2000); however, it is consistent with the GrC model. In this study, we use the equal-interval binning technique (Witten & Frank, 2000) for numeric attributes discretization. This technique is not only applied for IG construction by using mixed K-means (Ahmad & Dey, 2007), but also is used to represent the sub-attributes to deal with overlapping between the IGs.

Latent semantic indexing
The dataset including many features often containing sparse information can reduce the performance of the classifier. Feature selection and feature extraction techniques are used individually or in combination to reduce the dimensionality of the feature space (Chen et al., 2008). LSI is one features extraction technique which has been proposed combining with 'Information granulation' to solve the imbalance class problem, reduce the number of sub-attributes, shorten implementation time and increase the efficiency of classifier performance (Chen et al., 2008).
Concepts of singular value decomposition (SVD) and LSI are summarized in Deerwester, Dumais, Landauer, Furnas, and Harshman (1990). The singular value decomposition of A expresses A = USV T , where S = diag(s 1 , . . . , s r ), U = (u 1 , . . . , u r ), UU T = I and V T = (v 1 , . . . , v r ) T , VV T = I. Then A k = U k S k V k T , matrix of rank k (k is the suitable dimension of the low-dimensional space, should be small enough to quickly retrieve and large enough to adequately capture the structure of the corpus (Chen et al., 2008;Deerwester et al., 1990)), is the approximation of A.

Granular computing
When describing the problem, we tend to avoid numbers and use combinations to find out alternative questions (Chen et al., 2008). This is especially true when the problem includes incomplete, uncertain or vague information. Zadeh (1979) coined the term 'information granulation' and stressed that the plethora of detailed knowledge is not necessary (Chen et al., 2008;Su, Chen, & Yih, 2006). The IG construction process is considered as information split which groups elements together based on their distinguish ability, similarity, proximity or their functionality. It is regarded as clustering (Batista et al., 2004). People always give the same decision with the same conditions. The granules are generated by the similarity of objects, so the objects in the same granule have the same class (Su, Chen, & Yih, 2006). The summarized four reasons/situations why we need to process perception-based information have been presented in Zadeh (2005).
Researchers paid much attention on uncertainty/ambiguity of human decisions, such as fuzzy sets, rough sets, GrC, etc. (Chen et al., 2008). There are many works that have used GrC to solve the imbalanced data classification problem such as those mentioned in Su, Chen, & Yih (2006), Kaburlasos (2007), Chen et al. (2008), Fernández, Jesus, andHerrera (2009), andLópez et al. (2012). Chen et al. (2008) provided a general model for knowledge discovery from the IGs which consists of three steps: IG construction, IG representation and knowledge acquisition. According to this model, different strategies can be used for each step to achieve optimal classification performance. In our work, we use three steps of this model to classify imbalanced data, propose a new strategy for IG construction and a strategy to represent the IGs in the form of sub-attributes. Detail of the three-step model is as follows.

IG construction
IGs exist with different levels of granularity, so we often group the granules with similar 'size' (that is granularity) in a single level. The more detailed process is required, the smaller IGs are selected (Su, Chen, & Yih, 2006). By changing granularity, we can discover or hide less or more details (Chen et al., 2008). Many IG construction methods have been proposed to construct IGs such as Self Organizing Map (SOM) network, Fuzzy C-means (FCM), rough sets, shadowed sets, Fuzzy ART (Kaburlasos, 2007;Su, Chen, & Chiang, 2006) and K-means (Chen et al., 2008). Su, Chen, and Chiang (2006) used the Fuzzy ART network to construct the IGs and proposed H-index, U-ratio; Purity index and Centrality index to measure the appropriate level of granularity. Kaburlasos (2007) proposed a method that uses Fuzzy ART to select a level of granularity (López et al., 2012). Chen et al. (2008) used K-means (an unsupervised, simple and the most widely used clustering algorithm) combined with two indexes H-index and U-ratio to construct IGs. Besides, Chen et al. also proposed a strategy which constructs IGs in high-skewed situation. Fernández et al. (2009) proposed the application of a thicker granularity to generate the initial rule base and to reinforce those problem subspaces.

IG representation
An IG often contains more than one object. The upper limit and lower limit of the values of attributes are used to represent all objects in an IG. However, overlapping problem always appears in the IGs. They are difficult to be handled by traditional data mining algorithms, because such algorithms are not designed to handle IGs, especially when the overlapping situation appeared. The concept 'sub-attributes' has been given to solve this problem (Chen et al., 2008;Su, Chen, & Chiang, 2006;Su, Chen, & Yih, 2006).
Assume the numeric attribute X i of the two IGs (I A and I B ) as shown in Figure 2. I A and I B are described by [a min , a max ] and [b min , b max ], respectively. The sub-intervals, [a min , b min ], [b min , a max ], [a max , b max ] are called sub-attributes. The binary variable which is employed to be the values of sub-attributes represents whether an IG contains these sub-intervals or not (Chen et al., 2008). From Figure 2, it is easy to find that the input variables increased 3 times (from 1 to 3). If we have to handle continuous data, this gets worse.
Many methods have been proposed to reduce the number of sub-attributes such as feature extraction, feature selection or a combination of the both. Feature selection is to select a subset of most representative features from the original feature space. One feature selection method is the rough set method that can be utilized to remove superfluous sub-attributes (Su, Chen, & Yih, 2006). Feature extraction is to transform the original feature space to a smaller feature space to reduce the dimensionality (Chen et al., 2008).
The process of knowledge acquisition from IGs is interpreted as follows. Let A be the matrix of IGs, in the form of sub-attributes, constructed from the training set. After analysing SVD for A, we get three matrices U, S, V T with A = USV T . We have A k ≈ A (Deerwester et al., 1990) after performing LSI to reduce the dimensionality of the data. Then U k = AV k S −1 k is the final matrix of A when the feature extraction step finishes. At this time, we take U k into the training step of neural networks. Let B be the test set which is represented in the form of sub-attributes. W k = BV k S −1 k is the result matrix of B and then W k are taken into the neural networks to determine output values.

Strategy for constructing IGs separated by class
Based on the strategy proposed by Chen et al. (2008), we propose our new strategy that can be applied to datasets with ratio of any class distribution. Note that samples which belong to different classes are separated into sub-datasets. Each sub-dataset corresponds to each class. Thus, we apply a clustering strategy to the each data subset. Depending on the dataset, the number of minority samples and the sample distribution of these classes, we build the number of IGs for each class. Figure 3 illustrates clearly the idea of this strategy. In the figure, small squares represent samples in majority class; small ellipses represent minority class samples; big rectangles represent IGs of the majority class and big ellipses represent minority class IGs. If the number of records of a class is small, the number of IGs constructed for this class will be the number of samples of the class. It is exactly similar to the strategy proposed by Chen et al. The dataset is separated at the beginning before IG construction (clustering). Thus, in the IG construction strategy, we do not need to test the homogeneous and indistinguishable IGs (i.e. we do not calculate the H-index and U-ratio and set their thresholds as in Chen's approach). This strategy leads to computational time reduction.

Sub-attributes representation
In this study, we propose a specific strategy and explain clearly the representation of the IGs in the form of sub-attributes after discretizing numeric values (if exist) instead of that was mentioned by Chen et al. (2008). Numeric attributes are discretized into S e equal-intervals for the whole input dataset. Its aim is to cover the entire range of values of each attribute of the whole dataset.
As mentioned before, to resolve the overlap between IGs, we need to represent IGs in form of sub-attributes (Figure 2). If number of the IGs is large, it needs a lot of bits to represent the sub-attributes. Suppose there are 30 IGs, the number of bits represented in the worst case is 30 * 2 − 1 = 59 bits. This can be handled through a strategy as in Figure 4(a), the sub-attributes will be little bits representation (constant, for example, S e = 10). According to this strategy, the IG I A is represented by 1 from position a ′ min to b ′ min (a ′ min , b ′ min are discretized values of a min , b min in Figure 2 correspondingly), the remaining positions are represented by 0; the IG I B is represented similarly to IG I A . For a categorical attribute as X j in Figure 4(b), the positions corresponding to categorical values appearing in the IG I C are represented by 1, the remaining positions will be 0; the IG I D is represented similar to the IG I C . This strategy has the following advantages: (1) Reduces the computational time because we only use discretized and categorical values for IG representation in the form of sub-attributes without considering overlapping intervals between IGs. (2) Reduces (fix) the number of bits for sub-attributes because the number of split intervals for numeric attributes discretization and number of distinct values of categorical attributes are fixed.

Algorithm
Notice that in order to compare with the method proposed by Chen et al. (2008), we use K-means to construct IGs, LSI to reduce sub-attributes and neural networks for classifying task. Generally, this algorithm has the following steps. First, we split the whole dataset into sub-dataset for each class and determine the number of IGs corresponding to each class. Then, K-means is implemented for IG construction. After that, we represent IGs in form of sub-attributes, implement SVD. And then, we use LSI for feature extraction (K LSI increasing from the number of dataset attributes to the number of dataset attributes + 9). Finally, we use neural networks for classifying task and evaluate the accuracy of learned classifier for K LSI . Optimal K LSI corresponds to the max G-means value.
Particularly, our algorithm includes the following steps: Step 1: Split the training dataset after numeric data discretization (if exist) into sub-datasets. Each sub-dataset corresponds to each class.
Step 2: Determine the number of IGs for the each class; each class has a number of different IGs.
Step 3: Implement K-means for each sub-dataset.
Step 4: Represent the original attributes of training dataset in form of sub-attributes after numeric data discretization, then create a matrix A.
Step 5: Implement SVD for matrix A.
Step 7: Extract features using LSI with K LSI .
Step 8: Learn Neural networks from training set, calculate classification accuracy for test set.
Step 9: Determine the optimal accuracy corresponding to the largest G-mean.
If K LSI is less than the number of attributes + 10 then increase K LSI of 1 and repeat steps 7-9, otherwise, go to step 10.
Step 10: Terminate procedure with the optimal K LSI .
We give an example to better understand the algorithm. Table 1 shows one of the IGs for Iris versicolor class of Iris dataset. Matrix A is created from all sub-attributes of all IGs. Each row is an IG in the form of subattributes. The number of rows is the number of IGs, the number of columns is N r * S e = 4 * 5 = 20 bits, where N r is number of numeric attributes (in this example). Then we perform SVD to analyse the matrix A and implement LSI with K LSI = 4 (number of attributes). Finally, we implement the training process of neural networks. One test sample has the value such as (5.7, 2.8, 4.5, 1.3, versicolor) corresponding to the standardized value (0.388889, 0.333333, 0.59322, 0.5) and the discretized values denoted as (b, b, c, c). Sub-attributes of this sample are 01000 01000 00100 00100. Then these sub-attributes will be extracted to reduce the number of features being taken into the learned neural networks to determine the output value. If the output for this sample is 0.9, rounded to 1.0 (i.e. versicolor) is regarded as the test sample that is identified (classified) rightly. The value |output−class| = |0.9−1| = 0.1 is the classification error for this sample.
Repeat Steps 7-9, K LSI increases 1 (from 4 to 13) to find the optimal accuracy.

Experiments
We use K-means to construct the IGs for numerical homogeneous, SVD and neural networks; it is obtained from the open source library ALGLIB (alglib.net website). For datasets with mixed data types, we use the algorithm proposed by Ahmad and Dey (2007).

Implementation environment
The configuration the testing platform is as follows: Windows 7, 4G RAM, Intel(R) Core™ i3-2310M and Visual Studio C++ 2008.

Datasets
Six experimental datasets are taken from the UCI Machine Learning Repository website http://archive.ics.uci.edu/ml/datasets.html. The first four datasets, including Credit screening, Contraceptive, Wine and Pima, were used for classifying imbalanced data of Chen et al. (2008); we also use them for comparison purposes. For the purpose of validation, we use extra datasets including Heart and Vote. Table 3 shows the information of experimental datasets. We remove missing data from these datasets because LSI cannot work with missing data. We also use 10-fold cross validation to compute the accuracy.

Evaluation measures
As mentioned above, performance of the classifier on test set of imbalanced datasets cannot be evaluated by the Overall Accuracy. For imbalanced datasets, high Overall Accuracy may be invalid if the precision of the minority class is very low. Therefore, to evaluate the performance of the classifier, we use the Overall Accuracy, Accuracy Negative, Positive Table 2. The ranges of values of IG's discretized attributes and sub-attributes representation correspondingly.
Accuracy and G-mean as used in the publication of Batista et al. (2004) and Chen et al. (2008). Geometric mean (G-mean) of the Positive Accuracy and Negative Accuracy: G -mean = Positive Accuracy × Negative Accuracy .

Experimental results
For each class in training set, samples are clustered using K-means repeated 10 times to select the optimal K K-means corresponding to the clustering result which has the lowest error. We use LSI for set of IGs which have just been constructed after representation in form of sub-attributes, with K LSI increasing from the number of attributes of dataset to the number of attributes plus 10. Optimal K LSI is selected corresponding to the classification result (using neural networks) which has the highest accuracy. Each training set has different optimal K LSI and different optimal K K-means for each class. Therefore, we do not give K K-means and K LSI specifically.
In addition, parameters of the algorithm, which were proposed by Chen et al. (2008), were not published specifically, so we try to implement this algorithm with different parameters to find out the highest accuracy.
Parameters are given in Table 4 such as the network structure, iterations for NN, the number of intervals for the discretizing process called S e are used in our implementation  for both algorithms and H-index and U-ratio thresholds are employed for Chen's algorithm. We do not use the H-index and U-ratio in our algorithm. As mentioned initially, the minority class is often important, so for each dataset, we compare the accuracy of each class. Table 5 illustrates this comparison. We do not care about the number of sub-attributes because the number of sub-attributes is different in different training sets for finding the optimal accuracy. The number of bits used to represent IGs in the form of sub-attributes is calculated using the formula S e * N r + N c i=1 n c,i , where N r is the number of attributes, N c is the number of categorical attributes, S e is the number of discretized intervals and n c,i is the number of distinct values of each categorical attribute.
The table shows the accuracy comparison between our and Chen's algorithm. Our method has higher accuracy for each class especially for minority class, higher overall accuracy for some datasets and lower for the some other datasets. First, on the Heart, Vote, Credit screening datasets, the accuracy for each class is higher and on other datasets, the accuracy for minority class is also higher. Second, the first three datasets have higher overall accuracy, the others are lower. Furthermore, G-mean in our method is always higher. This shows that the accuracy of each class in our method is more balanced. On average, our method's overall accuracy, G-mean, accuracy for each class are higher (increase of 0.35%, 5.47%, 4.2%, respectively).
For the Contraceptive dataset, the overall accuracy is lower than Chen's method, but the accuracy of two minority classes is higher. In addition, Contraceptive implemented by Chen's method has a very low G-mean value (7.66%), so the accuracy of the classes in test sets is also very low. The experimental results showed that there are many folds (test sets) which have class accuracy equal to 0, so the G-mean average is very low. Besides, when implementing the algorithm proposed by Chen et al. (2008), we obtain 5.73% higher accuracy for Contraceptive in comparison with the one reported by Chen et al. (2008). This confirms that our discretization and IG representation in form of sub-attributes strategies are consistent with this dataset. Table 6 shows the average of computational time between our and Chen's method. The computational time for each dataset by using our method is lower than Chen's approach because our method does not calculate the H-index, U-ratio and does not check their thresholds. Besides, the equal-intervals binning discretization technique for numeric attributes used for IG representation in the form of sub-attributes also helps to reduce running time. However, for some datasets such as Credit approval, Contraceptive, Pima, the computational time of Chen's method in this study is slower than that of approach reported by Chen et al. (2008). This is because the mixture K-means algorithm (Ahmad & Dey, 2007) in our implementation has more computations when the number of constructed IGs is large.

Conclusion and future work
Our proposed strategies improve the IG construction for each class and sub-attributes representation, leading to the result of reducing computational time. In particular, we do not calculate H-index and U-ratio (i.e. we do not calculate and test the homogeneous and handle indistinguishable IGs). Besides, we take the discretized values (for numeric attributes) into the sub-attributes representation without considering overlap between the IGs. However, our method has disadvantages. First, the min and the max of each attribute domain for each IG are discretized into the more or less min and max, which influences the training and test process. Second, we use K-means to construct the IGs, so we have to determine the number of IGs before executing K-means.
The experimental results confirm that our proposal improves the classification performance for the imbalanced data classification problem. The accuracy between classes is balanced because we choose the maximum G-means. We try to implement the algorithm proposed by Chen et al. (2008) with the different parameters and our discretization strategy, so it does not report the results correctly as in Chen's publication. In the future, we will study more other IG construction strategies and imbalanced data classification methods.

Disclosure statement
No potential conflict of interest was reported by the authors.

Notes on contributors
Lai Duc Anh received his Master's degree in Computer Science from the University of Science, Vietnam National University of Ho Chi Minh, Viet Nam in 2015. His research interests include association rules, classification.
Bay Vo received his PhD degree in Computer Science from the University of Science, Vietnam National University of Ho Chi Minh, Viet Nam in 2011. His research interests include association rules, classification, mining in incremental database, distributed databases and privacy preserving in data mining.
Witold Pedrycz is Professor and Canada Research Chair (CRC) in Computational Intelligence in the Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Canada. He is also with the Systems Research Institute of the Polish Academy of Sciences, Warsaw, Poland. He holds an appointment of special professorship in the School of Computer Science, University of Nottingham, UK. In 2009 Dr. Pedrycz was elected a foreign member of the Polish Academy of Sciences. In 2012 he was elected a Fellow of the Royal Society of Canada. Witold Pedrycz has been a member of numerous program committees of IEEE conferences in the area of fuzzy sets and neurocomputing. In 2007 he received a prestigious Norbert Wiener award from the IEEE Systems, Man, and Cybernetics Council. He is a recipient of the IEEE Canada Computer Engineering Medal 2008. In 2009 he received a Cajastur Prize for Soft Computing from the European Centre for Soft Computing for "pioneering and multifaceted contributions to Granular Computing". In 2013 he was awarded a Killam Prize. In the same year he received a Fuzzy Pioneer Award 2013 from the IEEE Computational Intelligence Society. His main research directions involve Computational Intelligence, fuzzy modeling and Granular Computing, knowledge discovery and data mining, fuzzy control, pattern recognition, knowledge-based neural networks, relational computing, and Software Engineering. He has published numerous papers in this area. He is also an author of 15 research monographs covering various aspects of Computational Intelligence, data mining, and Software Engineering.