Basis exchange and learning algorithms for extracting collinear patterns

ABSTRACT Understanding large data sets is one of the most important and challenging problem in modern days. Exploration of genetic data sets composed of high-dimensional feature vectors can be treated as a leading example in this context. A better understanding of large, multivariate data sets can be achieved through exploration and extraction of their structure. Collinear patterns can be an important part of a given data set structure. Collinear (flat) patterns exist in a given set of feature vectors when many of these vectors are located on (or near) some planes in the feature space. Discovered flat patterns can reflect various types of interaction in an explored data set. The presented paper compares basis exchange algorithms with learning algorithms in the task of flat patterns extraction.


Introduction
Large data sets might contain very useful information. Models and computational tools for new knowledge extraction from large data sets are constantly being developed within pattern recognition or data mining methods (Duda, Hart, & Stork, 2001;Hand, Smyth, & Mannila, 2001). Along with the rapidly increasing size of databases grows interest related to extraction of knowledge, which can be really useful (Jaseena & Julie, 2014;Julie & Kannan, 2011). However, the discovery of dependencies is not so easy and there are a lot of challenges associated with mining Big Data (Albert, 2013). The contemporary problem of understanding large data sets is crucial for many practical issues (Ian & Eibe, 2005). Better understanding of genetic data sets composed of high-dimensional feature vectors is a leading example in this context (Wong, 2016). Research conducted in that way allows one to prepare for better diagnosis, prognosis, and therapies (Clarke et al., 2008).
We assume that explored data sets are composed of feature vectors which represent particular objects (events, phenomena) in a standardized way in a particular feature space. Data mining tools are used for discovering useful patterns in large data sets (Bobrowski, 2005). These tools let us analyse the data in some automatic way (Sun-Mi & Patricia, 2003). The term pattern denotes a subset (cluster) of feature vectors characterized by a certain type of regularity (Duda et al., 2001). More specifically, the term collinear ( flat) pattern means a large number of feature vectors situated on some planes in feature space. Algorithms presented in this article could find some dependencies in multidimensional data and in results return pattern from given data sets.
Flat patterns can be extracted from large data sets through minimization of a special type of the convex and piecewise linear (CPL) criterion function (Bobrowski, 2005). Basis exchange algorithms allow one to efficiently find the global minimum of the CPL criterion function even in the case of large, multidimensional data sets. Learning algorithms provide another possibility of minimizing CPL functions (Bobrowski & Zabielski, 2016b).
This paper is an extended version of the ACIIDS 2016 paper (Bobrowski & Zabielski, 2016b). The authors describe another additional algorithm and the possibility of discovering collinear patterns based on basis exchange. Furthermore, the comparison of a selected learning algorithm with a basis exchange algorithm in the task of flat patterns discovering is demonstrated. The results of the experimental comparison of these two types of algorithms are discussed.

Dual hyperplanes and vertices in the parameter space
Let us consider the data set C composed of m feature vectors x j = [x j,1 , … , x j,n ] T : (1) Feature vectors x j belong to the n-dimensional feature space F[n] (x j ∈ F[n]). Components x j,i of the jth feature vector x j can be treated as numerical results of n standardized examinations of the jth object O j , where x ji ∈ R or x ji ∈ {0,1}. Each of m feature vectors x j from the set C (1) defines the following dual hyperplane h j in the parameter space R n (w ∈ R n ) (Kushner & Yin, 1997): where w = [w 1 , … , w n ] T is the parameter (weight) vector (w ∈ R n ).
Unit vectors e i = [0, … , 1, … , 0] T define the below hyperplanes h 0 i in the n-dimensional parameter space R n []: Consider the set S k of n linearly independent feature vectors x j (1) and unit vectors e i : The set S k contains r k feature vectors x j ( j ∈ J k ) and n − r k unit vectors e i (i ∈ I k ). The kth vertex w k = [w k,1 , … , w k,n ] T in the parameter space R n is described on the basis of the set S k (4) as the intersection point of the r k hyperplanes h j (2) defined by the feature vectors x j (j∈J k ) and the n − r k hyperplanes h 0 i (3) defined by the unit vectors e i (i ∈ I k ). The vertex w k can be defined by the following linear equations: The vertex w k can be defined by the below set of n linear equations: and Equations (5) and (6) can be represented as the matrix equation: where the square, nonsingular matrix B k is the kth basis linked to the vertex w k : The feature vector x j(l) ( j(l ) ∈ J k ) constitutes the lth row of the matrix B k . The vertex w k can be computed in the below manner (7): The below three definitions are taken from the work of Bobrowski (2014): DEFINITION 1: The rank r k (1 ≤ r k ≤ n) of the kth vertex w k = [w k,1 , … , w k,n ] T (9) is defined as the number of the non-zero components w k,i (w k, ≠ 0).

DEFINITION 2:
The degree of degeneration d k of the vertex w k (9) of the rank r k is defined as the number d k = m k − r k , where m k is the number of such feature vectors x j from the set C (1), which define the hyperplanes h 1 j (2) passing through this vertex (w T k x j = 1). The vertex w k (9) is degenerated if the degree of degeneration d k is greater than zero (d k > 0).

DEFINITION 3:
The basis B k (8) is regular if and only if the number n − r k of the unit vectors e i (i ∈ I k (6)) in the matrix B k is equal to the number of such components w k,i of the vector w k = [w k,1 , … , w k,n ] T (9) which are equal to the zero (w k,i = 0).

Vertexical planes and collinear patterns in feature space
The hyperplane H(w, θ) in the feature space F[n] is defined as follows (Bobrowski, 2005): where w is the weight vector (w ∈ R n ) and θ is the threshold (θ ∈ R 1 ). The kth vertex w k (9) of the rank r k (r k > 1) (Definition 1) allows us to define the (r k − 1)dimensional vertexical plane P k (x j(1) , … , x j(rk) ) in the feature space F[n] as the linear combination of r k linearly independent feature vectors x j(i) (j(i) ∈ J k ) (5) from the set S k (4) []: where j(i)∈J k (5) and the parameters α i (α i ∈ R 1 ) satisfy the following condition: Two linearly independent vectors x j(1) and x j(2) from the set C (1) support the below straight line l(x j(1) , x j(2) ) in the feature space F[n] (x ∈ F[n]): where α ∈ R 1 .
LEMMA 1: The vertexical plane P k (x j(1) , … , x j(rk) ) (11) based on the vertex w k (9) with the rank r k greater than 1 (r k > 1) is equal to the hyperplane H(w k , 1) (10) in the n-dimensional feature space F[n] defined by this vertex.
THEOREM 1: The jth feature vector x j (x j ∈ C (1)) is located on the vertexical plane P k (x j(1) , … , x j(rk) ) (11) if and only if the jth dual hyperplane h 1 j (2) passes through the vertex w k (9) of the rank r k .
The proof of the above lemma and theorem was provided in the article by Bobrowski (2014).
is formed by m k (m k > r k ) feature vectors x j located on the vertexical plane P k (x j(1) , … , x j(rk) ) (11). Lemma 1 shows that the flat pattern F k is based on the vertex w k (9): The flat pattern F k of the dimension r k − 1 is composed of such m k feature vectors x j that the dual hyperplanes h 1 j (2) pass through the vertex w k (9) of the rank r k . The kth vertex w k = [w k,1 , … , w k,n ] T (9) has n − r k components w k,i equal to zero:

CPL collinearity functions
Let us define the collinearity penalty functions w 1 j (w) linked to feature vectors x j (1) ( Figure 1): Remark 1: The collinearity penalty functions w 1 j (w) (16) is equal to zero if and only if the distance between point w and the dual hyperplane h j (2) is not greater than ε / ||x j ||: The kth collinearity criterion function F 1 k (w) is defined here as the sum of the penalty functions w 1 j (w) (16) linked to feature vectors x j from the subset C k (C k , C (1)) []: where J k = {j: x j ∈ C k , C (1)} and the positive parameters β j (β j > 0) can be treated as the prices of particular feature vectors x j . The standard choice of the parameters β j values is one ((∀j ∈ {1, … , m}) β j = 1.0).

The minimal value of the collinearity criterion function
The collinearity criterion function F 1 k (w) (18) like the penalty functions w 1 j (w) (16) are CPL []. It has been proved that the minimal value F 1 k (w * k ) of the CPL criterion function F 1 k (w) (18) can be found in one of the vertices w k (9) []: Let us introduce the below layer L ε ′ (w, θ) in the feature space F[n] which is based on the hyperplane H(w, θ) (10): where ε ′ ≥ 0 is the margin (ε ′ ∈ R 1 ). The layer L ε ′ (w, θ) (20) contains such feature vectors x for which the distance from the hyperplane H(w, θ) (10) is no greater than ε ′ /||w||. THEOREM 2: If all the feature vectors x j from the subset C k (C k , C (1)) are located inside the layer L ε ′ (w, θ) (20) with θ ≠ 0, then the minimal value F 1 k (w * k ) (19) of the collinearity criterion functions F 1 k (w) (18) is equal to zero for the margin ε (19) no less than ε ′ / |θ| (ε ≥ ε ′ / |θ|).
Let us suppose that θ > 0. Then In accordance with the definition (16) of the collinearity penalty function w 1 j (w): Because all the feature vectors x j from the subset C k are located inside the layer L ε ′ (w, θ) (20) with θ > 0, then (18): The last equation means that the minimal value of the collinearity criterion functions F 1 k (w) (21) with the margin ε = ε ′ / θ is equal to zero in the point w* = w / θ. Let us suppose now that θ < 0. By dividing both sides of the inequality (25) by the negative number θ, we get the below result: As a result, we obtain the below equation similar to (24): The minimal value F 1 k (w * k ) (19) of the collinearity criterion function F 1 k (w) (18) with ε = ε ′ / |θ| is equal to zero. An increase in the margin ε in the penalty functions F 1 k (w) (16) allows us to preserve the location of all vectors x j in the layer L ε ′ (w, θ) (20). So, the minimal value F 1 k (w * k ) (19) remains equal to zero. It can be proved that the minimal value F 1 k (w * k ) (19) of the CPL criterion function F 1 k (w) (18) can be found in one of the vertices w k (9) (Simonnard, 1966). So, taking into account equation (26), we infer that such vertex w * k (9) exists for which F 1/u k (w * k ) = 0.

Basis exchange algorithm
The basis exchange algorithm was initially proposed and developed as an efficient tool for designing linear classifiers and examining linear separability of large, multidimensional data sets (Bobrowski, 2005). The first version of the basis exchange algorithm was described by Bobrowski and Niemiro (1984) and Bobrowski (1991). The first basis exchange algorithms were aimed at minimizing the perceptron criterion function. The CPL perceptron criterion function links the linear separability concept, which is fundamental in the perceptron theory of neural networks (Duda et al., 2001). The CPL criterion functions were specified later for a variety of goals. Different types of basis exchange algorithms were proposed for particular CPL criterion functions (Bobrowski, 2005). The basis exchange algorithms are based on the Gauss-Jordan transformation. The famous Simplex algorithm used in the linear programming is also based on this transformation (Simonnard, 1966). Initial procedures of flat patterns extraction and implementation are described in Bobrowski and Zabielski (2016a). A brief description of the basis exchange algorithms aimed at the minimization of the function F 1 k (w) (21) is presented below.
Let us assume that the kth basis B k (8) is the squared matrix composed from n linearly independent feature vectors x j(i) from the subset C k (C k , C (1)): The inverse matrix B −1 k during the kth stage can be represented in the below manner: The vectors x j(i) and r i ′ (k) fulfill the below equations: and During the kth stage of the algorithm, the basis B k (27) is changed into the basis B k+1 . The matrix B k+1 is created from the matrix B k (27) through replacing of the lth row x j(l ) by the new vector x k taken from a given data subset C k .
The exchange of the lth basis vector x j(l ) by the new vector x k results in the new basis B k+1 (27) and (∀i ≠ l) The basis exchange algorithms are based on the Gauss-Jordan transformation (30). The index l of the vector x j(l ) which leaves the basis B k (27) during the kth stage is determined by the exit criterion of a particular basis exchange algorithm. The entry criterion determines which vector x k from the data subset C k (1) enters the new basis B k+1 (28). The stop criterion allows us to determine the final stage K of the basis exchange algorithm. Selected criteria (exit, entry, and stop) should ensure the collinearity function F 1 k (w) (18) decreasing during each stage k.
The exchange of the kth basis B k (27) into the basis B k+1 causes the move from the vertex w k to w k+1 (9). After finite number K of steps, the optimal vertex w * k ′ (22) is reached this way (Bobrowski, 2005).

Learning algorithm
The optimal vertex w * k (9) which constitutes the minimum F 1 k (w * k ) (19) of the collinearity criterion function F 1 k (w) (18) can be approximated using learning algorithms (Kushner & Yin, 1997). The Robbins-Monro procedure of the stochastic approximation has been used by us in designing learning algorithm aimed at the criterion function F 1 k (w) (18) minimization (Bobrowski & Zabielski, 2016b). For this purpose, it is useful to represent the criterion function F 1 k (w) (18) as the regression function F k (w) in the below manner: where w ε (x j ;w) = w j ε (w) (19), and p(x j ) is the uniform probability distribution defined on m k elements x j of the kth data subset C k (C k , C (1)): The uniform probability distribution p(x j ) was used to generate the below learning sequence {x(k)} from m k elements x j of the kth data subset C k : x(1), x(2), . . . , x(n), . . . .
The elements x(n) of the above learning sequence were generated in each step n independently in accordance with the probability distribution p(x j ) (32): where P{x(n) = x j } is the probability of obtaining the jth vector x j during the nth learning step. The Robbins-Monro procedure aimed at minimization of the regression function F k (w) (34) can be given in the below manner (Kushner & Yin, 1997): where w(n) is the value of the weight vector during the nth learning step and {α n } is a sequence of decreasing, positive parameters α n , e.g. α n = 1/n which fulfil the below conditions: n a n = 1 and n a 2 n , 1 .
The Robbins-Monro procedure (35) applied to the regression function F k (w) (31) with the penalty functions w 1 j (w) (16) allows us to specify the below learning algorithm: if w(n) T x(n) , 1 − 1 then w(n + 1) = w(n) + a n x(n), and if w(n) T x(n) . 1 + 1 then w(n + 1) = w(n) − a n x(n), and if 1 − 1 ≤ w(n) T x(n) ≤ 1 + 1 then w(n + 1) = w(n), where x(n) = x j if the feature vector x j was generated (34) during the nth learning step. The learning algorithm (37) can be treated as a modification of the famous errorcorrection algorithm used in the Perceptron (Duda et al., 2001).
THEOREM 3: If the learning sequence {x(n)} (33) is generated in accordance with the stationary probability distribution p(x j ) (34), then sequence {w(n)} of the weight vectors w (n) generated with the learning algorithm (37), (36) converges to the optimal vector w* (14) which constitutes the minimal value F 1 k (w * k ) (19) of the criterion function F 1 k (w) (18). The proof of this theorem can be given using theorems of the stochastic approximation method (Kushner & Yin, 1997).

Experimental results
The compared basis exchange algorithm (31) and iterative learning algorithm (37) were implemented in C++ and run on an Intel Core i7, 1.73 GHz CPU (turbo boost to 2.93 GHz).
Three synthetic data sets D 1, D 2 , and D 3 , each with different level of outliers, were created for computational experiments. The term outliers means here such additional feature vectors x j which were not located on the vertexical line L k (x j(1) ,x j(2) ) (10). The outlier feature vectors x j were generated in accordance with the normal distribution N 2 (0, I) with the unit covariance matrix I in the case of the two-dimensional feature space R 2 . The level of outliers was set accordingly: D 1 -> 33%, D 2 -> 50%, and D 3 -> 66%. One of this data set is presented in Figure 2. Figure 3 shows results of the learning process with the iterative learning algorithm as the distance ρ(w(n), w * k ) between the result w(n) (37) during the nth step and the optimal solution w * k (19) defined on data sets D k with different level of outliers.  . Distance ρ(w(n), w * k ) between the result w(n) during the nth step and the optimal solution w * k (19) on the two-dimensional learning sets D 1 (squares), D 2 (circles), and D 3 (triangles) with different level of outliers (33%, 50%, and 66%). Experiments were performed using the iterative learning algorithm (37).
As shown in those figures, the iterative algorithm converges after about n = 200 steps. However, in the case of higher level of outliers, the solution is worse and the distance between the result and the optimal solution w * k (19) is greater. The algorithm ends after n = 1000 steps in the case of each data set D k .
As a comparison, Figure 4 presents the results of the basis exchange algorithm, and the plot demonstrates the distance ρ(w(n), w * k ) between the result w(n) (37) during the nth step and the optimal solution w * k (22) defined on data sets D k with different level of outliers.
As shown in those figures, the solution was found in each case for data with different levels of outliers and the distance ρ(w(n), w * k ) between the result w(n) during the nth step and the optimal solution w * k (19) was equal to zero. At the beginning of the execution (in the first few steps), the basis exchange algorithm gave worse results than the iterative algorithm. But after a few steps and at the end, the basis exchange algorithm is better and gives better results than the iterative algorithm. Moreover, the method of the iterative algorithm takes longer (in case of a number of iteration and execution time) than the basis exchange algorithm.
Next experiments were related to the role of the margin parameterε. Iterative learning algorithm (37) and basis exchange algorithm (30) with different values of the parameter ε were tested on data sets shown in Figure 5.
The results of the learning process based on the iterative learning algorithm (37) are presented in Figure 6. The algorithm (37) was stopped when the condition: 1 − ε ≤ w (n) T x(n) ≤ 1 + ε was reached.
Those figures have shown that the iterative learning algorithm converges after about 20 steps. The solution is worse and the distance between the result and the defined solution is greater in case of a greater value of the tested parameter ε. When the margin greater Figure 4. Distance ρ(w(n), w * k ) between the result w(n) during the nth step and the optimal solution w * k (19) defined on the two-dimensional learning data sets D 1 (squares) , D 2 (circles), and D 3 (triangles) with different levels of outliers. Experiments were performed using the basis exchange algorithm. than zero (ε > 0) is selected, the algorithm ends faster than in case of algorithms without any margin (ε = 0).
Similarly, Figure 7 presents results of the basis exchange algorithmthe distance ρ(w (n), w * k ) between the result w(n) during the nth step and the optimal solution w * k (19) defined on the two-dimensional learning data set D 4 with different values of the margin ε.  Figure 6. Distance ρ(w(n), w * k ) between the result w(n) during the nth step and the optimal solution w * k (19) defined on the two-dimensional learning data set D 4 with different values of the margin ε (ε = 0 represented by squares, ε = 0.5 × δ represented by circles, and ε = δ represented by triangles). Experiments were performed using an iterative learning algorithm (37).
The final values of the criterion function F 1 k (w) (18) for the basis exchange algorithm are equal to zero or are near this value. It gives better results and works faster than the iterative algorithm. The computation ends earlier for the margin ε > 0. Figures 8-10 show the result dependence on the margin ε value. With the increase in margin ε value, more elements from feature vectors are in the solution plane. In case of ε = 0.1, most of them are defined by the result. Figure 7. The distance ρ(w(n), w * k ) between the result w(n) during the nth step and the optimal solution w * k (19) defined on the two-dimensional learning data set D 4 with different values of parameter ε (ε = 0 represented by squares, ε = 0.5 × δ represented by circles, and ε = δ represented by triangles). Experiments were performed using the basis exchange algorithm.

Concluding remarks
A better understanding of large, multivariate data sets can be achieved through the identification of collinear (flat) patterns F k (14). Extraction of collinear patterns F k (17) can be performed through the minimization of the collinear criterion function F 1 k (w) (18) defined on various data subsets C k (C k , C (1)).
Both the basis exchange algorithm (30) as well as the iterative learning algorithm (37) can be used in the minimization of the collinear criterion function F 1 k (w) (18). Usability of both types of algorithms was tested experimentally. Selected results of these experiments are shown in the figures.  Particular attention in the article was paid to the role of the margin ε (12) in extraction of flat patterns F k (14). The margin ε greater than zero (ε > 0) allows us to discover not only collinear patterns F k (14) as the subsets of points x j located on vertexical planes P k (x j(1) , … , x j(rk) ) (3), but also flat patterns located in layers L ε ′ (w, θ) (20) of different thicknesses. Noise reduction may be based on such layered aggregation of data subsets C k . Increasing capacity in this area should be achieved by applying the relaxed linear separability method of feature subset selection (Bobrowski & Łukaszuk, 2011).
The proposed method of discovering collinear patterns on the basis of the CPL criterion functions F 1 k (w) (18) can be treated as a generalization of the methods based on the Hough transformation. The Hough transformation techniques are used in computer vision for detection lines and curves in pictures (Ballard, 1981;Duda & Hart, 1972).

Disclosure statement
No potential conflict of interest was reported by the authors.

Notes on contributors
Professor Leon Bobrowski is the head of the Software Department at the Faculty of Computer Science, Bialystok University of Technology. Additionally, he works in the Laboratory of Biomedical Data Analysis at the Institute of Biocybernetics and Biomedical Engineering of the Polish Academy of Sciences in Warsaw. Research interests of Leon Bobrowski include specific methods of data mining, pattern recognition, and medical diagnosis support systems which are based on the minimization of the convex and piecewise linear (CPL) criterion functions defined on data sets. The basis exchange algorithms have been developed and implemented which are similar to linear programming and allow find the minimum of the CPL functions efficiently, even in the case of large, multidimensional data sets. This approach is used for designing medical diagnosis support systems (Hepar), hierarchical neural networks, multivariate decision trees and visualizing transformations. Most recent research topics involve designing prognostic models by using concepts of ranked regression, interval regression and the relaxed linear separability (RLS) method of feature (genes) subsets selection. Teaching experience relates to statistical models and algorithms, multivariate data analysis, decision support systems, and exploratory analysis of large data sets. Professor Leon Bobrowski was co-organizer since 1994 eleventh seminar "Statistics and Clinical Practice" held in Warsaw in the framework of the International Centre of Biocybernetics (ICB) of the Polish Academy of Sciences. Paweł Zabielski was born in 1988 in Białystok (Poland). He was studying at the Faculty of Computer Science at the Bialystok University of Technology. He obtained a bachelor degree in Computer Science from the Faculty of Computer Science at the Bialystok University of Technology in 2010 and a Master degree in Computer Science in 2011 at the same university, both graduated with a distinction. Paweł Zabielski has been working at the Faculty of Computer Science at the Białystok Technical University since 1st March 2011, at first as a research assistant and from 1st October 2011 as an assistant professor. Research interests of Paweł Zabielski include the methods of processing and analysis of biomedical datasets and applications to medical diagnosis support. He works on developing the methods of knowledge data discovery based on artificial neural network models.