Basis exchange and learning algorithms for extracting collinear patterns

ABSTRACT Understanding large data sets is one of the most important and challenging problems in the modern days. Exploration of genetic data sets composed of high dimensional feature vectors can be treated as a leading example in this context. A better understanding of large, multivariate data sets can be achieved through exploration and extraction of their structure. Collinear patterns can be an important part of a given data set structure. Collinear (flat) pattern exists in a given set of feature vectors when many of these vectors are located on (or near) some plane in the feature space. Discovered flat patterns can reflect various types of interaction in an explored data set. The presented paper compares basis exchange algorithms with learning algorithms in the task of flat patterns extraction.


Introduction
Large data sets might contain very useful information. Models and computational tools for new knowledge extraction from large data sets are constantly being developed within pattern recognition or data mining methods (Duda et al., 2001;Hand et al., 2001). The contemporary problem of understanding large data sets is crucial for many practical issues. Better understanding of genetic data sets composed of high dimensional feature vectors is a leading example in this context (Wong, 1973).
We assume that explored data sets are composed of feature vectors which represent particular objects (events, phenomena) in a standardized way in a particular feature space. Data mining tools are used for discovering useful patterns in large data sets (Bobrowski, 2005). The term pattern denotes a subset (cluster) of feature vectors, characterized by a certain type of regularity (Duda et al., 2001). More specifically, the term collinear ( flat) pattern means a large number of feature vectors situated on some plane in feature space.
Flat patterns can be extracted from large data sets through minimization of a special type of the convex and piecewise linear (CPL) criterion function (Bobrowski, 2005). Basis exchange algorithms allow to find efficiently the global minimum of the CPL criterion function even in case of large, multidimensional large data sets. Learning algorithms provide another possibility of minimizing CPL functions (Bobrowski & Zabielski, 2016).
Comparison of a selected learning algorithm with a basis exchange algorithm in the task of flat patterns discovering is demonstrated in the presented paper. The results of the experimental comparison of these two types of algorithms are discussed.

Dual hyperplanes and vertices in the parameter space
Let us consider the data set C composed of m feature vectors x j = [x j,1 , … ,x j,n ] T : Feature vectors x j belong to the n-dimensional feature space F[n] (x j ∈ F[n]). Components x j,i of the j-th feature vector x j can be treated as numerical results of n standardized examinations of the j-th object O j , where x j,i ∈R or x j,i Î{0,1}. Each of m feature vectors x j from the set C (1) defines the following dual hyperplane h j in the parameter space R n (w ∈ R n ) (Bobrowski, 2014): where w = [w 1 , … ,w n ] T is the parameter (weight) vector (w ∈ R n ).
Unit vectors e i = [0, … ,1, … ,0] T define the below hyperplanes h i 0 in the n-dimensional parameter space R n (Bobrowski, 2005): Consider the set S k of n linearly independent feature vectors x j (1) and unit vectors e i : The set S k contains r k feature vectors x j (j ∈ J k ) and nr k unit vectors e i (i∈I k ). The k-th vertex w k = [w k,1 , … ,w k,n ] T of the rank r k in the parameter space R n is defined on the basis of the set S k (4) as the intersection point of the r k hyperplanes h j (2) defined by the feature vectors x j ( j ∈ J k ) and the nr k hyperplanes h i 0 (3) defined by the unit vectors e i (i ∈ I k ). The vertex w k can be defined by the following linear equations: and The equations (5) and (6) can be represented as the matrix equation: where the square, nonsingular matrix B k is the k-th basis linked to the vertex w k : The feature vector x j(l) ( j(l ) ∈ J k ) constitutes the l-th row of the matrix B k . The vertex w k can be computed in the below manner (7): The below three definitions are taken from the work (Bobrowski, 2014): Definition 2.1: The rank r k (1 ≤ r k ≤ n) of the k-th vertex w k = [w k,1 , … ,w k,n ] T (9) is defined as the number of the non-zero components w k,i (w k,i ≠ 0).
Definition 2.2: The degree of degeneration d k of the vertex w k (9) of the rank r k is defined as the number d k = m kr k , where m k is the number of such feature vectors x j from the set C (1), which define the hyperplanes h j 1 (2) passing through this vertex (w k T x j = 1). The vertex w k (9) is degenerated if the degree of degeneration d k is greater than zero (d k > 0). (8) is regular if and only if the number nr k of the unit verctors e i (i ∈ I k (6)) in the matrix B k is equal to the number of such components w k,i of the vector w k = [w k,1 , … ,w k,n ] T (9) which are equal to the zero (w k,i = 0).

Vertexical planes and collinear patterns in feature space
The hyperplane H(w, θ) in the feature space F[n] is defined as follows (Bobrowski, 2005): where w is the weight vector (w∈R n ) and θ is the threshold (θ ∈ R 1 ). The k-th vertex w k (9) of the rank r k (r k > 1) (Def. 1) allows to define the (r k -1)dimensional vertexical plane P k (x j(1) , … ,x j(rk) ) in the feature space F[n] as the linear combination of r k linearly independent feature vectors x j(i) ( j(i)∈J k ) (5) from the set S k (4) (Bobrowski, 2014): where j(i)∈J k (5) and the parameters α i (α i ∈ R 1 ) satisfy the following condition: Two linearly independent vectors x j(1) and x j(2) from the set C (1) support the below straight line l(x j(1) , x j(2) ) in the feature space F[n] (x ∈ F[n]): where α ∈ R 1 . The straight line l(x j(1) , x j(2) ) (13) can be treated as the vertexical plane P k (x j(1) , x j(2) ) (11) spanned by two supporting vectors x j(1) and x j(2) with α 1 = 1 -α and α 2 = α. In this case, the regular basis B k (8) contains only two basis feature vectors x j(1) and x j(2) (r k = 2) and n -2 unit vectors e i (i∈I k ). As a result, the vertex w k = [w k,1 , … ,w k,n ] T (9) contains only two nonzero components w k,i (w k,i ≠ 0).
Lemma 3.1: The vertexical plane P k (x j(1) , … ,x j(rk) ) (11) based on the vertex w k (9) with the rank r k greater than 1 (r k > 1) is equal to the hyperplane H(w k , 1) (10) in the ndimensional feature space F[n] defined by this vertex.
Theorem 3.1: The j-th feature vector x j (x j ∈ C (1)) is located on the vertexical plane P k (x j (1) , … ,x j(rk) ) (11) if and only if the j-th dual hyperplane h j 1 (2) passes through the vertex w k (9) of the rank r k .
The proof of the above lemma and theorem was provided in the article (Bobrowski, 2014).
Definition 3.4: The collinear (flat) pattern F k of the dimension r k -1 in the feature space F[n] is formed by m k (m k > r k ) feature vectors x j located on the vertexical plane P k (x j(1) , … , x j(rk) ) (11). Lemma 1 shows that the flat pattern F k is based on the vertex w k (9): The flat pattern F k of the dimension r k -1 is composed of such m k feature vectors x j that the dual hyperplanes h j 1 (2) pass through the vertex w k (9) of the rank r k . The k-th vertex w k = [w k,1 , … ,w k,n ] T (9) has nr k components w k,i equal to zero:

Convex and piecewise linear (CPL) collinearity functions
Let us define the collinearity penalty functions φ j ε (w) linked to feature vectors x j (1): Remark 4.1: The collinearity penalty functions φ j ε (w) (16) is equal to zero if and only if the distance between point w and the dual hyperplane h j (2) is not greater than ε / ||x j ||: The k-th collinearity criterion function Φ k ε (w) is defined here as the sum of the penalty functions φ j ε (w) (16) linked to feature vectors x j from the subset C k (C k ⊂ C (1)) []: where J k = {j: x j ∈ C k ⊂ C (1)} and the positive parameters β j (β j > 0) can be treated as the prices of particular feature vectors x j . The standard choice of the parameters β j values is one ((∀j ∈ {1, … ,m}) β j = 1.0).

Proof:
If the vector x j from the subset C k is located inside the layer L ε (w, θ) (20), then Let us suppose that θ > 0. Then In accordance with the definition (16) of the collinearity penalty function φ j ε (w): Because all the feature vectors x j from the subset C k are located inside the layer L ε (w, θ) (20) with θ > 0, then (18): The last equation means that the minimal value of the collinearity criterion functions Φ k ε ′ (w) (18) with the margin ε ′ = ε / θ is equal to zero in the point w* = w / q. Let us suppose now that θ < 0. By dividing both sides of the inequality (25) by the negative number θ we get the below result: As a result, we obtain the below equation similar to (24): can be found in one of the vertices w k (9) (Simonnard, 1966). So, taking into account the equation (26) we infer that such vertex w k * (9) exists for which Φ k ε / |θ| (w k *) = 0.

Basis exchange algorithm
The basis exchange algorithm was initially proposed and developed as an efficient tool for designing linear classifiers and examining linear separability of large, multidimensional data sets (Bobrowski, 2005). The first version of the basis exchange algorithm was described in the papers (Bobrowski, 1991;Bobrowski & Niemiro, 1984). The first basis exchange algorithms were aimed at minimizing the perceptron criterion function. The convex and piecewise linear (CPL) perceptron criterion function links the linear separability concept, which is fundamental in the perceptron theory of neural networks (Duda et al., 2001). The CPL criterion functions were specified later for variety of goals. Different types of basis exchange algorithms were proposed for particular CPL criterion functions (Bobrowski, 2005).
The basis exchange algorithms are based on the Gauss-Jordan transformation. The famous Simplex algorithm used in the linear programming is also based on this transformation (Simonnard, 1966). A brief description of the basis exchange algorithms aimed at the minimization of the function Φ k ε (w) (18) is presented below. Let us assume that the k-th basis B k (8) is the squared matrix composed from n linearly independent feature vectors x j(i) from the subset C k (C k ⊂ C (1)): The inverse matrix B k −1 during the k-th stage can be represented in the below manner: The vectors x j(i) and r i ′ (k) fulfil the below equations: and During the k-th stage of the algorithm the basis B k (27) is changed into the basis B k+1 . The matrix B k+1 is created from the matrix B k (30) through replacing of the l-th row x j(l ) by the new vector x k taken from a given data subset C k (C k ⊂ C (1)).
The exchange of the l-th basis vector x j(l ) by the new vector x k results in the new basis B k+1 (30) and the new inverse matrix B k+1 −1 = [r 1 (k+1), … , r n (k + 1)] (28). The Gauss-Jordan transformation allows to compute efficiently the columns r i (k + 1) of the matrix B k+1 −1 on the basis of the columns r i (k) (1) of the inverse matrix B k −1 (28), where k = 0, 1, … , K (Bobrowski, 1991;Bobrowski & Niemiro, 1984): and (∀i = l) The basis exchange algorithms are based on the Gauss-Jordan transformation (30). The index l of the vector x j(l ) which leaves the basis B k (27) during the k-th stage is determined by the exit criterion of a particular basis exchange algorithm. The entry criterion determines which vector x k from the data subset C k (1) enters the new basis B k+1 (27). The stop criterion allows to determine the final stage K of the basis exchange algorithm. Selected criteria (exit, entry, and stop criterion) should ensure the collinearity function Φ k ε (w) (21) decreasing during each stage k.
The exchange of the k-th basis B k (27) into the basis B k+1 causes the move from the vertex w k to w k+1 (9). After finite number K of steps the optimal vertex w k ′ * (19) is reached this way (Bobrowski, 2005).

Learning algorithm
The optimal vertex w k * (9) which constitutes the minimum Φ k ε (w k *) (19) of the collinearity criterion function Φ k ε (w) (18) can be approximated using learning algorithms (Kushner & Yin, 1997). The Robbins -Monro procedure of the stochastic approximation has been used by us in designing learning algorithm aimed at the criterion function Φ k ε (w) (18) minimization (Bobrowski and Zabielski). For this purpose it is useful to represent the criterion function Φ k ε (w) (18) as the regression function Φ k (w) with the uniform probability distribution p(x j ) (Tsypkin, 1973): where φ ε (x j ; w) = φ j ε (w) (16), and p(x j ) is the uniform probability distribution defined on m k elements x j of the k-th data subset C k (C k ⊂ C (1)) as follows: The uniform probability distribution p(x j ) was used to generate the below learning sequence {x(l )} from m k elements x j of the k-th data subset C k : x(1), x(2), . . . , x(n), . . . .
The elements x(l ) of the above learning sequence were generated in each step l independently in accordance with the probability distribution p(x j ) (32): where P{x(l) = x j } is the probability of obtaining the j-th vector x j during the l-th learning step.
The Robbins -Monro procedure aimed at minimization of the regression function Φ k (w) (31) can be given in the below manner (Tsypkin, 1973): where w(l ) is the value of the weight vector during the l -th learning step, and {α l } is a sequence of decreasing, positive parameters α l , e. g. α l = 1/l which fulfil the below The Robbins -Monro procedure (35) applied to the regression function Φ k (w) (31) with the penalty functions φ ε (x j ; w) = φ j ε (w) (16) allows to specify the below learning algorithm: if w(l) T x(l) , 1 − 1, then w(l + 1) = w(l) + a l x(l), and if w(l) T x(l) . 1 + 1, then w(l + 1) = w(l) − a l x(l), and where x(l )= x j if the feature vector x j was generated (34) during the l-th learning step. Figure 3 Distance ρ(w(n), w k *) between the result w(n) during the n-th step and the optimal solution w k * (22) on the two-dimensional learning sets D 1 (squares) , D 2 (circles) and D 3 (triangles) with different level of outliers (33%, 50%, and 66% respectively). The experiments were carried out using an iterative learning algorithm (35).
The learning algorithm (37) can be treated as a modification of the famous error-correction algorithm used in the Perceptron (Duda et al., 2001) Theorem 7.3: If the learning sequence {x(l)} (33) is generated in accordance with the stationary probability distribution p(x j ) (34), then sequence {w(l)} of the weight vectors w(l) generated with the learning algorithm (37) converges to the optimal vector w k * which constitutes the minimal value Φ k ε (w k *) (19) of the criterion function Φ k ε (w) (18).
The proof of this theorem can be given using theorems of the stochastic approximation method (Kushner & Yin, 1997;Tsypkin, 1973).

Experimental results
The compared algorithms of the basis exchange (30) and the iterative learning algorithm (35) were implemented in C ++ and run on an Intel Core i7, 1.73 GHz processor (turbo boost to 2.93 GHz). . Distance ρ(w(n), w k *) between the result w(n) during the n-th step and the optimal solution w k * (19) defined on the two-dimensional learning data sets D 1 (squares) , D 2 (circles) and D 3 (triangles) with different levels of outliers. Experiments were performed using the basis exchange algorithm (30).
For the computational experiments, three synthetic datasets D 1, D 2 and D 3 , were created, each with a different level of outliers. The term outliers here denotes such additional feature vectors x j which were not located on the vertexical line l k (x j(1) ,x j(2) ) (13). The outlier feature vectors x j were generated in the two-dimensional feature space R 2 according to the normal distribution N 2 (0, I) with the unit covariance matrix I. The levels of outliers were established as follows: D 1 -33% , D 2 -50% and D 3 -66%. One of these datasets is shown in Figure 2. Figure 3 shows results of the learning process with the iterative learning algorithm as the distance ρ(w(n), w k *) between the result w(n) (35) during the n-th step and the optimal solution w k * (22) defined on data sets D k with different level of outliers.
As shown in these figures, the iterative algorithm converges after approximately n = 200 steps. However, in case of higher level of outliers, the solution is worse and the distance between the result and the optimal solution w k * (19) is greater. Algorithm ends after n = 1000 steps in case of each data set D k .
For comparison, Figure 4 shows the results of the basis exchange algorithm, where the plot demonstrates the distance ρ(w(n), w k *) between the result w(n) (40) during the n-th step and the optimal solution w k * (19).
As shown in these figures, a solution has been found in each case for data with different level of outliers. The distances ρ(w(n), w k *) between the result w(n) during the n-th step and the optimal solution w k * (19) became equal to zero.
The basis exchange algorithm performed worse in the first few steps than the iterative algorithm. But after a few steps and finally the basis exchange algorithm improved and Figure 5. The twodimensional learning set D 4 with the margin δ = 0.1 performed better results than iterative algorithm. Moreover, the iterative algorithm needs more time to find a solution than the basis exchange algorithm.
The next experiments concerned the role of the margin parameter ε (16). The iterative learning algorithm (35) and the database exchange algorithm (30) with different values of the parameter ε were tested on the data set shown in Figure 5.
The results of the learning process based on the iterative learning algorithm (35) are presented in Figure 6. The algorithm (35) was stopped when the condition: 1 -ε ≤ w (n) T x(n) ≤ 1 + ε was reached.
These results show that the iterative learning algorithm converges after about 20 steps. The solution becomes worse in cases of a higher value of the examined parameter ε. If a margin greater than zero (ε > 0) is selected, the algorithm ends faster than in the case of algorithms without a margin (ε = 0).
Similarly, Figure 7 shows the results of the basis exchange algorithm (30) with different values of the margin ε. (16).
The final values of the criterion function Φ k ε (w) (18) for the basis exchange algorithm are equal to or close to zero. It gives better results and works faster than iterative algorithm. The computation ends earlier for a margin ε greater than zero. Figure 6. Distance ρ(w(n), w k *) between the result w(n) during the n-th step and the optimal solution w k * (19) defined on the two-dimensional learning data set D 4 with different margin ε (ε = 0 represented by squares, ε = 0.5*δ represented by circles, ε = δ represented by triangles. Experiments were carried out using the iterative learning algorithm (35).
The final values of the criterion function Φkε (w) (18) for the base exchange algorithm are equal to or close to zero. The basis exchange algorithm gives better results and runs faster than the iterative algorithm. The computation ends earlier for a margin ε greater than zero. Figures 8-10 show the dependence of the result on the value of the margin ε. As the value of the margin ε increases, more feature vectors x j (1) are in the solution layer L ε (w, θ) (20).

Concluding remarks
A better understanding of large, multivariate data sets can be achieved through identification of collinear (flat) patterns F k (14). Extraction of collinear patterns F k (14) can be performed through the minimization of the collinear criterion function Φ k ε (w) (18) defined on various data subsets C k (C k ⊂ C (1)).
Both the basis exchange algorithm (30)   . The distance ρ(w(n), w k *) between the result w(n) during the n-th step and the optimal solution w k * (19) defined on the two-dimensional learning data set D 4 with different value of parameter ε (ε = 0 represented by squares, ε = 0.5 * δ represented by circles, ε = δ represented by triangles). Experiments were carried out using the basis exchange algorithm (30).
Particular attention in the article was paid to the role of the margin ε (16) in extraction of flat patterns F k (14). The margin ε greater than zero (ε > 0) allows to discover not only collinear patterns F k (17) as the subsets of points x j located on vertexical planes P k (x j(1) ,  … ,x j(rk) ) (11), but also flat patterns located in layers L ε ′ (w, θ) (23) of different thicknesses. Noise reduction may be based on such layered aggregation of data subsets C k . Increasing capacity in this area should be achieved by applying the RLS method of feature subset selection (Bobrowski & Łukaszuk, 2011).
The proposed method of discovering collinear patterns on the basis of the CPL criterion functions Φ k ε (w) (18) can be treated as a generalization of the methods based on the Hough transformation. The Hough transformation techniques are used in computer vision for detection lines and curves in pictures (Ballard, 1981;Duda & Hart, 1972).

Disclosure statement
No potential conflict of interest was reported by the author(s).

Notes on contributors
Professor Leon Bobrowski works at the Computer Science Faculty of Bialystok University of Technology and, additionally, in the Laboratory of Biomedical Data Analysis at the Institute of Biocybernetics and Biomedical Engineering of the Polish Academy of Sciences in Warsaw His research interests include data exploration, bioinformatics, and machine learning. He is developing data mining techniques based on the minimization of the convex and piecewise linear (CPL) criterion functions. The CPL criterion functions are linked to the concept of the linear separability of large, Figure 10. The result do the data set D 4 with value of parameter ε = δ.
multivariate data sets and to the perceptron model of neuronal networks. The relaxed linear separabilty (RLS) method of feature subset selection has been developed on the basis of the CPL functions and has been used foe searching specific genes. The basis exchange algorithms have been developed and are used for efficient and precise minimization of the CPL criterion functions. These algorithms are based on the Gauss-Jordan transformation and, for this reason, are similar to the Simplex algorithm from linear programming.
Paweł Zabielski graduated at the Faculty of Computer Science of the Białystok University of Technology in Poland with an excellent result. He teaches the following subjects: Basics of programming, Object-oriented programming, Programming applications in JavaScript, Programming, algorithms and data structures, Advanced programming techniques, Building applications in WPF technology, Analysis and testing of information systems. Under the supervision of Profeessor Leon Bobrowski, he started research work related to the subject of detection of flat patterns. The subject of his research work is focused on the analysis of large data sets and the detection of certain patterns. He develop methods and algorithms for acquiring knowledge. The results of his work have been published in scientific articles.