Feature Selection for Supervised Learning and Compression

ABSTRACT Supervised feature selection aims to find the signals that best predict a target variable. Typical approaches use measures of correlation or similarity, as seen in filter methods, or predictive power in learned models, as seen in wrapper methods. In both approaches, the selected features often have high entropies and are not suitable for compression. This is a particular drawback in the automotive domain where fast communication and archival of vehicle telemetry data is increasingly important, especially with technologies such as V2V and V2X (vehicle-to-vehicle and vehicle-to-everything communication). This paper aims to select features with good predictive performances and good compression by introducing a compressibility factor into several existing feature selection approaches. Where appropriate, performance guarantees are provided for greedy searches based on monotonicity and submodularity. Using the language of entropy, the relationship between relevance, redundancy, and compressibility is discussed from the perspective of signal selection. The approaches are then demonstrated in selecting features from a vehicle Controller Area Network for use in SVMs in a regression task, namely predicting fuel consumption, and a classification task, namely identifying Points of Interest. We show that while predictive performance is slightly lower when compression is considered, the compressibility of the selected features is significantly improved.


Introduction
Cities are becoming increasingly smart, with more sensors producing larger amounts of data. Sensors are deployed in all aspects of city infrastructure, and in particular, they are necessary for intelligent transportation systems. Collecting data from these sensors, storing and analyzing it, presents a huge challenge for researchers. Vehicles increasingly communicate with other vehicles (V2V) and traffic management systems (V2X) to enhance both efficiency and safety (Harding et al. 2014;Weiß 2011). For example, it is possible for one vehicle to broadcast that its brakes have been applied to warn others that they are slowing, reducing the burden on a human driver seeing the brake light.
Modern roadside infrastructure is also laden with sensors that measure traffic flow, pollution, and weather among other things and use transmitters to provide useful information for vehicles and their drivers (Guerrero-Ibáñez, Zeadally, and Contreras-Castillo 2018). For example, traffic flow beginning to slow in one section of a highway may be eased by lowering the speed limit temporarily in another, reducing the level of traffic at the bottleneck. These examples all require large amounts of telemetry data to be communicated over networks. Existing communications infrastructure cannot satisfy the required bandwidth, necessitating new communication protocols and novel approaches to data analysis.
Supervised machine learning aims to build models that map inputs to output predictions for a target variable (Witten et al. 2016). For example, an automotive manufacturer may wish to learn a model that estimates whether or not a vehicle is likely to have a fault in the near future. Such a model may take input data from on-board vehicular sensors but also require data from many vehicles to learn a model with sufficient accuracy. Similarly, a highways agency may wish to locate pot-holes or predict traffic jams using road sensors and vehicle telemetry data. The number of possible input features in such domains is large, and computationally, it is typically impractical to consider them all. Moreover, many features may be irrelevant to the target variable or redundant with respect to other features. Both irrelevance and redundancy can increase model complexity or cause incorrect mappings to be learned, which in turn leads to lower predictive performance. To overcome this, supervised feature selection can be used to find a subset of the features that are most related to the target variable (i.e., high relevance) but least related to each other (i.e., low redundancy) Peng, Long, and Ding 2005) or have the best performance when combined in a model (Aboudi and Benhlima 2016;Guyon and Elisseeff 2003). However, feature selection approaches generally have a bias for features that carry more information about a target variable, as these typically are the most relevant. This high information content often coincides with high entropies and poor compression, meaning that these signals are likely to be expensive to communicate and difficult to store in data centers. This paper considers compression as a cost in the feature selection process using ideas from cost-based feature selection (Bolón-Canedo et al. 2014) and aims to choose features with good predictive performance as well as high compression. In particular, the paper: • introduces a compressibility factor into a variety of filter and wrapper approaches to feature selection, • performs an entropy analysis of introducing the compressibility factor into feature selection schemes based on mutual information (MI), and • assesses the selection of compressible features from vehicle telemetry data in a regression and a classification task.
Specifically, a compressibility factor is introduced into the maximum relevance (MR), minimal redundancy maximal relevance (MRMR) (Peng, Long, and Ding 2005), and goodness of fit (GOF) (Das and Kempe 2011) filter approaches, and the wrapper (Kohavi and John 1997) approach using linear regression (LRr), logistic regression (LCc), and support vector machines (SVM). Relevant performance guarantees for greedy searches, based on monotonicity, submodularity, or weak submodularity, are provided for each of the filter methods (Nemhauser, Wolsey, and Fisher 1978). Although such theoretical guarantees do not extend to wrapper approaches, related assurances can be derived from those of model performance measures (May, Dandy, and Maier 2011). As such, wrapper methods typically achieve good performances when a suitable evaluation is used (Kohavi and John 1997). The proposed methods are then demonstrated in selecting features to predict instantaneous fuel consumption (FC) and to detect points of interest (PoIs) ( Van Hinsbergh et al. 2018) using the telemetry signals available via a vehicle controller area network (CAN).
The remainder of this paper is structured as follows. Section 2 discusses related work in feature selection and compression. Section 3 details the performance measures and selection algorithms, along with any relevant performance guarantees. This is followed by an entropy analysis of MR, MRMR, and compressibility in Section 4. The selection methods are then demonstrated on collected vehicle telemetry data in Section 5. Finally, Section 6 concludes this paper and outlines future work.

Related Work
Feature selection aims to find a subset of all possible features to reduce their number while still sufficiently describing the data with respect to a particular task (Chandrashekar and Sahin 2014). Unsupervised feature selection generally aims to find the features that best differentiate the samples (Jennifer and Brodley 2004;Mitra, Murthy, and Sankar 2002). Typically, this relies on transforming the feature space, assessing clustering properties, or other discriminatory properties measured using heuristics (Alelyani, Tang, and Liu 2013).
In supervised learning, which is the focus of this paper, the three main approaches to feature selection are embedded, wrapper, and filter methods (Guyon and Elisseeff 2003;Jain and Singh 2018). Embedded methods perform feature selection as part of the learning algorithm (Lal et al. 2006). For example, in decision tree induction, the choice of variable on which to split nodes can be seen as feature selection. The nodes used, and their associated features, are then the selected features. Trees in random forests may also be removed if their estimated performance is poor, which may mean poorperforming features are never used in the learned model. Wrapper methods assess feature subsets via the performances of models that use them as inputs (Kohavi and John 1997). This is done by comparing the performances of models learned using the same process and training samples but with different sets of input features. Wrapper methods are typically computationally expensive, as they require a machine learning algorithm to build a model for each feature subset evaluation. Filter methods, on the other hand, generally assess the performances of feature subsets using heuristics that are typically less computationally expensive. Some commonly used heuristics include similarity measures such as Pearson's Correlation Coefficient (PCC) or MI, which can be used to estimate both relevance and redundancy of a feature set (Jain and Singh 2018;Witten et al. 2016).
In general, selecting up to l features from a set of features, X, can be represented as an optimization problem (Kumar and Minz 2014), argmax S�X;jSj�l;l�jXj PðS; yÞ; (1) where PðS; yÞ estimates the performance of a subset of these features, S, with respect to predicting the target variable, y. Ideally, all possible subsets would be evaluated to find the feature set that provides the highest performance, but the number of possible feature subsets is jXj l � � , and this exhaustive search is infeasible. In practice, therefore, a more efficient combinatorial search algorithm is applied. Possibly, the most common search strategy is the forward greedy search (Kumar and Minz 2014), which selects the feature, s k 2 XnS, that satisfies a condition, C, PðS k [ fs k g; yÞ À PðS k ; yÞ; (2) in each of k ¼ 1; 2; . . . ; l iterations. The search begins with no selected features, S 0 ¼ ;, and so initially the feature with the highest individual performance is selected. It is typical that during this first iteration, Pð;; yÞ ¼ 0. In the subsequent l À 1 iterations, PðS k ; yÞ is the performance of the already selected features, and the feature that increases the performance most is selected. The search stops when a stopping criterion is met, such as when a given number of features are selected or if the performance score decreases after selecting a new feature.
If the performance measure is both monotone (i.e., increases with jSj), PðS; yÞ � PðS 0 ; yÞ; for all S � S 0 � X, and submodular (i.e., has diminishing returns), PðS [ x; yÞ À PðS; yÞ � PðS 0 [ x; yÞ À PðS 0 ; yÞ; for all x 2 XnS 0 , then the solution of the forward greedy search is within 1 À e À 1 ( � 63%) of the optimal. The consequences of submodularity are also used in the lazy greedy search algorithm (as described in Section 3.6), which maintains a sorted list of upper bounds on the marginal gains to reduce the number of evaluations of PðS; yÞ (Minoux 1978). If the performance measure is submodular but non-monotone, alternative greedy algorithms with different performance guarantees can be used (Buchbinder et al. 2015;Feige, Mirrokni, and Vondrak 2011), and the submodularity ratio (Das and Kempe 2018) can be used for set functions that are monotone but not submodular. Two advances to the greedy algorithm include the distributed greedy and stochastic greedy searches (Khanna et al. 2017;Mirzasoleiman et al. 2015). In each of these, multiple searches are applied to different subsets of the full feature set. In distributed greedy search, the subsets are generated by sampling the features without replacement, whereas in stochastic greedy search, the sampling is performed with replacement. The results of the multiple searches are combined into an interim result set, on which another greedy search is applied to find the final result. Alternatively, the feature subsets can be combined by taking their intersection (Nagpal and Singh 2018).
Other optimization approaches include particle swarm optimization (PSO), which iteratively updates candidate vectors in a population based on both local and global information (Kennedy and Eberhart 1995). For feature selection using binary PSO, a candidate vector is a probabilistic representation of a feature being selected, that are collapsed onto either 0 or 1 for feature set evaluation using a random threshold (Chuang et al. 2008). Similar to this, brain storm optimization iteratively clusters candidates, generates new ones, and selects those with the best performance for the next iteration (Papa et al. 2018). Differential evolution approaches also iteratively create new populations through a process of mutation, crossover, and selection (Zhang et al. 2020).
Feature selection is often characterized as a compression method, as it reduces the data that must be processed by machine learning algorithms and models. This has several advantages, including reducing model complexity and improving performance, but selected features often have high variances and entropies that typically coincide with good predictive performance (Guyon and Elisseeff 2003). This bias toward high entropy features limits the data compression that can be achieved with those that are selected. It may be the case, particularly with the high redundancies found with vehicle telemetry and city data, that some features with lower entropies and better compression may provide comparable predictive performances (Taylor, Griffiths, and Mouzakitis 2018).
To achieve both these competing aims, namely selecting high quality features that have good compression, this paper adopts ideas from cost-based feature selection. A simple approach is to introduce a cost factor, CðSÞ, into the optimization problem (Equation 1) as follows (Bolón-Canedo et al. 2014 where ω is a cost weighting parameter. A similar approach is to sum the costs of individual features to estimate the cost of the feature set (Jiang, Kong, and Li 2021). Costs of features can arise from observing or measuring them (e.g., in using expensive medical machinery), or computational costs (e.g., in extracting features from large images). Other approaches include searching for a pareto-optimal solution where no criteria can improve without worsening another (e.g., (Hancer, Xue, and Zhang 2018;Zhang et al. 2019Zhang et al. , 2020). Typically, such mutliobjective optimization approaches focus on the competing aims of minimizing the number of selected features vs predictive performance, but Hancer et al. (Hancer, Xue, and Zhang 2018) demonstrated it for multiple feature performance measures (including MI relevance and redundancy). While such approaches may be applicable to the objectives of predictive performance and compression, they do not enable an investigation of the trade-off between them. As such, this paper employs a cost-weighting parameter that enables full control over the relative importance of compression, measured using a compressibility factor outlined in Section 3.1.
Data compression aims to remove redundancy, thus making the data representation smaller, in such a way that it can be restored (Salomon and Motta 2010). Characteristics of city and vehicle telemetry data, including temporal consistency, noise, and signal redundancy, all support good compression. While signal redundancy is removed in performing feature selection, noise and temporal consistency are not removed and must be considered using other kinds of data compression. There are two broad categories of data compression, namely lossless and lossy.
Lossless compression aims to compress the data in such a way that the uncompressed version is indistinguishable from the original (Salomon and Motta 2010). Typically, lossless compression inspects the frequencies of symbols and looks for repeating symbols or sequences of symbols in the data stream. Perhaps the most simple method of compression is run-length encoding, in which symbols are encoded along with their number of consecutive repetitions. For example, the string 'AAAABBA' can be encoded as 'A4B2A1.' Two other notable lossless compression algorithms are LZ77 dictionary encoding (Ziv and Lempel 1978) and Huffman coding (Huffman 1952). LZ77 uses a sliding window and searches for repeating sequences, which are encoded as the length and location of its first occurrence in the window.
Huffman coding produces a variable length prefix-code defining the path to the encoded symbol in a Huffman tree. Symbols that occur with higher frequencies are located closer to the root node in the tree and thus have shorter Huffman codes. Taken together, LZ77 and Huffman encoding make up the DEFLATE compression algorithm (Deutsch 1996), which is the basis of the ZIP file format.
Although lossless compression guarantees that the decompressed stream is the same as the original, lossy compression relaxes this constraint and aims only to minimize information loss. In particular, lossy compression aims to keep information where it is important and lose information where its loss will not be noticed. In MP3 audio compression, for example, the high frequencies above the human hearing range are removed. For vehicle telemetry data, similar components of the signals can be removed if they are not useful to further analysis. Some signals such as vehicle speed, for example, contain noise that may even be detrimental to analyses. Such approaches to separating informative and noise components of signals include the discrete wavelet transforms (DWT) (Addison 2017), the discrete Fourier transform (DFT) (Bailey and Swarztrauber 1994), the discrete cosine transforms (DCT) (Ahmed, Natarajan, and Rao 1974).
The DWT operates by extracting two signals that are each half the length of the original (Addison 2017). The first represents an approximation of the original signal and is referred to as the low frequency (LF) component. The second is the high frequency (HF) component and is a representation of the detail in the original signal. The LF and HF components are produced by a convolution of the original signal with wavelet kernels, followed by a downsampling by a factor of 2. Typically, the HF component contains many small values and can be considered the noise component of the original signal. By quantizing this component and applying a threshold, many of these small values become zero and can be encoded very efficiently using lossless compression methods such as run-length encoding. This quantization introduces errors into the signal when it is reconstructed, but the error is minimized because only the HF component (detail) is affected. This process can be performed recursively to the LF component, further increasing the potential for lossless compression of the coefficients. This paper focuses primarily on lossless compression, but the methods are also applicable to lossy compression approaches such as DWT.

Feature Selection
To consider the compression of features as a cost during their selection (Bolón-Canedo et al. 2014), a compressibility factor is introduced into the feature selection process alongside the performance metric, PðS; yÞ, of a feature set, S. This compressibility factor is detailed in Section 3.1, followed by its introduction to filter methods in Sections 3.2-3.4, and the wrapper method in Section 3.5. These feature selection approaches were chosen because of their simplicity and existing adoption in the literature and practice. Their range also demonstrates that compression can be considered generally in feature selection. Finally, Section 3.6 describes forward greedy and lazy forward greedy searches used to find the subset of features that maximizes the performance measure, PðS; yÞ.

Compressibility Factor
The compressibility factor, ComðSÞ, is defined as the mean compressibility over a set of features, S, where βðx i Þ is the compressibility of the individual feature, x i . Compressibility of an individual feature can typically be assessed using entropy, but this is difficult to compute for vehicle telemetry signals, which are not independent and identically distributed. We therefore adopt a more direct approach and measure the space savings achieved by a compression algorithm on training samples, where B comp is the number of bits in the compressed data and B orig is the number of bits in the original. This measure is larger for variables that are more easily compressed than those that do not compress well and typically has a range of ½0; 1�. Therefore, this compressibility measure should be maximized to minimize the impact of poor feature compression. Due to the pigeon-hole principle, a lossless compression algorithm may increase the number of bits in the data in some circumstances, leading to βðxÞ having a value greater than 1. To mitigate this, a compression algorithm can output the original data if its size would be increased and prepends a single bit to signify whether or not this is the case. If the bit has a value of 1, all subsequent data is equal to the original (i.e., it was not compressed), and otherwise the subsequent data was compressed by the algorithm and should be decompressed. Therefore, in general, βðxÞ has a range of ½À �; 1�, where � ¼ 1 B orig ðxÞ . In the simulations performed for this paper, the compressibility is measured using DEFLATE for all features, which in the worst case increases the required storage by at most three bits per block. Different features are more suited to different compression methods, however, so it may be beneficial to employ a compression strategy targeted toward the particular features being considered. For example, DEFLATE may be applied to some features and DWT to others, or groups of features may be considered together in a more comprehensive compression strategy. These approaches incur higher computational costs during feature selection, however, compared to applying the same compression method to each feature individually. We therefore leave the development of a more sophisticated estimate for the impact of poor compression to future work.

Maximum Relevance (MR)
To assess the relevance of a feature set, the individual feature relevancies can be aggregated, where Iðx i ; yÞ is the MI between feature, x i , and the target, y. Unfortunately, the mean is not monotone and is not submodular in general. To overcome this, MR is redefined as a summation, which is both monotone and modular, In a forward greedy search, this is equivalent to selecting the feature, s k , that satisfies, in each iteration.
Proof 1 Iðx; yÞ � 0,and any summation of a set of positive numbers is monotone and modular.
Theorem 2 The final set, S,found using a stable forward greedy search is the same when using either the mean of individual signal performances, P μ ðS; yÞ ¼ P MR ðS; yÞ), or their summation, P � ðS; yÞ ¼ P �MR ðS; yÞ).
Corollary 1 Using a stable forward greedy search to maximize the summation P � ðS; yÞ or mean P μ ðS; yÞ yields the same optimal result for MR.
To maximize the compression of selected features, the compressibility factor must also be maximized. The combination of MR and the compressibility factor (MR þ C) can be defined as, where a weighting parameter, ω com , allows the level to which compression is considered during the selection process to be varied. The smaller the value of ω com , the less compression is considered. In particular, a value of ω com ¼ 0 means that MR þ C is equivalent to MR selection. As with MR, MR þ C can be written as a monotone and modular summation, which results in the same optimal final set as P MRþC ð�Þ when using stable forward greedy search (see Proof 2). Such a forward greedy search selects the feature, s k , that maximizes, Iðs k ; yÞ þ ω com βðs k Þ; in each iteration.

Minimum Redundancy Maximum Relevance
The redundancy of a feature set can be assessed as the mean of all pairwise feature similarities, measured using MI, which, in MRMR, is to be minimized in the performance function, for example, P MRMR ðS; yÞ ¼ P MR ðS; yÞ À RedðXÞ: As with MR, MRMR can be written as a summation, which is equivalent to selecting the feature, s k , which satisfies, in each iteration of a stable forward greedy search. In this form, MRMR is neither submodular nor monotonic. Having said this, the MRMR criterion (Equation 18) is equivalent to the max-dependency criterion when one feature is selected at a time (Peng, Long, and Ding 2005), as in forward greedy search. While MRMR aims to minimize redundancy between features and maximize their relevance to a target variable, it does not consider the joint probability distribution of the variables. Maxdependency aims to directly maximize the joint dependency between the features and the target variable (Peng, Long, and Ding 2005). Direct analysis of the joint probability distributions is computationally expensive, however, so MRMR is a useful alternative when using forward greedy search. Furthermore, max-dependency is easily shown to be monotone (i.e., adding an input variable does not decrease the MI dependency with the target). Therefore, MRMR results in the same selected features as with max-dependency, which provides an approximation guarantee of 1 À e À 1 .
Finally, the compression factor can again be introduced as, which is equivalent to selecting the feature, s k , that satisfies, in each iteration of a stable forward greedy search.

Goodness of Fit
If two features are poor individual predictors (i.e., they each have low relevance to the target) but can be combined in a model to have good performance, they should be selected. MR and MRMR are both unable to capture these kinds of feature interactions (although adaptations using conditional MI exist (Herman et al. 2013)). Alternatively, such feature interactions are captured by assessing the GOF, G, of features in a model, produced by fitting the features S to target y using the learning algorithm L M . Goodness-of-fit measures how closely the model maps the input data, S, to the target variable, y. In regression tasks, a linear regression model is often used alongside the R 2 goodness-of-fit measure. In general, any model and GOF measure can be used. For example, in a classification task, a Naïve Bayes or SVM classifier can be used alongside the accuracy or area under the receiver operator characteristic curve (AUC) as a measure of fit (Witten et al. 2016). The performance of a feature set using GOF then has the form, If the measure for GOF is a non-negative monotone set function, an approximation guarantee is provided via the submodularity ratio, which is the minimum ratio of increase in performance when adding individual elements of one subset, x 2 M, to another, L, compared to adding all elements of M to L at once Kempe 2011, 2018). Specifically, with linear regression and R 2 , forward greedy stepwise selection provides a lower bound approximation of 1 À e À γ S;l ðPÞ . A special case of this approximation is when the GOF is measured using a sub-modular function, in which case γ S;l ¼ 1 and the lower bound approximation is 1 À e À 1 . For the regression task investigated, a linear regression model is used, M LRr , of the form, where β is the linear coefficient and � is the error term. GOF is measured using R 2 , where E½�� is the expectation, or mean, of the variable. In this GOF selection for regression (GOFr), therefore, the performance measure is For the classification task, binomial logistic models (Cox 1958) are used, which output the likelihood of the target variable having a value of 1. Using this likelihood, the AUC, G AUC , is calculated as a measure for goodness of fit. Thus, goodness of fit for classification (GOFc) uses the performance measure While this suffices for the binary classification task used in the evaluation for this paper, performing stepwise selection with multi-class target variables requires a multi-class learning algorithm, such as a decision tree or naïve Bayes. Another possible approach is to transform the target into multiple binary variables and learn multiple binary classification models to be combined in an ensemble (Fürnkranz 2002).
To consider compression in both the regression and classification tasks, the compressibility factor is again added to the performance metrics. The GOFr and compressibility (GOFr þ C) performance metric is then P M LRr ;G R 2 þC ðS; yÞ ¼ G R 2 ðy; M LRr ðSÞÞ þ ω com � ComðSÞ; and for GOFc and compressibility (GOFc þ C), it is P M LCc ;G AUCþC ðS; yÞ ¼ G AUC ðy; M LCc ðSÞÞ þ ω com � ComðSÞ:

Wrapper (W)
The wrapper approach to feature selection is similar to GOF, but goodness of fit is replaced by model performance, measured from data not used when fitting the models (Aboudi and Benhlima 2016;Guyon and Elisseeff 2003). As feature selection is performed using the training data, the training data is further split into a selection-training portion, hX tr ; y tr i, to learn models using different feature sets, and a selection-testing portion, hX te ; y te i, to assess their performance. It is typical for this selection-testing portion to be referred to as hold-out or validation data. A selection-training portion that is too small may lead to poorly fit models that are not indicative of the performance of the feature set, while one that is too large will leave too few samples in the selection-testing portion for a reliable performance estimate. Alternatively, the training data can be split into several folds, for a k-folds cross evaluation of each feature set, but of course this increases the computational requirements of the already computationally expensive wrapper approach. In this paper, the first half of the training samples temporally make up the selection-training portion, and the remainder make up the selection-testing portion. Any learning algorithm, LðS; yÞ, can be used in a wrapper, although usually it is the same model that is to be used on the final set of features. For example, if features are being selected for use in SVM, then a SVM is the most appropriate method for use within the wrapper during the selection process. A large number of models are typically learned during wrapper selection, and some learning algorithms require a large amount of computation to learn each model. Therefore, the computational requirements of a learning algorithm may also be a consideration when adopting the wrapper selection approach, in particular with large numbers of features from which to select. In some cases, a two-stage feature selection process may be performed that uses a filter to reduce the number of features significantly before applying a wrapper to reduce the feature set further (Peng, Long, and Ding 2005).
To assess the performance of a set of features, S, a model, M, must first be learned using the selection-training portion, hS tr ; y tr i, M ¼ L M ðS tr ; y tr Þ: This model is then applied to the selection-testing portion, hS te ; y te i, to produce a set of outputs, which are then compared to the observed values, y te , to estimate the performance using a performance measure, P M;P ðS; yÞ ¼ PðMðS te Þ; y te Þ: Alternatively, this can be written by defining the model outputs using a partial function of the learning algorithm, P M;P ðS; yÞ ¼ PðL M ðhS tr ; y tr iÞðS te Þ; y te Þ: Again, compression can be considered by adding the compressibility factor to the performance measure, P M;PþC ðS; yÞ ¼ PðMðS te Þ; y te Þ þ ω com � ComðSÞ ¼ PðL M ðhS tr ; y tr iÞðS te Þ; y te Þ þ ω com � ComðSÞ: The submodularity of the wrapper approach cannot be guaranteed, in general, because the model is learned and evaluated using different sets of data. If the training data is independent and identically distributed (IID) and sampled in a stratified manner, a performance guarantee similar to that provided by the submodularity ratio can be achieved when using linear regression. Time series data, however, is not IID, and so no such guarantees can be provided for the data used in this paper. In our evaluation for the regression task, R 2 is used as a model performance measure and both linear regression (i.e., P M LRr ;P R 2 ), referred to as WLRr, and SVM regression (i.e., P M SVMr ;P R 2 ), referred to as WSVMr. For the classification task, AUC is used along with logistic regression (i.e., P M LCc ;P AUC ), referred to as WLCc, and SVM classification (i.e., P M SVMc ;P AUC ), referred to as WSVMc. In all cases, compression is included by adding the compression factor; therefore, for linear regression and R 2 with compression (WLRr þ C), the performance measure is P M LRr ;P R 2 þC ðS; yÞ ¼ P R 2 ðM LRr ðS te Þ; y te Þ þ ω com � ComðSÞ; and for SVM regression with compression (WSVMr þ C), it is defined as P M SVMr ;P R 2 þC ðS; yÞ ¼ P R 2 ðM SVMr ðS te Þ; y te Þ þ ω com � ComðSÞ: The performance measure for logistic regression with compression (WLCc þ C) is defined as P M LCc ;P AUCþC ðS; yÞ ¼ P AUC ðM LCc ðS te Þ; y te Þ þ ω com � ComðSÞ; and finally, for SVM classification with compression (WSVMc þ C), it is defined as P M SVMc ;P AUCþC ðS; yÞ ¼ P AUC ðM SVMc ðS te Þ; y te Þ þ ω com � ComðSÞ:

Search
In feature selection, the aim is to find the subset of features that maximizes the performance metric, PðS; yÞ. Naively, this is done by evaluating the performance metric for all possible subsets of features, but in reality, there are too many subsets for this to be feasible. For instance, there are jXj k � � ¼ jXj!ðk!ðjXj À kÞ!Þ À 1 different subsets of size k when selecting from jXj features. The forward greedy search is outlined in Algorithm 1 and iteratively selects the feature that maximizes the marginal gain in performance until k features have been selected. In each iteration, PðS; yÞ is computed for each unselected feature, meaning that 0:5kð2jXj À k À 1Þ feature set evaluations are performed when selecting k features from a set of jXj features. While this is usually computationally feasible for filter approaches, the feature set evaluation in the wrapper approach is significantly more expensive, and so a more efficient search is required. Algorithm 2 outlines the lazy forward greedy search algorithm (Minoux 1978). To avoid computing PðS; yÞ for every unselected feature in each iteration, lazy forward greedy search prioritizes the features with good performance in previous iterations. Specifically, the feature, s, with maximum observed marginal gain, Δ s , is tentatively selected and evaluated to recompute and update its marginal gain, Δ s ¼ PðS [ fsg; yÞ À PðS; yÞ: It is possible that this new marginal gain is no longer the maximum observed. In this case, the next feature, which now maximizes the marginal gain, is tentatively selected and evaluated. This process is repeated until a feature is tentatively selected for a second time, where it is selected and added to S. During the first iteration, when S ¼ ;, the algorithm requires that Pð;; yÞ ¼ 0. In subsequent iterations, PðS; yÞ is the performance of the already selected features.
where HðxÞ and HðyÞ are the marginal entropies of x and y, Hðx; yÞ is their joint entropy, and HðxjyÞ and HðyjxÞ are conditional entropies. The marginal entropy, HðxÞ, also provides a lower bound on the number of bits required to represent some data, x, i.e., the best compression achievable. Although compression algorithms such as DEFLATE often require a larger number of bits than this bound in practice, HðxÞ is still a useful theoretical tool for assessing compressibility. In particular, HðxÞ is lower when x can be represented by fewer bits when compressed than when x is difficult to compress and, thus, has similar properties to the compression ratio. In this section, we use entropies to analyze the relationships between relevance, redundancy, and compression in MR, MR þ C, MRMR, and MRMR þ C. MR selects the feature that satisfies the condition, C MR , in each iteration, which is defined using entropy as HðyÞ À Hðyjs k Þ: Similarly, C MRþC is defined as HðyÞ À Hðyjs k Þ À ω com Hðs k Þ; C MRMR as HðyÞ À Hðyjs k Þ þ 1 jSj X x2S Hðs k jxÞ; and C MRMRþC as HðyÞ À Hðyjs k Þ À ω com Hðs k Þ þ 1 jSj X x2S Hðs k jxÞ: (51) Each of these equations is a linear sum with four terms, of which some are maximized (i.e., they are positive) and some are minimized (i.e., they are negative). Table 1 shows the coefficients of these terms in each of the four information-based feature selection methods. While HðyÞ is always maximized with a factor of 1, it can be ignored as it is a constant throughout the selection process because the target, y, is fixed. In all the equations relevance is maximized, which in each equation is represented by À Hðyjs k Þ (i.e., the conditional entropy of y given s k is minimized). Intuitively, this is minimizing the unknown quantity of y given the information in s k , represented as the intersections of s 1 and s 2 with y Venn diagram in Figure 1(a). In this example, s 1 has a larger intersection with y than s 2 (i.e., more of y is known given s 1 than s 2 ).
In MRMR and MRMR þ C, redundancy is minimized using the term jSj À 1 P x2S Hðs k jxÞ (i.e., the mean conditional entropy of s k given the already selected features is maximized). This term aims to minimize the information overlap, or redundancy, of the selected features. Figure 1(b) shows three variables, s 1 , s 2 and s 3 , each with the same size of intersection (relevance) with the target variable, y. After selecting s 1 , s 2 is preferred over s 3 as its intersection (redundancy) with the already selected features, S ¼ fs 1 g, is smaller.
Finally, in MR þ C and MRMR þ C, compressibility is maximized by À ωHðs k Þ (i.e., a portion of the marginal entropy of s k is minimized). This is effectively minimizing the amount of information a feature carries. The Venn diagram in Figure 1(c) shows two variables, s 1 , s 2 with equal overlaps with the target variable (relevance), and no redundancy. The smaller set, s 1 , Table 1. The quantities of entropy that are maximized in each of the entropy based criterions.

e2034293-2358
representing lower entropy, is more compressible than the larger set, s 2 , representing a higher entropy, and so in this case, therefore, s 1 is preferred over s 2 . The effect of ω com is demonstrated using simulated data in Figure 2, for which 100 samples were generated, with the first 50 samples having a target value of 1 and the target value of the remaining 50 samples being 0. Nine features were then generated with fifty zero values and fifty non-zero values, to have different relevancies, redundancies, and entropies. In the first set of three features, the non-zero values were at samples 5 to 55, in the second set of three features the non-zero values were at samples 10 to 60, and in the final set of three features they were at samples 15 to 65. This ensures that the features have different relevancies to the target and different redundancies to one another. In each set, the non-zero values of the first feature were either 1 or2, in the second feature they were either 1, 2, or 3, and in the third feature they were either 1, 2, 3 or 4 each chosen uniformly at random. The more values a feature takes, the more entropy it has in general.
Figure 2(a) shows the mean MR, MR þ C (with ω com ¼ 0:1) over ten generated datasets. The features with the most non-zero values when y ¼ 1 (features 1, 2, and 3) had the highest relevancies (measured using MR), whereas those with fewest non-zero values when y ¼ 1 (features 7, 8, and 9) had the lowest relevancies. Features with the same set of non-zero samples but that took more values also had higher MR (e.g., feature 3 had a higher relevancy than feature 1), even though they contain the same information about the target variable, y. With MR þ C, those features that had more values but the same set of non-zero values were considered of lower importance, since the entropy of features is also considered. Figure 2(b) shows the same results for MRMR, and MRMR þ C (with ω com ¼ 0:1). Redundancy for each feature was computed as its total MI to the other eight features. Here, MRMR also prefers features with fewer values, as they have less chance for having values in common. This is, however, not true in general, and ω com still had an effect in reducing the MRMR þ C scores of the features that took more values. In fact, a feature with more non-zero samples when y ¼ 1 and higher relevance (feature 6) had a slightly lower score than a feature with fewer non-zero samples (feature 7) once compression was considered. This means that a feature with lower predictive performance may be selected over a feature that may provide better predictions, but to the benefit of compression.

Results
In this section, the results for the feature selection approaches described in Section 3 are provided. The results were obtained using the Location Extraction Dataset (LED) ( Van Hinsbergh et al. 2018). The LED consists of over 1900 vehicle telemetry signals collected over 72 journeys in an urban environment, while performing various pick-up and drop-off scenarios. Data was sampled at 10 Hz and the mean length of each journey was 19.7 minutes, the standard deviation of journey lengths was 8.2 minutes, and the range was 29.1 minutes. All signals with names containing the strings 'time' and 'minutes' were removed prior to any feature selection or model learning, as they were found to be detrimental to the results due to each sample having a unique value.
One regression task, namely estimating the instantaneous Fuel Consumption (FC), and one classification task, namely detecting Points of Interest (PoI), were constructed using the LED. For FC, all signals containing the string 'fuel' or 'torq' were removed from the data, because these signals can be used to achieve high performances in simple models using one feature. In both tasks, training data was made up of the samples from ten journeys sampled randomly without replacement, and testing data was made of different ten journeys, again sampled randomly without replacement. The data was split temporally into equally sized blocks of samples, from which the mean (if continuous) or modal (if categorical) value was extracted. The block size for the FC task was 10 samples (1 second), and for the PoI task a block size of 50 samples (5 seconds) was used. These block sizes were chosen as they provided the most interesting trade-off for predictive performance and feature compression, and provided training and testing datasets with a mean number of samples of about 12000 for the FC task and 5200 for the PoI task.
Signals were selected from the training data using incremental search, maximizing the performance measures listed in Table 2. MR þ C and MRMR þ C are applied to both the regression (FC) and classification (PoI) tasks, whereas GOFr þ C, WLRr þ C, and WSVMr þ C are used for the regression task only, and GOFc þ C, WLCc þ C, and WSVMc þ C are used for the classification task only. In both tasks, each performance measure was applied in both forward greedy and lazy forward greedy searches, and the number of feature set evaluations (applications of the performance measure) was recorded.
For the MR þ C and MRMR þ C methods the variables were discretised in order to estimate the integrals when computing MI (Equation 43). Specifically, the features in both tasks were discretised by learning a decision tree for each input feature using Gini's impurity as a split criterion (Breiman 2017), and taking the cut points of nodes in the tree as discretization intervals. To obtain a discrete target variable in the regression task, Matlab's histcounts function 1 was used to find suitable discretization bins. Following feature selection, the original continuous features and target variables were used in model learning and evaluation.
The selected features were used to train a SVM that predicted the target variable. Predictions were then made for each sample in the testing data, which were compared to their observed values to assess the model performances. The R 2 scores are presented for the regression task, and in the classification task AUC is used. Mean Absolute Error (MAE) and the F1 score results are available in supplimentary material. Finally, the overall compression was measured on the testing data for the selected features, using DEFLATE and DWT. The results presented are the mean performances over ten train-test cycles, and error bars show the standard errors. Figure 3 shows the maximum R 2 performances for the regression task when selecting up to (a) 5, (b) 10, and (c) 15 features with different values of the compressibility factor weight, ω com . With up to five features, MR þ C and MRMR þ C had poor performances, with R 2 < 0:5, but performed significantly better when selecting up to ten or fifteen features. In fact, with up to ten or fifteen features, MR þ C had significantly the best performances when compared to the other selection methods for any value of ω com . The R 2 performances of GOFr þ C, WLRr þ C, and WSVMr þ C either did not change or increased only marginally when selecting up to ten or fifteen features compared to five features. In almost all cases, giving greater consideration to compression (i.e., larger values of ω com ) led to the selection of features with worse performances. One exception was MR þ C, where R 2 performances remained the same with different values of ω com . MRMR þ C, GOFr þ C, and WLRr þ C, on the other hand, were affected significantly by the value of ω com , with R 2 performances decreasing by up to 0:3 when comparing ω com ¼ 0 and ω com ¼ 1 for the same number of features. WSVMr þ C was less affected by increasing ω com , with the maximum achieved R 2 performance only decreasing slightly between ω com ¼ 0 and ω com ¼ 1. Interestingly, each selection method was affected for different ranges of ω com . For instance, WLRr þ C decreased in performance with low values of ω com � 0:2 and then was unaffected by increasing ω com further, whereas performance was maintained by MRMR þ C and GOFr þ C until ω com � 0:6. This may be as a result of the different ranges and distributions of the feature set performance measures, indicating that the effect of ω com is dependent on the performance measure being used.

Predictive Performance
The maximum AUC performances for the classification task, PoI, are shown in Figure 4 when selecting up to (a) 5, (b) 10, and (c) 15 features with different values for ω com . As with the regression task, the performances when predicting PoIs improved with the number of features selected for MR þ C and MRMR þ C. For GOFc þ C, GOFc þ C, and WSVMc þ C, the AUC performances however did not increase when selecting more than five features. The highest AUC performances were attained by GOFc þ C, GOFc þ C and WSVMc þ C, followed by MRMR þ C. Unlike in the regression task, the consideration of compression actually improved the performance of the features selected for the PoI task in some cases. In particular, the AUC performance increased significantly when selecting up to 5 features with MRMR þ C with ω com ¼ 0 compared to ω com ¼ 0:1. Likewise, AUC increased with larger values of ω com � 0:3 for MR þ C, which had the worst performance of all selection methods investigated. This is indicative of overfitting due to their use of MI, which is prone to selecting high entropy features at the cost of their generalizability to new data (Taylor, Griffiths, and Bhalerao 2015). The consideration of compression is likely reducing this bias in MI and leading to the selection of features that are less specific to the training data.
The GOFc þ C and WLCc þ C selection methods, which do not rely on MI, were less affected by the value of ω com , but were still improved for increased values of ω com . The AUC performance of WLCc þ C increased from 0:875 (with ω com ¼ 0) to 0:939 (with ω com > ¼ 0:2), and for GOFc þ C it increased from 0:922 when ω com ¼ 0 to 0:953 when ω com ¼ 0:1. For WSVMc þ C however, the AUC performances decreased slightly with higher values of ω com . Figure 5 (FC task) and Figure 6 (PoI task) show the number of bytes required to represent the best performing features when selecting up to (a) 5, (b) 10, and (c) 15 features, using the DEFLATE compression algorithm, with a compression level of 6. For the FC task, the number of bytes is shown for the feature set that achieved the highest R 2 in each case, and for the PoI task the feature set with the highest AUC was used. In both tasks, more features    required more bytes to represent them. In many cases, however, the maximum predictive performances were achieved by fewer than five or ten features, so the number of bytes remained the same when selecting up to ten and up to fifteen features. In particular, when ω com � 0:3, the number of bytes required to represent features selected using WLCc þ C was the same when selecting either up to five, ten, or fifteen features, as were their performances.

Compression Performance
In the PoI task, the number of bytes required decreased as ω com increased for MR þ C. Interestingly, this was predominantly when 0 � ω com � 0:5, which is where the predictive performance also improved (Figure 4). This is more evidence that MR þ C was overfitting to the training data in the PoI task, due to selecting high entropy features.
Although MR þ C achieved the best performances in the FC task (Figure 3), the features it selected had significantly the worst compression in either task. For example, selecting ten features using MR þ C when ω com ¼ 0 required 341:7KB when using DEFLATE, whereas selecting ten features using WSVMr þ C required only 82:4KB. Similarly, for the PoI task, MR þ C with ω com ¼ 0 required 87:5KB to represent the top ranked ten features, whereas WSVMc þ C required only 28:9KB. Features selected by MRMR þ C also required significantly more bytes than did the goodness of fit and wrapper approaches. Furthermore, the number of bytes required to represent features selected by MR þ C and MRMR þ C for the FC task were not affected significantly by the value of ω com , whereas the number of bytes required decreased as ω com increased for all other selection methods (GOFr þ C, WLRr þ C, and WSVMr þ C). In particular, WSVMr þ C, which had the second best performances for the FC task, required significantly fewer bytes when ω com ¼ 1 than when ω com ¼ 0, with only a marginal decrease in predictive performance. The decrease in bytes was largest for WLRr þ C and WSVMr þ C between ω com ¼ 0 and ω com � 0:4, and when ω com � 0:5 the number of bytes did not change significantly. As in the FC task, the number of bytes required by features selected using GOFc þ C and WLCc þ C decreased as ω com increased when ω com � 0:3, and with ω com > 0:3 the number of bytes required did not change significantly. The number of bytes required by features selected using WSVMc þ C continued to decrease with increased values of ω com in all cases investigated, to be similar to that required by GOFc þ C.
The number of bytes required to represent the selected features using the DWT are shown in Figure 7 for the FC task, and in Figure 8 for PoI. The DWT transform had two layers, each using the db5 wavelet, with a quantization level of 8 and threshold of 0:5. The resulting coefficients were then compressed using DEFLATE with a compression level of 6. The results in both FC and PoI reflect those found when compressing the features using DEFLATE only, but DWT achieved better compression in all cases. In both tasks more bytes were required when selecting more features. In the FC task, MR þ C was again unaffected by the value of ω com , and GOFr þ C and WSVMr þ C had better compression with higher values of ω com . In the PoI task, the number of bytes required by features selected using GOFc þ C decreased as ω com increased for small values of ω com . In this case, however, the number of bytes required by features selected by MR þ C increased with ω com � 0:3, and only decreased again with ω com � 0:5. Table 3 presents the maximum performances (by R 2 ) of the selection methods when applied in (a) forward greedy and (b) lazy forward greedy searches to select features for the FC task. The performances for MR þ C are not displayed in these tables, as it selects the same set of features when using either forward greedy or lazy forward greedy search. Features selected using the filter methods, MRMR þ C and GOFr þ C, tended to have worse performances when forward greedy search was employed than when using lazy forward greedy search. Although for some values of ω com , R 2 performances for GOFr þ C with lazy forward greedy were higher than with forward greedy, the MAE s were still higher. For MRMR þ C the decrease in both R 2 performance and increase in MAE from forward greedy to lazy forward greedy was significant. As a result, even though fewer bytes were required, lazy forward greedy is not appropriate when using MRMR þ C to select features for the FC task.  Whereas the performances of GOFr þ C when using lazy forward greedy tended to be lower than for forward greedy, the features selected required fewer bytes to achieve the maximum R 2 score. When ω com ¼ 0:4, for example, 32:3KB were required to achieve an R 2 of 0:722 with forward greedy and 26:7KB to achieve 0:714 with lazy forward greedy. This particular difference was predominantly due to forward greedy selecting six features whereas lazy forward greedy selected five, but to select five features with ω com ¼ 0:4 forward greedy still required 30:7KB, and also achieved a lower R 2 performance of 0:704. This is likely as a result of extra redundancy in features selected using lazy forward greedy, which ranks features based on their performance in previous rounds to expedite the search, rather than assessing all features in each round as in forward greedy. Because features that are promising in one round are prioritized in the next, features selected in consecutive rounds may be good in similar ways and be redundant, making their combination more easily compressible. While this is accounted for in the performance metrics, such as P MRMR and P MRMRþC , it is more challenging due to the reduced lazy search.

Lazy Evaluations
The performances of the wrapper methods, WLRr þ C and WSVMr þ C, were again worse in general when lazy forward greedy was used, especially with small values of ω com . Also in these cases, fewer bytes were typically required by the features selected. With larger values of ω com � 0:2, however, WLRr þ C had better performances and required more bytes with lazy forward greedy than forward greedy. This was likely as a result of the peak performance when using forward greedy being achieved with only one feature in these cases, which is a clear Table 3. Best performances (by R 2 ) in the FC task selecting up to 15 features using (a) forward greedy and (b) lazy forward greedy search.  sign of overfitting during feature selection. Features selected by WSVMr þ C also required more bytes when lazy forward greedy was used than when using forward greedy, and also achieved marginally worse performances. The maximum performances (by AUC) for the PoI task are presented in Table 4 for (a) forward greedy search and (b) lazy forward greedy search. Again, fewer bytes were typically required to represent features selected using lazy forward greedy than were required when using forward greedy. In most cases, lower AUC and F1 performances were also achieved by lazy forward greedy. For example, MRMR þ C had worse performance in both AUC and F1 for any value of ω com , and GOFr þ C and WLRr þ C had lower AUC s. However, GOFr þ C and WLRr þ C did have slightly higher F1 performances in some cases. WSVMc þ C, on the other hand, improved in both AUC and F1 performances while also requiring fewer bytes.
In general, lazy forward greedy selected worse features than did forward greedy, but the advantages of lazy forward greedy lie in computational efficiency while still selecting adequate features. Figures 9 and 10 show the number of feature set evaluations performed when using lazy forward greedy with each of the search methods when selecting (a) 5, (b) 10, and (c) 15 features. For all search methods except MR þ C, the number of feature set evaluations increases with ω com . The most simple way to implement MR þ C is to evaluate each feature individually and produce a single ranking, from which the top features are selected. In both the FC and PoI tasks, the methods using linear regression had the most feature set evaluations, in particular for larger values of ω com . While MRMR þ C had the fewest feature set evaluations, WSVMr þ C had a similar number and WSVMc þ C only required around twice as many.

Summary
In summary, the following conclusions can be drawn from these results.
• Considering compression led to the selection of features with worse predictive performances in general, but that required fewer bytes when compressed. • In the FC task, MR þ C had best performances with up to 10 or 15 features to achieve it and required the most bytes to represent them. • The next best performances in the FC task were achieved by GOFc þ C and WSVMc þ C, which also required the fewest bytes. • In the PoI task, GOFc þ C and WSVMc þ C achieved the best performances and also required fewest bytes. • Using forward greedy generally provided better performances than lazy forward greedy, but in some cases the selected features required more bytes when compressed.
When considering compression of selected features, there is a trade-off against performance. In some cases, the lower predictive performance of a feature set may be acceptable when considering the significant improvement in compression. For example, in the FC task when selecting features with GOFr þ C, it is possible to reduce the compressed size of the features from 96KB to 32KB while only reducing  the R 2 performance by 0:035 and increasing the MAE by 0:002. In some applications, the 3 � improvement in compression may warrant the small increase in error. MR þ C required selection of more than five features for good predictive performances, and they had the poorest compression (i.e., they required the most bytes) of all the methods. The features selected using MR þ C for the FC regression task had higher R 2 performances than those selected by any other method, when more than five features were selected. The compression performances of the features selected by MR þ C was also unaffected by ω com in this task. For the PoI classification task, the features selected by MR þ C had good AUC performances when compression was considered with small values (but greater than 0) of ω com , but poor performances with larger values. MR þ C also required the fewest feature evaluations of any method investigated when using a lazy search. In regression tasks where compression is not important, therefore, MR þ C may be most appropriate, especially when the number of feature set evaluations is to be minimized. However, this is unlikely to be the case in general as it ignores feature redundancy (Guyon and Elisseeff 2003).
MRMR þ C required more features for good performance than did most other methods, only having comparable performance when 15 features were selected. The selected features also had poorer compression, requiring almost 10 � more bytes than all other methods except MR þ C when compression was considered during selection. When using lazy evaluations, however, MRMR þ C required fewer than half the number of evaluations of other methods except MR þ C.
The linear regression wrapper had poor performance in the regression FC task for any value of ω com . In the classification PoI task, the AUC performances achieved by features selected using WLRr improved when compression was considered during selection. With ω com > 0:2, AUC performance was then unaffected, and the features selected required the fewest bytes of all methods when compressed.
The two remaining approaches, based on GOF and WSVM had similar results in general, only differing for the FC regression task for larger values of ω com . For such values of ω com > 0:6, the R 2 performances of GOFr þ C decreased significantly as ω com increased, whereas those of WSVMr þ C were much less affected. The features selected by either of these methods had the highest predictive performances (except MR þ C with FC) of any method investigated, and the effect for small values of ω com was small. Their compression performances were also similar, and both methods were outperformed only by WLRr þ C in the FC regression task when assessed using DEFLATE (they were similar when using DWT). Under lazy evaluation, however, both methods required the largest number of feature set evaluations.
In general, our findings suggest that the GOF and WSVM based methods are best for selecting feature subsets with both high predictive performance and good compression. Wrapper methods are computationally very complex, however, and even with lazy search the number of feature set evaluations may be too high when using GOFr þ C or WSVMr þ C for regression tasks, or GOFc þ C or WSVMc þ C for classification tasks. Therefore, it may be preferable to employ a filter method such as MRMR þ C, which selected feature sets with similar predictive performances but required more bytes when compressed.

Conclusion
This paper investigated the compressibility of features selected using three filter methods and one wrapper method. We have introduced a compressibility factor to feature set performance measures in order to improve this compression. Performance guarantees based on monotonicity and submodularity were given for the P MR and P MRþC performance measures, after showing their equivalence with a sum of non-negative numbers. The submodularity ratio can be used to provide performance guarantees for GOFr, GOFr þ C, GOFc, and GOFc þ C. The P MRMR and P MRMRþC performance measures are not submodular, but are equivalent to max dependency (Peng, Long, and Ding 2005) when used in forward greedy search. Similarly, under assumptions of a strong evaluation procedure, the wrapper methods are known to provide good performances when used in greedy searches (Kohavi and John 1997).
An entropy analysis was then provided for criteria defined by the performance measures based on MI, namely C MR , C MRþC , C MRMR , and C MRMRþC . This analysis provided a useful perspective, and identified the entropy components that are are minimized and those that are maximized in each of the performance measures. Since MI, used to assess relevance and redundancy in MR and MRMR, and compressibility can both be defined using entropy, the analysis also provides insight into the impact of compression in the selection process.
Finally, the feature selection performance measures were demonstrated in selecting features using forward greedy and lazy forward greedy search for one regression task, FC, and one classification task, PoI. We found that, considering compression (i.e., ω com > 0) led to lower performances of the selected features, but improved compression. In the regression task, MR had the best performances, but selected features with significantly worse compression. Forward greedy search also had better performances than lazy forward greedy search in general, but in some cases the selected features required more bytes when compressed. Overall, using the wrapper approaches had the best performances while selecting compressible features. When wrapper approaches are not possible due to computational limitations, MRMR þ C provides a suitable alternative, but the selected features generally had poorer compression.
As future work, different compressibility factors should be investigated. First, as different features have different compressions with different algorithms, a more comprehensive approach to accommodate this is likely to improve its accuracy. This is especially the case if specialized compression approaches are to be applied to different features. Also, with a more comprehensive and accurate measure of feature set compression, cost sensitive searches for a feature set that satisfies a budget defining a maximum number of bytes could be applied (Leskovec et al. 2007). Alternatively, a multiobjective optimization search could be employed, such as one that searches for a Pareto-optimal solution (e.g., hancer2018,ZHANG2020).