Using pseudometrics in kernel density estimation

Common kernel density estimators (KDE) are generalised, which involve that assumptions on the kernel of the distribution can be given. Instead of using metrics as input to the kernels, the new estimators use parameterisable pseudometrics. In general, the volumes of the balls in pseudometric spaces are dependent on both the radius and the location of the centre. To enable constant smoothing, the volumes of the balls need to be calculated and analytical expressions are preferred for computational reasons. Two suitable parametric families of pseudometrics are identified. One of them has common KDE as special cases. In a few experiments, the proposed estimators show increased statistical power when proper assumptions are made. As a consequence, this paper describes an approach, where partial knowledge about the distribution can be used effectively. Furthermore, it is suggested that the new estimators are adequate for statistical learning algorithms such as regression and classification.


Introduction
The multivariate kernel density estimator (KDE) with a fixed bandwidth is given bŷ where K H (x) : R m → R is a single modal, symmetric, nonnegative and zero-mean function that integrates to 1. The bandwidth matrix H describes the level of smoothing. It is convenient to write H = h m , where h is a positive scalar and |det( )| = 1. With this parametrisation, the shape is determined by and the scale is determined by h. This giveŝ where K is a kernel that is independent of H. The first univariate version of Equation (2) was proposed by Fix and Hodges (1951), while the general class was investigated by Rosenblatt (1956) and Parzen (1962). The multivariate extension was outlined by Cacoullos (1966) and Epanechnikov (1969). (2), where the two main approaches are the balloon estimator and the sample smoothing estimator. A comparison between KDE with fixed and variable estimators is given by Terrel and Scott (1992).

Several authors have suggested a variable bandwidth version of Equation
The balloon estimator lets h(x) to be equal to the distance to the kth nearest neighbour of x. The estimator is given byf (3) The first version of this estimator was published by Loftsgaarden and Quesenberry (1965) and has later been known as the k-nearest neighbour estimator. It is given bŷ where n is the sample size, V m the volume of the m-dimensional Euclidean unit ball and r the Euclidean distance between x and its kth nearest neighbour. Equation (4) can be derived from Equation (3), if we choose H to be identity and a kernel that is a uniform density on the mdimensional Euclidean unit ball. In Loftsgaarden and Quesenberry (1965), it was also proved that the estimator in Equation (4) is consistent for increasing n. The sample smoothing estimator was suggested by Breiman, Meisel, and Purcell (1977). Here, the bandwidth parameter is independent of the query point, but dependent on the sample points. The estimator isf where h(x i ) is the distance to the k-nearest neighbour of x i .
To improve readability, we recollect the following definitions.
Definition 1.1 The kernel of a function f : X → Y is an equivalence relation = f given by x = f y ⇔ f (x) = f (y), where x and y are elements in X.
Definition 1.2 A pseudometric on a set X is a function (distance function) d : X × X → R, which for all x, y, z in X satisfies the following conditions: This definition of pseudometric is exactly the same as the definition of a semimetric in the field of nonparametric functional data analysis (Ferraty and Vieu 2006). To avoid confusion in this paper, we will keep referring to pseudometrics, even though most of the literature in nonparametric functional data analysis uses the term semimetric.
In nonparametric functional data analysis, the explanatory variables are functional and most of the work has been done on regression and classification. An overview of the field is given in the book by Ferraty and Vieu (2006). In the case of density estimation, Delaigle and Hall (2010) proved that a probability density function generally does not exist for functional data, but developed notions of density and mode, when functional data are considered in the space determined by the eigenfunctions of the principal components. In the paper by Ferraty, Kudraszow, and Vieu (2012), an infinite dimensional analogue of a density estimator is discussed, when the small ball probability associated with the functional data can be approximated as a product of two independent functions; one depending on the centre of the ball and one depending on the radius.
In this paper, we consider only finite dimensional pseudometric spaces, where balls with finite radius have finite volumes. Such spaces exist as a specification of the theory of nonparametric functional data analysis (Ferraty et al. 2012). The theory in this paper is distinguished from that of nonparametric functional data analysis, because we consider pseudometric spaces, where volumes of balls with fixed radius are allowed to be dependent on locations of centres. In all previous work on using pseudometrics for functional data, the volumes of the balls are always considered as independent on the locations of the centres. This generalisation opens up for applications that are less linked to functional data and more related to restricted nonparametric and semiparametric estimators on real (non-functional) data.
To be more accurate, the parameters of the pseudometric precisely describe the kernel, or more roughly the level sets of the distribution. This is important for visualisation and interpretation of the estimated distribution. For instance, the complete multi-dimensional density that is estimated with an estimator using a pseudometric of family one (described later in this paper) can be described by a two-dimensional plot only.
The choice of the family of pseudometrics describes the constraints on the level sets of the distributions that can be modelled. In case of fitting parameters, it is therefore important that the level sets of the actual distribution are not in violation with these constraints. This is somewhat analogous to parametric models, where serious misbehaviour can be observed, if the distribution has a different shape than any of the members in the family of models.
An example of a similar approach can be found in a paper by Liescher (2005), where the level sets of the distributions were constrained to be of elliptic shape. The elliptic shape of the level sets was constrained by a transformation that was performed prior to applying the nonparametric density estimator. It was shown that the convergence rates were independent of the number of dimensions, except in the neighbourhood of the mean, and thereby overcoming the so-called curse of dimensionality. The estimators in this paper can be seen as a generalisation of the estimators of Liescher (2005), because a wider family of associated distributions is allowed. Convergence rates are not explicitly given in this paper, but simulations show that the performance close to the mode coincides with the common KDE and improves quickly away from the mode, which is similar to the results obtained by Liescher (2005).
In the context of reducing the curse of dimensionality of nonparametric density estimators, it is also worth mentioning the projection pursuit density estimator. This method was first introduced in Friedman, Stuezle, and Schroeder (1984), and a parametric extension was given by Welling, Zemel, and Hinton (2003). In projection pursuit one projects the explanatory variables into principal directions and fits one-dimensional smooth density functions to these projections. The resulting density is the product of these densities. A special case is the flexible Bayesian estimator, where the projection matrix is identity. In essence, the flexible Bayesian method is assuming independence between the explanatory variables. The estimator will be used later in this paper and is plug-inf where K is a univariate kernel, the bandwidth is h, x j is the jth coordinate of x and x ij is the jth coordinate in sample i. Consistency is proved under the assumption of independence (John and Langley 1995). This is a nonparametric variation of the old and well-known naive-Bayesian estimator, where the individual distributions are assumed Gaussian. A similarity between the estimators in this paper and the projection pursuit density estimators is that the resulting density can be visualised in lower dimensional spaces. In projection pursuit, it is enough to show the onedimensional (ridge) functions along with the principal directions to understand the full distribution. The proposed estimators have less in common with semiparametric models, where part of the model is fully parametric. For instance models which are a product of a parametric and a nonparametric component, such as in Naito (2004) and references therein. Another example is when some of the explanatory variables are modelled nonparametrically, while the rest are modelled parametrically, such as in Hoti and Holmström (2003).
In the next section, the underlying assumptions of the KDE with fixed and variable bandwidths are discussed. This is followed by formal definitions of the new estimators. A framework for finding suitable parametric families of pseudometrics is outlined in Section 3 and two families are identified. Some experiments are given in Section 4, where the new KDE are compared with common KDE on samples drawn from common distributions. In Section 5, open questions and ideas of how to use the new estimators on various problems are elaborated on, before the work is concluded.

Assumptions for kernel density estimation and definition of new estimators
It is convenient to start our discussion by investigating the k-nearest neighbour method, because it is conceptually easier than the other KDE. If we generalise Equation (4) for using any metric d, thenf where x k is the kth nearest neighbour of x and For the sake of discussion, a random variable Y is defined as a k-nearest neighbour estimate at a particular x. Each k-nearest neighbour estimate requires drawing a new training sample and this is why Y is a random variable. The variance of Y is a consequence of which sample is chosen (because this determines the location of the kth-nearest neighbour), while a great portion of the bias can be explained by the shape of the ball. This is potentially more clear if we realise that Equation (5) can be derived from two assumptions that were not stated in Loftsgaarden and Quesenberry (1965). The first assumption is that The second assumption is thatf (x) is equal to the mean off , which is how Equation (6) implies Equations (5) and (4). A measure for how much the second assumption is violated is identified as The parameter AV is therefore highly dependent on which metric is chosen, because the metric determines the shape and the size of the region of integration. A great portion of the bias of Y is a consequence of the second assumption. Also, we recognise that there is a connection between f and the choice of metric. No metric is optimal for any f and vice versa. This argumentation can be transferred to all KDE with both fixed and variable bandwidths, because these methods are essentially weighting the contribution of each sample x i as a function of V B d [x,d(x,x i )] . For the KDE with fixed bandwidth, the balloon estimator and the sample smoothing estimator, respectively, we could writê where K H , K H(x) and K H(x i ) are nonnegative, strictly decreasing, one-sided univariate kernels that integrate to one, when integrated over x. If we let the random variable Y be defined as any of the kernel density estimates at a particular x, then the bias of Y is highly related to the shape of the balls in combination with the shape of the f near x.

Definition of the k-nearest pseudo-neighbour estimator
A generalisation of the k-nearest neighbour estimator is proposed aŝ where d p is a pseudometric. Figure 1 illustrates the principles of k-nearest neighbour estimator as of Loftsgaarden and Quesenberry (1965) together with the k-nearest pseudo-neighbour estimator. This particular pseudometric is a special case of the parametric families of pseudometrics that are outlined later in this paper. It is postulated that a proper choice of a pseudometric can reduce Equation (7) far more than a metric, because balls in pseudometric spaces span a richer variety of shapes. The key idea is to find a parametric family of pseudometrics and select parameters by either Bayesian inference or learning.
It is a crucial point that the parameters of a pseudometric family are typically associated with various sample statistics and associations between the explanatory variables. In the example shown in Figure 1, the centre of the dashed circles is a parameter of the pseudometric. This parameter can be estimated by the sample mean.

Definition of pseudometric KDE
A challenge with using pseudometrics is that the volumes of the balls with constant radius are generally dependent on the locations of the centres. This effect is neutralised by defining the Figure 1. Description of the k-nearest pseudo-neighbour estimator. The scatter plot shows a training population with 100 points that are i.i.d and drawn from a two-dimensional normal distribution. A cross is marked at a query point and in k-nearest neighbour estimation, the kth-nearest neighbour is found by comparing the Euclidean distance to all other points. The smallest Euclidean ball that encapsulates the 5th-nearest neighbours is shown as a solid circle. Clearly, f is not very constant inside this ball. In k-nearest pseudo-neighbour estimation, the kth-nearest neighbour is found by comparing the distance in the pseudometric space from the query point to all other points. The interior between the two dashed lines represents a ball in this pseudometric space, where the query point is in the centre. The radius of the ball (not of the dashed circles) is the distance to the 5th-pseudo-nearest neighbour. Obviously, f is far more constant inside this ball. Consequently, improved statistical power is achieved when the equivalence classes with respect to the kernel of the distribution are assumed to be circles around the mean. equivalent Euclidean distance, which is the function where V m is the volume of the Euclidean unit m-ball. The equivalent Euclidean distance from x to y in a pseudometric space (R m , d p ) is the radius of the Euclidean m-ball that has the same volume Note that e d p is generally not a metric, nor a pseudometric, because the symmetry criteria are broken.
The idea is to weight the contributions of the x i 's according to the equivalent Euclidean distances from the query point. This ensures that the level of smoothing is constant with respect to volume, which is the case for KDE with fixed bandwidths.
The pseudometric kernel density estimator (PKDE) is suggested aŝ where K h (u) is the kernel with bandwidth h and dF/de d p is equal to 1/n at the e d p (x, x i )s and 0 elsewhere. An example of a kernel is the one-dimensional normal, which is given by integrates to 1, when integrated over u. Notice that the kernel itself is not dependent on the choice of pseudometric. Only the number of dimensions is necessary input.
where |det( )| = 1 and a bandwidth h, then Equation (1) becomes a special case of Equation (9), when H = h m and normal kernels are used. The only difference is that the kernel in Equation (9) takes a scalar argument, while the kernel in Equation (1) takes a vector. A similar argumentation can be made for any kernel and this is why the PKDE is a generalisation of the standard KDE.
In the case of variable bandwidths, the pseudometric balloon estimator is suggested aŝ where h(x) = e d p (x, k) and k is the k-nearest pseudo-neighbour of x (nearest in the sense d p ).
Notice that if we choose d p to be Euclidean and a kernel that has uniform density on the Euclidean ball, the k-nearest neighbour method falls out as a special case. Moreover, the pseudometric sample smoothing estimator is suggested aŝ

Choosing appropriate parametric families of pseudometrics
A clear challenge with using the new estimators is proposing suitable families of pseudometrics. It is therefore instructive to define the equivalence relation ∼ d p , where x ∼ d p y if and only if d p (x, y) = 0. While discussing the usefulness of a family of pseudometrics, it makes sense to ask which density functions are invariant to ∼ d p , and the next two definitions are motivated by this.
Definition 3.1 A probability density function is said to be an associated distribution of a pseudometric (or metric) d p , denoted as f ∼ dp : X → R + , when f ∼ dp is invariant to ∼ d p . Invariance to ∼ d p , means that x ∼ d p y, implies that f ∼ dp (x) = f ∼ dp (y) for all x, y in X.

Volumes of generalised balls
Two new families of pseudometrics are proposed, including the associated distributions and the derivations of the analytical expressions for the volumes of their balls. Before this, a definition of the closed linearly transformed generalised balls and a derivation of an analytical expression for the volumes are needed. Table 1 contains a list of symbols that are consistently used in rest of this paper.

Here, p is a vector of the positive p i s, while p is the harmonic average of the p i s
. Moreover, x = Ax and y = Ay, where A is a square transformation matrix with nonzero determinant.
In general d s,A,p is a semimetric, because the triangle inequality does not hold when at least one of the p i s are less than 1. Depending on the p i s, the generalised balls describe a wide range of multi-dimensional geometrical objects, where spheres, cubes, cylinders and stars are special cases. A few three-dimensional objects are shown in Figure 2.
In articles by Wang (2005) and Gao (2013), the volume of the generalised unit ball was derived.
Theorem 3.4 The volume of the closed generalised unit ball V p is where I is the identity matrix.
The proof by Wang used induction and was motivated by an idea from Folland (2001), while the proof by Gao used properties of the exponential distribution. An alternative proof of Equation (10) is included in the appendix. (0, 1).
Theorem 3.5 The volume of the closed linearly transformed generalised -ball that is centred in a is This means that V p and the absolute value of the determinant of A determine the overall scaling of the volume. This is a consequence of choosing the exponent in d s,A,p to be 1/p.

Pseudometric family of type one
Definition 3.6 The pseudometric family of type one d p1,A,c,p (x, y) : R m → R is given by where c is a chosen point (for instance a mean or a median).
The parameters that can be trained for this pseudometric are A, c and p. In the family of pseudometric spaces (R 3 , d p1,A,c,p ), the balls can be viewed as generalisations of the spherical shell and two examples are shown in Figure 3.
where u = d s,A,p (x, c) p . Moreover, f ∼ d p1,A,c,p (x) satisfies the strong condition.
If we choose g(u) to be an exponential function a exp(−u), then the coefficient a can be found from Equation (11). When inserting for g(u), the integral part of Equation (11) becomes a (m/p), so that The parametric family of associated distributions is therefore a linear transformation of the m-dimensional generalised normal distribution (also known as the exponential power distribution) To interpret this result, a description of the univariate case is given by where more details are given in Figure 4. Also, if we choose A to be diagonal with 1/α i s on the diagonal then In this case all dimensions are independent and the α i s are the scale factors and p i s are the shape factors.

Volumes of the balls
Theorem 3.8 The volumes of the balls in the family of pseudometric spaces (R m , d p1,A,c,p ) are given by where O is the set of positive odd numbers that are smaller than or equal to m.
The continuity of the ball at = d s,A,p (a, c) is seen, when noting the fact that m i=0 m i = 2 m and i∈O m i = 2 m−1 . In the case when < d s, A,p (a, c), the volume is a polynomial with respect to either or d s, A,p (a, c). The same is true when ≥ d s, A,p (a, c), but the polynomial is different. This demonstrates how the volumes of the balls are dependent on the locations of the centres, and this is why it was necessary to define the equivalent Euclidean distance in Equation (8).

Pseudometric family of type two
Definition 3.9 The pseudometric family of type two d p2, A,c,p,q where q is an integer larger than or equal to one, I is identity, x = Ax, y = Ay and c = If m 1 is less or equal to 1, B d p2,A,c,p,q [a, ] is the familiar m-dimensional ball. The familiar torus and a deformation of a torus are other special cases that are shown in Figure 5.

Associated distributions
In terms of discussing the associated distributions for d p2,A,c,p,q , it is an important observation that the equivalence classes of R m / ∼ d p2,A,c,p,q are all topological manifolds of dimension m 1 − 1, while the equivalence classes of R m / = f ∼ d p2, A,c,p,q are (m − 1)-manifolds. This means that R m / ∼ d p2,A,c,p,q can never be equal to R m / = f ∼ d p2, A,c,p,q , unless m 1 is equal to m. This dimension reduction may be desirable with regards to improving robustness, because the family of associated distributions spans wider. On the other side, it may reduce statistical power because AV defined in Equation (7) may not be minimised properly.
Theorem 3.10 The parametric family of associated distributions f ∼ d p2,A,c,p,q (x) is equal to a function g(u, x m 1 +1:m ), which satisfies If we also assume that the x 1 , x 2 , . . . , x m 1 are all independent of x m 1 +1 , x m 1 +2 . . . , x m (and vice versa), then g(u, x m 1 +1:m ) = g 1 (u)g 2 (x m 1 +1:m ). Hence, g 1 (u) is subject to the constraint while g 2 (x m 1 +1:m ) is a probability density function that integrates to 1, when integrated over x m 1 +1:m . Clearly, the family of associated distributions f ∼ d p2,A,c,p,q (x) has the family of associated distributions f ∼ d p1,A,c,p (x) and the family of all distributions as special cases.

Volumes of the balls
Theorem 3.11 The volumes of the balls in the family of pseudometric spaces (R m , d p2,A,c,p,q ) are given by a 1:m 1 , c ), a 1:m 1 , c ).
Here , B t (a, b) is the incomplete beta function, a = Aa and O is the set of positive odd numbers that are smaller than or equal to m 1 .
In the case when < d s,I,p (a 1:m 1 , c ), the volume is a polynomial with respect to either or d s,I,p (a 1:m 1 , c ). The same is true when ≥ d s,I,p (a 1:m 1 , c ), but the polynomial is different.
Theorem 3.12 The volumes of the balls in the family of pseudometric spaces (R m , d p2,A,c,p,q ) in the limiting case when q → ∞ are given by a 1:m 1 , c ), where O is the set of positive odd numbers that are smaller than or equal to m 1 .
The continuity of the ball at = d s,I,p (a 1:m 1 , c ) is seen, when noting the fact that m 1 i=0 m 1 i = 2 m 1 and i∈O m 1 i = 2 m 1 −1 . This formula is simpler than the general formula given in Theorem 3.11 and it may be desirable in the context of reducing computational time.

Bandwidth selection
In Rosenblatt (1956), approximations for the bias, the variance, the mean squared error and the mean integrated squared error were derived in the univariate case. It was shown that by minimising the asymptotic mean integrated squared error (AMISE), an optimal bandwidth could be found when information of ∞ −∞ |d 2 f /dx 2 | 2 dx exists. It is not possible to make such an elegant argumentation for pseudometric kernel density estimation. However, an expression, including one-dimensional integrals, for the AMISE of f ∼ d p1,A,0,p is derived when f is equal to f ∼ d p1,A,0,p .
Theorem 3.13 The asymptotic mean squared error AMSE(x) of f d p1,A,0,p (x) and the AMISE of f d p1,A,0,p are where E(f d p1,A,0,p (x)) and σ 2 (f d p1,A,0,p (x)) are given by and Here

Experiments
In this section, a few experiments are presented where various PKDE are compared with the standard KDE and the flexible Bayesian estimator. The experiments involve drawing repetitively an order pair (x, X), n rep times. Here, X is a sample of n points that are drawn from a known distribution f and x is a query point that is drawn randomly from an equivalence class [{r i , 0, 0, . . . , 0}] = f . The sequence (r i ) n eqc i=1 starts at 0 and is uniformly spaced over the relevant domain of f . For every r i , the mean and the quartiles are computed and compared with the known analytic solution. When plotted as a function of r i , the variances and the biases of the estimators are easily seen.
Seven experiments are conducted, where PKDE, KDE and flexible Bayesian are compared on a set of distributions, including the standard normal, the uniform, the one-sided normal and a one-sided normal that is split into two. Variable PKDE and variable KDE are compared only on the standard normal distribution. The parameters are summarised in Tables 2 and 3.

Results
In experiment one (Figure 6), the estimator is shown in various dimensions while the bandwidth is kept constant. The bandwidth is chosen to 0.5, which is the bandwidth that minimises AMISE (12) in six dimensions. The estimator seems to be unbiased in the tails, while it underestimates near  the centre where the level of underestimation increases with the number of dimensions. The tails seem fairly unbiased, but as we move away from the centre, the mean normalised experimental error starts to fluctuate. From the quartile plots, it is seen that the skewness of the normalised experimental error increases with distance from the centre. In experiment two (Figure 7), we see the effect of increasing and decreasing the bandwidth in six dimensions. When the bandwidth is increased, the underestimation near the centre increases while the variance decreases everywhere. When the bandwidth is decreased, the bias is reduced at the cost of a higher variance.
In experiment three (Figure 8), KDE with fixed bandwidth and flexible Bayesian is shown for comparison with PKDE in experiment one. Since the bandwidth in KDE is the same as for PKDE, the bias and the variance are the same near the centre, while KDE continues to underestimate much further away from the centre. Also, when the analytical graph is passed, KDE starts to overestimate. Obviously, an optimal bandwidth for KDE is smaller than for PKDE, since the bias in the tail is larger. This underestimation and overestimation pattern is also seen in the flexible Bayesian, but to a lower extent. In the case of the flexible Bayesian, we have chosen a bandwidth so that the variance near the centre is approximately the same as the KDE and the PKDE. This means that it is easier to compare the biases.
In experiment four (Figure 9), PKDE and KDE with variable bandwidths are shown for comparison. The number of neighbours is chosen so that the variances near the centres are approximately the same as in Figures 6 and 8, where the bandwidths are fixed. It is interesting to see that the variance of the normalised experimental error is fairly stable and not skewed compared to the fixed bandwidth graphs. The bias in the tails is clear in these plots. Notice also that the PKDE with variable bandwidth has a similar, or smaller, variance and bias than the flexible Bayesian inside the shaded area.
In experiment five (Figure 10), PKDE, KDE and flexible Bayesian are compared on a uniform population. The bandwidth of the flexible Bayesian is adjusted so the variance near the centre is similar for all estimators. Clearly, PKDE has the sharpest border, while KDE is outperformed by the other two. In experiment six (Figure 11), the estimators are evaluated on a one-sided normal distribution. When the bandwidth is half the bandwidth used in experiment one, PKDE performs identically to PKDE in experiment one. KDE, on the other hand, underestimates far more than in the two sided case in experiment three. This is particularly the case near the centre. The same effect is present for the flexible Bayesian.
In experiment seven (Figure 12), the one-sided normal distribution is split into two equally sized populations. The left population contains points where the radius is less than 2.3126 and the right population contains points where the radius is more than 2.3126. The bandwidth settings were the same as in experiment six, which is probably not optimal for a classification. However, it is still clear that PKDE would be favourable in a classification setting. This is because, in the case of PKDE and KDE, there is a trade off in how well the means are fitted on the two populations. The better fit we have on one population the worse the fit will be on the other population. Here, PKDE fits much better on both populations and is therefore favourable. The flexible Bayesian estimator obviously performs very poorly on the right population, because of the severe violation of the independence assumption.

Discussion of the experiments
KDE and variable KDE are known to be consistent, but in the above experiments the estimators are clearly biased. This is because the sample size is low compared to the number of dimensions. In a practical setting, this is often the case and as a result, nonparametric estimators may suffer from poor statistical power.
In all experiments, the estimators are tested on associated distributions that satisfy the strong condition. In a practical setting, the knowledge about the parameters is less exact, where for instance only the priors are given. In a fair comparison, estimation of parameters should be part of the experiment, but his topic has not been treated in this manuscript.
It is worth noting that PKDE is fairly biased near the centre on the normal distribution. The reason for this is that the radii of the balls with constant volume are larger near the centre. Large contributions from points that are far away (in the pseudometric space) together with a high curvature of the distribution explains this effect. The underestimation is therefore also dependent on the number of dimensions and the choice of bandwidth. The bias near the centre is not seen in the uniform distribution, because the curvature is 0.
In the case of using a variable bandwidth, it seems that the skewness of the normalised experimental error is reduced everywhere at the cost of overestimation in the tails. The reason for the overestimation in the tails is that a query point that is far away from any point in the sample will always have contributions because of the increasing bandwidth. When choosing between variable and fixed PKDE (and also KDE), one should consider the need for accuracy in the tails. In case of classification, it may be desirable to detect very small densities, when the query points are far from any of the points in the population.  In the above experiments, it has also been revealed instances where KDE and flexible Bayesian perform poor. In the case of the uniform population, KDE struggles with the sharp border that is heavily smoothed to decrease variance. This is a great example where PKDE promises great potential over KDE. Notice also that PKDE performs even better than the flexible Bayesian.  In the case of a one-sided population, it is clearly seen that KDE and flexible Bayesian underestimate severely near the centre and continue to underestimate far into the tails. The reason for this is that the estimators are smoothing over orthants where the distribution is 0 by definition. This does not happen in PKDE, because the volume calculation is only done on the relevant orthant.
The one-sided normal distribution that is split in two is added to symbolise a two class classification problem of practical interest. The flexible Bayesian struggles with the right population because of the obvious dependency between explanatory variables. KDE performs poorly on the left population as a result of the effect described in experiment six. It also performs poor on the right population because the boundary is sharp. Clearly, PKDE would perform well if the density estimates were used as inputs to a classification rule.
To understand the full extent of experiment seven, it is important to note that similar results would be evident for a great variety of distributions. Each explanatory variable could be a generalised normal, which could also be either one-sided left, one-sided right or two sided. In addition, the whole distribution could be linearly transformed.
As a concluding remark of the experiments, it is clear that increased statistical power is observed for PKDE with pseudometric of type one, when proper assumptions about the distributions are made. Obviously, if poor assumptions were made, the estimator would suffer from poor robustness. PKDE with a pseudometric of type two has pseudometrics of type one and KDE as special cases. If care is taken when choosing parameters (or priors for parameters), it is therefore possible to increase statistical power at little expense of robustness.

Asymptotic argumentation
The literature on kernel density and kernel regression is full of asymptotic argumentation for any variation of estimators. Description of consistency and rates of convergence started with the early works of Fix and Hodges (1951), Rosenblatt (1956) and Parzen (1962) in the one-dimensional case. Optimal rates of convergence for nonparametric estimators were described in Stone (1980). It is an interesting question if it is possible to outline an argument for the rates of convergence and AMISE for the estimators described in this paper, provided that the data sample is drawn from an associated distribution. This seems difficult, since the results of the derivations of the bias and AMISE in Section 3.4 contain integrals that seem difficult to treat. However, it is possible that there are approximations that are not seen by the author of this manuscript.

Parameter selection
In the case of kernel density estimation, the parameters, usually the bandwidth matrix, have usually been selected using either plug-in or cross-validation methods. An introduction can be found in Wand and Jones (1994), while more recent articles are given by Hazelton (2003, 2005). In short, the plug-in selector is found by minimising an analytic expression of the AMISE. Since analytical expressions for AMISE do not exist yet, this path for fitting parameters seems difficult. It is far more likely that a cross-validation algorithm is a better route, because it is minimising the mean integrated squared error directly. This is a very important question, but the scope of it is so large that it is left as an open question in this paper. If we succeed with fitting parameters in the future, the parameters that give the best fit describe the level sets of the estimated density. This is useful for exploration of data and interpreting results. As an example, it was enough to describe the density as a function of distances from the centre in the experiments in Section 4.1.

Kernel regression
The new nonparametric density estimators can be used on various problems. In terms of regression, most kernel smoothing algorithms can be generalised by using pseudometrics because they are essentially derived from the probability densities (Watson 1964). An overview of kernel smoothing algorithms is given by Simonoff (1996) and Wand and Jones (1994). A generalisation of the Nadaraya-Watson kernel-weighted average is proposed aŝ where the Y (x i )s are the outputs related to the inputs x i s in the training set. Notice thatŶ (x) is constant on subsets of the level sets of the estimated density. If the level of smoothing is decided, that is h, parameters can be found by minimising the squared residuals. In this case, the parameters that give the best fit describe the level sets ofŶ (x) (and generally not the level sets of the density, since the parameters have been fitted to minimise the residuals). The bandwidth parameter is really a question of how closely one wants to fit the data, thus obviously minimising the residuals will favour very small bandwidths. It is left as an open problem how to integrate the bandwidth selection with fitting all the other parameters. A potential path is to penalise roughness of thê Y (x), similar to, for instance, the method laid out in Hurvich and Simonoff (1998). The model above fits somewhat into the plethora of restricted nonparametric and semiparametric regression methods to reduce the effect of high dimensionality. The additive model has the form of a linear combination of one-dimensional smoothers. This method is strong when the explanatory variables are independent, while dependencies are only dealt with through the use of the back fitting algorithm. In projection pursuit, the additive model is generalised by allowing a projection of the data matrix of explanatory variables before application of the one-dimensional smoothers. The additive models and the projection pursuit models were first defined in Friedman and Stuetzle (1981), with the functional extension given in Ferraty, Goia, Salinelli, and Vieu (2013). The projection pursuit regressor has in common with Equation (15) that the visualisation and interpretation of the whole distribution are possible in lower dimensional spaces. However, the regressors will generally have very different forms, since in projection pursuit the regressor is a sum of one-dimensional curves, while Equation (15) has parts of the level sets described by the pseudometric.
In partially linear models the explanatory variables are divided into a linear part and a nonparametric part. The general principles are described in the book of Härdle, Liang, and Gao (2000) and the functional extension is described in Aneiros-Péres and Vieu (2006). This method has less in common with Equation (15), sinceŶ (x) is generally not linearly dependent on any of the explanatory variables.

Classification
The classification rule can use density estimates for each subpopulation (Taylor 1997). More precisely, the classification rule is where the π i s are the prior probabilities andf d i s are the density estimates for each subpopulation. Recognise that different pseudometrics can be used for each subpopulation.

Statistical learning
Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis as described in Mohri, Rostamizadeh, and Talwalkar (2012). The theory deals with the problem of finding a predictive function based on a set of data. The problems that can be solved through this framework are many, including density estimation, classification and regression as discussed above. Another problem that is worth mentioning is data clustering. It is possible that the proposed estimators can be used in density-based clustering (Kriegel, Kröger, Sander, and Zimek 2011).

Some practical considerations related to training parameters for pseudometric kernel density estimation
Clearly, it is not a trivial challenge to find a recipe for training parameters on a wide range of distributions (or problems). Both searching the parameter space and avoiding over fitting are challenges. Limiting the hypothesis space is an important tool to improve computational time and avoid over fitting. An interesting idea is to use a hypothesis test to find which explanatory variables are dependent on each other and rearrange them so they are grouped together. A semi-naive Bayesian estimator is proposed asf where (o j ) is the sequence of the last element in each group and o 1 equals 0.
Often, the bottleneck in computational time for kernel density estimation is the training of the parameters. When cross-validation is part of the training, the computational time is dominated by the calculations that are O(n 2 ). In pseudometric kernel density estimation, distance, volume and kernel function calculations are all O(n 2 ). However, all d s,A,p (x i , c) calculations and all coefficients in the volume polynomials can be precomputed before the cross-validation starts. This means that the new estimators can have a computational performance that is comparable to common KDE.

Conclusion
An idea for optimising kernel density estimation using pseudometrics is described, where assumptions on the kernel of the distribution can be given. In simulations, the proposed estimators show increased statistical power when proper assumptions are made. As a consequence, this paper describes an approach, where partial knowledge about the distribution can be used effectively. It is also argued that the new estimators may have a potential in statistical learning algorithms such as classification, regression and data clustering.

A.1 Proof of Theorem 3.4
We use that the m-dimensional generalised normal distribution f (x) can be expressed as a function g( ), where = Since f (x) integrates to 1, then the following also holds for g( ) where the ball with radius is defined as Let B be a diagonal matrix, where the ith diagonal element b ii = −1/p i , then we recognise that A.3 Proof of Theorem 3.7 We recognise that R m / ∼ d p1,A,c,p = {{x ∈ R m | d s,A,p (x, c) = }∀ ∈ R + }.
Each equivalence class in R m / ∼ d p1,A,c,p is therefore uniquely determined by u, thus f ∼ d p1,A,c,p (x) = g (u). Moreover, f ∼ d p1,A,c,p (x) must integrate to 1, which means that the following is valid for g(u) If we substitute for β, we see that the inner integral is a function U of α ( q − α q ) i/q α m 2 −1 dα .
If we change the variables in the integrals with α q / q , we see that the integral parts can be expressed with incomplete beta functions.