Approximation by finite mixtures of continuous density functions that vanish at infinity

Given sufficiently many components, it is often cited that finite mixture models can approximate any other probability density function (pdf) to an arbitrary degree of accuracy. Unfortunately, the nature of this approximation result is often left unclear. We prove that finite mixture models constructed from pdfs in $\mathcal{C}_{0}$ can be used to conduct approximation of various classes of approximands in a number of different modes. That is, we prove approximands in $\mathcal{C}_{0}$ can be uniformly approximated, approximands in $\mathcal{C}_{b}$ can be uniformly approximated on compact sets, and approximands in $\mathcal{L}_{p}$ can be approximated with respect to the $\mathcal{L}_{p}$, for $p\in\left[1,\infty\right)$. Furthermore, we also prove that measurable functions can be approximated, almost everywhere.

1. Introduction.Let x be an element in the Euclidean space, defined by R n and the norm • 2 , for some n ∈ N. Let f : R n → R be a function, such that f ≥ 0, everywhere, and f dλ = 1, where λ is the Lesbegue measure.We say that f is a probability density function (pdf) on the domain R n (an expression that we will drop, from hereon in).Let g : R n → R be another pdf, and for each m ∈ N, define the functional class: where c ⊤ = (c 1 , . . ., c m ), R + = (0, ∞), [m] = {1, . . ., m}, and (•) ⊤ is the matrix transposition operator.We say that any h ∈ M g m is a m-component location-scale finite mixture of the pdf g.
Much of the popularity of finite mixture models stem from the folk theorem, which states that for any density f , there exists an h ∈ M g m , for some sufficiently large number of components m ∈ N, such that h approximates f arbitrarily closely, in some sense.Examples of this folk theorem come in statements such as: "provided the number of component densities is not bounded above, certain forms of mixture can be used to provide arbitrarily close approximation to a given probability distribution " Titterington, Smith and Makov (1985, p. 50), "the [mixture] model forms can fit any distribution and significantly increase model fit" Walker and Ben-Akiva (2011, p. 173), and "a mixture model can approximate almost any distribution" Yona (2011, p. 500).Other statements conveying the same sentiment are reported in Nguyen and McLachlan (2019).There is a sense of vagary in the reported statements, and little is ever made clear regarding the technical nature of the folk theorem.
In order to proceed, we require the following definitions.We say that f is compactly supported on K ⊂ R n , if K is compact and if 1 K ∁ f = 0, where 1 X is the indicator function that takes value 1 when x ∈ X and 0, elsewhere, and (•) ∁ is the set complement operator (i.e., X ∁ = R n \X).Here, X is a generic subset of R n .Furthermore, we say that f ∈ L p (X) and for p = ∞, if where we call • Lp(X) the L p -norm on X.When X = R n , we shall write In addition, we define the so-called Kullback-Leibler divergence, see Kullback and Leibler (1951), between any two pdfs f and g on X as In Nguyen and McLachlan (2019), the approximation of pdfs f by the class M g m was explored in a restrictive setting.Let {h g m } be a sequence of functions that draw elements from the nested sequence of sets The following result of Zeevi and Meir (1997) was presented in Nguyen and McLachlan (2019), along with a collection of its implications, such as the results of from Li and Barron (1999) and Rakhlin, Panchenko and Mukherjee (2005).
Theorem 1 (Zeevi and Meir, 1997) and g are pdfs and K is compact, then there exists a sequence {h g m } such that Although powerful, this result is restrictive in the sense that it only permits approximation in the L 2 norm on compact sets K, and that the result only allows for approximation of functions f that are strictly positive on K.In general, other modes of approximation are desirable, in particular approximation in L p -norm for p = 1 or p = ∞ are of interest, where the latter case is generally referred to as uniform approximation.Furthermore, the strictpositivity assumption, and the restriction on compact sets limits the scope of applicability of Theorem 1.An example of an interesting application of extensions beyond Theorem 1 is within the L 1 -norm approximation framework of Devroye and Lugosi (2000).
Let g : R n → R again be a pdf.Then, for each m ∈ N, we define which we call the set of m-component location-scale linear combinations of the pdf g.In the past, results regarding approximations of pdfs f via functions η ∈ N g m have been more forthcoming.For example, in the case of g = φ, where is the standard normal pdf.Denoting the class of continuous functions with support on R n by C. We have the result that for every pdf f , compact set K ⊂ R n , and ǫ > 0, there exists an m ∈ N and h ∈ N φ m , such that f − h L∞(K) < ǫ (Sandberg, 2001, Lem. 1).Furthermore, upon defining the set of continuous functions that vanish at infinity by we also have the result: for every pdf f ∈ C 0 and ǫ > 0, there exists an m ∈ N and h ∈ N φ m , such that f − h L∞ < ǫ (Sandberg, 2001, Thm. 2).Both of the results from Sandberg (2001) are simple implications of the famous Stone-Weierstrass theorem (cf.Stone, 1948 andDe Branges, 1959).
To the best of our knowledge, the strongest available claim that is made regarding the folk theorem, within a probabilistic or statistical context, is that of DasGupta (2008, Thm. 33.2).Let {η g m } be a sequence of functions that draw elements from the nested sequence of sets {N g m }, in the same manner as {h g m }.We paraphrase the claim without loss of fidelity, as follows.
Claim 1.If f, g ∈ C are pdfs and K ⊂ R n is compact, then there exists a sequence {η g m }, such that lim Unfortunately, the proof of Claim 1 is not provided within DasGupta (2008).The only reference of the result is to an undisclosed location in Cheney and Light (2000), which, upon investigation, can be inferred to be Theorem 5 of Cheney and Light (2000, Ch. 20).It is further notable that there is no proof provided for the theorem.Instead, it is stated that the proof is similar to that of Theorem 1 in Cheney and Light (2000, Ch. 24), which is a reproduction of the proof for Xu, Light and Cheney (1993, Lem. 3.1).
There is a major problem in applying the proof technique of Xu, Light and Cheney (1993, Lem. 3.1) in order to prove Claim 1.The proof of Xu, Light and Cheney (1993, Lem. 3.1) critically depends upon the statement that "there is no loss of generality in assuming that f (x) = 0 for x ∈ R n \2K".Here, for a ∈ R + , aK = {x ∈ R n : x = ay, y ∈ K}.The assumption is necessary in order to write any convolution with f and an arbitrary continuous function as an integral over a compact domain, and then to use a Riemann sum to approximate such an integral.Subsequently, such a proof technique does not work outside the class of continuous functions that are compactly supported on aK.Thus, one cannot verify Claim 1 from the materials of Xu, Light and Cheney (1993), Cheney and Light (2000), and DasGupta (2008), alone.
Some recent results in the spirit of Claim 1 have been obtained by Nestoridis and Stefanopoulos (2007) and Nestoridis, Schmutzhard and Stefanopoulos (2011), using methods from the study of universal series (see, e.g., Nestoridis and Papadimitropoulos, 2005). Let denote the so-called Wiener's algebra (see, e.g., Feichtinger, 1977) and let be a class of functions with tails decaying at a faster rate than o ( x n 2 ).In Nestoridis, Schmutzhard and Stefanopoulos (2011), it is noted that V ⊂ W. Further, let denote the set of compactly supported continuous functions.The following theorem was proved in Nestoridis and Stefanopoulos (2007).

(a)
For any f ∈ C c , there exists a sequence {η g m }, such that For any f ∈ C 0 , there exists a sequence {η g m }, such that For any 1 ≤ p < ∞ and f ∈ L p , there exists a sequence {η g m }, such that For any measurable f , there exists a sequence {η g m }, such that If ν is a σ-finite Borel measure on R n , then for any ν-measurable f , there exists a sequence {η g m }, such that lim m→∞ η g m = f , almost everywhere, with respect to ν.
The result was then improved upon, in Nestoridis, Schmutzhard and Stefanopoulos (2011), whereupon the more general space W was taken as a replacement for V, in Theorem 2. Denote the class of bounded continuous functions by C b = C ∩ L ∞ .The following theorem was proved in Nestoridis, Schmutzhard and Stefanopoulos (2011).

(a)
The conclusion of Theorem 2(a) holds, with C c replaced by The conclusions of Theorem 2(b)-(e) hold.(c) For any f ∈ C b and compact K ⊂ R n , there exists a sequence {η g m }, such that Utilizing the techniques from Nestoridis and Stefanopoulos (2007), Bacharoglou (2010) proved a similar set of results to Theorem 2, under the restriction that f is a non-negative function with support R, using g = φ (i.e.g has form (1), where n = 1) and taking h φ m as the approximating sequence, instead of {η g m }.That is, the following result is obtained.

(a)
For any pdf f ∈ C c , there exists a sequence h φ m , such that For any measurable f , there exists a sequence h φ m , such that For any pdf f ∈ C, there exists a sequence h φ m , such that To the best of our knowledge, Theorem 4 is the most complete characterization of the approximating capabilities of the mixture of normal distributions.However, it is restrictive in two ways.First, it does not permit characterization of approximation via the class M g m for any g except the normal pdf φ.Although φ is traditionally the most common choice for g in practice, the modern mixture model literature has seen the use of many more exotic component pdfs, such as the student-t pdf and its skew and modified variants (see, e.g., Peel and McLachlan, 2000, Forbes and Wraith, 2013, and Lee and McLachlan, 2016).Thus, its use is somewhat limited in the modern context.Furthermore, modern applications tend to call for n > 1, further restricting the impact of the result as a theoretical bulwark for finite mixture modeling in practice.A remark in Bacharoglou (2010) states that the result can generalized to the case where g ∈ V instead of g = φ.However, no suggestions were proposed, regarding the generalization of Theorem 4 to the case of n > 1.
In this article, we prove a novel set of results that largely generalize Theorem 4. Using techniques inspired by Donahue et al. (1997) and Cheney and Light (2000), we are able to obtain a set of results regarding the approximation capability of the class of m-component mixture models M g m , when g ∈ C 0 or g ∈ V, and for any n ∈ N. By definition of V, the majority of our results extend beyond the proposed possible generalizations of Theorem 4. The remainder of the article is devoted to proving the following theorem.
Theorem 5 (Main result).If we assume that f and g are pdfs and that g ∈ C 0 , then the following statements are true.

(a)
For any f ∈ C 0 , there exists a sequence {h g m }, such that For any f ∈ C b and compact K ⊂ R n , there exists a sequence {h g m }, such that (c) For any 1 < p < ∞ and f ∈ L p , there exists a sequence {h g m }, such that For any measurable f , there exists a sequence {h g m }, such that If ν is a σ-finite Borel measure on R n , then for any ν-measurable f , there exists a sequence {h g m }, such that lim m→∞ h g m = f , almost everywhere, with respect to ν.
If we assume instead that g ∈ V, then the following statement is also true.
(f ) For any f ∈ C, there exists a sequence {h g m }, such that The article proceeds as follows.The separate parts of Theorem 5 are proved in the subsections of Section 2. Comments and discussion are provided in Section 3. Necessary technical lemmas and results are also included, for reference, in Appendix A.

Main result.
2.1.Technical preliminaries.Before we begin to present the main theorem, we establish some technical results regarding our class of component densities C 0 .Let f, g ∈ L 1 and denote the convolution of f and g by f ⋆ g = g ⋆ f .Further, we denote the sequence of dilates of g by {g k : g k (x) = k n g (kx) , k ∈ N} .The following result is an alternative to Lemma 5 and Corollary 1.Here, we replace a boundedness assumption on the approximand, in the aforementioned theorem by a vanishing at infinity assumption, instead.
Lemma 1.Let g be a pdf and f ∈ C 0 , such that f L∞ > 0.Then, Proof.It suffices to show that for any ǫ > 0, there exists a k (ǫ) ∈ N, such that g k ⋆ f − f L∞ < ǫ, for all k ≥ k (ǫ).By Lemma 6, f ∈ C b , and thus f L∞ < ∞.By making the substitution z = kx, we obtain for each k.By Corollary 1, we obtain lim k→∞ 1 {x: x 2 >δ} g k dλ = 0 and thus we can choose a k (ǫ), such that Since g is a pdf, we have By uniform continuity, for any ǫ > 0, there exists a δ (ǫ) > 0 such that |f (x − y) − f (x)| < ǫ/2, for any x, y ∈ R n , such that y 2 < δ (ǫ) (Lemma 6).Thus, on the one hand, for any δ (ǫ), we can pick a k (ǫ) such that and on the other hand The proof is completed by summing (2) and (3).
Lemma 2. If f ∈ C 0 is such that f ≥ 0, and ǫ > 0, then there exists a h ∈ C c , such that 0 ≤ h ≤ f , and Proof.Since f ∈ C 0 , there exists a compact K ⊂ R n such that f L∞(K ∁ ) < ǫ/2.By Lemma 7, there exists some g ∈ C c , such that 0 ≤ g ≤ 1 and 1 K g = 1.Let h = gf , which implies that h ≥ 0 and 0 ≤ h ≤ f .Furthermore, notice that 1 K (f − h) = 0 and h L∞ ≤ f L∞ , by construction.The proof is completed by observing that For any δ > 0, uniformly continuous function f , let denote the modulus of continuity of f .Furthermore, define the diameter of a set X ⊂ R n by diam (X) = sup x,y∈X x − y 2 and denote an open ball, centered at x ∈ R n with radius r > 0 by B (x, r) = {y ∈ R n : x − y 2 < r}.
Notice that the class M g m can be parameterized as where k i = 1/σ i and z i = µ i /σ i .The following result is the primary mechanism that permits us to construct finite mixture approximations for convolutions of form g k ⋆ f .The argument motivated by the approaches taken in Theorem 1 in Cheney and Light (2000, Ch. 24), Nestoridis and Stefanopoulos (2007, Lem. 3.1), and Nestoridis, Schmutzhard and Stefanopoulos (2011, Thm.3.1).
Lemma 3. Let f ∈ C and g ∈ C 0 be pdfs.Furthermore, let K ⊂ R n be compact and h ∈ C c , where 1 K ∁ h = 0 and 0 ≤ h ≤ f .Then for any k ∈ N, there exists a sequence {h g m }, such that lim m→∞ Proof.It suffices to show that for any k ∈ N and ǫ > 0, there exists a sufficiently large enough m(ǫ) ∈ N so that for all m ≥ m(ǫ), h g m ∈ M g m such that (4) For any k ∈ N, we can write Here, kK is continuous image of a compact set, and hence is compact (cf.Rudin, 1976, Thm. 4.14).By Lemma 8, for any δ > 0, there exists We can obtain a disjoint covering of kK by taking A δ 1 = B 1 and ) and noting that kK = m−1 i=1 A δ i , by construction (cf.Cheney and Light, Ch. 24).Furthermore, each A δ i is a Borel set and diam denote the disjoint covering, or partition, of kK.We seek to show that there exists an m ∈ N and Π δ m , such that where Thus, 0 ≤ c m ≤ 1, and our construction implies that h g m ∈ M g m , where We can bound the left-hand side of (4) as follows: Makarov and Podkorytov, 2013, Thm. 4.7.3),we may choose a δ (ǫ) > 0 so that w (g, δ (ǫ)) < ǫ/ (2k n ).We may proceed from (5) as follows: To conclude the proof, it suffices to choose an appropriate sequence of partitions Π δ(ǫ) m , m ≥ m(ǫ), for some large but finite m(ǫ), so that ( 5) and (6) hold, which is possible by Lemma 8.
For any r ∈ N, let Br = {x ∈ R n : x 2 ≤ r} be a closed ball of radius r, centered at the origin.
point-wise.We obtain our conclusion via the Lesbegue dominated convergence theorem.

Proof of Theorem 5(a)
. We now proceed to prove each of the parts of Theorem 5. To prove Theorem 5(a) it suffices to show that for every ǫ > 0, there exists a h g m ∈ M g m , such that f − h g m L∞ < ǫ.Start by applying Lemma 2 to obtain h ∈ C c , such that 0 ≤ h ≤ f and f − h L∞ < ǫ/2.Then, we have The goal is to find a for all k ≥ k (ǫ).With a fixed k = k (ǫ), apply Lemma 3 to show that there exists a By the triangle inequality, we have The proof is complete by substitution of ( 8) into (7).

Proof of Theorem 5(b).
For any ǫ > 0 and compact K ⊂ R n , it suffices to show that there exists a sufficiently large enough m(ǫ) ∈ N so that for all m ≥ m(ǫ), h g m ∈ M g m , such that f − h g m L∞(K) < ǫ.By Lemma 5, we can find a k (ǫ, K) ∈ N, such that for every k ≥ k (ǫ, K).Since g ∈ C 0 , g L∞ ≤ C < ∞ for some positive C, by Lemma 6.For any k, r ∈ N, For fixed k, we may choose r (ǫ, K) ∈ N, using Lemma 4, so that f − 1B r f L 1 ≤ ǫ/ (3k n C) and thus the final term of ( 10) is bounded from above by ǫ/3 for all r ≥ r (ǫ, K).Thus, for k = k (ǫ, K) and, r ≥ r (ǫ, K) Using Lemma 3, with approximand 1B r(ǫ,K) f , component density g, compact set Br(ǫ,K) , h = 1B r(ǫ,K) f , and with k = k (ǫ, K) fixed, we have the existence of a density We obtain the desired result by combining ( 9), ( 11), and ( 12), via the triangle inequality.

Proof of Theorem 5(c).
The technique used to prove Theorem 5(c) is different to those used in the previous sections.Here, we use a result of Donahue et al. (1997) that generalizes the classic Barron-Jones Hilbert space approximation result (cf.Jones, 1992 andBarron, 1993) to Banach spaces.
To prove Theorem 5(c), it suffices to show that for every ǫ > 0, there exists a sufficiently large enough m(ǫ) ∈ N so that for all m ≥ m(ǫ), h g m ∈ M g m such that f − h g m Lp < ǫ.Begin by applying Corollary 1 to obtain a k (ǫ), such that For some pdf g and fixed k ∈ N, let us define the class Thus, g ∈ L p , for any 1 < p < ∞, by Lemma 9. Since g is a pdf and f ∈ L p , we have the existence of g k ⋆ f and the fact that g k ⋆ f Lp is finite.
Furthermore, for any ψ ∈ G k g , since g ∈ L p and by definition of G k g , we have ψ Lp ≤ k n p g Lp .Thus, we have Following van de Geer ( 2003), we can write the closure of G k g as and thus we immediately have Combined with ( 14), we can apply Lemma 11 to obtain the conclusion that there exists a function where α = min {p, 2} and C p is a finite constant.Since p > 1, m 1−1/α is strictly increasing, and hence we can choose an m (ǫ) ∈ N, such that for all m ≥ m (ǫ), The proof is then completed by combining ( 13) and ( 15) via the triangle inequality.
2.5.Proof of Theorem 5(d) and Theorem 5(e).By Theorem 5(a), there exists a sequence {h g m } that uniformly converges to f , as m → ∞.Thus, by Lemma 12, {h g m } almost uniformly converges to f and also converges almost everywhere, to f , with respect to any measure ν.We prove Theorem 5(d) by setting ν = λ, and we prove Theorem 5(e) by not specifying ν.
2.6.Proof of Theorem 5(f ).It suffices to show that for any ǫ > 0, there exists a sufficiently large enough m(ǫ) ∈ N so that for all m ≥ m(ǫ), h g m ∈ M g m , where g ∈ V, such that f − h g m L 1 < ǫ.Begin by applying Lemma 4 in order to find a r (ǫ) ∈ N, for any ǫ > 0, such that for all r ≥ r (ǫ), where 0 ≤ 1B r f ≤ f , and 1B r f ∈ C c with compact support Br .Let K = Br and apply the triangle inequality to obtain Hence we need to show that there exists a function h g m ∈ M g m , such that Since g ∈ V and g k (x) = k n g(kx), by substitution, we have where β, θ > 0 are independent of k.By Lemma 5 and Corollary 1, we can obtain a k 1 (ǫ), such that for all k ≥ k 1 (ǫ), Suppose that γ > 1 and let To do so, firstly, for any x ∈ R n , To obtain a Riemann sum approximation of g k ⋆ (1 K f ), we use an argument analogous to that of Lemma 3.That is, we partition kK into m − 1 disjoint Borel sets Π m = {A 1 , . . ., A m−1 }, and we approximate by ( 16).Then, by a similar argument to Lemma 3, c i ≥ 0 for all i ∈ [m] and m i=1 c i = 1.Thus, we may define an element h g m ∈ M g m via the parameters above.For sufficiently large k ≥ k 2 , we use Lemma 3 to show that , which implies and thus ( 19) is proved.Using (19), we write Using polar coordinates and (17), we have where A n is the surface area of a unit sphere embedded in R n .We then have which implies that we can choose a k 3 ∈ N, such that for all k ≥ k 3 , ( 22) Lastly, we write which implies that we can choose the same k 3 as above to obtain the bound for any k ≥ k 3 .Thus, we obtain the bound 1 K f − h g m L 1 < ǫ/2, for all k ≥ max {k 1 , k 2 , k 3 }, by combining ( 18), ( 19), ( 20), ( 21), ( 22), and (23), via the triangle inequality.The result is proved by combing the bound above, with (16), for an appropriately large r (ǫ) ∈ N.

Comments and discussion.
3.1.Relationship to Theorem 1.In the proof of Theorem 1, the famous Hilbert space approximation result of Jones (1992) and Barron (1993) was used to bound the L 2 norm between any approximand f ∈ L 2 and a convex combination of bounded functions in L 2 .This approximation theorem is exactly the p = 2 case of the more general theorem of Donahue et al. (1997), as presented in Lemma 11.Thus, one can view Theorem 5(c) as the p ∈ (1, ∞) generalization of Theorem 1.

3.2.
The class W is a proper subset of the class C 0 .Here, we comment on the nature of class W, which was investigated by Bacharoglou (2010) and Nestoridis, Schmutzhard and Stefanopoulos (2011).We recall that Bacharoglou (2010) conjectured that Theorem 4 generalizes from g = φ to g ∈ V.In Theorem 5(a)-(e), we assume that g ∈ C 0 .We can demonstrate that g ∈ C 0 is a strictly weaker condition than g ∈ V or g ∈ W.
For example, consider the function in g : R → R: and note that For x ≤ 0, we observe that g = 0 and thus the left limit is satisfied.On the right, for any 1/ǫ > 0, we have x (ǫ) ≥ ⌈ǫ⌉ − 1/2, so that g (x) < 1/ǫ, for all x > x (ǫ), where ⌈•⌉ is the ceiling operator.Therefore, g ∈ C 0 .
Within each interval i − 1 ≤ x < i, we observe that g is locally maximized at x = i − 1/2.The local maximum corresponding to each of these points is 1/i.Thus g / ∈ W, since where ∞ i=1 (1/i) = ∞.Furthermore, g / ∈ V since V ⊂ W.
3.3.Convergence in measure.Along with the conclusions of Theorem 5(d) and (e), Lemma 12 also implies convergence in measure.That is, if ν is a σ-finite Borel measure on R n , then for any ν-measurable f , there exists a sequence {h g m }, such that for any ǫ > 0, lim m→∞ υ ({x ∈ R n : |f (x) − h g m (x)| ≥ ǫ}) = 0.

APPENDIX A: TECHNICAL RESULTS
Throughout the main text, we utilize a number of established technical results.For the convenience of the reader, we append these results within this Appendix.Sources from which we draw the unproved results are provided at the end of the section.
Then, for all f ∈ L p and 1 ≤ p < ∞, Furthermore, for all f ∈ C b and any compact K ⊂ R n , lim k→∞ g k ⋆ f − f L∞(K) = 0.
The sequences {g k } from Lemma 5 are often called approximate identities or approximations of the identity.A simple construction of approximate identities is by taking dilations g k (x) = k n g (kx), which yields the following corollary.
Corollary 1.Let g be a pdf.Then the sequence of dilations {g k : g k (x) = k n g (kx)}, satisfies the hypothesis of Lemma 5 and hence permits its conclusion.Lemma 6.The class C 0 is a subset of C b .Furthermore, if f ∈ C 0 , then f is uniformly continuous.
Lemma 8.If X ⊂ R n is bounded, then for any r > 0, X can be covered by m i=1 B (x i , r) for some finite m ∈ N, where x i ∈ R n and i ∈ [m].
Lemma 10.If f ∈ L p and g ∈ L 1 , for 1 ≤ p ≤ ∞, then f ⋆ g exists and we have f ⋆ g Lp ≤ g L 1 f Lp .
Lemma 11.Let G ⊂ L p , for some 1 ≤ p < ∞, and let f ∈ Conv (G).For any K > 0, such that f − α Lp < K, for all α ∈ G, there exists a h m ∈ Conv m (G), such that where α = min {p, 2}, and Lemma 12.In any measure ν, uniform convergence implies almost uniform convergence, and almost uniform convergence implies almost everywhere convergence and convergence in measure, with respect to ν.