SHAH: SHape-Adaptive Haar Wavelets for Image Processing

We propose the shape-adaptive Haar (SHAH) transform for images, which results in an orthonormal, adaptive decomposition of the image into Haar-wavelet-like components, arranged hierarchically according to decreasing importance, whose shapes reflect the features present in the image. The decomposition is as sparse as it can be for piecewise-constant images. It is performed via a stepwise bottom-up algorithm with quadratic computational complexity; however, nearly linear variants also exist. SHAH is rapidly invertible. We show how to use SHAH for image denoising. Having performed the SHAH transform, the coefficients are hard- or soft-thresholded, and the inverse transform taken. The SHAH image denoising algorithm compares favorably to the state of the art for piecewise-constant images. A clear asset of the methodology is its very general scope: it can be used with any images or more generally with any data that can be represented as graphs or networks.


INTRODUCTION
The contribution of this work is two-fold: first, we introduce a new transform for images, based on new shape-adaptive Haar (SHAH) wavelets from which it takes its name, and second, we propose a methodology for image denoising based on the SHAH transform.
The SHAH transform of an image results in its orthonormal decomposition into a ranked collection of weighted level differences between pairs of zones in the image, the "most informative" such contrasts being ranked first. It thus provides a natural decomposition of the image into a set of features ordered according to their importance for the image description. The transform identifies the edges and other prominent features of the image, and the decomposition is as sparse as it can be for piecewise constant images. The SHAH transform is performed via a stepwise bottom-up algorithm with quadratic linear complexity, but nearly linear variants also exist. It might be viewed as the selection of a particular image-driven orthonormal basis (hence the term "shape-adaptive") and the projection of the image onto the selected basis. Due to its shape-adaptivity, the transform bypasses the classical notion of dyadic wavelet scales. It can be viewed as a two-dimensional extension of the unbalanced Haar wavelet transform of a curve (Fryzlewicz 2007).
The SHAH transform produces sparse representations of images, especially (nearly)piecewise-constant ones, and hence can be used in conjunction with soft-or hardthresholding operations with the purpose of removing noise from the input image. This results in a "highly nonlinear" operation on the image, being a superposition of two nonlinear operations: SHAH and thresholding. The resulting image denoising technique is shown to perform well, in particular for piecewise-constant images. Its performance can be improved further via linear averaging.
Although this article focuses on image analysis, it is worth emphasizing that the methodology we propose applies to more general data structures. Indeed, the SHAH transform can be applied to any data that can be encoded as a graph whose nodes are associated with a given intensity and are embedded in a normed, not necessarily two-dimensional, space.

RELATED WORK
This section aims to situate our work among the variety of available methods. Multiscale image representation. The SHAH transform falls into the category of "multiscale representation of images." (Nonadaptively selected) wavelet bases are a canonical example of a tool used to achieve such representations, and a survey of their use in image processing can be found in Mallat (2009a). Wavelets, although widely used and relatively well understood, suffer from inefficacies in capturing nonhorizontal or nonvertical features in an image; curvelets (Candès and Donoho 2001) attempt to remedy this by using a more flexible family of building blocks, which are also not selected adaptively.
Adaptive image representation and processing. In contrast to wavelets or curvelets, the building blocks of the SHAH transform are selected adaptively from the data. A review of adaptive image representations can be found in Peyré (2011). The principle of adaptivity (although not the particular construction used in SHAH) is shared by a number of "-let" transforms, including bandlets of Le Pennec and Mallat (2005) (see also Mallat and Peyré 2008 for a review of related techniques), wedgelets (Donoho 1999;Claypoole and Baraniuk 2000), tetrolets (Krommweh 2010), the easy path wavelet transform (Plonka 2009), edge-adapted nonlinear multiresolution techniques (Arandiga et al. 2008), and directed trees (Narendra and Goldberg 1980). Heijmans and Goutsias (2000) provided, through morphological wavelets, a framework for describing nonlinear liftingbased wavelets decompositions. Grouplets (Mallat 2009b) preserve the classical notion of scale and grid subdivision present in the Haar or lifting transforms (see below for references to lifting), but equip the standard Haar transform with an "association field" that groups together points that are not necessarily neighbors. This leads, in a context different from that in SHAH, to similar Haar-like filtering operations with weights not necessarily equal to those in SHAH. We emphasize that in contrast to grouplets, SHAH does not follow the dyadic scale structure of the classical wavelet transform. Other ap-proaches to image processing (in this case, denoising), which can be viewed as adaptive but do not use the notion of decomposition or hierarchy are, for example, adaptive weight smoothing (Polzehl and Spokoiny 2000) and penalized regression on a graph (Kovac and Smith 2011). A recent review of image denoising techniques can be found in Milanfar (2013).
Wavelet-like methods on graphs outside of the image context. Hammond, Vandergheynst, and Gribonval (2009) and Antoine, Rosca, and Vandergheynst (2010) defined wavelets on graphs by studying eigenvalues of the graph Laplacian; the latter takes the form of a matrix encoding the connectivity of each node and edge. Coifman and Maggioni (2006) used the powers of a diffusion operator as the scaling tool leading to multiscale analysis. Several variants of their ideas Szlam et al. 2005) lead to different wavelet constructions. Crovella and Kolaczyk (2003) used the n-hop distance (the minimal number of edges one has to travel to go from the central node to another) to define wavelets on the graph. Jansen, Nason, and Silverman (2009) used the lifting algorithm akin to that of Sweldens (1996) to construct wavelets on graphs using a bottom-up approach where wavelets between the nearest nodes get constructed first. Some authors also have defined wavelet transforms specifically designed for the dendrogram: Murtagh (2007) used Haar bases, while Gavish, Nadler, and Coifman (2010) generalized to unbalanced Haar. Singh, Nowak, and Calderbank (2010) iteratively reduced the graph by replacing two (groups of) nodes by a single one, but unlike in SHAH, the graph structure is not used in the reduction process. The latter method is closely related to the idea behind treelets (Lee, Nadler, and Wasserman 2008), defined for unordered data. We end by mentioning that SHAH can be viewed as a contiguity-constrained agglomerative clustering technique, a broad class of methods described generically in chap. 5 of Murtagh (1985).
Relationship to Swelden's lifting transform. "Lifting" (Sweldens 1996) is a device for designing iterative data transformations whereby (transformed) data points get "predicted" using neighboring values and, once the prediction error has been recorded, the predicted coefficient is removed from the system to reduce its complexity. It is a nonadaptive transformation in the sense that its form does not depend on the values of the data being processed, and it is a linear transformation of the data. In its original version cited above, each iterative stage involves predicting and removing half of all available coefficients. Versions for data on more complex domains also exist, for example, the "lifting one coefficient at a time" scheme of Jansen, Nason, and Silverman (2009), which is also nonadaptive and linear.
In contrast to these, SHAH, which also uses the notion of predicting data points or their clustered regions using neighboring values, and then successively removing them (either "one coefficient at a time," or "a small subset of coefficients at a time"), is an adaptive and nonlinear transform of the data. The adaptivity and nonlinearity arise as a result of SHAH choosing, in a data-dependent way, which part of the data to operate on in each stage of the transform.
To give but one example of the consequences of these properties, we remark that image denoising via SHAH, described later in Section 3 is an operation, which belongs to the class of methods described by DeVore (1998) as "highly nonlinear," since it involves a nonlinear operation (thresholding) performed on an adaptively (and hence nonlinearly) chosen basis. This is part of the reason why linear averaging of SHAH image reconstructions can bring improvements in their quality, as described in that section.
Finally, in contrast to classical lifting, the SHAH transform is conditionally orthonormal, by which we mean "orthonormal given the selected basis." This property is important, among others, in the application of SHAH to image denoising where it leads to a fast algorithm for threshold selection, and in fast computation of the inverse SHAH transform.

ORGANIZATION OF THE ARTICLE
The article is organized as follows. Section 2 defines the SHAH algorithm and describes some of its properties. Section 3 shows how to apply SHAH to image denoising. Section 4 concludes.

CORE IDEAS
The shape-adaptive Haar (SHAH) transform encodes images in an invertible, datadriven, hierarchical, and sparse way. It requires three pieces of information to describe an image: the intensities of the pixels, a notion of neighborhood between the pixels, as well as the spatial location of the pixels in the two-dimensional space. The object describing an image in this way is termed an intensity network (IN). We describe below how to define it for a given image. The SHAH transform is a data-driven procedure for dimension reduction, with minimum loss of information at each step. It can be interpreted as an agglomerative-type algorithm, where pixels of an image, each initially forming a separate zone, get progressively grouped into contiguous zones according to a specific criterion. We now describe the core ideas of the SHAH transform.
Defining the IN (intensity network). The IN associated with an image is constructed as follows. Consider a gray level image, stored as a real-valued matrix of dimensions N × M. Then, draw a network on this image. Each pixel is a node of the network, and each node is related by edges to its four nearest neighbors (in the left (or west, W), right (east, E), top (north, N), and bottom (south, S) directions, respectively). This graph structure mathematically encodes the idea of neighborhood between the pixels of the image. More complex topologies are possible; we do not pursue them in this work but implement some in our software (more details below). Assign unique labels l 1 , l 2 , . . . , l NM to each node. Associate an orientation with the edges so that each of them consists of an input node l i and an output node l j with i < j. (We only use the terms "input" and "output" to facilitate references to the oriented edge (l i , l j ).) Store the mapping relating those labels to the Cartesian coordinates of the pixels in a codebook. Moreover, associate uniform weights to all the nodes of the network, as the information they store (i.e., the value of the related pixel) is a priori equally important in the image description. The object comprising the pixel values (the NM real values stored in an N × M matrix) and the graph structure (the NM nodes and 2NM − N − M edges) embedded in the space through the codebook is termed an intensity network (IN). An example of an IN can be found in Figure 1.
Choice of image topology. Throughout this article, we work with four-element neighborhoods (W, N, E, S). These are, arguably, the simplest reasonable neighborhoods, which also offer the fastest computation. More complex neighborhood structures are clearly possible, most notably eight-element neighborhoods (W, NW, N, NE, E, SE, S, SW). Although we do not pursue the latter in this work because of the increased computation times, we do implement the SHAH transform with eight-element neighborhoods in the R code provided at http://stats.lse.ac.uk/fryzlewicz/shah/shah_code.R. One attractive feature of the SHAH algorithm is that it always proceeds in the same way once the initial edge topology has been defined. In particular, this is true of the three-dimensional version of SHAH, also implemented in our software.
Smoothing the image. The idea of the SHAH transform is to progressively smooth the image in a data-adaptive way, while retaining as much information as possible about the current image in each smoothing step. In practice, compute (weighted) differences between pairs of neighbor nodes along each edge. Those differences are referred to as details. Identify the smallest detail (in absolute value) and replace the values of the corresponding linked nodes by their (weighted) average. Then, reduce those two nodes to a single (linked, merged) node in the network, which is given a larger weight due to the increased number of pixels it encodes. Finally, update the graph structure of the network by removing the edge between the linked nodes. Since the detail being replaced is the smallest one, the loss of information is the smallest possible. This reduction process is iterated NM − 1 times, up to the point at which the image is finally reduced to a single node. Figure 2 shows an example of how the graph structure might evolve during the reduction process.
Encoding the transform. At each iteration of the algorithm, store the labels of the nodes that are removed, as well as the (weighted) difference between them. Thus, each iteration returns three values: the input node label, the output node label, and the selected detail, the latter being the (weighted) value at the output node minus the (weighted) value at the input node of the edge. There are NM − 1 iterations for reducing an N × M image to a single node associated with a unique real value for the reduced image. The complete reduction process can thus be stored in two column vectors: one of them encodes the (NM − 1) edges and the other encodes the (NM − 1) detail coefficients, which can be interpreted as intensity differences. Both of the vectors are constructed element by element, from bottom to top. In addition, the (very) top element of either vector stores, respectively, a degenerate edge linking the remaining node to itself, and the associated value of intensity. Those two vectors combined with the spatial information stored in the codebook define the SHAH transform of the image, an illustration of which can be found in Figure 3. The output of the SHAH transform will also be referred to as the SHAH signature of the input image, see Figure 4 for an example.
Alternatively, the SHAH transform can be viewed as the projection of the image on a particular image-adapted orthonormal basis in which the basis functions are arranged hierarchically (in a multiscale way) and encode the image sparsely (see Figure 5 for an example). Overview of the key properties. The SHAH transform is a one-to-one transformation of the input image. It provides a data-driven encoding of images, in which both the pixel intensities and the image topology are accounted for. It describes the image as a linear combination of simple, regionwise-constant basis images, hierarchically organized according to what can be viewed as the importance of the image feature they encode. If SHAH is applied to a noiseless image with edges, then the edges and the regions of constant intensity they delimit are captured in the basis elements, which leads to sparsity in the description of the image. For noisy images, the SHAH transform also attempts to concentrate, in a greedy fashion, as much energy of the image in as few coefficients as possible. The algorithm can be applied to more general geometries than a rectangular image with a grid.

THE SHAH ALGORITHM
In this section, we provide the algorithmic details of the SHAH transform. The input and output of the algorithm are defined in a formal general way. The one-dimensional version of SHAH, termed unbalanced Haar (UH) was introduced in Fryzlewicz (2007) and applied to curve classification in Timmermans and von Sachs (2015).
Input • E IN is a graph. It is a ranked set of E oriented edges l = (j, k), l = 1 . . . E, with j, k ∈ {1, . . . , p}, j = k, identifying the linked nodes. In the case when no natural orientation exists for the edges, any choice is equally convenient but an orientation is required for the transform to be invertible.
• X (p) is a vector of intensities. It is a real-valued vector of length p encoding the intensities of the image I at the successive points defined in D (p) .
A typical example is as follows.
• The image I is a gray level image of N × M pixels encoded as a matrix A. • E OUT is a graph. It is a ranked set of p oriented edges l = (j, k), l = 0 . . . p − 1, with j, k ∈ {1, . . . , p}, j = k identifying the linked nodes. Edge 0 links to itself and j = k for this edge.
• d is a vector of intensity differences. It is a real-valued vector of length p encoding the intensity differences associated with the edges successively defined in E OUT . The value d 0 is an intensity instead of an intensity difference.
As an example, the output of the SHAH transform of the IN from Figure 1 is in Figure 3. The algorithm. The algorithm, detailed below, is also illustrated in Figure 2.
The SHAH algorithm Iteration #i: 1. Compute detailsd l along each of the edges l = (j, k) in E (i) : 2. Select an edge l * with the minimum absolute value of detail: l * := arg min l |d l |.
In case of multiple equal minimum values |d l |, select the smallest index l. Note l * = (j * , k * ).

Smooth:
4. Encode the partial value of SHAH: 5. Reduce the network and prepare next iteration: Update E i by replacing all indexes k * by j * . Discard duplicate edges in E i , retaining only the first occurrence of each edge. This defines E (i+1) .

Final step:
Some remarks are in order. The filter taps d l = (− w k √ w 2 j +w 2 k , w j √ w 2 j +w 2 k ) used in computing the detail coefficient d l are always chosen so that, if the original image were constant over the region which the detail coefficient corresponds to, the value of the detail d l would be zero. This is a consequence of the fact that the starting values of the weights w j at the beginning of the algorithm are equal to 1. The form of the update to the weight vector w (i+1) is designed to preserve this property as the algorithm progresses. The property that the detail equals zero over constant regions is a natural requirement, which causes the SHAH algorithm to offer sparse representations for piecewise-constant images, in a similar vein to standard Haar wavelets that also produce zero detail coefficients in regions of constancy. This, and the requirement that d l 2 2 = 1, uniquely determines the values of the taps applied to compute the detail coefficients, up to sign flips.
The smooth weights s l = ( ) are chosen so that the filters d l * and s l * are orthonormal. This implies that the SHAH transform is conditionally orthonormal, by which we mean "orthonormal given the selected basis." This property is important, among others, in the application of SHAH to image denoising where it leads to a fast algorithm for threshold selection, and in fast computation of the inverse SHAH transform. The SHAH basis selection takes place iteratively, via the minimization of |d l | in Step 2 of the algorithm. This is a greedy procedure that ensures that each consecutive detail coefficient encodes as little variation of the image as possible, thereby attempting to concentrate as much signal as possible in the latter stages of the algorithm, in the hope of obtaining a sparse representation of the image. This is in contrast to the standard nonadaptive Haar transform for images, where no basis selection takes place, and implies, in particular, that SHAH is a nonlinear transformation.

COMPUTATIONAL COMPLEXITY AND VARIANTS OF THE ALGORITHM
In the version described above, the computational complexity of the SHAH algorithm is quadratic in the number of pixels, that is, is of computational order p 2 . This is because at each iteration i, all the edges are examined. However, other variants of the SHAH algorithm are possible, with substantially reduced computational complexity. We outline some ideas below.
• Examination of a fixed number of edges. Substantial computational cost can be saved if only a preset number of edges (not exceeding a constant), are examined at each iteration i. The edges can be selected in a deterministic or random way. This potentially results in an algorithm of computational order p, that is, linear in the number of pixels, depending on how the edges are selected.
• Two-or multi-stage algorithm. For an image of size N × N , first, divide the image into (N/k) 2 nonoverlapping sub-images, each of size k × k. Execute the algorithm on each sub-image separately (stage 1), then execute it on the resulting N/k × N/k matrix of coefficients d 0 from each sub-image (stage 2). The computational complexity is then (N/k) 2 k 4 + (N/k) 4 , which attains its minimum when k = N 1/3 , resulting in the complexity of N 8/3 = p 4/3 . The algorithm can be executed similarly in more stages than one, bringing the computational complexity arbitrarily close to linear, if the number of stages is large enough.
• Removal of multiple nodes at once. In the version described above, one pair of nodes is merged at each iteration (this can be viewed as the "removal" of one of the nodes and updating of the other). An alternative might be to merge multiple pairs of nodes, corresponding to a number of smallest detail values. Merging a fixed proportion ρ ∈ (0, 1) of the node pairs in each iteration results in an algorithm of computational order p log p. Pairs of nodes can be merged simultaneously in a single iteration if, out of the set of pairs of nodes to be merged, no node belongs to more than one pair.
If, in addition to the output described in Section 2.2, the SHAH algorithm stores the filter coefficient w j * / w 2 j * + w 2 k * used at each iteration i, the inverse SHAH transform is performed by simply reversing the steps of the SHAH algorithm. The computational complexity of the inverse SHAH transform is then linear in the number of pixels.
We now briefly discuss how the different variants of the algorithm compare in terms of execution times. Table 1 shows times obtained for 128 × 128 and 256 × 256 images. Computational savings will differ depending on the fixed number of edges examined in the "fixed number of edges" version, on the k parameter in the two-stage algorithm and on the ρ parameter in the "removal of multiple nodes at once" version. Clearly, the standard version, implemented in R, is unacceptably slow and one of the faster versions needs to be used in practice.  Figure 6 shows the compression capabilities of the different version of the algorithm on the noisy images from Section 3, Examples 1 and 2. The steeper the curve at the start, the larger the proportion of the variance of the image explained by the same number of the largest SHAH coefficients. The curves corresponding to the standard SHAH, the "fixed number of edges," and the "removal of multiple nodes at once" versions are practically indistinguishable. Understandably, the two stage version is a less good image compressor, because of its region constraints.
It is also worth noting that the image from Example 2 is represented more sparsely due to its much higher signal-to-noise ratio than the image in Example 1. This is despite the fact that the noise-free image from Example 1 is piecewise-constant, and therefore it would be represented (much) more sparsely via SHAH than the noise-free image from Example 2, which is not piecewise-constant.

PROPERTIES OF SHAH
In this section, we briefly summarize the key mathematical properties of SHAH. The proofs are straightforward, so we omit them. Figure 6. Proportion of image variance (y-axis) explained by each given number of the largest SHAH coefficients (x-axis). Black: standard SHAH; blue: "fixed number of edges" version with M = 1000 edges chosen at random each time; green: "removal of multiple nodes at once" version with ρ = 0.01; red: two-stage version with k = 2. The black, blue, and green lines virtually overlap.
1. SHAH as a data-driven orthonormal decomposition of the image. At iteration i of the SHAH algorithm, each d p−i can be represented as the inner product of the original image X and an image p−i , where • p−i is selected in a data-driven way at each iteration i, • p−i has mean zero, except when i = p, k=0 is an orthonormal basis and Further, due to the Parseval identity, the total energy (i.e., the sum of squares) of X equals p−1 k=0 d 2 k . An example of the basis k is provided in Figure 5. The orthonormality of k is a simple consequence of the orthonormality of the detail and smooth filters used at each iteration of the algorithm. SHAH is an invertible transform.
2. Hierarchical nature and Haar-like character of the basis k . Let supp( k ) denote the support of k , that is, the domain on which it is nonzero.
• For each k = 1, . . . , p − 1, supp( k ) consists of two contiguous adjacent zones such that k is constant positive on one and constant negative on the other. 0 is positive and constant on the entire domain.
• The structure of the basis k is hierarchical is the sense that if the supports of l and k overlap and l < k, then supp( k ) must be contained either within the zone where l is positive or the zone where it is negative.
These properties are reminiscent of the Haar wavelet basis. However, here, the key difference is that the supports of k are determined by the data and can have arbitrary contiguous shapes, as is apparent from the example in Figure 5. This is because the basis images k are chosen adaptively from the data at each iteration of the algorithm.

Sparsity of representation and energy concentration.
• For each k = 1, . . . , p − 1, if supp( k ) is contained within a region where X is constant, then the corresponding d k = 0. This is a consequence of the mean-zero property of k .
• Consequently, by the construction of the SHAH algorithm, for a piecewiseconstant image X, the only nonzero elements of the vector (d 0 , d 1 , . . . , d p−1 ), besides possibly d 0 , will be d 1 , . . . d Z−1 , where Z is the number of zones of contiguous identical values in X, the notion of contiguity being defined by the linkage structure of the network. Therefore, SHAH encodes the edges of such an image in the sparsest possible way.
• For nonpiecewise-constant (e.g., noisy) images, the SHAH algorithm is an attempt to achieve the same effect, that is, to concentrate as much energy of the image X in as few initial coefficients d 1 , d 2 , . . . as possible, and therefore to represent its significant features sparsely.

IMAGE DENOISING USING SHAH
The SHAH wavelet transform can be used for image denoising in a similar process to any other wavelet transform, whether adaptive or not. The usual procedure in nonlinear wavelet-based image denoising is to take the wavelet transform of the image, perform a shrinkage/thresholding operation on the wavelet coefficients (in the hope of thresholding out the typically large number of coefficients that carry mostly noise, but retaining most of those carrying signal) and take the inverse wavelet transform. The statistical model we consider in this section is X u,v = f u,v + ε u,v , u, v = 1, . . . , N, where X u,v is the observed noisy image, f u,v is the unknown true image, and ε u,v is iid noise distributed as N (0, σ 2 ).
At the transform stage, in the case of SHAH, we have a number of options for speeding up computation for large images, as described in Section 2.3. Empirically, we have found that the two-stage algorithm with k = 2 or k = 4 often leads to the best denoising, especially for noisier images, and this is the version we focus on here. It may come as a surprise that the two-stage algorithm is able to beat the various one-stage versions, despite its worse compression capabilities, as shown in Section 2.3. This, we believe, is because the two-stage algorithm is "less greedy" than the one-stage versions because of its region constraints, which may be advantageous for processing noisier images, in which the onestage algorithms may have more scope for making globally significant basis choice mistakes because of their lack of region constraints.
In the thresholding step, we pursue two strategies: apply either soft, or hard thresholding to each SHAH coefficient d i , for i = 1, . . . , p − 1. This results in the following operations: where λ S and λ H are thresholds used in soft and hard thresholding, respectively, and I() is the indicator function. Motivated by the choice of the regularization parameter for image smoothing in Kovac and Smith (2011), we choose the threshold λ as follows (our strategy applies to both soft and hard thresholding and therefore we write, generically, λ for either λ H or λ S ). For each candidate λ, we compute the reconstructed imagef λ u,v and estimate the variance of the empirical residuals asσ 2 λ = N −2 N u,v=1 (X u,v −f λ u,v ) 2 . By construction,σ 2 0 = 0 andσ 2 ∞ is the empirical variance of X u,v , which is typically larger than σ 2 . We then select the largest λ for whichσ 2 λ ≤σ 2 , whereσ 2 is the median-absolute-deviation-based estimate of σ 2 used in Kovac and Smith (2011).
By choosing the largest possible value of λ, which leads to "reasonable" residuals from the fit in the sense of (2), we ensure that the reconstructed image is "as simple as possible" in the sense of being composed of the smallest possible number of wavelet coefficients, under the constraint (2). We also note that thanks to the conditional orthonormality of SHAH (i.e., orthonormality given the selected SHAH basis), the operation of checking all possible values of λ can be performed quickly in the SHAH coefficient domain, and is implemented in the code provided in this fast way. We illustrate the potential of the above SHAH-based image denoising procedure on two examples.
Example 1. We use the cartoon medical image, of size 256 × 256, investigated in Polzehl and Spokoiny (2000) and Kovac and Smith (2011). The clean and noisy images are shown in the top left and top middle plots of Figure 7. This is a piecewise-constant image, for which we expect SHAH to perform well due to the piecewise-constant nature of the SHAH basis functions. The top right plot shows the reconstruction obtained by the adaptive weight smoothing (AWS) technique of Polzehl and Spokoiny (2000), this was produced by the aws routine from the aws R package (version 1.9-4, dated 2014-03-05), executed with its default parameters.
We process the image via the SHAH denosing procedure described earlier, used here with k = 4 and both hard and soft thresholding. The execution of the code, written in R, took under 10 sec on a standard PC. The reconstructions, shown in the bottom left and bot- Figure 7. Clean, noisy, and denoised image using adaptive weight smoothing and SHAH with hard and soft thresholding as well as SHAH-avg with hard thresholding. Table 2. Empirical mean square errors (MSE) and estimates of total variation (TV) for the various reconstruction methods of the image from Example 1. The value for the starred method is taken from Kovac and Smith (2011). AWS refers to adaptive weight smoothing from Polzehl and Spokoiny (2000). AWS-avg is constructed like SHAH-avg but with SHAH replaced by AWS with default parameters. Boxed value in the MSE column is the lowest MSE across methods. Boxed values in the TV column are those within 5% of the TV for the clean image. Wavelet thresholding uses the Daubechies' least asymmetric filter indexed 10, combined with universal hard thresholding (default option in the R package wavethresh). Gaussian kernel estimate is an unattainable Gaussian kernel smoother in which the bandwidth was chosen by minimizing the MSE with respect to the true image (execution: routine kernsm from the R package aws) tom middle plot of Figure 7, respectively, appear mostly satisfactory but the reconstructed circle is "jagged" in appearance. To remedy this, significant improvements are available by performing the following procedure: (a) for i = 1, . . . , m, add further iid N (0, σ 2 1 ) noise to image X u,v to obtain Y (i) u,v , (b) perform SHAH denoising on each image Y (i) u,v , (c) average the results over i. We call the thus-constructed procedure SHAH-avg. The averaging introduces an extra smoothing effect that tends to alleviate the jaggedness of the individual reconstructions. We note again that the SHAH denoising procedure is highly nonlinear, and it should be expected that different SHAH bases are selected for each i; therefore the individual reconstructions can be expected to differ enough for each i for the averaging effect to be helpful in removing spurious artifacts present in the individual reconstructions. Throughout this section, we demonstrate SHAH-avg with σ 1 =σ /2 and m = 10; these parameters have not been optimized in any way. Table 2 lists the mean square errors (MSEs) of the various reconstructions, and estimates of their total variation, computed as in Kovac and Smith (2011). SHAH-avg with hard thresholding is by far the best in terms of the MSE. Apart from this method, also AWS-avg (constructed like SHAH-avg but with SHAH replaced by AWS with default parameters) and SHAH with hard thresholding lead to total variation values close to those of the clean image. Importantly, we note that AWS-avg does not offer a significant MSE improvement over AWS, due to the latter reconstruction already being smooth, and perhaps even overly so. SHAH-avg offers very significant MSE improvement over SHAH.
We end this example by noting that SHAH with hard thresholding retains 58 nonzero SHAH coefficients for this image, which is fewer than 0.1% of the total number of SHAH coefficients (the latter being equal to the number of pixels). This can be interpreted to mean that the reconstructed image is composed of 58 features, each of which is of the form of a difference between two consecutive regions of the image.
Example 2. We consider the teddy image from the R package wavethresh. The size is 256 × 256. Unlike the previous two examples, this image is not piecewise constant. The purpose of this example is to investigate how SHAH handles the task of denoising nonpiecewise-constant images. The clean and noisy images are shown in the top left and top middle plots of Figure 8.
The AWS and AWS-avg reconstructions are slightly more appealing visually than those produced by SHAH and SHAH-avg (here with hard thresholding and k = 2), which is unsurprising given the nonpiecewise-constant character of the image. However, the visual difference does not appear to be large. The MSEs for the various methods tested are in Table 3. SHAH retains 387 nonzero coefficients.
The two examples considered provide evidence for the unsurprising tendency of SHAH to perform better on piecewise-constant images than on general smooth ones. The fundamental reason for this is that the SHAH building blocks are themselves piecewise-constant.

CONCLUSION
In this article, we have proposed the SHAH (shape-adaptive Haar) transform for images, which results in an orthonormal, adaptive decomposition of the image into Haarlike components, arranged hierarchically according to decreasing importance, whose shapes reflect the features present in the image. The decomposition is extremely sparse Figure 8. Clean, noisy, and denoised image using AWS, AWS-avg, SHAH with hard thresholding and SHAH-avg with hard thresholding. Table 3. Empirical mean square errors (MSE, divided by 10 4 and rounded) for the various reconstruction methods of the image from Example 2. AWS refers to adaptive weight smoothing from Polzehl and Spokoiny (2000). AWSavg is constructed like SHAH-avg but with SHAH replaced by AWS with default parameters. Boxed value in the MSE column is the lowest MSE across methods. Wavelet thresholding uses the Daubechies' least asymmetric filter indexed 10, combined with universal hard thresholding (default option in the R package wavethresh). Gaussian kernel estimate is an unattainable Gaussian kernel smoother in which the bandwidth was chosen by minimizing the MSE with respect to the true image (execution: routine kernsm from the R package aws) for piecewise-constant images. It is performed via a stepwise greedy bottom-up algorithm with quadratic computational complexity; however, nearly linear variants also exist. SHAH is rapidly invertible. We have shown how to use SHAH in conjunction with thresholding for the purpose of image denoising. SHAH is general in scope and can be used not only with images but also with any data that can be described as graphs or networks.

MSE
One interesting open question is that of the applicability of SHAH to the decomposition of color images, for example, those using the RGB color space. In the RGB case, depending on the application, one would entertain the possibility of selecting the SHAH basis either independently for each color band (e.g., if one wished to remove noise from each band separately), or jointly across the bands. Similar basis choice considerations would apply to multispectral or hyperspectral images. We leave this for future research.