Path histogram distance and complete subtree histogram distance for rooted labelled caterpillars

ABSTRACT A rooted labelled caterpillar (a caterpillar, for short) is a rooted labelled unordered tree transformed to a path after removing all the leaves in it. In this paper, we discuss two histogram distance between caterpillars. One is a path histogram distance as an -distance between the histograms of paths from the root to every leaf and another is a complete subtree histogram distance as an -distance between the histograms of complete subtrees for every node. While the latter is always a metric for general trees, the former is not a metric. In this paper, we show that, for caterpillars, the path histogram distance is always a metric, simply linear-time computable and incomparable with the edit distance. Furthermore, we give experimental results for caterpillars in real data of comparing the path histogram distance and the complete subtree histogram distance with the isolated-subtree distance as the most general tractable variation of the edit distance.


Introduction
An efficient comparison of tree-structured data such as HTML and XML data for web mining, DNA and glycan data for bioinformatics, phrase trees for natural language processing, and so on, is one of the important tasks for data mining. The tree-structured data are often regarded as rooted labelled unordered trees (trees, for short). The most famous distance measure between trees is the edit distance (Tai, 1979).
The edit distance is formulated as the minimum cost of edit operations, consisting of a substitution, a deletion and an insertion, applied to transform from a tree to another tree. Unfortunately, it is known that the problem of computing the edit distance between trees is MAX SNP-hard (Zhang & Jiang, 1994). Furthermore, this statement also holds even if trees are binary (Hirata, Yamamoto, & Kuboyama, 2011).
In order to achieve an efficient comparison of trees, many variations of the edit distance have developed as more structurally sensitive distances (cf., Kuboyama, 2007;Yoshino & Hirata, 2017). Almost variations are metrics and the problem of computing them is tractable as cubic-time computable (Yamamoto, Hirata, & Kuboyama, 2013;Yoshino & Hirata, 2017;  Furthermore, we give experimental results for caterpillars in real data, such as N-glycans and all of the glycans (we refer to all-glycans) from KEGG , 1 CSLOGS, 2 dblp, 3 SwissProt, TPC-H, Auction and Nasa from UW XML Repository 4 datasets represented by Table 1 in Section 4. Then, we compare the path histogram distance and the complete subtree histogram distance with the isolated-subtree distance for running time, distributions and scatters.

Preliminaries
A tree T is a connected graph (V, E) without cycles, where V is the set of vertices and E is the set of edges. We denote V and E by V(T) and E(T). The size of T is |V| and denoted by |T|. We We denote an empty tree (∅, ∅) by ∅. A rooted tree is a tree with one vertex r chosen as its root. We denote the root of a rooted tree T by r(T).
Let T be a rooted tree such that r = r(T) and u, v, w [ T. We denote the unique path from r to v, that is, the tree ( The parent of v(= r) is its adjacent vertex on UP r (v) and the ancestors of v(= r) are the vertices on UP r (v) − {v}. We say that u is a child of v if v is the parent of u and u is a descendant of v if v is an ancestor of u. We call a vertex with no children a leaf and denote the set of all the leaves in T by lv(T).
Let T be a tree (V, E) and v a vertex in T. A complete subtree of T at v, denoted by T We use the ancestor orders < and ≤, that is, u<v if v is an ancestor of u and u ≤ v if u<v or u = v. We say that w is the least common ancestor of u and v, denoted by u ⊔ v, if u ≤ w, v ≤ w and there exists no vertex w ′ [ T such that w ′ ≤ w, u ≤ w ′ and v ≤ w ′ .
For v [ T, pre(v) (resp., post(v)) denotes the occurrence order of v in the preorder (resp., postorder) traversal of all the vertices in T. Then, we say that u is to the left of v in T if pre(u) ≤ pre(v) and post(u) ≤ post(v). We say that a rooted tree is ordered if a left-toright order among siblings is given; unordered otherwise. We say that a rooted tree is labelled if each vertex is assigned a symbol from a fixed finite alphabet Σ. For a vertex v, we denote the label of v by l(v), and sometimes identify v with l(v). In this paper, we call a rooted labelled unordered tree a tree simply.
We say that a tree C such that r = r(C) is a caterpillar ( cf., Gallian, 2007) if C is transformed to a path UP r (v) for some v [ C after removing lv(C). For a caterpillar C, we call the remained path UP r (v) a backbone of C and denote it by bb(C). It is obvious that V(C) = V(bb(C)) < lv(C). We can determine whether or not a tree T is a caterpillar in O(|T|) time.
Next, we introduce an edit distance and a Tai mapping.
Definition 2.2(Edit distance (Tai, 1979)): For a cost function γ, the cost of an edit operation e = l 1 7 ! l 2 is given by g(e) = g(l 1 , l 2 ). The cost of a sequence E = e 1 , . . . , e k of edit operations is given by g(E) = k i=1 g(e i ). Then, an edit distance t TAI(T 1 , T 2 ) between trees T 1 and T 2 is defined as follows: Definition 2.3 (Tai mapping (1979)): Let T 1 and T 2 be trees. We say that a triple (M, T 1 , T 2 ) is a Tai mapping (a mapping, for short) from T 1 to T 2 if M # V(T 1 ) × V(T 2 ) and every pair (v 1 , w 1 ) and (v 2 , w 2 ) in M satisfies the following condition: (1) v 1 = v 2 iff w 1 = w 2 (one-to-one condition).
We will use M instead of (M, T 1 , T 2 ) when there is no confusion denote it by Let M be a mapping from T 1 to T 2 . Then, the cost g(M) of M is given as follows: Trees T 1 and T 2 are isomorphic, denoted by T 1 ; T 2 , if there exists a mapping M [ M TAI(T 1 , T 2 ) such that M| 1 = M| 2 = ∅ and g(M) = 0.
(1) For trees T 1 and T 2 , the problem of computing t TAI(T 1 , T 2 ) is MAX SNP-hard (Zhang & Jiang, 1994). This statement holds even if both T 1 and T 2 are binary, the maximum height of T 1 and T 2 is at most 3 or the cost function is the unit cost function (Akutsu et al., 2013;Hirata et al., 2011).
Finally, we introduce an isolated-subtree mapping and an isolated-subtree distance as the variations of the Tai mapping and the edit distance.
Definition 2.6(Isolated-subtree mapping and distance (Zhang, 1996)): Let T 1 and T 2 be trees and M [ M TAI(T 1 , T 2 ) . We say that M is an isolated-subtree mapping, denoted by M [ M ILST(T 1 , T 2 ) , if M satisfies the following condition: Furthermore, we define an isolated-subtree distance t ILST(T 1 , T 2 ) as follow: It is obvious that M ILST(T 1 , T 2 )#M TAI(T 1 , T 2 ) and then t TAI(T 1 , T 2 )≤t ILST(T 1 , T 2 ) . In contrast to Theorem 2.5, the following theorem also holds.
Theorem 2.7(cf., Yamamoto et al., 2013): Let T 1 and T 2 be trees. Then, we can compute It is known that t ILST is the most general tractable variation of t TAI (Yoshino & Hirata, 2017).

Path histogram distance and complete subtree histogram distance for caterpillars
In this section, we formulate a path histogram distance and a complete subtree histogram distance between caterpillars.
First, we introduce a path histogram distance. Let T be a tree such that r = r(T). Then, for v [ lv(T), we regard the path as a string l(v 1 ) · · · l(v k ) on Σ and denote it by s(r, v). Also we say that a string s [ S * occurs in T if there exists a leaf v [ lv(T) such that s = s(r, v) and denote the number of occurrences of s in T by f (s, T). Furthermore, we For trees T 1 and T 2 , a path histogram distance d P(T 1 , T 2 ) between T 1 and T 2 is defined as an L 1 -distance between H P(T 1 ) and H P(T 2 ) , that is, For l = |lv(T)|, it is obvious that |H P(T )|≤l and s[S(T) f (s, T) = l. Next, we introduce a complete subtree histogram distance. We denote the set  For trees T 1 and T 2 , a complete subtree histogram distance d CS(T 1 , T 2 ) between trees T 1 and T 2 is defined as an L 1 -distance between H CS(T 1 ) and H CS(T 2 ) , that is, For n = |T|, it is obvious that |H CS(T )|≤n and c[C(T) f (c, T) = n. As the properties of d CS , the following theorem holds.
(1) d CS is a metric.
In the remainder of this section, we discuss the properties of the path histogram distance d P . Start the following example.
Example 3.4 For trees, d P is not a metric in general. For trees T 1 and T 2 in Figure 2, it holds that d P(T 1 , T 2 )=0 (because H P(T i )={〈aba, 2〉} ) but T 1 ò T 2 . In Figure 2, T 1 is a caterpillar but T 2 is not. If both trees are caterpillars, then the following theorem holds.
Theorem 3.5 For caterpillars, d P is always a metric.
Proof. By the definition, it is sufficient to show that d P(C 1 , C 2 )=0 iff C 1 ; C 2 for caterpillars C 1 and C 2 . In other words, it is sufficient to show that we can transform C from H P(C) uniquely.
Let C be a caterpillar such that r = r(C). Suppose that the backbone bb(C) of C consists of vertices v 1 , . . . , v k such that v 1 = r. We denote a string l(v 1 ) · · · l(v k ) representing bb(C) by s(C). Then, for every leaf v [ lv(C), s(r, v) is of the form l(v 1 ) · · · l(v k ) such that v = v k and l(v 1 ) · · · l(v k−1 ) is a prefix of s(C). Hence, consider the following procedure: First, select the longest string s = l(v 1 ) · · · l(v n ) in H P(C) and set v 1 , . . . , v n−1 to a backbone. Next, for every 〈s, f (s, Since every caterpillar has just one backbone, the above procedure constructs a caterpillar C from H P(C) uniquely.
Proof. Let C be a caterpillar such that r = r(C) and suppose that bb(C) consists of vertices r = v 1 , . . . , v n . Then, repeat the following procedure from i=1 to n: For every leaf v which is a child of v i , store s = l(v 1 ) · · · l(v i )l(v) and the number of leaves with the label l(v) as the children of v i as f (s, C).
Since V(C) = V(bb(C)) < lv(C) and V(bb(C)) > lv(C) = ∅, the repetition traverses every vertex in C just once, so we can compute H P(C) in O(|C|) time.
Theorem 3.7 There exist caterpillars C 1 and C 2 satisfying the following statements. These statements also hold even if t TAI is replaced with t ILST .
Proof. The caterpillars C 1 and C 2 that are isomorphic but just labels of the roots are different satisfy statement 1. On the other hand, the paths C 1 and C 2 with the same length such that all the labels of vertices in C 1 is a and all the labels of vertices in C 2 are b satisfy the statement 2.
Hence, t TAI or t ILST and d P are incomparable metrics between caterpillars.
Proof. Let C 1 be a star, that is, bb(C 1 ) consists of a single vertex, and C 2 the caterpillar obtained by adding new vertex connected to the root of C 1 . Then, since S(C 1 ) > S(C 2 ) = ∅, it holds that d P(C 1 , C 2 )=O(l) . On the other hand, it is obvious that d CS(C 1 , C 2 )=1 , so the statement 1 holds. Furthermore, the same caterpillars C 1 and C 2 in the proof of Theorem 3.7.2 also satisfy the statement 2. Hence, d CS and d P are also incomparable metrics between caterpillars. Since 0 ≤ d P(C 1 , C 2 )≤l 1 +l 2 and 0 ≤ d CS(C 1 , C 2 )≤n 1 +n 2 , where l i = |lv(C i )| and n i = |T i |, we use the normalized versions d P * and d CS * of d P and d CS , that is, d P * (C 1 , C 2 )=d P(C 1 , C 2 )/(l 1 +l 2 ) and d CS * (C 1 , C 2 )=d CS(C 1 , C 2 )/(n 1 +n 2 ) . Additionally, since 0 ≤ t ILST(C 1 , C 2 )≤n 1 +n 2 , we use the normalized version t ILST * of t ILST , that is, t ILST * (T 1 , T 2 )=t ILST(T 1 , T 2 )/(n 1 +n 2 ) .

Experimental results
In this section, we give experimental results of comparing d P and d CS with t ILST (and comparing d P * and d CS * with t ILST * ) for caterpillars.
We use N-glycans and all of the glycans (we refer to all-glycans) N-glycans and all of the glycans (we refer to all-glycans) from KEGG, 1 CSLOGS, 2 dblp 3 and SwissProt, TPC-H, Auction and Nasa from UW XML Repository 4 datasets. Table 1 illustrates the number of caterpillars in the datasets, denoted by #cat, and the number of data, denoted by #data.
Note that, in the dataset TPC-H, the number of different caterpillars is just 8, so we will use the 8 caterpillars, referring TPC-H • , in the following. Furthermore, For D [ {Auction, Nasa}, D − denotes the trees obtained by deleting the root for every tree in D. Since one tree in D produces some trees in D − , the total number of trees in D − is greater than that of D.
We deal with caterpillars for N-glycans, all-glycans, CSLOGS, the selected 0.1% (= 5154) caterpillars from the largest one in dblp (we refer to dblp 0.1 ), SwissProt, TPC-H • , Auction − and Nasa − . Table 2 illustrates the information of such caterpillars. Here, ([a, b]; c) means that a, b and c are the minimum, the maximum and the average number.
In the following, we denote the number of caterpillars (#cat), the number of vertices (#vertex), the degree, the height, the number of leaves (#leaves) and the number of labels (#labels) by c, n, d, h, λ and β, respectively. Table 3 illustrates the running time to compute d P , d CS , t ILST and t TAI for all the pairs in caterpillars in Table 2. Here, we assume that a cost function in t ILST and t TAI is a unit cost function. Also we refer the result in Muraka, Yoshino, and Hirata (2019) to the running time of t TAI . The computer environment is that CPU is Intel Xeon E51650 v3 (3.50 GHz), RAM is 1 GB and OS is Ubuntsu Linux 14.04 (64bit). Table 3 shows that, except TPC-H • , it holds that d P,d CS,t ILST,tTAI for running time. The running time of d P and d CS is large if both n and c are large. On the other hand, the running time of t ILST (and then t TAI ) is large if both n and d (or λ), rather than c, are large.
For the ratios of t ILST/d P and t ILST/d CS , t ILST/d P is smaller than 10 and t ILST/d CS is smaller than 7 for Nglycans, all-glycans, CSLOGS, Auction − and Nasa − . On the other hand, the ratios are extremely larger for dblp 0.1 and SwissProt than others. The reason is that the caterpillars in dblp 0.1 and SwissProt have much larger values of n, d, λ and a smaller value of h than other datasets.
Next, we discuss the distributions and scatters for the normalized distances of d P * , d CS * and t ILST * . Figure 3 illustrates the distributions of the normalized distances d P * (long dashed line), d CS * (solid line) and t ILST * (short dashed line) for pairs of caterpillars in N-glycans, all-glycans, CSLOGS, dblp 0.1 , SwissProt and Nasa − . Here, we omit the distributions of TPC-H • and Auction − , because the value of d P * and d CS * in TPC-H • has just one value of 1 and those in Auction − has just two values of 0 and 1. Figure 3 shows that (1) the distribution of d P * is larger than that of d CS * for N-glycans and all-glycans, (2) the distributions of all the distances concentrate near to 1 for CSLOGS and dblp 0.1 and (3) the distributions of d P * , d CS * and t ILST * are independent. Figure 4 and 5 illustrate the scatter charts of the normalized distances d P * , d CS * and t ILST * for N-glycans, all-glycans and CSLOGS, and those for dblp 0.1 , SwissProt and Nasa − .
Here, d 1 /d 2 (d 1 , d 2 [ {d P, d CS, t ILST} ) denotes the scatter charts between the number of pairs of caterpillars with the value of the distance d 2 pointed at the x-axis and that with the value of the distance d 1 pointed at the y-axis. Also cc denotes the correlation coefficients between d 1 and d 2 . Figures 4 and 5 show that the plots concentrate at x=1 for the cases of t ILST * /d P * and t ILST * /d CS * in N-glycans, all-glycans, CSLOGS and Nasa − and for the case of t ILST * /d P * in dblp 0.1 and the plots concentrate at y=1 for the cases of d P * /d CS * in N-glycans, allglycans, CSLOGS, dblp 0.1 and Nasa − . Just in SwissProt, the plots concentrate no axis.
Also, the correlation coefficient of t ILST * /d CS * has the highest value in N-glycans, allglycans and dblp 0.1 , whereas the correlation coefficient of d P * /d CS * has the highest value in CSLOGS, SwissProt and Nasa − . Furthermore, consider the correlation coefficients in Figures 4 and 5. (1) The correlation coefficient of d P * /d CS * in N-glycans and those of all the cases in all-glycans are small, that is, less than 0.5. Also the correlation coefficient of neither t ILST * /d P * nor t ILST * /d CS * in N-glycans is large, that is, less than 0.6. Then, no pairs of distances are interrelated. (2) The correlation coefficient of d P * /d CS * in CSLOGS is much larger than those of t ILST * /d P * and t ILST * /d CS * , where the former is greater than 0.7 but the latter is less than 0.3. Also, the correlation coefficient of t ILST * /d CS * in dblp 0.1 is much larger than those of t ILST * /d P * and d P * /d CS * , where the former is greater than 0.9 but the latter is less than 0.6. Then, the distances of d P * and d CS * in CSLOGS and those of t ILST * and d Cs * in dblp 0.1 are interrelated. (3) All of the correlation coefficients in SwissProt are large, that is, greater than 0.95. Also, all of the correlation coefficients in Nasa − are also large, that is, greater than 0.85. Then, all the distances of d P * , d CS * and t ILST * are highly interrelated.

Conclusion and future works
In this paper, we have introduced the path histogram distance and shown that it is a metric, linear-time computable and an incomparable metric with the edit distance and its variations. Also, by introducing the complete subtree histogram distance as another metrics, we have given experimental results for caterpillars in real data to compare running time, distributions and scatter charts between the path histogram distance d P , the complete subtree histogram distance d CS and the isolated-subtree distance t ILST . As a result, whereas d P and t ILST are incomparable in Theorem 3.7 and d P and d CS are incomparable in Theorem 3.9, the experimental results show there exist cases such that their distances are interrelated.
In this paper, we have just referred the result of computing t TAI for caterpillars (Muraka et al., 2019) in Table 3. Then, it is a future work to compare d P and d Cs with t TAI , instead of t ILST , in more detail. Also it is an important future work to compare d P and d CS with the vertical and horizontal distances (Muraka et al., 2019) to approximate t TAI for caterpillars.
As another metrics between trees, Kawaguchi, Yoshino, and Hirata (2018) have introduced an earth mover's distance (Rubner, Tomasi, & Guibas, 2007) for trees. Since their formulation is based on the complete subtree histogram, it is a future work to formulate the earth mover's distance for caterpillars based on the path histogram if possible, and compare it with the path histogram distance.