DW-PathSim: a distributed computing model for topic-driven weighted meta-path-based similarity measure in a large-scale content-based heterogeneous information network*

ABSTRACT From the past, several studies in the information network mining have been mainly designed for single-typed objects and links, called the homogeneous information network (HoIN). These HoIN-based approaches are definitely unsuitable for multi-typed objects and links, known as the heterogeneous information network (HIN). There is no doubt that most of the real-world networks are not only composed in a complex heterogeneous manner but also are extremely large in size. The big size of these networks is one of the most challenging issues that influence directly the system's performance. In this paper, our studies are mainly focused on improving the topic-driven weighted similarity measurement between same-typed objects in HIN, based on the meta-path-based mechanism, called W-PathSim. Moreover, our contributions in this paper also aim to optimize the performance of the W-PathSim model in the manner of very large-scaled HIN by combining the proposed W-PathSim model with the approach of distributed computing of ‘graph-frames’ on Spark, called DW-PathSim. The DW-PathSim not only supports in tackling the problem of weighted meta-path-based similarity searching in HINs but also the distributed computing problem on the big networked data. We test the DW-PathSim model with the real-world DBLP dataset in order to demonstrate the effectiveness of our proposed models.


Introduction
This paper is an extended version of the ACIIDS 2018 conference paper (Pham, Do, & Ta, 2018). Depending on the achievements of the proposed W-PathSim model, our works in this paper are concentrated on improving the model's capability in handling a large-scaled network. The new model is called DW-PathSim, which enables topic-driven similarity measure between nodes in large-scaled networks by using the Spark GraphFrames.
We all know that most of real-world data entities are interconnected via multiple types of relationships within the large-scaled networks, called the information network (Chang et al., 2015;Shi et al., 2017;Sun & Han, 2012, 2013. The real information networks consist of a large number of interacting, multi-typed components such as social networks, biological gene networks, WWW, etc. In the past, most of the studies only considered information networks in the view of the homogeneous aspect, which means that they only contain one type of object and link. These HoIN (homogeneous information network)-based studies are unsuitable for applying in HIN (heterogeneous information network)-based mining tasks due to the failures in distinguishing different types of objects and links in the networks. Due to the existing drawbacks of previous HoINbased approaches, in recent times, more and more researchers have begun to consider the differences in types of objects and relationships in HINs and are developing a suitable mechanism for leveraging the HIN-based mining tasks. Moreover, as in the case of the big networked data, the developed approaches are also needed to be capable for working on the distributed computing infrastructure. There is no doubt that with the large size of recent real-world networks such as common social networks (Facebook, Twitter, etc.) with the billions of nodes and an uncountable number of relationships, the standalone-based computing mechanism approaches are definitely unaffordable in this case.

Preliminaries and backgrounds
In an information network (see Definition 1) mining, evaluating the relationships and computing relevant score between same-typed objects are the fundamental problems. Similarity measurement is considered as a baseline task for other important mining tasks in information networks. Moreover, evaluating the similarity between objects in the information networks also supports the involvements in constructing the object-based information retrieval and recommendation system (Yu et al., 2014). Most of the approaches support for similarity measuring between two same-typed objects are based on evaluating the relationships between linked objects.
Definition 1 (Information Network (IN)) (Sun & Han, 2012, 2013. The information network is defined as a directed/undirected graph-based structure, denoted as G = (V, E). Each vertex (v, v [ V) represents the object and edge/link (e, e [ E) represents the relationship between objects in (V) (as illustrated in Figure 1(A)). An IN contains the set of objects' types (A) and relations (R). There are two mapping functions, including f:V A, f(v) [ A and c:E R, c(e) [ R. In general, a specific IN belongs to two main types: . Definition 1a. Homogeneous Information Network (HoIN). HoIN contains only one type of object and relationship between objects (where |A| = 1 and |R| = 1). . Definition 1b. Heterogeneous Information Network (HIN). On the other hand, HIN is a complex type of IN, with the number of objects' types and relationships' types more than 1 (where |A| . 1 and |R| . 1).
Definition 2 (Network Schema (NS)) (Sun & Han, 2012, 2013. Similar to the schema structural paradigm of the relational table database, the information network is defined as the 'meta template' or 'network structural schema' (as shown in Figure 1(B)). The network schema of a given network is denoted as NS G = (A, R), representing how a specific object/ entity type is connected with the other objects in different or same type. Multi-typed object and link in HIN mining. In an HIN, the network schema (see Definition 2) is used in order to specify the constraints of type on the sets of existed objects and their relationships. Similarity measurement is only meaningful when evaluated on the set of same-typed objects, definitely we can calculate the similarity scores of different-typed pairwise objects, such as 'author' vs. 'paper' or 'paper' vs. 'venue', etc. Moreover, the similarity scores between pairwise same-typed objects are computed also based on evaluating how they are linked to each other via different paths in the given network. However, the paths that are used to link same-typed objects might be not the same due to the differences in the relationship and sequential order of linked nodes within the path. For example, in the DBLP bibliographic network, two 'author' objects might be linked via different type of paths such as author work at affiliation work at author, author write paper write author, etc. and these types of paths cannot be treated as the same because they are different in the meaning, such as the first path indicates the coworker relationship between two authors while the left one represents the co-authorship relation between two authors. Therefore, Sun et al. (2011) have proposed an approach of using 'meta-path' (see Definition 3) to indicate the semantic relations between pairwise objects which support to leverage the similarity measure as well as another mining task in HIN.
Definition 3 (Meta-path). In HIN mining, a meta-path (P) is defined as a sequential order of objects and relation with (l ) in length, denoted as P = A 1 R 1 A 2 · · · R n−1 A n or P = (A 1 , A 2 , . . . , A n ), in short.
In HIN-based mining techniques, the meta-path is considered as the principle paradigm for discovering relationships in the context of the multi-typed object network. Totally different from HoIN-based similarity search approaches, two same-typed objects in a HIN can be linked via different meta-paths and every meta-path is different in the physical semantic meaning (as examples in Figure 2 and Table 1). Following the defined meta-path (P), there might be one or several concrete paths: p,p = (a 1 a 2 . . . a l a l+1 ) which linked two specific objects: a 1 and a l+1 is called a 'path instance' of meta-path (P).

Remaining issues and challenge statements
In recent times, most of the HIN-based approaches for similarity measurement between same-typed nodes encounter three main problems, which include: Binary relation-based similarity evaluation challenge. In prior studies, there was no method which focuses on considering the weighted attributes of the relation between same-typed objects while evaluating the similarity. The approach of the PathSim model (Sun et al., 2011) is considered the most well-known meta-path-based binary link evaluation approach. The similarity between two same-typed objects is obtained mostly depending on evaluating how many these same-typed pairwise objects are connected, following the paradigm that the more they are linked, the more they are similar to each other. This type of method encounters the issue of shortage in considering rich attribute values of nodes and links for the similarity evaluation process. For instance, in the DBLP bibliographic network, there are multiple object's types which have their own attribute such as 'author' objects have own attribute like 'gender', 'age', 'home-town' 'education background', etc. or 'paper' objects have 'topic', 'category', etc. attributes. There is no doubt that these attributes play an important role in the processes of extracting relevant features that are used as the baseline for computing the similarity between same-typed objects.
Non/less-connected objects similarity evaluation challenge. Moreover, as a subsequent result of the link-based similarity evaluation approach, the level of relevancy between objects is obtained by counting the number of paths that connect these objects together; hence, the less-linked objects will be considered as less relevant. This situation is not correct in some cases, for example, assuming that two 'authors' work in different affiliations but they mostly contributed their works to the'bioinformatics' field, and they Figure 2. Examples of common meta-paths that are linked two 'author' objects in the DBLP bibliographic network. Table 1. Examples of common meta-paths on the DBLP bibliographic network, every meta-path carries a different semantic meaning.

Meta-path schema
Description of semantic meaning Path: author write paper write author This meta-path presents for the co-authorship between two authors, working on the same paper Path: author cite paper cite author This meta-path presents for the citation relationship between two authors, two authors cite the same set of papers Path: author write paper submit venue submit paper write author This path presents for the relationship of two authors who commonly submit their work to the same venues/journals never or rarely submit their papers to the same venues or journals. So, within the linkbased approach, the similarity score between these two authors is very low or even zero, and definitely this result seems unreasonable in this case. Big networked data processing challenge. Most of the real-world information networks are very large in size in nature, typically such as social networks, WWW, biological gene database, chemical compounds database, bibliographic network, etc. with the number of network's nodes and edges being in billions. Therefore, traditional standalone-based approaches for the large-scaled information network are unaffordable due to low-performance and time-consuming as well as challenges related to upgrading the computing resources. Therefore, the need for constructing new methods, which are affordable for handling the large-scaled information networks, is totally necessary.

Proposed solutions and contributions
As aforementioned problems in this paper, our contributions concentrate on tackling the problems related to binary link-based similarity evaluation of previous approaches as well as big networked data handling. We propose the DW-PathSim model, which is the extension of the previous W-PathSim (Pham et al., 2018) content-based HIN similarity measurement model in the distributed computing framework of Apache Spark with Graph-frames for large-scaled graph processing. For the W-PathSim model, it supports in the thorough evaluation of the similarity in the topic of content-based objects such as 'paper' objects in a bibliographic network combining with the classic meta-path-based approach of PathSim. This combination can help to overcome mentioned challenges of the link-based similarity measurement technique. Overall, our works in this paper are three-fold, including: . First of all, we use the LDA model to extract the topic distributions from content-based objects within the networks. Then, these topic probabilistic distributions are used to compute the topic similarity between these content-based objects via the cosine similarity metric. . After that, we present the methodology of W-PathSim and DW-PathSim models which is the combination of the meta-path-based similarity measurement with the topic similarity between content-based objects following the pre-defined meta-paths. For the DW-PathSim model, this model is designed to work on the distributed computing environment of Apache Spark in order to handle the large-scaled heterogeneous network. . Finally, we conduct empirical studies on our proposed models by comparing with previous approaches that are applied both in HoIN-based and HIN-based similarity measurement. The experimental outputs demonstrate the effectiveness of our proposed models in both the enrich-schema information network similarity measurement as well as time-consuming performance.
The rest of our paper is structured in four main sections. We discuss previous works as well as motivations in the second section. In the third section, we present the related background concepts, our methodologies and the implementation of both the proposed W-PathSim and DW-PathSim models. For experimental studies, in the fourth section, we describe our experimental environment scenario, set-up as well as evaluation metric usage; in this section, we also give thorough discussions about the experimental outputs. And the last section is the conclusion about our works and next improvements.

Related works and motivations
Following the development of networked data mining, recently the HIN mining has attracted many researchers in multiple disciplines due to its wide applications. From the baselines of the heterogeneous structure mining problem which was first introduced by J. Han and Sun Y., the HIN mining problem has been more and more popular recently.
In the complicated dynamic real-world network context, the HIN-based mining tasks are needed to rapidly develop in order to meet the complex knowledge mining requirements. It is unnecessary to say the similarity search is a primitive task for solving all the other techniques that are used to discover the knowledge from the information network, such as object/node clustering, object/node classification, link prediction, recommendation system, etc. The key principle of the HIN-based mining technique is using the 'metapath'(s) also called the 'semantic path'(s) for guiding the processes of similarity evaluation between pairwise nodes in HIN. Several approaches have been developed under the paradigm of meta-path-based mining such as the most well-known RankClus model , which is the in-cluster-ranking model for the bi-type network and the proposed NetClus (Sun, Yu, & Han, 2009) for solving the start-network clustering problem. In the next development, the proposed PathSim (Sun et al., 2011), PathSelClus  algorithms were considered as a principal baseline for meta-path-based similarity measurement with strengthened theories of meta-path-based random walks in HIN. However, as the aforementioned problems, most of the previous models are considered as the link-based similarity measurement technique, which is mostly based on the relationships between objects to compute the level of relevance. The shortage of considering object or relation own attributes while evaluating the similarity might lead to the decrease in the output accuracy. Moreover, most of the contemporary HIN-based similarity measurement approaches are only suitable for small-scaled information networks that fail to apply in the parallel and distributed computing environment which is commonly used to handle the big networked dataset.

Methodology and system design
In this section, we present the approaches of W-PathSim and DW-PathSim models. The W-PathSim model is designed to run on the standalone-based environment, while the DW-PathSim is aimed to work on the distributed computing environment which supports handling of very large-scaled information network. Both the proposed models are the combination of previous meta-path-based and topic attribute-based similarity between same-typed objects by using the LDA topic model, in a given content-based heterogeneous network.

Topic similarity evaluation between content-based objects in HIN
Within the content-based HIN such as the DBLP bibliographic network, there are rich-text objects such as 'papers' which can be used to extract the latent topic distributions by the support of topic modelling techniques, such as LSI, pLSI and LDA. In this paper, we select the LDA topic model (Blei, 2012;Blei, Ng, & Jordan, 2003;Steyvers & Griffiths, 2007) to obtain the topic distributions of content-based objects in HIN. In general, the LDA model supports to extract the latent topic (T) distributions over content-based objects, denoted as (d). The probabilistic distribution of the (j)th topic over a specific contentbased object (i), donated as P(t j |d i ) = u d i z j with (t:t [ T). For the set of probabilistic distributions of a set of latent topics (T ), we can represent the content-based object (d i ) as a |T|-dimensional vector d i (as shown in Equation (1)): where . d 1 is the set of probabilistic topic distributions of the content-based object (d i ) which is represented as a |T|-dimensional feature vector. . P(t j, j[|T| |d i ) represents the probabilistic distribution of the (j)th topic at a content-based object (d i ).
Finally, we use these latent topic distributed proportions as the weights between the document/paper in HINs. For obtaining the similarity score between two same-typed content-based objects, denoted as (d x ) and (d y ), we use the cosine similarity. The cosine similarity between two objects (d x ) and (d y ), denoted as cos sim(d x , d y ) is computed as the following equation (as shown in Equation (2)): where . d x and d y represent for the |T|-dimensional feature vector of content-based objects (x) and (y), respectively.
represent the component of the feature vector d x and d y , respectively, merely the probabilistic distribution value of the (i)-th topic at each object.
Following the given equation, the topic relevance between two content-based objects such as papers in DBLP is calculated by the cosine similarity metric. Normally, most of the content-based objects in HINs are not directly connected. Their relationships are represented as meta-path such as two 'paper' objects are indirectly linked in common metapaths like A-P-V-P-A, V-P-A-P-V, etc. Then in the next section, we define the combination of meta-path-based and topic similarity of content-based objects within metapath to leverage the process of object similarity measurement. W-PathSim model: the combination of meta-path-based and topic attributebased similarity measurement in content-based HIN By obtaining the topic similarity between content-based objects, such as 'paper' in the DBLP network, we can use them as the weighted attributes to leverage the process of similarity evaluation between other objects' types such as 'author' or 'venue' following the predefined meta-paths, such as A write P submit V submit P write A, A write P cite V cite P write A, V publish P write by A writen by P publish V, etc. Giving a general symmetric meta-path (P), with length is (l), P = A 1 R 1 · · · A c · · · R 2 Al 2 l 2 +1 · · · A c · · · R l−1 A l , the meta-path P is composed of at least one content-based object, denoted as (A c ) inside. The W-PathSim model is restricted to use only for meta-paths which contain the content-based object. The fact is that this limitation does not seem a considerable problem due to the following reasons. First of all, there is no doubt that most of the rich-text HINs often have a significant amount of content-based objects, such as: 'comment', 'post', 'news', etc. on the social networks such as Facebook, Twitter, etc. 'researching paper/article' on the bibliographic network such as DBLP, DBIS, etc. Moreover, these content-based objects definitely appear in most of common meta-path(s) as well as have considerable influences on the other object's types. Therefore, our proposed models can be widely applied in multiple cases of HIN mining. In our proposed W-PathSim model for two same-typed objects, denoted as (x) and (y, the weighted similarity score of this pairwise object follow the meta-path (P). We re-define the PathCount function of the PathSim model (Sun et al., 2011) as a topic-driven weighted similarity path counting mechanism, denoted as W -PathCount as follows(as shown in Equation (3)): ( 3) where . p and |P| represent the specific path and a total number of path instances between two same-typed objects (x) and (y), respectively.
. , is the total weight of link which connects two same-typed objects (x) and (y), following the meta-path (P), because most of the relation's types in HIN are binary relations, therefore the value of is often set to 1. . x c and y c represent for the content-based objects of the object (x) and (y) following the meta-path (P), respectively. Both (x c ) and (y c ) content-based objects are in the same type, with f(x) = A c and f(y) = A c .
For the previous approach of PathSim, the path count function, which supports to calculate the total path instances between two objects, is not clearly evaluated. For a specific meta-path, all the path instances' weights are assigned as 1 (binary relation). The traditional PathSim model depends mostly on the number of paths that connect two same-typed objects. It does not thoroughly investigate the importance of each path. The more two objects connect, the more they are similar. This assumption is not totally correct in some cases. For example, as illustrated in Figure 3, with the pre-defined meta-path A-P-V-P-A, there are two path instances between two authors (1) and (2), starting from author (1), which are (path 1): [author 1 → paper 1 → venue 1 → paper 3 → author 2] and (path 2): [author 1 → paper 2 → venue 1→ paper 3 → author 2]. We can see that most of the edges in path-1 and path-2 are the binary relation, such as author-paper, paper-venue, so the overall weights for these paths are all 1. Hence, the link-based approach cannot help to distinguish what are the differences between these path instances. As shown from the example in Figure 3, following the previous link-based approach, the output will indicate that author (3) is much more similar than author (1) because the number of path instances between authors (1) and (3) is higher than authors (1) and (2). But thoroughly looking at the inside content of their submitted papers, most of the papers of author (3) cover the topics/subtopics of 'artificial intelligence and robotics' while authors (1) and (2) share the same interests on 'data mining and retrieval' fields. Subsequently, concluding author (3) is much more similar with author (1) than author (2) which does not make sense in this case. Tables 1 and 2 show examples of path instances for meta-path A-P-V-P-A and V-P-A-P-V, respectively. Therefore, we need to add the topic weighting attributes of pairwise content-based objects to control the output following on the relevancy of topics. Finally, the final function of W-PathSim is changed by the path-counting function which is described in Equation (3) as follows (as shown in Equation (4)) Table 3: (4) where . , P, represents all path instances, following meta-path (P), which starting form object (x) to other objects which are in the same type with (x), and with f(x) = f(¬x).
. , P, represents all path instances, following meta-path (P), which starting form object (y) to other objects which are in the same type with (y), with: f(y) = f(¬y). For the overall information network, the W-PathSim model enables one to capture the similarity score between two same-typed objects which depend on not only how much these pairwise objects are connected but also on how their incorporated (interlinked) objects are similar to each other. This fact is reasonable in the realistic world as most relevant authors often research and publish their papers on the same topic fields, in the bibliographic network. Or a set of friends on the social network, like as the Facebook, often post comments or feeds about the same social topics, etc. Therefore, the combination of link-based and topic attribute-based approach in the W-PathSim model is capable to leverage the similarity searching task in the context of a heterogeneous network (Tables 4 and 5).

Meta-path-based traversal via BFS for information network discovering
Definition 4 (Commuting matrix (M P ) (Sun et al., 2011)). The commuting matrix, denoted as (M P ) for specific meta-path (P), with (l) being the meta-path's length, the meta-path is given as P where W A i A j represents for the adjacency matrix between object's type A i and A j , and M (i, j) represents the number of path instances between two root objects ( Obtaining the number of path instances between two same-typed pairwise nodes over a given information network is quite complex in the heterogeneous context of multi-typed objects and relationships. It is true because in the circumstance of dealing with a very large network we cannot figure how many relation's types that each object has as well as the types of target objects which are linked to the current investigated object. Therefore,  we need to use the meta-path to guide the system travel through specific routes only. In this section, we define the meta-path-based traversal via applying the BFS strategy which is used for computing the number of path instances between two same-typed pairwise nodes and constructing the commuting matrix (Sun et al., 2011) (see Definition 4).
Overall processes are described in Algorithms 1 and 2.
Algorithm 1. Pseudo code for computing the number of path instances of a specific object in a given heterogeneous network via BFS following defined meta-path (P).
Output: the commuting matrix (M P ). 1: Function Constructing M P BFS(G, P): 2: Create: candidates = list node by type(G, f(A 1 ) 3: Create: M P = [|candidates|][|candidates|] 4: For candidate in candidates: 5: Set: next = candidate; 6: MetaPath BFS Traversal(G, P, M P , candidate, next): 7: End for 8: Return M P 9: End function DW-PathSim: the distributed computing framework for W-PathSim in Spark and the Graph-frames environment We all known that most of the real-world information networks are very complicated in their structure as well as huge in size, such as more than 1.8 billion linked websites in WWW, approximately 500 million users and company in the LinkedIn network, or over 1.74 billion users along with uncountable relationships in Facebook social network, etc. Therefore, it is true that most of the traditional of the standalone-based processing method seems unsuitable for handling tasks related to knowledge discovering from these such large-scaled networks. In order to overcome the challenges which incorporate big networked data, the approaches of distributed parallel computing are taken into consideration, one of the trusted available solutions for handling scalable distributed graphbased data structure is Apache Spark, firstly developed by Matei Zaharia, at AMPLab in University of California, Berkeley. There are two main components of Spark that support handling graph-based distributed computation, which are GraphX and Graph-frames. Both components' mechanisms are designed to work for distributed storage and graph processing efficiently in the resilient distributed dataset (RDD) which is affordable for scalable networked data computation. In this section, we demonstrate the approaches of using Spark's graph-based distributed computing components for meta-path traversal via motifs finding a mechanism, which helps us to fast discover structural patterns of a given heterogeneous network following defined meta-paths. A motif is defined as a sequential order of nodes and edges within the network, denoted as G = (V, E), as a tuple following specific structural pattern. The general form of a motif is defined as the following equation (as shown in Equation (5)): where . s and t are the source and target nodes of the sequence, respectively, including constraints on the attributes of source and target nodes.
. e represents the direction as well as constraints on the attributes of the relationship between the source (s) and target (t) nodes.
For representing a longer sequential relation, such as the meta-path(s), we can combine multiple motifs to meet the searching requirement. In this paper, we define the method for constructing the motif querying pattern to find all the paths between two sametyped pairwise nodes (x) and (y) which depends on the structure of a specific metapath (P), with length (l ), denoted as , following the general formulation (as shown in Equation (6)): (6) where . A and R denote the type of object and the relationship between two linked objects, respectively. . s represent the constraint function which supports to restrict the attributes of links and objects.
The number of tuples which are used to compose a specific meta-path is equal to the length (l) of that meta-path. For example, with the meta-path: A-P-V-P-A (author write paper submit venue submit paper write author, l = 4), with this meta-path, we will Algorithm 3. Pseudo code for the overall processes of constructing the commuting matrix via Spark graph-frames motifs finding, following the meta-path (P) in the DW-PathSim model.
Output: the commuting matrix (M P Mapping: GraphFrame G G; 5: For i in range (0, length(candidates): 6: For j in range (0, length(candidates): 7: Define: (Equation (6)) 8: Set: paths = GraphFrame G .find ; 9: Update: M P [i][j] = count( paths); 10: End for 11: End for 12: Return M P 13: End function In the beginning, the system generates the set of pairwise same-typed nodes and created a commuting matrix M P with the size being equal to the number of candidates. After that the information network is mapped to the graph-frames. Basing the predefined meta-path (P), the motifs are created with the structure following Equation (6). Finally, the motifs are mapped into the graph-frames to find all the matched path instances, the number of paths then update to the created commuting matrix. We use the Scala programming language with the Graph-Frame as follows: gf

Experimental studies and discussions
In this section, we demonstrate our experimental studies in order to evaluate the effectiveness of our proposed models. For the proposed W-PathSim model, which is the combination of meta-path-based and topic-driven attribute-based similarity measurement, we test the model with previous approaches of HoIN-based and HIN-based similarity measurement to evaluate the accuracy of W-PathSim with previous models. Moreover, we also conduct experiments for comparing the time-consuming performance of the DW-PathSim model, which is implemented in the distributed computing environment of Apache Spark with the standalone-based W-PathSim model to demonstrate the advantages of the DW-PathSim model in handle large-scaled information networks.

Experimental dataset and evaluation metric usage
In this paper, we use the real-world DBLP bibliographic network 1 as the main dataset for all experiments. The DBLP bibliographic network contains over 2M authors, 4.1M papers and over 5.3K venues/journals. For the text content of published papers in DBLP, we use the abstract text dataset, which is provided by Aminer. 2 The Aminer corpus contains over 1.5M abstract contents of published DBLP papers. From the 1.5M abstract text document of Aminer, we randomly selected 100K papers as the main corpus for training the LDA topic model. By using LDA, we extract 10 latent topics and 20 keywords for each topic from that text corpus. For evaluating the accuracy of outputs, we used the nDCG (normalized Discounted Cumulative Gain) (Järvelin & Kekäläinen, 2002) metric to compute the accuracy rate. Each output result is intuitively evaluated and the specific relevancy level score is assigned from 1 to 3 (higher the better), where 0 is the non-relevant object to the target object in the user's query, 1 is quite a relevant object to target one in the query. 2 is closely relevant object to target one in the query and score 3 is very/highly relevant object to target object one in the query.

Experimental set-up, results and discussions
Model accuracy evaluation. In order to evaluate the accuracy of the W-PathSim model, we compare the output results of the W-PathSim model with both HoIN-based models, such as Personalized PageRank (PPR), SimRank and HIN-based PathSim model. For HIN-based        approaches of personalized PageRank (PPR) and SimRank, W-PathSim gains better performance in accuracy, over 14.09% in compared with PPR, and approximately 13.71% in compared with SimRank, for both top k authors and venues similarity search. As shown from results, the W-PathSim models also achieve slightly a higher accuracy than the traditional HIN-based PathSim model (2.23% for top k authors searching and 1.85% for top k venues searching). Standalone vs. distributed environment evaluation: For evaluating the effectiveness of applying graph-based distributed computing for the similarity measurement task in the large-scaled network, we implemented the DW-PathSim model, which ise a combination of the Spark-based distributed processing environment with the W-PathSim model using the motifs finding mechanism, and it runs on different scales of the experimental dataset (increasing gradually by sizethe number of nodes). The overall processes execution times (seconds) are logged to make the comparison with the traditional standalonebased W-PathSim approach. The details about the environmental infrastructure set-up for both W-PathSim and D-PathSim models are described in Table 10.
For testing the performance of standalone-based and distributed-based mechanisms of similarity search in a given network, we split the dataset into 10 parts, following the size (number of nodes), from 10% to 100% (Figures 6 and 7). Both mechanisms are implemented and executed simultaneously on a same dataset's size, then the overall completion times (in seconds) for each mechanism in each dataset's size are collected to be compared with each other. The experimental results for meta-path A-P-V-P-A are shown in Table 11 and meta-path A-P-A in Table 12.
The experimental results showed that the distributed-based mechanism of DW-PathSim has gained better performance in time consumption than the standalonebased W-PathSim model. In the context of the large-scaled network, nearly 80% size of the experimental dataset, the DW-PathSim model takes a smaller amount of time duration to complete the processes of meta-path-based traversal and constructing the commuting matrix than the W-PathSim model.

Conclusion and future works
In this paper, we present two novel approaches of W-PathSim and DW-PathSim models, which are the combination of topic attribute-based and meta-path-based similarity  10  78  198  20  246  567  30  426  786  40  652  1123  50  874  1326  60  1427  1572  70  2178  1927  80  2982  2362  90  3678  2872  100 3927 3121 searching in HINs. In the context of heterogeneous network mining, the proposed W-PathSim model enables one to thoroughly evaluate not only to the semantic relationships between linked objects following defined meta-paths but also topic relevance which depends on the extracted attributes of content-based objects via the LDA topic model. From the experimental studies, our proposed W-PathSim model has shown capabilities of leveraging the similarity measurement in HIN. We also extend our studies of the W-PathSim model to the aspects of large-scaled networked data handling, called DW-PathSim. The DW-PathSim model is designed to work on the Spark-based distributed computing environment which is capable of handling large-scaled HINs. Through experiments, within the context of large-scaled networks, DW-PathSim is much better in performance than the previous standalone-based W-PathSim model.

Disclosure statement
No potential conflict of interest was reported by the authors.

Funding
This research is funded by Vietnam National University Ho Chi Minh City (VNU-HCMC) under the grant number B2017-26-02.

Notes on contributors
Phuc Do is currently an Associate Professor of the Faculty of Information Systems at University of Information Technology (UIT), VNU-HCM, Vietnam. His research interests include data mining, knowledge-based systems and web information exploration, big data analysis and applications.
Phu Pham is currently a Ph.D. student at the Faculty of Information Systems, University of Information Technology (UIT), VNU-HCM, Vietnam. His research interests include data mining, information network mining, big-data analysis, distributed and parallel computing and business information system.