Smart interactive search for Vietnamese disease by using data mining-based ontology

ABSTRACT In disease information retrieval, usually, users only know a top of basic information about the disease such as headache and high fever symptoms. Most search engines rely on keyword search and do not return exact results because they only literally find the records matching with the keywords of the queries. In this work, we propose a novel interactive search for disease information by using data mining-based ontology. In particular, a human disease ontology is used as a knowledge base to semantically recognize the input for the search engine. The association rule is applied to generate associated relations among keywords (including symptoms, derives from, located in …) for interactive query refinement. A Bayesian-based ranking algorithm is also proposed to arrange the search result. Prospectively, our approach is valuable not only to increase the accuracy of searching human disease, but also provide a significant approach in data mining ontology.


Introduction
The availability of information for treatment and healthcare on the Internet has helped people increasing knowledge about disease prevention and even treatment for themselves and others. As a consequence, research institutions, clinics and hospitals consecutively provide online medical knowledge. Then, users currently look up a disease hierarchically by index and read a considerable amount of information to get a desired reference. Due to a vast majority of unverified sources of information, Internet users, who do not have adequate expertise knowledge in medicine, tend to apply other familiar disease treatments onto their own cases, which possibly lead to mistreatment.
Most search engines rely on keyword search and do not return exact results because they only literally find the records matching with the keywords of the queries. For example, a user who does not know of 'Hemorrhagic fever' but its symptoms such as headache and high fever. These keywords are then inputted as a query into the search box. named entity recognition tool that is specifically trained for biomedical text. MetaMap tool parses the surgical pathology reports of the repository and produces a set of UMLS concepts. The queries will perform a free text search based on concepts identifiers and return relevant bio-specimens via the concept mapping location information. The limitation of building this search engine is that the non-natural language patterns of text in the repository might affect how to parse the text, identify noun phrases and map concepts to the text segments. Furthermore, it is also difficult to experiment with different mapping tools since it frequently requires a database schema modification 2 . Another concept-based semantic search engine in medical domain is the research of Bevan Koopman (Koopman, Bruza, Sitbon, & Lawley, 2012). Its data source is composed of SNOMED-CT (see Note 2) ontology and a real-world collection of medical records. In order to extract concept, it utilizes MetaMap to transform the term-based originals of documents into UMLS concepts as defined by the SNOMED-CT ontology. In some cases, the UMLS concepts have no equivalent definition in SNOMED-CT.

Disease ontology
Scientists have studied and documented the variation of human health to unravel the mystery of human disease (Schriml et al., 2012). Data from multiple sources are unified and integrated into Disease ontology (DO) on semantic webs. 3 DO facilitates the evaluation of diagnostic and treatment and comparison of data between studies based on semantic annotations of diseases. The initial version of DO was created in 2003 and 2004 by using ICD-9 as the foundation for DO vocabularies. Later versions of DO were improved and reorganized based on UMLS disease concepts including the term concepts of SNOMED-CT and ICD-9 (Schriml et al., 2012). Then it has become a community-driven, open and extensible framework using semantic relationships. DO are continuously being improved and extended with new DO terms. It currently has a single structure for disease classification and provides a clear definition for each. However, the relationship between disease causes and symptoms, as well as signs of diseases in this ontology, have not been systematically built yet. These relationships will be exploited to create the interactivity between the user and the disease in the new semantic search engine.
The ontology-based search engine can infer and provide users with more relevant results (Stojanovic, Studer, & Stojanovic, 2014). It interprets the keyword queries input by the user and translates them into semantic queries to access the ontology repository. A new search engine proposed in this paper would translate semantically search queries and interact with the users to capture their intent. Besides, the association rule (Neesha & Chander, 2014) is utilized to infer a list of questions to the user. Then, the answers iteratively help the search engine apprehend user intent and narrow down searching scope.

Data mining-based ontology for interactive search on disease domain
We proposed a new interactive search engine based on the DO. It has an ability to infer concepts from a query and suggest the inferred concepts to the user in order to make it interactive with them. Disease database is extracted from DO by using semantic query language. Association rules based on Apriori algorithm exploit disease data to support interaction between users and the system. For arranging the search result, a Bayesian-based ranking algorithm is also provided. Figure 1 is the system architecture.
DO is a hierarchy of diseases. Each disease may have multiple levels of a subclass or superclass. A subclass or superclass is also a disease. And, a disease can have no subclass. In Figure 2, 'Bệnh truyền nhiễm' (disease by infectious agent) has multilevel of the subclass. 'Bệnh nhiễm khuẩn do nấm' (fungal infectious disease) is a subclass level of it or we can say that it is a superclass level of the fungal infectious disease. In addition, the category also has other subclass levels such as 'Bệnh nấm da' (cutaneous mycosis), 'Bệnh nấm ngoài da' (ermatophytosis). 'Bệnh nấm da' and 'Bệnh nấm ngoài da' are at the same level and also subclasses of 'Bệnh truyền nhiễm'. Besides, 'Bệnh nấm da chân tay' (basidiobolomycosis) is a disease that has no subclass.

DO extraction
The semantic search engine is a mechanism that queries on ontology repository, so it must use the syntax of semantic queries to access the ontology. Users tend to search with natural language query rather than a semantic query for its complicated knowledge about ontology and repository schema (Reichert, Linckels, Meinel, & Engel, 2005). There have been recently several complicated mechanisms converting from natural language to semantic one (Cimiano, Haase, & Heizmann, 2007;Lopez, Pasin, & Motta, 2005). For example, QUICK search engine (Zenz, Enrico, Wolf, & Wolfgang, 2012) generates a complete semantic query space, that is, a set of all possible semantic queries for a given set of keywords. It first identifies all possible query patterns called query template, then generates semantic queries by binding keywords to query templates. SemSearch engine (Lei et al., 2006) uses another approach. For detail, instances, concepts and properties are determined from the keyword queries and then combined to produce appropriate formal queries. It is time-consuming if the natural langue query has been complex since 'SemSearch' engine may create an explosion number of semantic entity match combinations. In order to solve this problem, we extract all definitions, properties and relationship from DO and pour them into MySQL database. It will be faster and more flexible for our semantic search engine to access the database with MySQL queries.  Jena framework is used to extract data from DO. Jena is a semantic web framework, it provides a programmatic environment for RDF, RDFS and OWL which are data representation models for ontology. 4 SPARQL language is utilized to retrieve and manipulate data stored in RDF and OWL format. SPARQL allows a query to consist of triple patterns, conjunctions, disjunctions and optional patterns. 5 The process of disease factor extraction is illustrated in Figure 4.
Secondly, all disease properties or disease factors including symptom, cause and location are derived. In the DO, these disease factors are in the definition, so we must create a table 'DiseaseDefinition' first and we extract its records from DO. Besides, the table 'DiseaseDefinition' included is 'DiseaseId' and 'DiseaseDef' which is the definition of the DiseaseId. In the disease definition, the label of symptom is 'có_triệu_chứng' (has_symptom), the label of cause is 'vật_gây_bệnh' and the label of location is 'lưu_trú_tại' (located_in) (see Figure 3). Subsequently, we extract all disease factor values from all disease definitions and insert them into disease factor tables including table 'Symptom', table 'Cause' and table 'Location'. Likewise, the disease factor table contains 'FactorId' and 'FactorValue' which is the value of 'FactorId' ( Figure 5).
Thirdly, the Id of each record in disease factor table and the Id of the disease containing the disease factor of the record are inserted into table 'Factor-Disease'. Thus, the table 'Factor-Disease' includes 'DiseaseId' and 'FactorId' (Figure 6).

Retrieval algorithm
User's queries may contain one or more disease factors in order to search for the diseases matching with these factors. Retrieval algorithm for the queries requires combining multiple tables together to have enough fields for the matching. The system also creates the full text index for many fields of many tables to increase the searching speed. Search result ranking would be based on the support, the confidence of Association Rule algorithm and the probability of Bayesian algorithm.

Hint retrieval
This procedure is performed when users enter a query to search. Space key press event of the user will call a function to access the 'DiseaseFactor' tables to suggest other similar disease factors. This list of hints will be included in the drop-down list of the search box. Simultaneously, these hints are also stored in the cache for later retrievals.

Disease retrieval
Once users have completed the query and clicked the search button, the system will execute the query to disease database to retrieve the matching results. This query is performed on multiple tables joined together such as 'DiseaseIndex', 'Factor-Disease', 'Disease-Factor', 'DiseaseHierarchy' and 'DiseaseClass'. The results' probabilities of the query will then be calculated based on Bayesian algorithm. The probabilities will be sorted and displayed in descending order.

Disease factor recommendation retrieval
In addition to getting the diseases matching the query, the system also accesses to the 'DiseaseRule' table to obtain the disease factors related to the query and recommend them to the user. This 'DiseaseRule' table is generated earlier from exploiting the disease database by Apriori algorithm. Disease factor recommendation retrieval of the system is to get the rules in this table which have 'LeftSide' fields containing the user's query and recommend the RightSide field to the user. However, the order of disease factors in 'Left-Side' field of 'DiseaseRule' table may be different with the order of disease factors in the user's query. Therefore, in order to match more easily the 'LeftSide' field with the query, we extract the 'LeftSide' field of this table and insert them into 'LeftRule' table. This 'LeftRule' table includes 'RuleId' which is the rule id of the 'LeftSide' and 'FactorValue' which is the disease factor in the 'LeftSide'. These rules are also calculated confidences and supports. The ranking of recommended disease factors bases on these confidences and these supports.

Hint suggestion
Hint suggestion would increase the interactivity between the users and the system. 6 When users enter the keyword into the search box, the query autocompletes functionality of the system that will prompt the user with similar queries for choosing. This feature helps the user to select the query more precise and accuracy to their intent and that will make the results more relevant. It not only helps to fill out the query faster by selecting the query from hints but also gives spelling suggestions to users. When the users just remember the first letters of a keyword and enter them into the search box, the system will suggest the keywords that contain these letters and the users can choose the best suitable keyword from the suggestions. In another case, users just only remember a part of the query. So, query autocomplete will be very useful in this case. The queries of the hint which are similar to the user's query will help them to choose the query most efficient for them. In addition, the suggestion hints not only help users choose the correct queries for their intent, but also help them have the queries belonging to the dictionary of the system. This will provide the high confidence in the query and the system will return more accurately in the results. Figure 7 shows the process of getting hint suggestions from the system back-end in order to return to the user. First, the user will enter the keyword into the search box. After each user's key press event, the system will take query in the search box and check the cache whether it contains this query or not. If so, the system will get a list of hint suggestions of this query from the cache and suggest them to users. For instance, the query is not in the cache, the system will query table 'DiseaseFactor' of the database to select the records containing the query and save them with the query in the cache. Then, these records will be the hint suggestions returned to the users. Moreover, in each hint suggestion, the query will be highlighted in order to help users more easily recognize the hints which are closet to the query.

Disease rule extraction
The diseases in disease database contain diseases factors. Each disease factor can be a property of disease such as symptom, cause and location. In this database, we can identify the factors of a disease but the relationship between these factors is still unknown. This relationship is also important in the diagnosis and in finding the cause of disease. For example, the user with a very painful bump on his back does not know the cause of this symptom and he wants to find out it. Based on the relationship of this bump symptom, we can determine the reason of this symptom is caused by a spider or a special ant appearing nearby. Another example, the user who has a fever and headache, but he does not know exactly what kind of his illness is. The relationship of these two symptoms can ascertain other related symptoms of the user who may have and all these symptoms should help him more easily to identify his disease by this semantic search engine. In order to exploit the relationship or association rule between the disease factors, we rely on the frequent occurrence of the disease factors in the disease set.
The well-known approach for mining association rules of frequent item sets is Priori (Han & Kamber, 2006). This algorithm is often used in analysing market basket database. This kind of database stores transaction set containing many items. The items in a transaction are the products that user bought at the same time. In our context, the items are the disease factors and the transaction is the disease which contains several disease factors.

Apriori algorithm
The Apriori algorithm applying on DO generates a set of association rules (Figure 8). The rule A ⇒ B in this set means that A is the disease factors of the queries user enter into the search engine and B is the inferred disease factors from A. The disease factors of B are recommended to the user following his queries. Then, these association rules are stored in the table 'DiseaseRule' (Figure 9). This table includes 'RuleId', 'LeftSide' (the disease factors in A are separated by '&&' notation), 'RightSide' (the disease factor in B), Confidence (confidence of 'RuleId') and Support (support of 'RuleId').

Bayesian ranking
The user's queries return the diseases matching the disease factors in the queries. However, the specificity degree of the disease factors in the queries for each retrieved disease (retrieved disease is the disease matching the queries) is different. The results will be ranked by this specificity degree. We calculate this specificity degree based on the probability of Bayesian algorithm ( Figure 10). Bayesian probability is a quantity that is assigned to represent a state of knowledge (Jaynes, 1986). Bayesian probabilistic specifies some prior probability, which is then updated in the light of new, relevant data. 7 The  prior probabilities are the probability of class and the probability of attribute within a given class. In our context, the super classes of retrieved diseases are the classes and the disease factors in the queries are the attributes. In the returned result of our semantic search engine, diseases will be grouped by their superclasses and the superclasses will be ranked based on the probabilities of them for disease set in the queries. Let NC i be the number of diseases belonging to disease superclass C i . Let NS i be the number of diseases containing disease factor S i and belonging to disease superclass C i . NTS i is the total number of disease factor S i . The total number of diseases is called as NT. The Bayesian probability (Han & Kamber, 2006) of the disease factor S i within given class C i is calculated by the equation: Probability is calculated for multi attributes within a given class, Bayesian algorithm assumes attributes have independent distributions and thereby it estimates as following (Han & Kamber, 2006): X in the above formula is a set of attributes with n attributes. In our context, let S be a set of disease factors in the queries and S has n disease factors. The Bayesian probability of a set of disease factors within a given class C i is calculated by the equation: P(S |C i ) = P(S 1 |C i ) * P(S 2 |C i ) * · · · * P(S n C i ). P(S | C i ) is the probability of the superclass of retrieved disease for the disease factors in the queries. In order to calculate the P(S | C i ) faster when receiving the retrieved diseases, we extract all the superclasses of all diseases, calculate the quantities of them in advance and store them in the table 'DiseaseClass'. Table 'DiseaseClass' includes 'SuperClassId' which is not allowed duplication and 'Quantity' which is the quantity of the 'SuperClassId'. The quantity of a superclass is the number of diseases belonging to this superclass. The Figure 11 is the process of extract all superclasses of all disease. Firstly, we get superclass for each disease relying on 'DiseaseHierarchy' table. Then we select distinctly these superclasses and count the quantities for them.

Interface
The interface of the interactive disease search includes three sections. The first section is the search box with autocomplete drop-down list (Figure 7). This drop-down list shows the list of disease factors similar to the user queries. The disease factor of this list would help the user to complete his queries faster or help him in the case he just remembers a part of his keyword, he can rely on this disease factor as a hint to obtain his keyword.
The second section presents a list of disease factor inferred from the user's query. This list is a list of checkbox in order to allow user to choose multi disease factors suggested from the system (Figure 12).
The final section shows a list of diseases which is the result returned from the user's query. Each disease also has its Bayesian probability which expresses the probability of the relationship between the disease and user's query.

Experiment
In order to evaluate the accuracy of interactive disease search, we calculate precision and recall for the disease factor suggestions which are returned by the system. We randomly select five diseases such as 'Sốt xuất huyết Korean' (Korean hemorrhagic fever), 'Bệnh thủy đậu' (chickenpox), 'Quai bị' (mumps), 'Cúm gia cầm' (avian influenza) and 'Bệnh dại' (rabies). Each disease is assumed to be the search intent of one user. Each user starts the search query with a disease factor related to the disease that he or she is looking for. The system will suggest other disease factors related to the query and return results with matching diseases and the query at the same time. When users choose disease factors suggested from the system and click search, the system will offer new suggestions and newly returned results. This users' selection is considered as the second time that users feedback are put into the system (the first time is the query when users start searching). Each user will perform five times of such communications with the system. Precision and recall of the suggestions or recommendations are returned by the system in a communication which are calculated by the average of the precisions and recalls of five users. In addition, precision is measured based on the number of suggested disease factors which belong to the disease users are looking for. Data noise cannot be completely removed due to uncommon data structure. The noisy data has the same content with the disease factor of the query that will be counted as a recall. Figure 13 is a graph of  precision and recall measurement for the disease factor suggestions returned by the system in five times of communications. Table 1 shows the number of retrieved disease in five times of communications.
According to the graph of Figure 13, we can see that from the first communication to the fifth communication, the precision tends to increase gradually (from 15% to 65%). This means that when the system interacts with the user many times, it will upward destabilize about the disease that the user wants to search. It helps the recommendation of the system more relevant to user intent. Similarly, in Table 1, the number of retrieved disease tends to decrease (from 84 to 3) when the user interacts with the system several times. This makes the search scope of the disease user is looking for being smaller. Then the user will be easier to identify the disease relevant with his/ her intent.
In addition to the accuracy evaluation, we also measure the efficiency of this interactive disease search engine by comparing it with Google search engine. We perform searching for five diseases above (the queries are the symptoms of these diseases) on both Google search engine and this Interactive disease search engine. The result of the test is shown in Table 2.
According to the table, we observed that the Google search engine takes a longer time to return a result than the Interactive disease search engine takes (the cases of 'Quai bị' and 'Thủy đậu' disease). In cases of 'Bệnh dại', 'Sốt xuất huyết Korean' and 'Cúm gia cầm' disease, Google search engine cannot return the relevant diseases or it returns a huge amount (millions) of result and user has to take much time to refine the results in order to find out the relevant diseases. The test showed that Google search engine can solely find the diseases having specific features or recognizable features. The strange diseases or the diseases having several common features cannot be found by Google search engine because it does not have the inference capability to interact with the users and capture more detail about the disease they are looking for. After all, interactive disease search engine has these capabilities and it returns the relevant diseases of the test in a short time.

Conclusion
In this work, a novel interactive search for disease information using data mining-based ontology is proposed. The search engine is able to understand user-input by using a vocabulary from disease ontology. The associated pieces of information (including symptoms, derives from, located in …) of diseases are used for interactively refining query. A Bayesian-based ranking algorithm is also proposed to arrange the search result. According to the experimental result, the proposed approach is a significant approach.

Notes
Vietnam. He was one of 37 students of all Vietnam universities who won scholarship certificate of Ministry of Postal Service of Vietnam for the most excellent student of information technology, 2003. He has experiences in Information Engineering, especially Artificial Intelligence. His research interest focuses on Semantics and Bigdata. He has more than 60 publications.