Matching biomedical ontologies through compact differential evolution algorithm

ABSTRACT Although biomedical ontologies have been widely used in the life science domain, the heterogeneous problem among biomedical ontologies hampers their inter-operability. Thus, the establishment of meaningful links between heterogeneous biomedical ontologies, so-called biomedical ontology matching, is critical to the success of biomedical ontology engineering. To determine the biomedical ontology alignment with high quality, in this work, a Hybrid Compact Differential Evolution (HCDE) algorithm-based biomedical ontology matching technique is proposed. In particular, we propose a similarity metric on biomedical concepts, construct an optimal model for the biomedical ontology matching problem, and introduce a binomial crossover into CDE's evolving the process to enhance its performance. The experiments are carried out on the Disease and Phenotype track and Biodiversity and Ecology track from the Ontology Alignment Evaluation Initiative (OAEI 2018). The experimental results show that HCDE can significantly improve the CDE in terms of the alignment's quality, and the alignments obtained by HCDE are also better than OAEI 2018's participants.


Introduction
In recent years, various biomedical ontologies, such as FMA (Detwiler, Mejino, & Brinkley, 2016) and SNOMED-CT (Filice & Kahn, 2019), have been widely used in the life science domain (Faria et al., 2018). However, existing biomedical ontologies that cover overlapping domains are mostly developed independently, and the different ways of defining the same biomedical concept yield heterogeneous problems among biomedical ontologies, which hampers their inter-operability. Thus, the establishment of meaningful links between biomedical concepts, so-called biomedical ontology matching, is critical to the success of biomedical ontology engineering (Oliveira & Pesquita, 2018).
Usually, a biomedical ontology possesses tens of thousands of biomedical concepts which are semantically ambiguous, and these biomedical concepts own complex relationships between each other. Thus, it is a big challenge to match them effectively. Being inspired by the success of Evolutionary Algorithm (Acampora, Loia, & Vitiello, 2013;Ginsca & Iftene, 2010;Martinez-Gil & Montes, 2011;Naya, Romero, & Loureiro, 2010;Wang, Ding, & Jiang, 2006;Xue & Pan, 2018;Xue & Wang, 2015) in ontology matching domain, in this work, we propose CONTACT Junfeng Chen chen-1997@163.com to use Differential Evolutionary algorithm (DE) (Storn & Price, 1997) to determine the biomedical ontology alignment with high quality. Since matching biomedical ontologies is a large-scale problem, we utilize a compact DE (CDE) (Mininno, Neri, Cupertino, & Naso, 2011) to avoid the memory overflow and introduce a local search strategy into CDE's evolving process to improve its converging speed. In particular, our contributions made in this paper are as follows: • a similarity metric on biomedical concept pair is proposed to calculate the similarity value of two biomedical concepts, • an optimal model for biomedical ontology matching is constructed, • a problem-specific Hybrid CDE (HCDE) is presented to effectively solve the biomedical ontology matching problem.
The rest of the paper is organized as follows: Section 2 presents the biomedical ontology matching problem and biomedical concept similarity metric; Section 3 presents in detail the HCDE-based biomedical ontology matching technique; Section 4 shows the experimental results; and finally, Section 5 draws the conclusion.

Biomedical ontology matching
Biomedical ontology matching dedicates to determine an ontology alignment which consists of biomedical concept correspondences. Usually, some external resources, such as the electronic dictionaries and knowledge bases, are required in matching process. Since the quality of biomedical ontology alignment can be measured by the number of correspondences found and the mean similarity value of the biomedical concept correspondences, in this work, we utilize the following equation to measure a biomedical alignment's quality: where |C 1 |, |C 2 | and |A| are, respectively, the cardinalities of two biomedical concepts sets C 1 and C 2 , and an alignment A between them, simValue i is the similarity value of the ith correspondence of A.
Further, an optimal model for the biomedical ontology matching problem is constructed as follows: where |C 1 | and |C 2 |, respectively, represent the cardinalities of two biomedical concepts sets C 1 and C 2 , x i , i = 1, 2, . . . , |C 1 | represents the ith pair of correspondence, and in particular, x i = 0 means the ith source concept is mapped to none. Supposing A is X's corresponding alignment, the objective function F(X) is equal to f (A).

Similarity metric
Biomedical similarity metric is the foundation of the biomedical ontology matching technique (Cross, 2018). In this work, we utilize a profile-based similarity metric to measure to what extent two biomedical concepts are similar to each other. Given the concept hierarchies of two biomedical ontologie, for each biomedical concepts, we construct a profile for it by collecting the name, property name, and method name from itself and all its direct ascendant and descendants. Then, the similarity value of two biomedical concepts c 1 and c 2 is calculated through their profiles p 1 and p 2 : where • |p 1 | and |p 2 | are the cardinalities of p 1 and p 2 , respectively, • p 1 i and p 2 j are, respectively, the ith element of p 1 and jth element of p 2 , • sim () computes the similarity value between p 1 i and p 2 j by SMOA (Stoilos, Stamou, & Kollias, 2005) and UMLS (Bodenreider, 2004), which is defined as follows:

Hybrid compact differential evolution algorithm
To save the memory consumption and improve the converging speed, in this work, we utilize the compact version of DE, i.e. CDE, and combine CDE (global search) with a binomial crossover (local search) to address the biomedical ontology matching problem. In the following, three kernel components of HCDE are presented, i.e. the encoding mechanism, mutation operator and the local search, and the pseudocode of HCDE is presented at last.

Encoding mechanism
In our proposal, a Probability Vector (PV) (Xue & Pan, 2018) is utilized to characterize the entire population, and each element inside stands for the probability that holds true for a biomedical concept correspondence. We utilize the grey encoding mechanism to encode each biomedical concept mapping. When decoding, the number obtained represents the index of a target biomedical concept, and in particular, the value 0 means a source concept is not mapped to any target concept.

Mutation operator
In HCDE, three solutions, namely ind r , ind s , and ind t , are sampled from the PV, and an offspring ind off is generated as follows: where F = 1.5 is a scale factor that determines how far the generated offspring is from ind t . For the biomedical ontology matching problem, which is a discrete problem, we introduce the edit distance to measure two individuals' distance. In the following, we present the equation about the calculation of two individuals ind r and ind s 's edit distance: where |ind r | is the cardinality of ind r , ind r,i and ind s,i are, respectively, the ith element of ind r and ind s . Next, the offspring ind off are generated by partly flipping the elements of ind t , and the number of altered elements is determined by a random number in [0, 1] and editDistance(ind r , ind s ). For the sake of clarity, the pseudocode of the mutation operator is given in Algorithm 1:

Local search
To improve CDE's converging speed, a local search strategy in introduced, which searches for the optimal solution in the neighbourhood of the elite solution. In particular, we execute the local search in each generation which is implemented with the binomial crossover. For the sake of clarity, the pseudocode of the binomial crossover is given in Algorithm 2.

Algorithm 2 Binomial Crossover
The binary crossover randomly copies a sequential fragment of ind off 's genes into the corresponding positions of ind off , to generate ind elite 's neighbour solution. This procedure is similar to the two-point crossover where the first cut point is randomly selected from {1, 2, . . . , n} where n is the length of a solution, and the second point is determined such that L consecutive genes (counted in a circular manner) are taken from ind off . In this work, since ind off and elite are both generated through PV, they are similar in term of chromosome's information, i.e. many gene bit values of them are the same. Therefore, even when crossover probability is large, the ind off generated by the crossover operator only mutates a few gene bit values of elite. In this sense, this variation operator can be considered fairly exploitative.

The pseudocode of hybrid compact differential evolution algorithm
Given the length of a solution (or PV) length, the maximum number of generations maxGen, the binomial crossover probability p c , the step length for updating PV step, the pseudo-code of HCDE is given in Algorithm 3.

Experimental setup
In order to study the effectiveness of HCDE, we exploit the Disease and Phenotype track 1 and Biodiversity and Ecology track 2 from the Ontology Alignment Evaluation Initiative (OAEI 2018). 3 The Disease and Phenotype track is composed of two tasks which, respectively, require matching Human Phenotype Ontology (HP) 4 to Mammalian Phenotype Ontology (MP), 5 and Human Disease Ontology (DOID) 6 to the Orphanet Rare Disease Ontology (ORDO). 7 The Biodiv track requires matching the Environment Ontology (ENVO) 8 and the Semantic Web for Earth and Environment Technology Ontology (SWEET), 9 and the Flora Phenotype Ontology (FLOPO) 10 and the Plant Trait Ontology (PTO). 11 All these biomedical ontologies are particularly useful for biodiversity and ecology research and have been used in various projects, which are semantically rich and contain tens of thousands of classes.
In order to compare with CDE-based biomedical ontology matcher, which is the nonlocal search version of HCDE, and the participants of OAEI 2018, 12 we evaluate the obtained alignments with traditional recall, precision and f -measure (Van Rijsbergen, 1986). HCDE and CDE's results in the tables are the mean values in 30 independent executions, and the symbols P, R and F in the tables stand for precision, recall and f -measure, respectively. HCDE uses the following parameters which represent a trade-off setting obtained in an empirical way to achieve the highest average alignment quality on all exploited dataset.
We run the Disease and Phenotype track with an Intel Core i9-8950HK CPU @ 2.90 GHz × 12 and 25 Gb allocated RAM, the Biodiversity and Ecology track with an Intel Core i5-7500 CPU @ 3.40 GHz and 15.7 Gb allocated RAM, which are the same with the OAEI's hardware configurations.

Experimental results
As can be seen from Tables 1 and 2, HCDE's f -measure is the best in all testing cases. In general, our precision value is high, which show the effectiveness of the proposed concept similarity metric. Comparing with CDE, which is the nonlocal search version of HCDE, the recall and precision are both improved significantly, which further show the effectiveness of the local search algorithm. To conclude, HCDE can effectively match the biomedical ontologies.

Conclusion
To efficiently match the biomedical ontologies, an HCDEbased technique is proposed to efficiently determine the identical biomedical concepts. In particular, HCDE combines CDE (global search) and binominal crossover (local search) to address the biomedical ontology matching problems. In addition, a novel biomedical ontology similarity metric is presented to distinguish the heterogeneous concepts and an optimal model of biomedical ontology matching problem is constructed. The experimental results show that the introduction of a local search strategy can significantly improve the converging speed of CDE, and the performance of HCDE outperforms the state-of-the-art ontology matchers of OAEI 2018.