A Multilayered Clustering Framework to build a Service Portfolio using Swarm-based algorithms

ABSTRACT In this paper, a multilayered clustering framework is proposed to build a service portfolio to select web services of choice. It is important for every service provider to create a service portfolio in order to facilitate the service selection process for someone to obtain the desired service in the absence of public UDDI registries. To address this problem, a multilayered clustering approach applied on a variety of data pertaining to web services in order to filter and group the services of a similar kind which in turn will improve the leniency in the process of service selection is used. The advantages of the layer approach are reduced search space, combination of incremental learning and competitive learning strategies, reduced computational labour, scalability, robustness and fault tolerance. The results are subjected to cluster analysis to verify their degree of compactness and isolation and appropriate evaluation indices are used. The results were found passable with an improved degree of similarity.


Introduction
A Service provider in general has to do the following to put a service available for use in a conventional approach. Firstly, the provider has to decide on the service he needs to provide, then choose a registry/registries for uploading information about the service. Next, decide on how to list the service at the registry and finally provide an explicit definition on how an user can connect to the service.
However, as public registries are closed it becomes essential for service providers to make all the web service descriptions that are published available in order to choose or search a web service preferably in a proprietary portal. The service requester initiates a query by specifying his/her requirements. So, there must exist a service match maker, a broker that matches the request up on comparison with the published services and a recommendation is provided which contains a set of web services that match the needs by identifying the degree of similarity. Thus the discovery process is made successful defending on the maturity and capability of the matching process Therefore in order to make the search process easier an arrangement and organization of services grouping them based on certain relevancy factors is an indispensable factor.
In this context two issues are of primary concern, One, a procedure for categorization of services through efficient clustering techniques in order to facilitate any match making process to fetch the right service based on the requirements. The other is data pertaining to the services to be considered for categorizing the services.
The present research summarizes three approaches addressing the first issue and four categories of data in order to address the second issue.

Common approaches for web services clustering
Five different approaches for web services clustering are presented in this survey.
(a) Syntactic vs. Semantic Clustering One of the popular strategies that are used for web services analysis is though syntactic structure [1]. Selected research has sought to the intensification of the discovery of web services with search engines by recommending proven approaches to clustering of WSDL descriptions into functionally-alike sets before responding to discovery requests [2]. This methodology mines the WSDL to extract attributes that define the semantics and performance of the web service, which reveal the functionality of the service and after which suitable text mining strategies are applied [3].

(b) Functional vs. Non-Functional Clustering
There may be several service providers who may be dealing with the similar functionalities defined in a service interface [4]. Identifying and choosing the best service is an important task for service requesters. The particulars in WSDL descriptions are not sufficient for ranking the best services [5]. Non-functional properties together with description of cost, performance, security, and trustiness of a service are presented for computing the Quality of Services (QoS) [6]. There are numerous attributes of QoS that can be equipped into categories with a set of measurable parameters. The "best" service may have diverse implications for diverse requesters. One may prefer security to cost, while the other may prefer lower cost to performance [7]. Measurements of these non-functional properties can be attained using statistical analysis, data mining, and text mining technologies [8]. It is generally prepared by a third-party through the assembly of subjective evaluations from requesters. This data vigorously changes over time [9].

(c) Biologically Inspired Clustering
In addition to the classical and conventional approaches, researchers contemplate the use of biologically inspired methods for clustering [10]. It was stated that the clustering method based on Particle Swarm Optimization is better than partitioning clustering as it avoids the problem of local optima stagnation [11]. Ant colony optimization yields better results than classical clustering methods. ACO based clustering does not require to know in advance the number of clusters and the obtained clusters have a higher quality [12]. A hybrid technique, Tree Traversing Ant (TTA), combines features of ant based clustering with features of classical clustering techniques [13]. In the case of service clustering, a method based on TTA is applied that considers the services' syntactic descriptions as clustering norms.

(d) Taxonomic Clustering
Certain research works recommend the use of taxonomic clustering algorithm that groups web services based on their functional similarity [14]. This clustering scheme proceeds into attention not only discrete factors such as input or output of service operations, but also the latent inter-relationships among the discrete factors [15]. When a set of services are given, that may or may not have been categorized, individual methods to handle the issue and mark out their classification labels in terms of a common (given) taxonomy, such as UNSPSC is adopted [16]. When a new service description is published, the unclassified service is compared with the classified ones and measures of the likelihood that the new service description belonging to each cluster are calculated [17]. Pertaining to this calculation, the service will be assigned to a suitable category.

(e) Fuzzy Clustering
With a different perspective on web services clustering, a proposal of fuzzy clustering of web services grounded on quality of service is presented [18]. It delivers a description of how web services' quality of service data can be clustered fuzzily using unsupervised methods. The fuzzy clustering of web services could help requestors who have limited technical knowledge in order to understand the realistic quality of a service [19]. They could subscribe to services that give the best value for their money. It could further be used as a reference for requestors in the process of negotiating and specifying service requirements. This could also provide an alternative approach to those that depend on expert knowledge and it requires only less time, has better accuracy and wider availability [20].

Criteria for choosing a clustering algorithm
Some of the concerns to be accounted (1) Ability of the algorithm to handle non-linear data (2) The algorithm should be able to handle voluminous data (3) Algorithms using distance calculations for separating clusters are very receptive to ranges of variables. For example, "age" in general ranges 0 ∼ 100 and "salary" can extend from 0 to 100,000. When both variables are used jointly, distance from salary can overwhelm the other (4) Formation of outliers should be in control (5) Handling categorized (non-numeric data, nonnumeric variables, categorical data, nominal data, or nominal variables) is a great deal (6) An effectual clustering practice should hold up the mechanical discovery of clusters in a variety of subspaces of a higher dimensional space (7) According to the situations, it is necessary to toggle between supervised or unsupervised approaches (8) While handling time variant data, capturing hidden patterns becomes a challenge (9) Need for a vigilance value to decide upon the significance and threshold of match making A detailed study on the following two major categories of algorithms is made as an outcome of the survey.
(1) Neural Networks based algorithms (2) Swarm-based algorithms 2.2.1. Artificial neural networks ANN models for learning may be sorted out as supervised learning, unsupervised learning and reinforcement learning. Supervised learning system accepts the accessibility of a supervisor who classifies the training instances into groups and uses the data on the class membership of each training example, whereas, unsupervised learning scheme categorize the pattern class data heuristically and enables reinforcement learning studies by means of trial and error connections with its atmosphere. Some of the primary advantages of choosing ANNs are listed below: (1) ANNs have the capability to study and model non-linear and complex relationships, which are certainly significant since in real-life, several relationships amongst inputs and outputs are nonlinear and also complex. (2) ANNs can generalize -After learning from the initial inputs and their relationships, ANNs can infer unseen relationships on unseen data also, thereby making the model generalize and predict on unseen data. (3) Unlike several other forecasting techniques, ANN does not impose any limitations on the input variables (like in what way they should be distributed). Moreover, many studies have shown that ANNs have improved heteroskedasticity i.e. data with high volatility and non-constant variance, given its ability to learn hidden relationships in the data without imposing any fixed relationships in the data.

Swarm-based algorithms
Swarm Intelligence is a fairly new interdisciplinary field of research, which has gained huge acceptance these days. Algorithms fitting to the domain pull inspiration from the collective intelligence developed from the behaviour of a group of social insects like bees, termites and wasps. When acting as a community, these insects with limited individual ability can cooperatively perform many complex tasks essential for their existence. Problems like finding and storing foods, selecting and picking up materials for future usage require a detailed planning, and are solved by insect colonies without any kind of supervisor or controller. An illustration of predominantly successful research direction in swarm intelligence is Ant Colony Optimization (ACO) which emphases on discrete optimization problems, and has been applied effectively to a large number of NP hard discrete optimization problems which include the travelling salesman, the quadratic assignment, scheduling, vehicle routing, etc., and also to routing in telecommunication networks. Particle Swarm Optimization (PSO) is another very prevalent SI algorithm for global optimization over continuous search spaces. Since its beginning in 1995, PSO has fascinated the attention of numerous researchers all over the world ensuing into a vast number of variants of the elementary algorithm as well as various parameter automation strategies.
A detailed study has been made on the applications and merits of the following algorithms.

Proposed Work
The Proposed work suggests the application of using ART algorithm for primary clustering with functional data and sub-clustering through swarm-based algorithms employing non-fictional data like metadata, QoS data and Doman log information. The cluster results are then compared for quality.

Algorithms used
Upon studying the two broad categories of algorithms ART (Adaptive Resonance Theory) Network which is a very apt learning algorithm that suitably addresses incremental leaning is chosen for the first iteration of clustering. Three specialties of the ART algorithm relevant to the context are (1) It handles the problem of 'Stability and Plasticity Dilemma' in an effective manner. Plasticity is required for the upgradation of new knowledge and at the same time stability has to be preserved in order to retain the previously earned knowledge. This is taken care in the ART Network. (2) It works on binary input which could facilitate in accommodating voluminous data. (3) A Vigilance parameter which is a threshold of recognition is pre-set. Vigilance value of 'zero' puts data sets as independent items and when vigilance is set to 'one', two data items will fall in the same cluster if and only if there is a 100% match. In the current experiment, average values between 0.5 and 0.7 are set.
In the second iteration of clustering which is applied over the results of the first iteration through ART, swarm-based algorithms are experimented. This is because all biologically inspired algorithms draw their inspiration from animals or birds or insects belonging to the same kind forming clusters. No two different kinds of birds, fishes, animals or insects cluster together. Hence it is appropriate to apply swarm-based algorithms upon the results of the first cycle where a certain amount of homogeneity exists.

Functional data
Functional data are the basic query data used by the client to make a search for a service. It may be keyword search or a collection of functional attributes that characterize the service.

Meta data
The following are the sources of the metadata of a web service

Non-Functional data (QoS parameters)
QoS (Quality of Service) attributes of a web service can be understood as the capacity of a web service to act in response to expected invocations. In this attempt, numeric QoS values are single-handedly taken in to consideration for experimentation which can either be positive (or) negative. A positive QoS attribute is one whose value when increases is a sign of a better quality, for instance throughput, reliability and availability are positive QoS attributes. Negative QoS elements, when taking higher values point toward poor quality, for example response time and price are negative QoS elements.

Service generated data
These are data recorded over a period of time after the service becomes completely functional. Reason for choosing Service Generated Data are • Core Functional Data is not self-sufficient for the analysis. • Availability of Metadata documents with all attributes filled in cannot be guaranteed • Qi's parameters observed in one system differ from the Qi's observed in another system for the same service due to multiple network conditions.
Hence a service is allowed to function for a period of time and then the generated historical values are observed and taken for analysis. In this perspective, for web service recommendation several service generated Big Data (in a relative sense) can be considered which include Trace Logs, QoS and Service Relationships. Here QoS refers to personalized QoS data.

Clustering schemes
The layered clustering framework is presented in Figure 1. In this layered architecture, functional data or the core information used for the search of a web service is converted in to binary matrices representing the service features and are given as input to the ART network and the vigilance or threshold for clustering is set and the services are clustered. This attributes to the first phase of clustering. The clustered services are then subjected to subclustering through three different swarm algorithms. During the second phase, while using swarm algorithms, non-functional data of the already shortlisted services are considered. Thus sub-clustering happens in the second phase. The nonfunctional data used are meta-data extracted from the documents associated with the web services, QoS data and domain logs registered over a period of time. With these three types of data, three different swarm-based clustering approaches namely BOIDS(Birds Flocking) algorithm, ABC (Artificial Bee Colony) algorithm and PSO (Particle Swarm Optimization) algorithm are used and then cluster analysis is done considering two metrics namely intra-and inter-cluster distances. Certain evaluation indices like dunn index are also used to check the quality of the clusters.

Clustering using ART and birds flocking algorithm (Scheme 1)
Core Functional Data are taken as the input to the ART Clustering algorithm and the feature set given as input to ART is a binary sequence. The threshold value for clustering i.e., the vigilance parameter is set between 0.5 and 0.8. In the second iteration, the output of ART is taken as the input and the feature sets are updated with values parsed from metadata documents. Now birds flocking algorithm does the clustering work. The output clusters undergo a cluster analysis. The metrics used are inter-cluster distance and intra-cluster distances. The results are found drivable.

Clustering using ART and ABC algorithm (Scheme 2)
Core Functional Data are again taken as the input to the ART Clustering algorithm and the feature set given as input to ART is a binary sequence as in the previous scheme. The threshold value for clustering i.e., the vigilance parameter is set between 0.5 and 0.8. In the second iteration, the output of ART is taken as the input and the feature sets are updated with QoS information. For our study, only one positive QoS value which is throughput and one negative QoS value which is response time are taken into account. Now instead of birds flocking algorithm, Artificial Bee Colony (ABC) algorithm does the clustering work. The output clusters undergo a cluster analysis. The metrics used are inter-cluster distance and intra-cluster distances. Mean and Standard deviations are also computed. The results are still found passable with improved closeness within clusters.

Clustering using ART and PSO optimization (Scheme 3)
Core Functional Data are again taken as the input to the ART Clustering algorithm and the feature set given as input to ART is a binary sequence as in the previous two schemes. The threshold value for clustering i.e., the vigilance parameter is set between 0.5 and 0.7. In the second iteration, the output of ART is taken as the input and the feature sets are updated with domain log information. This information is retrieved from the services over a specific period of time, preprocessed and then quantified appropriately to fit in to the feature set. In the third scheme, Particle Swarm Optimization algorithm does the clustering work. The output clusters undergo a cluster analysis. The metrics used are inter-cluster distance and intra-cluster distances. Mean and Standard deviations are also computed. The results are still found passable with improved closeness within clusters and inter-cluster distances are also passable.
The above three schemes are illustrated considering the following data set. Web services that deal with online purchase of domain names through the web portal www.databasereseller.com in India are considered for the examination. Consequently the functional and non-functional characteristics of the web services are to be transformed in to bit patterns duly in order to give as input to the ART algorithm as explained in all the three schemes initially and later in the second phase three kinds of data associated with the web services namely metadata, QoS data and service generated data are used and iterated with birds flocking, ABC and PSO algorithms respectively. Though the first two schemes produced quality outputs it is suggested to use domain logs which is a kind of service generated data observed over a period of time. They are promising because QoS values keep varying from one machine to another due to network conditions and again availability of metadata attributes cannot be guaranteed all the time for all services.
Dataset is taken from www.databasereseller.com which consists of the domain logs of different web services recorded over a period of time pertaining to services that offer domain name purchase in India.

Art algorithm and its features
The main features and functioning of the ART algorithm are presented in this section.
It usually comprises a comparison and recognition fields made-up of neurons, a vigilance limitation, and a reset module. Higher vigilance yields highly comprehensive memories (many fine-grained categories), while lower vigilance ends up in more general memories. The comparison field takes an input vector (a one-dimensional array of values) and transfers it to its best match in the recognition field. Its best match is the single neuron whose set of weights (weight vector) most closely matches the input vector. Each recognition field neuron outputs a negative signal (proportional to that neuron's quality of match to the input vector) to each of the other recognition field neurons and inhibits their output accordingly.
It supports self stabilized learning and incremental learning, also handles Stability -Plasticity dilemma effectively. This means it can learn new things at the same time retain old information.
Vigilance Parameter is a threshold value that helps to set the degree of similarity expected. It assigns bottom up weights (activation) and top down weights (expectations). ART networks consist of an input layer and an output layer. Bottom-up weights are used to determine output-layer candidates that may best match the current input. Top-down weights represent the "prototype" for the cluster defined by each output neuron. A close match between input and prototype is necessary for categorizing the input. Finding this match can require multiple signal exchanges between the two layers in both directions until "resonance" is established or a new neuron is added.
ART networks tackle the stability-plasticity dilemma: Plasticity: They can always adapt to unknown inputs (by creating a new cluster with a new weight vector) if the given input cannot be classified by existing clusters. Stability: Existing clusters are not deleted by the introduction of new inputs (new clusters will just be created in addition to the old ones). Clusters are of fixed size, depending on ρ (Vigilance Parameter). Also, the algorithm gets only binary input. The  basic functionality of ART1 algorithm is presented in Figure 2.
Training: There are two elementary approaches of training ART-based neural networks: slow and fast. In the slow learning technique, the degree of training of the recognition neuron's weights towards the input vector is intended to continuous values with differential equations and is consequently dependent on the length of time the input vector is presented. In fast learning approach, algebraic equations are used to compute the degree of weight adjustments to be made, and binary values are considered. While fast learning is operative and competent for a variety of tasks, the slow learning method is more biologically reasonable and can be used with continuous-time networks (i.e. when the input vector can vary continuously).

Parameters used
Following parameters are used.
• n − Number of components in the input vector

Relevancy factor
The percentage of relevancy indicates that a particular service fits into the same group irrespective of the change in vigilance value. The average relevancy percentage is 83.28% for domain log based clustering, 80.42% for the QoS based clustering and for metadata based clustering it is 74.14%. The details are tabulated in Table 1.
Relevancy factor ensures the fact that even when the vigilance values are varied, the service falls in the same cluster. It means that the cluster has a high level of cohesion amidst the web services contained in it. Vigilance parameter could be set between 0 and 1. If it is 0, every single webservice is considered as a unique distinct cluster by itself. On the otherhand, if the vigilance value is 1, the members (web services) within the services are identical by 100%. In the experiment conducted middle values of vigilance ranging between 0.5 and 0.8 are considered and the observations are recorded. Higher or improved relevancy percentage in the table indicates that the particular approach has given better clusters that are cohesive and unique. Thus as tabulated in Table 1, ART algorithm and PSO algorithm experimented with domain logs observed over a period of time gave passable results forming more cohesive and unique clusters. Though a phenomenal improvement is not seen in comparison with ART and ABC algorithm there is a steady increase in the relevancy percentage.

Intercluster and intracluster distances
The compactness of the data items with in the cluster and the intercluster distances are observed and compared for all the three schemes. ART with PSO (Domain Logs) gave passable results with a high degree of similarity and the clusters are well separated from each other. The Comparison of intercluster and intracluster distances of all the three approaches is tabulated in Table 2. It tabulates the readings observe red in 4 consecutive experiments where the number of services is increased steadily. In the three verticals of varied combinations of clustering algorithms, the number clusters, average inter and intercluster distances of the clusters formed are noted. The vigilance parameter for the ART algorithm is set to 0.8 in all the cases. The values indicate that intercluster distances are higher and intracluster distances are lower in the 'domain logs' scheme which employed the PSO algorithm for subclustering. In all the three approaches ART algorithm is used for phase I. ART with PSO, that employs periodically recorded domain logs data has produced more cohesive and distinctive clusters.

Key merits of the approaches
Initially, ART algorithm gave better results being an unsupervised algorithm than K-means where k value has to be fixed in advance. When non-functional data i.e. data extracted from metadata documents were used in addition to functional data, and birds flocking algorithm was employed, better results were produced upon sub-clustering.
In birds flocking algorithm, different flocks represent different service clusters. Similar to another bioinspired clustering algorithm, the ant colony algorithm, flocking algorithm does not need initial partitions or the prior knowledge about the class number for each dataset. The advantage of the flocking clustering algorithm is the heuristic principle of the flock's searching mechanism. This heuristic searching mechanism helps them quickly form a flock. Thus the flocking clustering algorithm can generate a better clustering result with fewer iterations than that of the ant clustering algorithm which is widely used. The clustering results generated by the flocking algorithm can be easily visualized and recognized by an untrained human user.
Though the above method was efficient, data extracted through metadata documents were inconsistent and there were many missing data in many instances. Thereby an attempt to use numeric QoS attributes was made by employing ART and ABC algorithms. But the problems with artificial bee colony algorithm are (1) No centralized processor to guide the ABC towards good solutions (2) Slower convergence than other heuristics. Moreover, QoS values were different from one client machine to another due to the fluctuating network conditions in different instances.
So in the third approach, service generated data like domain logs which are historical in nature, i.e. observed over a period of time after a web service has been instantiated, are used so that there will be a better consistency in the data. So for rectifying the earlier issues, in the final stage the PSO algorithm is proposed. The merits being (1) Simple implementation, (2) Easily parallelized for concurrent processing (3) Few algorithm parameters. The results through ART and PSO were much passable and optimality is established through query time and a credit based ranking technique is applied for presenting a better result.

Conclusion
In this paper, users are given recommendations about the web services available for a specific application by analysing the service generated data such as domain logs of web services by performing parallel PSO based clustering using Map Reduce technique and then extracting reports by further refinement to achieve optimality. Application of ART algorithm helped significantly in handling different classes of unfragmented data and as ART takes binary data representing multidimensional nonlinear data was simplified. Medium range of vigilance values between 0.5 and 0.8 are used to set the threshold values for the degree of similarity and it is observed that the cluster parameters are phenomenally increasing both in number and quality.
Usage of PSO algorithm helped to achieve optimality in a faster rate and the computational over head is controlled significantly due to the two phase approach. Optimality is achieved in a better way in PSO as there is no overlapping and mutation. Formation of outliers was also avoided. The self-organizing and scalable nature of swarm algorithms is well suited to complement unsupervised clustering of ART.

Disclosure statement
No potential conflict of interest was reported by the authors.