Nature inspired-based remora optimisation algorithm for enhancement of density peak clustering

Abstract Density peak clustering (DPC) has shown promising results for many complex problems when compared with other existing clustering techniques. Inspite of many advantages, DPC suffers with lack of cluster centroids and cut-off distance identification. Cut-off distance is the prominent parameter used in the calculation of local density. The improper choice of cut-off distance leads to improper cluster results. Currently, the cut-off distance is selected using decision graph or delta density or knee point detection or silhouette score or kernel functions. The main problem with the above functions for selecting the cut-off distance in DPC is that they often rely on heuristic or visually subjective criteria, making the choice of the optimal cut-off distance challenging and potentially sensitive to data characteristics. By leveraging metaheuristic optimisation algorithms, the process of selecting the cut-off distance becomes less subjective and data-driven, potentially leading to improved clustering results in DPC. This motivated us to work on the choice of cut-off distance by the usage of remora optimisation algorithm (ROA). The cluster results are improved by the usage of remora in selection of reliable cut-off distance (${d_c})$dc). The effectiveness of the updated DPC with ROA is evaluated by applying on the eight datasets and compared with K-means, traditional DPC, DPC merged with other optimisation results. The three parameters used here to check the quality of the cluster are homogeneity, completeness and silhouette analysis. ROA is new and built on the inspiration of remora which moves from one place to another using the sea fishes like shark, whale, sword fish, etc. It is clear from the results that DPC with ROA has produced the better homogeneity value of 0.807, completeness of 0.699 and silhouette analysis of 0.79 than the other clustering algorithms.


Introduction
Searching through and analysing a sizable collection of unstructured data in order to find patterns and extract important details is known as data mining.Classification and clustering are the two types of data mining methods that are most widely used.To assign classes to items, classification is used.Similar to classification is clustering.Clustering, on the other hand, finds similarities between objects to know how they differ.With the help of clustering, it is possible to identify inherent patterns or classes in the data without any need for prior knowledge.In situations where information with labels might be scarce or unavailable makes clustering more adaptable and suitable to a wide range of situations.The selection of the proper class labels, however, can be difficult, random or context-dependent in real-world situations.We can avoid having to prejudge the class labels using clustering.This advantage of clustering motivated us to work on various clustering techniques (Aggarwal, 2015;Halkidi et al., 2001;Han et al., 2011;Jain et al., 1999a;Xu & Wunsch, 2005a).Majority of researchers consider both exterior separation and the internal homogeneity when describing a cluster (Hansen & Jaumard, 1997;Jain & Dubes, 1988).
Clustering algorithms that were created to address a specific issue typically include favourable assumptions about the application of interest.Success in other problems that fail to meet these premises would always be impacted by these biases.For instance, the Euclidean measure is the foundation of the K-means algorithm, which produces hyperspherical clusters by default.However, if the actual clusters take on different geometrical shapes, K-means might no longer be useful, and we will need to turn to alternative methods.This circumstance also applies to mixture-model clustering, when a model is fitted to the data beforehand.Based on the characteristics of the clusters that are produced, hierarchical clustering and partitional clustering are two general categories that are widely accepted (Everitt et al., 2001;Jain et al., 1999).Partitional clustering immediately divides data objects into a predetermined number of clusters while hierarchical clustering (Jain et al., 1999b;Zhao et al., 2005) groups data objects with a series of partitions, varying from standalone, clusters to a cluster consisting of all samples or vice versa.
resembles a tree, such that C i 2 H m and C j 2 H l and m>l imply C i 2 C j or Hierarchical clustering comes in two main flavours: (i) Agglomerative clustering creates the appropriate number of clusters by merging the nearest pairings of clusters after each data point is first treated as a separate cluster.It is vulnerable to the initial conditions, like the selection of the distance measure or connection criteria, and has a high time complexity of O(n3), where n represents the total number of data points.(ii) Divisive clustering begins with all of the data points in the same cluster and continually splits the clusters till each data point creates its own cluster, but this method is challenging (Jain, 2010;Kaufman & Rousseeuw, 1990a) and has a tendency to be more complicated and computationally costly than agglomerative clustering.Due to the lack of an ideal way to calculate the number of clusters, this is extremely sensitive to noise and outliers in the dataset.Similarity Partitional clustering can be divided into two categories.(i) K-means clustering (Bishop, 2006;Hastie et al., 2009;Lloyd, 1982;MacQueen, 1967) seeks to divide the data into K clusters, although this is dependent on the centroids chosen at the beginning, and the user must provide the K number of clusters in advance.Datasets with irregular or erratic cluster shapes are difficult for it to handle.(ii) A variant of K-means clustering called K-medoids (Jain & Dubes, 1988;Kaufman & Rousseeuw, 1990b;Park & Jun, 2009;Yang & Yin, 2005) chooses actual data points as members (medoids) rather than centroids.Although it has a greater processing difficulty than K-means clustering and is more susceptible to noise and outliers, it minimises the total of differences between data points and medoids.Another form of clustering technique is density-based clustering (Ankerst et al., 1999;Ester et al., 1996), which clusters data points according to their density among the dataset rather than the distance measures.Density-based clustering often relies on the density of the datapoints, whereas partitional or hierarchical clustering relies on the distance between the datapoints to determine similarity and dissimilarity.Minkowski, EEuclidean, city-block, sup, mahalanobis and point symmetric distance are a few of the widely used distance measures, whereas density shows how many nearby datapoints are within cut-off distance.Density-based clustering has a number of benefits over distance-based clustering, Such as (i) Effective handling of noise and outliers is a feature of density-based clustering techniques.Instead of squeezing noisy data points into clusters, they can recognise and label them as outliers.(ii) Having the ability to find clusters of any size and shape.They can adjust to various cluster shapes, including elongated or irregular clusters.Due to their adaptability, they can recognise complex data structures.(iii) Unlike distance-based techniques, which require the number of clusters to be predefined, these can automatically calculate the number of clusters in the dataset.When the dataset has different cluster densities or when the number of clusters is uncertain, this feature is especially helpful.(iv) In comparison with distance-based algorithms like K-means, these make less assumptions about the distribution of the underlying data.They are better suited for datasets with a variety of structures since they do not presuppose that clusters have a given size, shape, or variance.In situations where distance-based algorithms can fail, such as while working with noise, outliers, varying cluster shapes, and unidentified cluster counts, density-based clustering algorithms offer a greater degree of versatility, reliability, and flexibility to different types of datasets.For this reason, they are especially helpful in these situations.We started working on density-based clustering because of these benefits (Campello et al., 2013;Saeed & Ahmed, 2017).Despite its many benefits, densitybased clustering struggles with the proper choice of cut-off distance.
A key factor in determining the local density is the cut-off distance.Incorrect local density computation due to inappropriate cut-off distance selection leads to inaccurate clustering results.In this research, a metaheuristic optimisation technique is used to demonstrate the best procedure for choosing the cut-off distance.In density-based clustering, the cut-off distance can be chosen using a variety of conventional techniques.The first one is the elbow approach, which includes ranking each data point's distance from its kth nearest neighbour.An appropriate number for epsilon might be thought of as the location on the plot where the rate at which distance increases significantly changes (the elbow point).The second one is a reachability plot.This figure displays the reachability distance, which is the shortest distance needed to go between two points of differing densities.When the reachability plot indicates large variations in reachability distances, an acceptable epsilon value can be identified.The final category is optimisation algorithms, which can test a variety of epsilon values and assessing their fitness.Finding the epsilon value that maximises clustering performance is the goal of the optimisation procedure.In comparison with the Elbow Method and Reachability Plot, using optimisation methods for the selection of the cutoff distance in density-based clustering algorithms has various benefits ().Automation of the cut-off distance selection process using optimisation algorithms eliminates the need for manual inspection or random plot interpretation.This can investigate a wider range of cut-off distance values, including smaller and bigger values, which the Elbow Method or Reachability Plot might find challenging to represent.This broader search assists in locating probable ideal values that human inspection may have missed.This may include prior knowledge of the dataset and clustering objectives or domain expertise.As optimisation goals or constraints, one might include restrictions on the cluster sizes or the desired degree of granularity.This makes it possible to choose the cut-off distance in a more specialised and situation-specific manner.The employment of optimisation algorithms was encouraged by their numerous benefits.
Numerous optimisation algorithms exist.Compared to other optimisation techniques, meta heuristic optimisation algorithms (Glover & Kochenberger, 2019a;Raidl & Hajdu, 2019) have several benefits.These have the ability to efficiently explore a variety of search space regions, enabling them to locate globally optimal or nearly ideal solutions even in complex situations (Blum & Roli, 2003;Glover, 2003;Talbi, 2009).By utilising techniques like randomisation, diversification, and intensification, they are efficient at escaping local optima.Metaheuristics can explore and utilise many regions of the search space, assisting in avoiding being caught in local optima and locating superior solutions.These algorithms frequently operate on a number of potential solutions concurrently, enabling parallel exploration of the search space and speeding up convergence to feasible solutions (Glover & Kochenberger, 2019b;Michalewicz & Fogel, 2004).
The Genetic technique (GA) (Deb et al., 2000;Haupt & Haupt, 2004) is a metaheuristic optimisation technique that draws inspiration from genetics and natural evolution.In order to iteratively seek for optimal or nearly optimal solutions, GA mimics the processes of natural selection, reproduction, and mutation, but in order to work well, GAs often require careful adjustment of parameters like population size, crossover rate, and mutation rate.If the population diversity declines too quickly or the operators are not balanced, GAs may converge too early, trapping the search in unfavourable regions.Particle Swarm Optimisation (PSO) is based on the collective behaviour of fish schooling or bird flocking (Eberhart & Shi, 2001).Particularly in heterogeneous or misleading fitness environments, PSO can become trapped in local optima and find it difficult to escape from them.The inertia weight and acceleration coefficients, for example, can have a significant impact on PSO performance, and obtaining the ideal parameter values may necessitate lengthy testing.PSO is prone to premature convergence, just like GAs.Ant Colony Optimisation (ACO) was created as a result of ant foraging behaviour (Dorigo et al., 1996(Dorigo et al., , 2006)).In particular for large-scale issues with complex search spaces, ACO has slower convergence rates compared to other metaheuristics.Additionally, ACO frequently requires the storage of pheromone trails, which can lead to high memory utilisation for problems with vast solution spaces.The Firefly Algorithm (FA) was developed as a result of studying firefly behaviours (Yang, 2010).Depending on the features of the problem and how computationally complex it is, especially for large-scale optimisation tasks, the performance of the Firefly Algorithm may vary dramatically.The algorithm's execution time can grow unreasonably high as the number of fireflies rises.Cuckoo Search (CS) was developed as a result of cuckoo bird behaviour (Yang & Deb, 2009).As the problem size or complexity rises, the CS algorithm's performance could suffer.Due to the increased search space and computational demands, it may find it difficult to handle high-dimensional or large-scale optimisation problems, and the CS algorithm may have trouble efficiently exploring and utilising multimodal optimisation problems (Yadav & Saini, 2017;Zhan et al., 2009) that have many peaks or solutions.It may have trouble maintaining diversity and thoroughly searching the search space, which could result in poor results.Bee Algorithm (BA) was developed as a result of research into honeybee foraging behaviour (Pham & Castellani, 2005;Pham et al., 2006).The BA, like many other optimisation algorithms, may have issues with high-dimensional optimisation situations (Luo & Cai, 2010).It becomes more difficult for the algorithm to successfully explore the space and settle to optimal solutions within an acceptable amount of time as the dimensionality rises.The remora optimisation algorithm (ROA) can be used to address problems raised by other metaheuristic algorithms.The cooperative relationship between remoras (fish) and sharks served as the inspiration for the relatively new metaheuristic optimisation technique known as remora optimisation (Li et al., , 2019(Li et al., , , 2021)).Remora optimisation uses adaptive mechanisms and a balanced explorationexploitation strategy.Compared to other optimisation techniques, this assists in achieving nearly optimal outcomes with a faster convergence time.Additionally, it may be used to address difficult optimisation issues with wide solution spaces.The approach is appropriate for high-dimensional or multimodal optimisation issues because it efficiently explores and exploits the search space.During the optimisation process, Remora optimisation uses adaptation mechanisms to dynamically change the algorithm's parameters.By responding to changes in the issue environment, the algorithm is able to adapt, this adaptation enhances its performance in dynamic or uncertain optimisation tasks.The previous optimisation approach does not show much evidence of this parameter adjustment.As a result, the ROA, which is motivated by this advantage, is used to define the cut-off distance in density peak clustering (DPC) rather than choosing the cut-off distance at random.The choice of the cut-off distance is subjective and potentially biased because the current methods for selecting the cut-off distance rely on heuristics.The cut-off distance must be manually adjusted, which can take time and may not produce the best results for all datasets, and these approaches may not adapt well to datasets with fluctuating densities or complicated structures, resulting to inferior clustering results in some circumstances.Traditional approaches may result in inferior clustering solutions since they do not directly optimise clustering quality indicators.These techniques require choosing the cut-off distance by visually examining plots or graphs, which can be difficult for big or high-dimensional datasets.The ineffectiveness of current methods unhandling noisy data can result in poor cluster identification.While our proposed work of using ROA for selecting cut-off distance reduces the need for manual and subjective decisions.This data-driven approach enhances objectivity and adaptability.Using the ROA, our method is designed to optimise clustering quality metrics, potentially leading to improved cluster identification compared to heuristicbased approaches.Our approach can adapt to datasets with varying densities and structures, making it versatile and suitable for a wide range of clustering tasks.The algorithm's optimisation process can take into account domain-specific knowledge or requirements, allowing for customisation of the clustering process.Our approach introduces a novel and innovative way to tackle the cutoff distance selection problem in DPC, potentially advancing the state of the art in clustering methodology.
To assess the effectiveness of the remora optimisation technique, 23 benchmark functions were taken into consideration.Mathematical benchmark functions are frequently used to assess how well optimisation techniques perform.In order to assess the efficacy and efficiency of various optimisation strategies, these functions offer a standardised set of problems with known optimal solutions.These are frequently employed to evaluate the accuracy and speed of convergence of optimisation methods.The 23 values of benchmark function returned by the remora optimisation algorithm are used as the 23 threshold distances for the DPC.The clustering is done with all the 23 threshold distances to pick the best clustering result and their corresponding cut-off distance (fitness value).The cluster formation is assessed by the homogeneity and completeness check.The outstanding performance among the 23 cluster results is used on eight datasets for prediction.The eight datasets are Heart failure, Drug review, 3W, Alcohol QCM server, Perfume, TV News Channel Commercial Detection, Facebook Live Sellers in Thailand, and Stock Details.Accurate and faster prediction is necessary in some key contexts like stock, heath, social media etc.The decision-making and prediction of these areas are affected by many factors.This motivated to generate a model with high accuracy.The metaheuristic algorithms can produce more optimised solution and showed promising results in many areas and on various problems like disease clustering, forecasting, and tweets clustering.In this regard, we used one of the metaheuristic algorithms discussed in section IV to compute the cut-off distance in DPC.This paper showed the performance of DPC with remorse optimisation on the eight datasets along with traditional DPC comparison, K-means clustering, and DPC with other optimisation algorithm.
The performance of ROA is tested by applying 23 benchmark functions CEC2005 (Jia et al., 2021).There are many more optimisation algorithm (Challapalli & Devarakonda, 2022;Eluri & Devarakonda, 2022a, 2022b;Papasani et al., 2022) but we have paused our study in this paper to only three optimisation algorithms.The order of the remaining text is as follows: Section II provides an explanation of the relevant works on DPC and its variations.Sections III and IV go into detail about the techniques employed in the proposed strategy.The proposed method's idea and its results are presented in Section V.

Motivation
However, choosing the appropriate class names in practical settings might be challenging, arbitrary, or context-dependent.By employing clustering, we may avoid needing to make assumptions about the class labels.These advantages motivated us to work on clustering.
When compared to other available clustering algorithms, DPC has demonstrated excellent outcomes for several complicated issues.Despite having several benefits, DPC has cut-off distance identification and a lack of cluster centroids.The key variable utilised to determine local density is cut-off distance.Inappropriate clustering results are caused by poor cut-off distance selection.The decision graph, delta density, knee point detection, silhouette score, and kernel functions are now used to choose the cut-off distance.The fundamental issue with the aforementioned methods for choosing the cut-off distance in DPC is that they frequently rely on heuristic or subjective visual criteria, making the selection of the ideal cutoff distance difficult and perhaps sensitive to data properties.Utilising metaheuristic optimisation techniques makes choosing the cut-off distance less arbitrary and more data-driven, which may lead to better clustering outcomes in DPC.This inspired us to use the ROA to work on the selection of the cut-off distance.

Benefits
Our method automates the process of choosing the cut-off distance, which eliminates the need for manual intervention and arbitrary judgements.This improves the clustering process' objectivity.Our method has the potential to produce higher quality clusters than heuristicbased methods by optimising clustering quality metrics through the ROA.As a result, the outcomes are more precise and significant.This concept is adaptable to datasets with various densities and structures, which makes it useful for a variety of clustering applications.It addresses the difficulty of locating clusters with different properties.Users can adapt the clustering to their needs by including domain-specific knowledge or criteria into the optimisation process.This adaptability is useful in practical applications.The automated method avoids the time and labour-intensive manual adjustment of the cut-off distance.When dealing with huge datasets or challenging clustering jobs, this is extremely helpful.The cutoff distance selection issue in DPC is addressed in this concept using a fresh and creative way.By combining tried-and-true methods in a novel way, it advances the clustering process.The optimisation procedure may make the clustering algorithm less sensitive to data noise, resulting in more reliable cluster identification.Our innovative concept has benefits for realworld uses like brand monitoring, crisis management, political analysis, and social research.Decision-making and insights in these areas may benefit from improved clustering findings.Remora optimisation and DPC are combined in our technique to provide a flexible methodology that can be used for various data analysis and clustering tasks, increasing the utility and relevance of the approach.

Related work
Rodriguez and Laio (Laio & Rodriguez, 2014) proposed DPC.Based on the local density and separation from all other datapoints, this technique spots cluster centres.The datapoints with high local density and high separation distance is called cluster centroid.Based on how close they are to the centres, the remaining points are distributed across clusters.DPC has shown equal portion of advantages and limitations.The cut-off distance, also known as a density threshold in DPC, must be determined in advance and can be difficult to determine accurately.Only the appropriate cut-off distance leads to accurate local density.
Keeping in the view of limitations, many improved DPCs are produced, one of them is FKNN-DPC (Xie et al., 2016).This showed two approaches of assigning cluster members to their respective cluster centres; however choosing cluster centres still remains as an unsolved issue.Time complexity of distance calculation between the points is another challenge of the existing DPC which can be solved by CFSFDP+A (Bai et al., 2017).Two datapoints are considered similar if they share maximum number of data points.This is shown in the shared-nearest-neighbour-based clustering algorithm SNN-DPC (Liu et al., 2018).But this leads to increase in complexity.DPC merged with KNN, DPC-KNN-PCA both showed a methodology for allocation strategy and calculation of local density is discussed in DPC-KNN (Du et al., 2016a;Shlens, 2014).This approach generates better results by adopting dimensionality reduction.The metric for calculation of local density brings out positive result.Fuzzyweighted K-Nearest Neighbours DPC (FKNN-DPC) (Xu & Wunsch, 2005b) is used in the calculation of local density which produced better result than the DPC but with higher rate of complexity and done manually.Local density calculation for small dataset is an another challenging task, addressed by the (Rodriguez & Laio, 2014a).Equation (1) shows the calculation of local density for small dataset.
But there is no specific technique to identify that the dataset is small or large.The metric for calculation of local density for any size of dataset is given in DPC-KNN (Du et al., 2016b).Equation (2,3) shows the calculation of local density.
where user-defined parameter is represented with "K" and set of KNN of point i is represented with KNN i DPC with Kernel density estimation (KDE) (Mehmood et al., 2016) is one of the solutions in defining the local density in DPC.
By taking into account both density and distance information for cluster centre determination, Li, C (Li & Ding, 2015) proposed an extension to DPC.To assess a point's eligibility as a cluster centre, it provides a density-based metric.The algorithm ranks the points using a density-distance product and chooses the centres based on a user-defined threshold, although this method still necessitates choosing a threshold, and the performance may differ depending on the threshold selection.Due to the dimensionality curse, it has trouble handling high-dimensional datasets.To automate the selection of the density threshold for DPC, Cao (Cao et al.,) offers a self-adaptive threshold technique.It estimates the ideal cut-off based on the variation of distances using statistical analysis.This strategy seeks to increase the adaptability and application of DPC; however, the self-adaptive threshold method depends on specific data distribution assumptions, and if those assumptions are not met, performance may suffer.Additionally, it is sensitive to dataset noise.Cluster centres as sites with both a high neighbourhood density and a great deal of distance, and it also provides a new "neighbourhood density" measure.The algorithm awards cluster points based on how close they are to the updated cluster centres, although the new definition of a cluster centre may produce different clustering outcomes than the original DPC.The selection of acceptable parameters and sensitivity to data properties may be difficult using the updated approach.The DPC technique is modified by Feng (Feng et al., 2020) to include a dynamic spatial density threshold.According to the quantity of points within a given radius, each point in the original DPC algorithm is given a density value.The density peaks are then determined by manually setting the density threshold.The variation may demand for fine-tuning of other parameters, such as the scaling factor or algorithm used to calculate the dynamic density threshold.In this variant, the authors dynamically alter the density threshold based on the spatial distribution of the data points.It can be difficult to choose the right parameter values, which could affect the clustering outcomes.
By combining K-means clustering and the K-nearest neighbour technique, Guo (Guo et al., 2022) suggests an improved version of the DPC algorithm.He begins by using the K-means clustering approach to divide the data into an initial set of clusters.This assists in giving a preliminary assessment of the cluster centres, which can direct the ensuing DPC process.The authors introduce a K-nearest neighbours search in the density peak selection stage to identify the density peaks.The approach dynamically modifies the threshold based on the local density and distance to the K-nearest neighbours rather than considering a preset density threshold.As a result, it is possible to respond better to changing cluster densities.The programme then goes through a refining stage using K-means clustering after locating the density peaks.This improves the overall clustering accuracy by more accurately assigning the remaining data points to the appropriate clusters, but the performance of the algorithm can also be affected by the selection of parameters, such as the number of clusters in the initial K-means step and the value of K in the K-nearest neighbours search.Poor clustering results may be obtained with poor parameter values.To enhance clustering performance and get around some of its drawbacks, DPC can be used with metaheuristic algorithms.Here are a few sources that go into how DPC can be combined with metaheuristic algorithms.DPC and the Artificial Bee Colony (ABC) algorithms are combined in Li's (Li et al., 2019) proposal for a hybrid clustering algorithm.It can be difficult to choose the right parameters for DPC and ABC, and adjustment may be necessary to get the best results.The incorporation of two different algorithms increases computational complexity.Large datasets may require more memory and runtime as a result of this.A hybrid DPC approach is put forth by Zhang et al. (Zhang et al., 2016) that combines the Gravitational Search approach (GSA) and the Artificial Bee Colony (ABC) optimisation process.The population-based optimisation algorithm ABC was motivated by honey bee foraging behaviour.The Gravitational Swarm Algorithm (GSA) is a metaheuristic algorithm.The proposed method is unique to the ABC and GSA pairing.While the solo DPC algorithm may benefit from this hybridisation approach's enhancements, other metaheuristic algorithms or density-based clustering methods may not be as easily adaptable to it.The characteristics and behaviours of ABC and GSA may have a significant impact on the algorithm's efficiency and performance.Moth-Flame Optimisation (MFO) and GSA, two metaheuristic algorithms, are combined with DPC in Jia et al.'s (Jia et al., 2021) hybrid technique.The initial cluster centre design and MFO and GSA's exploration of the search space may have an impact on how well the hybrid algorithm performs.It's crucial to guarantee robustness against various initialisations and dataset properties.A hybrid clustering approach that combines Artificial Fish Swarm Optimisation (AFSO) with DPC.The suggested algorithm combines the exploration and exploitation skills of AFSO with the density and distance measurements of DPC.
Based on the density and distance information, it employs AFSO to optimise the choice of cluster centres and the assignment of data points to clusters.For AFSO to perform at its best, certain parameters must be carefully adjusted, much like with other metaheuristic algorithms.It might be difficult to choose the right parameter settings for AFSO and requires a lot of testing.The initial setup of cluster centres might affect clustering techniques, including DPC and its variants.Depending on how cluster centres are first assigned and how AFSO explores the search space, the hybrid algorithm's performance may change.The Aquila Optimiser (AO), a novel population-based optimisation technique (Abualigah et al., 2021b) proposed in this study, draws its inspiration from the natural behaviours of aquilas as they pursue their prey.In this study, a set of experimental series are undertaken to validate the novel optimiser's capacity to identify the best solution for various optimisation issues.Here, the created AO algorithm's superiority can be seen.The Reptile Search Algorithm (RSA) (Abualigah et al., 2022), inspired by the crocodile's hunting strategy, is a revolutionary metaheuristic optimiser proposed in this study.The RSA is demonstrably better to other comparing approaches, according to the findings of the Friedman ranking test.Finally, the outcomes of the analysed engineering issues demonstrated that the RSA outperformed other various approaches.The basic mathematical arithmetic operators' distribution (Abualigah et al., 2021c) behaviour is used in this study to suggest a novel meta-heuristic technique dubbed the Arithmetic Optimisation Algorithm (AOA).To demonstrate the applicability of AOA, its performance is evaluated against 29 benchmark functions and several actual engineering design issues.Different scenarios have been used to test the performance analysis, convergence behaviours, and computing complexity of the proposed AOA.According to experimental findings, the AOA outperforms 11 other well-known optimisation algorithms in solving difficult optimisation issues.The paper [ref] is likely a comprehensive survey of the Internet of Drones (IoD) landscape (Abualigah et al., 2021a).It is expected to cover a wide range of topics, including the practical applications of IoD technology in various domains, such as agriculture, healthcare, surveillance, and logistics.
The choice of the cut-off distance for DPC is not studied in any of the research publications previously mentioned.There has been little research on selecting the right cut-off distance when employing an optimisation method.In choosing the cut-off distance in DPC, our novel approach makes use of the Remora optimisation method, which comes with benefits including automation, improved clustering quality, adaptability, and customisation.Meanwhile, subjectivity, a lack of adaptability, and manual adjustment in present approaches can result in less successful clustering solutions, particularly in complicated or noisy datasets.Our method eliminates these drawbacks and offers a viable way to enhance DPC.
The present methods for choosing the cut-off distance rely on heuristics, which makes the decision subjective and potentially biased.These methods might not adapt well to datasets with variable densities or complex structures, leading to worse clustering results in some situations.The cut-off distance must be manually modified, which might take time and may not yield the optimal results for all datasets.
Since conventional methods do not explicitly optimise clustering quality indicators, they may produce poor clustering solutions.These methods call for selecting the cut-off distance by the visual inspection of plots or graphs, which can be challenging for large or highdimensional datasets.Current approaches to handle noisy data are ineffective, which can lead to poor cluster identification.While our suggested approach avoids the requirement for human and arbitrary selections by applying the ROA to choose the cut-off distance.This datadriven strategy improves adaptability and objectivity.Our approach is intended to improve cluster identification compared to heuristic-based methods by optimising clustering quality metrics using the ROA.Our method is flexible and appropriate for a variety of clustering applications since it can adapt to datasets with different densities and topologies.The clustering method can be customised by including domain-specific knowledge or needs into the algorithm's optimisation phase.With our method, the cut-off distance selection problem in DPC is approached in a fresh and creative way, possibly enhancing the state of the art in clustering methods and resolves the discussed issues in current system.

About density peak clustering
The three well-known clustering algorithms are K-means clustering (Hartigan & Wong, 1979), Affinity propagation algorithm (Frey & Dueck, 2007), and density peaks clustering (Rodriguez & Laio, 2014b).DPC works based on two assumptions as follows: one is the distance between two cluster centres (higher density datapoints) is high and second is cluster centres surrounded by the datapoints with lower local density.Local density ρ i and separation distance δ i for every datapoint are calculated initially.Maximum separation distance is considered for the datapoint with highest local density.Minimum separation distance is considered for the datapoint with local density less than highest local density.The local density and separation distance calculations are shown in the Equation [4]-Equation [25] where ρ i indicates the points that are closer to datapoint "i" than the cut-off distance d c " while this cutoff distance value is selected randomly which may result in negative impact on the clustering.The separation distance calculation is shown below The datapoints with highest local density and separation distance are chosen as the cluster centroids and remaining datapoints are assigned to the nearest cluster centroids.Unlike K-means clustering, DPC need not iterate to get the accurate clusters.Therefore, DPC is robust and effective.DPC works effectively even when the datapoints are distributed circle/spherical/ curvilinear/linear.DPC has shown effective solution in almost every area.Inspite of many advantages, DPC lacks in the selection of cut-off distance, this paper tried to put forward a method of utilising ROA in selection of d c .The experimental section has shown the accuracy of "refined DPC" by applying on eight different datasets.

Remora optimisation algorithm
This algorithm is built on the inspiration of remora fish which sails from one place to other using swordfish, whales, and sailfishes.Whales use bubble-net feeding method for hunting, these hunt alone while sailfish hunt in groups with speed of 100 km/h.The remora fish are parasites that feed on the remaining food that is present on the surface of whales and sailfish.If the remora are particularly hungry, they each move at a lightning speed to snag the meal.The framework has two movement modes that, respectively, correspond to the exploration and exploitation stages.In the exploration phase, the "Remora factor" is proposed, which can enhance the accuracy of optimisation and successfully produce convergence.As a result of the development of these efficient procedures, ROA is able to produce promising research findings.The processes of exploration and exploitation were separated into "Free travel" and "Eat thoughtfully," with the "experience" of remora serving as the primary basis for phase switching."Free travel" is further separated into "SFO Strategy" and "Experiment attack", which correspond to "large global" and "little global," respectively, in the algorithm.Additionally, the primary method for remora to get experience is by "Experiment attack."Two additional feeding modes, "WOA Strategy" and "Host feeding," are included.
Different convergence techniques improve the algorithm's stability.By gradually converging a solution area around the host, "host feeding" improves and enhances the capability of local optimisation.Remora performs a global movement during exploration part of the ROA, which is called as "Free travel."In free travel, the movement of the remora on the swordfish is defined using Equation 12.The decision of changing the host or to continue the same host is decided by the calculation of R att .If the fitness value of R att is greater than the fitness value of R it then the remora switches to another host otherwise continues with the same host.Suppose the host is whale, then the movement of remora along with whale is given in equation 15.The decision of switching to another host is decided by the calculation of H i ð Þ.

If H i
ð Þis zero then switch to swordfish otherwise continues the whale.The small steps on the whale to eat the dead skin are shown in equation 20.To increase the accuracy of the optimisation, a "remora factor" was generated in the "Host feeding" module based on the various placements of the other whale sections.To show the above process more vividly, Figure 4 [57] is drawn.In the figure, in the stage of host feeding, remora will look for the food around the host while in the experimental attach, the remora does a small step and decides whether to change the host or to continue the same host is decided by the calculation of R att .If the fitness value of R att is greater than the fitness value of R it , then the remora switches to another host otherwise continues the same host.That is why in the experimental attack present in the figure, the remora is changing the host from whale to swordfish.The changing of the host is represented with the yellow arrow.As whale is also one of the host, so WOA strategy is followed and shown in the figure.
The mathematical models of "Free travel" and "Eat thoughtfully" are provided in the below subsections.

Initialisation
The remora position is represented with R i as shown in equation 8where "i" represents number of remora and "d" represents the dimension.
The optimal solution(target) is represented with R best as shown in equation 9Candidate solutions are evaluated using fitness function represented in equation 10Similarly we can apply fitness function on R best to get best fitness value represented in equation 11.

Free travel (exploration)
The change in the position of remora when swordfish as an host is defined by equation 12where "t" represents iteration number, random location by R rand .
A small step around the host helps to decide in change of host.This scenario is represented in equation 13.
where R pre represents the position of the previous generation.R att indicates a tentative step.The "randn" is chosen by the remora to make an active step which is a "small global" movement.
Comparative study of the fitness function value of current solution fR it and a small step of remora fR att is done.The judgement of remora is shown in the below equations 14 and 15

Eat thoughtfully (exploitation)
The change in the position of remora when whale as an host is defined by the equations 15-18.
where "D" is current optimal solution (distance between remora present on whale to prey), α is a random number in −1,1, and a goes down linearly between [−2,-1].
The formula for switching host is shown in the equation 19.
Among them, H(i) determines the host adsorbed by remora, and its initial value is 0 or 1.If H(i) is equal to 0, the sailfish is adsorbed; if H(i) is equal to 1, the whale is adsorbed.

Host feeding
Feeding on the host like a parasite.In host feeding stage, remora move around the body of whale to eat the dead skin/particles.This is shown mathematically using the equations 20-23.
where "A" represents movement of remora around the host, i.e. small steps, "C" represents narrow position of remora, this shows the difference in location between whale and remora.
The switching stage and preparation times in the ROA allow the algorithm to investigate a wider range of potential cut-off distances, which can be helpful in choosing the cut-off distance in DPC.Each remora fish (solution) is at random allocated to a new host (cut-off distance) during the switching stage.This enables the algorithm to bypass local optima and locate more accurate cut-off distances.

Methodology
The proposed system is built to refine the existing DPC threshold distance.Random cut-off distance is a major drawback in the traditional DPC.This paper has shown the solution to some extent using the ROA for the selection of cut-off distance value.

Novelty
The novelty of our work lies in the integration of the ROA for the automated selection of the cut-off distance in DPC, which has not been previously explored.We demonstrate the effectiveness of this modified DPC approach by achieving superior results compared to both traditional DPC and a range of other clustering algorithms on eight diverse datasets.This unique combination offers a new and innovative solution for enhancing the adaptability and performance of DPC in real-world data analysis.This work addresses the problem of improper selection of cut-off distance which thereby lead to poor clustering results.

Position updation of remora and fitness values
The remora begins by creating the set of search agents at random.Each search agent is evaluated by determining its fitness value.Equation 10 is used to calculate the fitness value, and 23 fitness functions are then performed to see if the search agent's movement falls within the range of fitness values.After the initial iteration of all search agents has been finished, equation 11 is used to determine which search agent has the highest fitness value.For each cycle, a different search agent solution is calculated.Remora will now be judged to transfer to a new host or stay on the same host for each subsequent iteration using equation 14.Using equations 12 and 15, Remora moves on to swordfish or whales, respectively.Remora then uses equations 13 and 20 to update its optimum position.The best position is determined using the fitness functions in this scenario up until the final iteration.

Optimal threshold value
The cut-off distance for DPC is these best remora positions that the fitness algorithms have returned.In order to evaluate the effectiveness of the clustering process, 23 cut-off distances are successively replaced.The eight datasets are subjected to the improved DPC.The best threshold value for fitness is the one that produced the highest levels of homogeneity and completeness.

Homogeneity and completeness
The uniformity or coherence within each cluster is referred to as homogeneity.The data points inside a cluster in a homogeneous cluster are more comparable to one another than they are to the data points in other clusters.Whether all data points belonging to the same class or category are placed in the same cluster is measured by completeness.

Objective function
Finding a collection of ideal placements with a short solution size and excellent clustering accuracy is the goal of remora optimisation-based DPC.The aim of the function is to maximise clustering accuracy while minimising clustering error1.Size of the solution is taken into account by the second objective function.Utilising DPC, the solutions are evaluated.where X error = number of wrongly clustered samples, X all = total number of instances, p is the position(solution) of the remora, s is the total number position(solution).
The performance of ROA is tested on 23 bench mark functions (fitness functions) which results to 23 values shown in Table 4 and their fitness value is shown in Table 6.The 23 values and their fitness function of whale optimisation algorithm (WOA) are shown in Tables 5, 7. The 23 values and their fitness function of Marine Predator Algorithm (WOA) are shown in Tables 5, 8.These 23 values are used as the cut-off distance for the DPC which results in 23 clustering results.The performance of clustering results is tested by homogeneity and completeness on the eight datasets as shown in Tables 9,10.The cut-off distance (Fitness function value) which resulted in best homogeneity and completeness value is considered as the optimal solution.The eight benchmark datasets are shown in Table 1 and parameters for optimisation algorithm are shown in Table 2.The list of companies considered in stock prediction is shown in Table 3.

Steps of proposed system
Step 1: Performance of remora optimisation algorithm tested with 23 benchmarks functions.
Step 3: Replace every value one after another instead of random cut-off distance in DPC for calculation of local density.
Step 4: Step 3 produces 23 clustering results which is evaluated by homogeneity and completeness by applying on 8 datasets.
Step 5: The clustering result with best homogeneity and completeness is the optimal solution Step 6: The benchmark function value which resulted to optimal solution is the optimal cut-off distance.

Pseudo code of proposed system
(1) Set the population size to "N" and total number of iterations to "T" (2) Set the population to R i (i = 1,2 . . ..N) (3) While(t<T) (4) Calculate the fitness value of each remora using the equation 10 (5) Calculate the 23 fitness value for the R best .R best is the lowest position value.
(6) For every remora calculate H(i) using the equation 19 (19) Pick the datapoint x i with highest δ i and ρ i by drawing graph as the cluster centroids.
(20) Assign the remaining points to the cluster centroids based on distance measure.
(21) Return clusters along with homogeneity and completeness of the clusters.
(22) Iterate the step 3 to step 6 continuously by replacing the cut-off distance with F2 to F23 benchmark function value.
(23) Return "benchmark function value" resulted to optimal homogeneity and completeness of the clusters.

Distance Calculation (DPC)
Distance Metric: d_ij represents the distance between data points x_i and x_j.

DPC Algorithm
Core Points: Core points are those with a local density above a threshold ρ_min.
Reachability Distance: reach_dist_ij is the reachability distance from data point x_i to core point x_j.
Cut-off Distance: cut_off_dist_i represents the novel concept of the cut-off distance for data point x_i

Remora Optimisation Algorithm (ROA):
Optimisation Function: The objective function for ROA is f(x), which evaluates the clustering quality.

Final Clustering
Clustering Assignment: Assign each data point to clusters based on reachability distances and the chosen cut-off distance cut_off_dist_i*.

Performance measure
Among all the Homogeneity and Completeness values present in Table 4, the Homogeneity = 0.807 and Completeness = 0.699 are the optimal solution.This value is resulted from the fitness function 14. Figure 1 shows homogeneity and completeness values computed by DPC with ROA on all the eight datasets as shown in Table 1.Table 10 shows the homogeneity and completeness along with their cut-off distance and their corresponding fitness value for all the eight datasets.
Table 11 shows the Homogeneity, Completeness, Silhouette analysis, Calinski-Harabasz Index of K-means, DPC, DPC with remorse optimisation algorithm, DPC with other optimisation algorithm.This results shows us clearly that DPC+ROA has win the race over other compared algorithm.
Figure 3 shows comparative study of the Calinski-Harabasz Index (CH-Index).

Strength and weakness of the proposed work
By automating the selection of the cut off distance, our method eliminates the need for manual decision-making.As a result, parameter adjustment takes less time and effort while improving objectivity.The results might be more accurate and cluster identification could be improved by this method's optimisation of clustering quality metrics.Our approach is flexible enough to handle datasets of various densities and structures.Although the suggested system has numerous benefits, there are some drawbacks.The ROA, like many optimisation techniques, can be computationally demanding.The runtime of the algorithm could be a constraint, depending on the size and complexity of the dataset.Running the optimisation procedure may need large computer resources, such as memory and processing power, depending on the size and complexity of the dataset.

Conclusion
In conclusion, this study presents a novel approach that significantly contributes to the field of DPC by introducing the concept of defining the cut-off distance through the ROA.The determination of the cut-off distance is critical in DPC, and this work fills a notable gap in the literature by addressing this key aspect.The primary novelty of our work lies in the establishment of a systematic method for defining the cut-off distance in DPC.By incorporating the ROA, we provide a data-driven and adaptive approach to calculating this crucial parameter, which has not been previously utilised in this context.This innovative methodology enhances the accuracy and effectiveness of DPC, ultimately leading to improved clustering results.To validate the effectiveness of our approach, we conducted comprehensive experiments on 23 benchmark functions and 8 diverse datasets.Our findings indicate that selecting the benchmark function value that results in the best clustering outcome as the optimal cut-off distance significantly enhances the clustering performance, showcasing the practical utility of our method across various data domains.While the reliance on benchmark functions to determine the optimal cut-off distance may introduce sensitivity to the choice of functions.Different functions may yield distinct results, potentially limiting the generalisability of the approach.Future work should focus on developing more robust and standardised techniques for cut-off distance determination.
8) Change in the position of remora to swordfish as an host using the equation 12 (9) Movement on the swordfish using the equation 13 (10) if fR it >fR att stays back on the same host (11) else; jump to other host (12) else if H i ð Þ ¼¼ 1 then (13) Change in the position of remora to whale as an host using the equation 15 (14) Movement on the whale using the equation 20 (15) Return "23" R best values: As we have chosen 23 benchmark functions.(16)Build a matrix consists of Euclidean distance between every pair of datapoints Calculate local density ρ i for every datapoint x i based on value of F1 benchmark function as cut-off distance (18) Identify the highest distance δ i for every datapoint x i , from the matrix build in step 1.

Table 7 . Fitness function of WOA
(Continued)