A new framework for GEOBIA: accurate individual plant extraction and detection using high-resolution RGB data from UAVs

ABSTRACT Citrus (Citrus reticulata), which is an important economic crop worldwide, is often managed in a labor-intensive and inefficient manner in developing countries, thereby necessitating more rapid and accurate alternatives to field surveys for improved crop management. In this study, we propose a novel method for individual tree segmentation from unmanned aerial vehicle remote sensing (RS) using a combination of geographic object-based image analysis (GEOBIA) and layer-adaptive Euclidean distance transformation-based watershed segmentation (LAEDT-WS). First, we use a GEOBIA support vector machine classifier that is optimized for features and parameters to identify the boundaries of citrus tree canopies accurately by generating mask images. Thereafter, our LAEDT workflow separates connected canopies and facilitates the accurate segmentation of individual canopies using WS. Our method exhibited an F1-score improvement of 10.75% compared to the traditional WS method based on the canopy height model. Furthermore, it achieved 0.01% and 1.38% higher F1-scores than the state-of-the-art deep learning detection networks YOLOX and YOLACT, respectively, on the test plot. Our method can be extended to detect larger-scale or more complex structured crops or economic plants by introducing more finely detailed and transferable RS images, such as high-resolution or LiDAR-derived images, to improve the mask base map.


Introduction
Accurate crop identification and quantity detection play an important role in sustainable agriculture management (Hu, Li, & Hall, 2021).Citrus (Citrus reticulata), which is a globally significant economic crop, is a major fruit crop in China and is important for agricultural economic growth.In developing countries such as China, mobile ground positioning systems are commonly used for individual tree positioning and ground surveys of tree species (Donmez, Villi, Berberoglu, & Cilek, 2021).However, this method is heavily reliant on manual measurements and statistics and is often inefficient and difficult to implement over large areas.Remote sensing (RS) technology can acquire extensive feature information and is frequently applied in agricultural production and monitoring processes owing to its wide spatial coverage, high efficiency, and non-destructiveness (Padua et al., 2022).In particular, the development of unmanned aerial vehicle (UAV) RS technology in recent years has significantly promoted the timely acquisition and dynamic monitoring of crops, water, soil, and other agricultural and forestry ecological elements (Utebayeva, Ilipbayeva, & Matson, 2022).UAVs are more flexible and can use different coupled sensors to obtain high-resolution (HR) or very-high-resolution (VHR) image data at a low cost compared to satellite and airborne RS data acquisition methods (Iqbal, Riaz, Zhao, Barthelemy, & Perez, 2023).Moreover, many studies have demonstrated that UAVs offer significant advantages in crop management, tree species classification, and biomass estimation (Navarro et al., 2020;Maimaitijiang et al., 2020;Hu, Wang, Zhong, & Zhang, 2022).For example, Qin, Zhou, Yao, and Wang (2022) classified 18 tree species in subtropical broad-leaved forests in southern China using UAV-based LiDAR, hyperspectral, and visible-light RGB data.Their method accurately depicted the crown contours of individual broad-leaved trees, achieving an overall accuracy rate of 91.8%.Mao, Ding, Jiang, Qin, and Qiu (2022) used UAV technology to achieve large-scale and multiscale monitoring of above-ground biomass in shrublands by effectively bridging the scale mismatch and spectral bias between satellite observations and ground surveys.Therefore, the utilization of UAV RS for the extraction and detection of citrus trees in VHR images at the landscape scale plays a significant role in achieving intelligent precision agriculture (Zheng et al., 2020).This study primarily involves two parts: (1) citrus tree crown (CTC) extraction and (2) individual tree crown (ITC) detection.Previous studies have employed independent methods in addressing either one of these aspects, resulting in a more complex hierarchical program structure that may limit the transferability of models.Moreover, modifying the parameter values of existing methods, such as segmentation parameters or machine learning hyperparameters, can lead to significantly different results.However, when processing images from different scenes, there is no clear standard for determining the optimal parameter configuration.Therefore, it is necessary to establish a simpler, coherent, and unified framework that exhibits good mapping capabilities in certain scenarios, such as land cover in agricultural contexts.
Currently, RS data extraction (or classification) methods are typically divided into pixeland object-based methods.Pixel-based (PB) methods are preferred for low-to medium-resolution satellite RS images such as Sentinel and Landsat multispectral images (Liu & Abd-Elrahman, 2018).However, PB methods commonly suffer from salt-and-pepper noise when dealing with high-resolution data, and they fail to effectively utilize features, which restricts the demand for obtaining more detailed geospatial information (Lou et al., 2023).Geographic object-based image analysis (GEO-BIA) is an alternative to PB methods that first segments the various bands of an RS image and merges homogenous pixels into several non-overlapping objects in a bottom-up manner (Lourenço et al., 2021;Feng & Fan, 2021).Subsequently, the spectral, textural, and terrain features are extracted from these homogenous objects, following which classification algorithms such as the support vector machine (SVM), random forest (RF), and k-nearest neighbors (KNN) are used as supervised classifiers to obtain mask images that are close to the true canopy surface (Guo et al., 2021).VHR images contain rich textures; that is, similar patterns with strong or weak regularity, which can improve the separability between objects (Luo & Ji, 2022).Yang, Wang, Cao, Wu, and Zhang (2023) used GF-2 images to estimate soil salinity by combining spectral and textural features.The experimental results demonstrated that the combination of features with textural factors yielded higher determination coefficients (R²) and lower root mean square errors in different classifiers than spectral features alone.Moreover, Yang, Zhang, Wang, and Lai (2022) introduced a canopy height model (CHM) into the process of extracting olive trees and found that the CHM made the greatest contribution to the accuracy of the extraction results.Their results revealed that the CHM could accurately reflect the true height of ground objects and could quantify the vertical structure of forests or other vegetation areas.These studies demonstrate that the GEOBIA method can smooth image noise and use spectral, textural, and contextual features in image objects more effectively.However, as the feature space dimension increases, the secondary features may act as noise (Zhou et al., 2021).Therefore, feature space dimension reduction is a key issue to be addressed when using the GEOBIA method.
Watershed segmentation (WS) is one of the most widely used individual tree crown delineation algorithms, and it is effective in separating tree crown structures with shape regularity and height differences between individuals (Lassalle, Ferreira, La Rosa, & de Souza Filho, 2022).WS is an edgebased unsupervised segmentation algorithm that converts an image into a gradient resembling a terrain surface and identifies the influence region of each local minimum in the image as a single object for segmentation (Yun et al., 2021).The CHM is commonly used for tree crown detection to enhance WS delineation (Qin et al., 2022).However, the following factors result in the poor performance and generalization ability of traditional WS methods in ITC detection (Lassalle et al., 2022): (1) as morphological differences and height variations exist in tree crowns, the CHM that is generated from sparse point clouds cannot accurately delineate the true edges of tree crowns, (2) the presence of fallen leaves and understory vegetation may cause over-segmentation, and (3) the overlap of two or more crown layers leads to under-segmentation of connected crown individuals.In recent years, object detection or instance segmentation algorithms in deep learning (DL) have made significant progress in ITC detection tasks owing to the widespread use of VHR images (Liu, Zhang, & Yang, 2023).Compared with traditional image processing methods, DL methods can overcome environmental noise to a certain extent and improve detection accuracy (Zhou et al., 2022).Although the robustness and noise resistance of DL methods are superior to those of traditional image processing methods, the lack of interpretability of the internal network operation is a major challenge in understanding how the model makes decisions or predictions (Nguyen, Kellenberger, & Tuia, 2022).Furthermore, the training of a complex network model may consume considerable time and computational resources (Liu et al., 2023).Therefore, traditional computer vision algorithms (including WS and unsupervised image processing algorithms such as mathematical morphology) that operate directly on grayscale images are a simpler, more flexible, and resource-efficient option when the computing resources are limited (Qin et al., 2022).
The remainder of this paper is organized as follows: The data acquisition, pre-processing, and detailed experimental procedures relating to the study area are described in Section 2. Section 3 presents the experimental results and corresponding discussions.Finally, the conclusions are provided in Section 4. The experiments were conducted on a computer system that was equipped with an AMD Ryzen R7-3700X processor (3.60 GHz), 64 GB of DDR4 memory running the Windows 10 operating system, and an NVIDIA GeForce RTX 3060 graphics card with 12 GB of memory.

Workflow overview
In this study, we used RGB data acquired from an UAV in conjunction with an improved GEOBIA framework for CTC extraction and ITC detection.Figure 1 illustrates the detailed workflow of this study.
The framework comprises several steps as follows: (1) Data acquisition and pre-processing: This section encompasses field data acquisition through photogrammetric surveying in the study area.Subsequently, fine-resolution CHM and CTC datasets are created, serving respectively as input features in the machine learning (ML) model within the GEOBIA workflow and for training the DL neural network model.(2) Improving the GEOBIA workflow for CTC extraction: The workflow is enhanced by implementing multi-resolution segmentation to generate geographically matched objects from the imagery and compiling feature attributes such as spectral, textural, geometric, and topographic characteristics.An optimized heuristic strategy is employed to adjust the parameter configuration of the ML model, facilitating training and achieving optimal performance evaluation.Additionally, feature selection is conducted on the model's input features to eliminate redundant attributes and generate more robust and accurate CTC results.(3) ITC detection using the combined GEOBIA and layer-adaptive Euclidean distance transformation-based watershed segmentation (LAEDT-WS) method: Following the acquisition of accurate crown images through GEOBIA, further morphological post-processing is performed on the output images, involving operations such as opening, closing, filtering, and hole filling to refine the boundaries of tree crowns.Subsequently, the proposed LAEDT-WS algorithm developed in this study is utilized to obtain the final detection results of ITC.

Study area and data collection
The study site was located in a citrus plantation area in Yong'an City, Fujian Province, China (117.4°E,26.0°N, 147 to 219 m a.s.l.) (Figure 2).The area is within a subtropical monsoon climate zone, with abundant rainfall (average annual precipitation of 1,574 mm), ample heat resources (average annual temperature of 19.1 ⍰), and long sunshine hours (average annual sunshine hours of 1766.1 h), thereby providing favorable climate conditions for the growth of citrus trees.The citrus trees in the plantation were in the early growth stage, with an average tree height of approximately 1.46 m and an average distance between trees of approximately 3.32 m (based on data from 200 citrus trees).Aerial photogrammetry using an UAV was conducted on August 1, 2022, from 11:00 to 11:30 am, under clear, cloudless conditions with a relatively high solar altitude angle to minimize the effect of shadows on the RGB data (Liu & Abd-Elrahman, 2018).We used a consumer-grade UAV (DJI Phantom 4 Multispectral) with a color sensor (1/2.9-inchCMOS, effective pixels: 2.08 million) for visible light imaging to capture the RGB images.DJI GS PRO software (DJI Technology Co., Ltd., Shenzhen, China) was used for flight planning and automatic flight task.The flight task was performed in two stages with both the forward and lateral overlaps set at 80% and a height of 80  m.All parameters were maintained constant and 1,257 aerial images with a resolution of 1,600 × 1,300 were collected during the flight task.Table 1 presents detailed information on the UAV and flight parameters.

Data pre-processing and data augmentation
We used the Pix4Dmapper software (Prilly, Switzerland) based on the structure from motion algorithm to reconstruct the 3D structure of the scene from the acquired RGB images (Meinen & Robinson, 2020).The specific steps included image import, feature point matching, and 3D point cloud construction, which generated a 3D model with an HR mesh texture.Finally, a digital surface model (DSM) and digital orthophoto map (DOM) were constructed based on the 3D model of the scene, with a ground sample distance of 0.033 m per pixel.All processing steps in the Pix4Dmapper software were executed with the quality parameter set to high to achieve a balance between output accuracy and processing efficiency.Furthermore, the CHM was generated by subtracting the digital terrain model that was interpolated from dense ground points (approximately 3.9 points per square meter) from the DSM (Meddens et al., 2018).
We produced 200 images with 600 × 600 pixel specifications from aerial photographs and DOM slices, and manually drew the regions of interest (ROIs) on these images using the open-source software Labelme (https://github.com/wkentaro/labelme), to obtain a set of samples for training the DL model.We performed image augmentation processing, including rotation (of only the rectangular boxes), brightness changes, and mirroring of the original images, thereby expanding the number of images in the dataset to 1,500, to prevent overfitting owing to small sample data as well as to improve the robustness and generalization ability of the model.Moreover, we selected test images only from the original dataset and divided the enhanced dataset into training and validation sets with a 9:1 ratio to prevent enhanced images from appearing in the test set.The final dataset contained 1,350 images for training, 150 images for validation, and 50 images for testing.

Image segmentation
Image segmentation is a fundamental step in GEOBIA, which involves transforming an image into homogeneous objects based on specific features (such as spectral and shape features) and scale parameters.The mainstream methods in the field of RS include (1) region-based segmentation: multi-resolution segmentation (MRS), mean shift segmentation (MSS), etc., (2) edgebased algorithms: primarily using WS, and (3) PB methods.Table 2 describes these methods.Among them, MRS is the most commonly used method for addressing the scale invariance problem in GEOBIA (Dao, Mantripragada, He, & Qureshi, 2021).The segmentation scale is the most important parameter for controlling the MRS process (Hossain & Chen, 2019).Under-segmentation may occur when the segmentation scale is excessively small, thereby increasing the computational burden and potentially negatively affecting the quality of the features that are extracted for each object, such as the textural features.Conversely, an overly large segmentation scale may lead to over-segmentation, which can cause a lack of consistency in the pixels within individual objects.The proprietary software eCognition v9.0 (Trimble Germany GmbH, Munich, Germany) provides a tool for the automatic determination of the segmentation scale parameter known as 'ESP2,' which was set to 10 for use in combination with expert visual judgement.

Feature extraction and object-based training
We extracted 45 features from each segmented object, including spectral, geometric, and textural features, to separate the CTC and background from the segmented images accurately.In particular, we created a CHM and spectral indices to enhance the object feature information.Detailed information regarding these features is provided in Table 3.All compiled features were input into parameterized ML models for the CTC extraction.Three ML models (KNN, SVM, and RF) were selected for this study because they are widely used in tree species classification and image recognition tasks.A total of 148,082 objects were created after segmentation in the tested area and 27,793 objects were randomly selected as samples (positive and negative samples at a ratio of 13,951:13,842) for the model training.

Model parameter optimization
The selection of model parameters can significantly affect the model performance (Gonçalves, Pôças, Marcos, Mücher, & Honrado, 2019).In this study, we used a heuristic algorithm known as particle swarm optimization (PSO) to optimize the hyperparameters of the ML model.PSO is a metaheuristic optimization algorithm that simulates the movement and social behavior of particles to search for a global optimal solution of the model parameters (Li et al., 2018).We set the PSO parameters and iteration numbers to a conservative level (as shown in Table 4) to reduce the computation time for each task; however, this was sufficient to find an approximate solution to the optimal configuration.
We used a tree-based method to calculate the feature importance and provide the corresponding feature importance measures (FIMs) for the feature parameter optimization.This was achieved by .Avoid the computational complexity introduced by image segmentation.
. The effective utilization of features is not achieved.(Ye, Pontius, & Rakshit, 2018) calculating the average reduction in the Gini impurity (i.e. the degree of disorder in the data resulting from the tree partitioning) of the features during the model prediction process, which measures the impact of the input features on the model prediction accuracy (Eq.( 1)) (Ling et al., 2022).Subsequently, we ranked the features in descending order of importance and performed feature optimization using a wrapper approach: the number of input features was increased incrementally and the ML models were constructed repeatedly to determine the performance of the models under different feature sets.The feature selection module of the scikit-learn library in Python was employed in this study.Weight assigned to the particle velocity 0.9 cogncomp Step size of particle flight towards its personal best 2 soccomp Step size of particle flight towards its global best 2 maxiter Maximum number of iterations 100 where FIM represents the average decrease in the Gini impurity of the j-th feature in the decision tree and n represents the number of decision trees in the RF.

Automated ITC detection of citrus trees
Automated ITC detection is a popular research topic in intelligent agricultural resource monitoring (Murray, Gullick, Blackburn, Whyatt, & Edwards, 2019).This section primarily encompasses two research topics: (1) the proposal of layer-adaptive Euclidean distance transformation-based WS (LAEDT-WS) and (2) a comparison with DL convolutional neural network (DL-CNN) models.

Morphological operation
The output images of GEOBIA processes require post-processing through morphological operations.Morphological operations typically involve image filtering, morphological opening and closing, and hole filling, among other compound operations, which are primarily achieved by computing the spatial distance or grayscale difference between each pixel and its neighboring pixels, thereby adding or removing certain key foreground or background pixels (Wang et al., 2021).These morphological methods not only smooth the object edge details and fill the gaps but can also eliminate isolated noise that may result from image classification.

LAEDT-WS
The combination of the EDT with image thresholding is a viable solution for addressing the under-segmentation of connected objects in the WS approach (Xue, Zhao, & Zhang, 2021).This method generates an unweighted distance map by calculating the minimum Euclidean distance between the foreground and background pixels in a binary image.Subsequently, the edge pixels are eliminated using a manually set threshold.However, it is often challenging to determine this threshold and incorrect threshold values may result in either under-segmentation or a loss of foreground pixels.To address this issue, an LAEDT-WS workflow was developed for accurate individual tree separation and counting (Figure 3).First, the global mean contour area Area mean of the image is calculated and the rectangular coordinate range ROI(C i ) of the contours with an area greater than the average area is recorded: where C represents the set of all contours, each contour is represented by C i , and Area(C i ) represents the area of contour C i .
Then, the EDT algorithm is employed to calculate the distance map for the contours marked as ROI: where EDT (x, y) is the minimum distance between each pixel point (x, y) and the background point (x i , y i ).
Meanwhile, the distance map generated after applying the distance transform is subjected to a small segmentation threshold (in this study, this threshold was set to 0.1 × the maximum grayscale value) to structured eliminate edge pixels (Eq.( 5)), which can minimize the disappearance of crown individuals to the greatest extent.
The above steps are repeated iteratively to gradually refine the crown layer through multiple thresholding and ROI adjustment.This loop continues until the maximum iteration 'n' is reached.We set a global mean area reduction threshold 'k' to terminate the loop early if Area mean no longer changes to reduce computational redundancy caused by reaching the maximum iteration.Finally, the watershed algorithm is employed to identify the local minima of the crown layer gradient, 'watershed lines' are constructed between adjacent crowns, and the contours and bounding rectangles are recorded to achieve ITC detection.

DL-CNN
The DL-CNN has become a research hotspot in tree crown delineation in recent years owing to its high robustness and generalization performance (Zhang et al., 2021).In this study, we selected four object detection networks: YOLOX (Ge, Liu, Wang, Li, & Sun, 2021), YOLOv7 (Wang, Bochkovskiy, & Liao, 2022), Faster R-CNN (Ren, He, Girshick, & Sun, 2017), and SSD (Liu et al., 2016), as well as two instance segmentation networks, namely YOLACT and Mask R-CNN (He, Gkioxari, Dollar, & Girshick, 2020;Bolya, Zhou, Xiao, & Lee, 2019).We set the same number of iterations, optimizer, and learning rate adjustment strategies for these models (Table 5), and the remaining parameters (such as the initial learning rate and batch size) were determined through repeated engineering experiments based on different model fitting conditions.Furthermore, we used pretrained weights from an open-source dataset (MS COCO) for transfer learning (Lin et al., 2014), which could enable the model to learn the general underlying features and accelerate network convergence.The complex-structured neural networks were compared with the proposed model to provide a more comprehensive evaluation and comparative analysis of the models.

Assessment
We designated three plots in the study area and used multiple complementary indicators to evaluate the different tasks (Figure 4) for feature extraction and detection.First, we calculated the general indicators, namely the accuracy and kappa coefficient (kappa) (Eqs.( 7) to ( 9)), as measures for the initial evaluation of the accuracy and consistency of the ML classifier.Subsequently, the accuracies of the CTC extraction and ITC detection results were evaluated using recall and precision (Eqs.( 10) and ( 11)).These indicators were calculated using the binary confusion matrix presented in Table 6.In CTC extraction tasks, the recall represents the ratio between the true positives and the total number of true CTC pixels, whereas the precision represents the ratio between the true positives and the number of pixels that are predicted as CTC by the classifier.
The pixel statistics of the test plots were obtained using the pixel-level annotation of CTC by experienced experts.In ITC detection tasks, the recall represents the ratio between the number of true positive detections and the total number of actual trees, whereas the precision represents the ratio between the number of true positive detections and the total number of predicted trees.
The actual number of citrus trees was obtained from a field survey of the three plots.Moreover, we used the F1-score (Eq.( 12)), which is a comprehensive metric that considers both the Figure 5 presents a visual representation of the cross-evaluation results for three segmentation methods (MRS, MSS, and WS) and one non-segmentation method (PB) with three machine learning classifiers.Figure 6 shows the pixel-level evaluation results in a confusion matrix, comparing the methods used in Figure 5 with the ground truth labels manually annotated.Moreover, to emphasize the advantages of object-based analysis, Figure 7 presents visual results when applying the KNN, RF, and SVM algorithms at the single-tree scale using both PB and GEOBIA approaches.The data from Figures 5-7 demonstrate that the employed method achieved highly satisfactory results in distinguishing CTC and non-CTC regions in the test images.These data indicate a common conclusion that the segmentation algorithm selection is crucial for any object-based analysis method.The visual results in Figure 5 and the confusion matrix in Figure 6 reveal that the use of MRS yielded significantly higher accuracy for SVM, RF, and KNN algorithms compared to other segmentation methods (Overall accuracy: KNN -93.83%,RF -94.55%,SVM -94.72%).These statistics are based on a total of 2,856,131 pixels (including TP, TN, FP, and FN) in the test images, consisting of 1,178,819 CTC pixels and 1,677,312 non-CTC pixels.
The data in Figure 6 also indicate that region-based segmentation in GEOBIA achieves higher accuracy compared to edge-based segmentation.It can be observed that the edge-based WS algorithm tends to exhibit over-segmentation (seen in Figure 5(c), (g), (k)).This characteristic is further reflected in the classification results of the WS algorithm, which show a high number of FP (KNN - 124,910 pixels, RF -128,004 pixels, SVM -129,523 pixels).This can be attributed to the complex texture or noise patterns present in HR RS images, which make it challenging for edge-based segmentation algorithms to accurately capture optimal object boundaries.Although some research has improved the WS by generating gradient images or pre-identifying edges (Wang & Li, 2014), improving the over-segmentation results of WS is still challenging and introduces additional computational complexity.
Another noteworthy point is that pixel-based classification does not necessarily produce the worst accuracy.This is because object-based analysis methods, influenced by the segmentation quality, may amplify errors within individual objects, leading to error accumulation.However, pixel-based methods are prone to salt-and-pepper noise.By conducting pixel-level evaluations, we can quantitatively detect this effect (Figure 7).The data from Figures 5-7 consistently demonstrate that the combination of high-quality segmentation objects provided by MRS and the automated scale parameter optimization tool ESP2 yields the best results.Additionally, in this case, we observe that the SVM classifier outperforms the other two classifiers in all segmentation modes.Therefore, in our example, the combination of MRS and SVM is the best solution for GEOBIA.

GEOBIA based on parameter optimization
Table 7 presents the quantitative evaluation results of the model performance under the default and optimized parameter settings.
As shown in Table 7, the ML algorithms using the optimal parameter settings outperformed those using the default parameter settings for all accuracy evaluation metrics.Specifically, the KNN and RF exhibited improvements of 0.62% and 0.42% in accuracy, and 0.83% and 0.64% in the F1-score, respectively, with the optimized parameter settings.However, the performance of the SVM model was poor under the default parameters, mainly because of the choice of the kernel function.The linear kernel struggles to make decisions on complex nonlinear data, which resulted in a lower classification accuracy (accuracy = 86.32%,F1-score = 83.56%).After optimizing the parameters, especially the kernel function, the SVM achieved the highest classification accuracy (accuracy = 94.72%,F1-score = 93.44%).The results indicate that the SVM exhibited higher potential and flexibility in this study, and because fewer parameters needed to be searched compared to when using the RF, the SVM was the best option for the GEOBIA.

Feature optimization and importance ranking
The accuracy and processing time are important indicators for evaluating the model performance in image classification (Lan et al., 2020).We selected the SVM, which exhibited the best performance with the optimized parameter settings, for additional feature optimization to improve the  performance of the classification algorithm further.Figure 8(a) presents the FIM for the top 25 features (out of 45), including nine vegetation indices (9/9), eight spectral features (8/8), four textural features (4/8), two terrain features (2/2), and two geometric features (2/18).The introduction of the mean CHM feature that was derived from the point cloud reconstruction provided the highest classification contribution (approximately 18%).This can be attributed to the fact that this feature is not affected by image noise at high resolutions and is more robust to environmental changes (such as lighting conditions) during data collection.This result is consistent with the findings of previous studies (Yang et al., 2022;Liu, Coops, Aven, & Pang, 2017).Moreover, the construction of spectral and textural features provided more comprehensive feature attributes for the CTC extraction.
We tested the relationships between (1) the number of features and the SVM model accuracy, and (2) the number of features and the time required for the SVM to compute these features under the optimized parameter settings with five-fold cross-validation.
It can be observed from Figure 8(b) that the model achieved the highest accuracy on the validation set when the number of features was 25 (accuracy = 99.07%),which was slightly higher (less than 1%) than the accuracy that was achieved when all 45 features were used, thereby indicating the presence of redundant input features that had a negative impact on the classification.Furthermore, the computation time of the SVM algorithm based on the RBF kernel did not increase linearly with the number of features.This may be owing to the addition of several effective features that enhanced the separability between different features and reduced the number of support vectors that were required for decision-making, thereby reducing the time that was required for the SVM optimization/inference.Thus, the selected feature set saved 29.82% of computation time and achieved a performance improvement of 0.07% for the SVM model.This means that feature optimization eliminates redundant features and significantly reduces the computational cost of the model.

ITC detection based on LAEDT-WS
A new ITC detection method based on high-quality base maps that are generated using GEOBIA combined with LAEDT-WS was developed.The effectiveness of this method is demonstrated in Figure 9, which shows the detection results and the TP, FP, and FN trees in an example study plot.
The figure shows that the overlapping edges of the tree crowns in plantations owing to the planting time, water and heat conditions, and planting spacing pose a significant challenge for accurate ITC detection.The EDT method can eliminate overlapping pixels at the crown edges by thresholding the distance map.However, in previous studies, it has often been difficult to determine an accurate segmentation threshold for separating tree crowns because of the varying degrees of crown overlap and irregular shapes.This may result in the loss of small crown areas or the inadequate segmentation of densely growing tree crowns.To address this deficiency, the LAEDT-WS method generates unique distance maps of different sizes for each crown based on the area of the pixels that is occupied by the crown and generates the constrained boundaries of the crown layer using the WS algorithm.This method can effectively and accurately separate the crown layer while retaining individuals with smaller crown diameters compared to when using the EDT method once or stacking it multiple times.

Comparison with previous studies
The WS method relies on the quality of the underlying base map to a certain extent (La Rosa et al., 2021).Previous studies have typically used the CHM as the initial tree map for watershed normalization segmentation (Yin & Wang, 2019;Yun et al., 2021).In this study, three methods were used for comparison: (1) the traditional WS method using the CHM as the base map; (2) a combination of the CHM and LAEDT-WS; and (3) a combination of GEOBIA and LAEDT-WS.The comparison was performed using data from three sites.Table 8 displays the quantitative detection results of this work, with a total of 773 trees in the three sample plots.The combination of GEOBIA and LAEDT-WS could correctly detect 770 of the trees, with an F1-score of 99.16%.The F1-score was improved by 10.75% compared to that of the traditional approach of CHM combined with WS.Furthermore, the use of the LAEDT-WS method instead of traditional WS on the CHM base map increased the F1-score from 88.41% to 93.26%.The graphical segmentation results of the three methods for the three sample plots are depicted in Figure 10.It can be observed that our method significantly reduced the under-segmentation rate and significantly improved the detection effect, which verifies the conclusion in Section 3.2.
Table 8 and Figure 10 demonstrate the overall quality of ITC detection, and the tree crown overlap data based on Figure 11 can provide a more detailed quantitative evaluation of segmentation quality.The data in Figure 11 demonstrates that the combination of GEOBIA and LAEDT-WS achieves a tree crown matching rate of 99.6% (i.e. a one-to-one correspondence between predicted and ground truth tree crowns in 99.6% of detection results) across three sample plots (Figure 11(l)).However, when using the CHM as the base map, this matching rate decreases to 89.4% (Figure 11 (g)).This indicates that GEOBIA effectively improves ITC depiction through more stable and efficient classification results.Furthermore, under the same base map, LAEDT-WS significantly  improved the matching rate (from 79.3% to 89.4%) and reduced the omission rate (from 15.5% to 9.5%).The tree crown overlap analysis in Figure 11 allows for a more accurate understanding of the segmentation quality of TP trees.Because those tree crowns that are accurately predicted will intersect with manually delineated individual crowns.Conversely, if the intersection ratio of tree crowns is greater or less than 1, it indicates under-segmentation or over-segmentation of tree crowns, respectively.The results of this analysis further demonstrate that our proposed GEOBIA + LAEDT-WS framework can better detect the ITC of citrus trees with mild to moderate adhesion.

Comparison with DL-CNN
DL-CNN has made significant progress in crop detection as a cutting-edge artificial intelligence model.It can mine deep features and effectively fit nonlinear data.It has a different implementation paradigm and a more complex network layer structure compared to our method.Therefore, it is meaningful and challenging to compare our method with DL-CNN models.We compared the detection results of several mainstream networks using the test set.The single-class average precision with an intersection over a union threshold of 0.5 (AP 50 ) and gigaflops (GFLOPs) were used to evaluate the accuracy and complexity of the models (Table 9).It can be observed from Table 9 that YOLOX (AP 50 = 99.65%,GFLOPs = 26.76)and YOLACT (AP 50 = 98.95%,GFLOPs = 94.89)achieved the highest overall performance (the highest AP and smallest GFLOPs) in the object detection and instance segmentation categories, respectively.This indicates that YOLOX and YOLACT had higher detection speeds and accuracies and that their architectures were more adaptable to the data in the test set.Therefore, we subsequently compared the detection results of YOLOX, YOLACT, and our method in the three study plots (Table 10 and Figure 12).
According to Table 10, our method achieved the highest recall (99.61%) and F1-score (99.16%), whereas the YOLOX network had the highest precision (99.22%) and a similar F1-score (with a difference of 0.01% compared to our method).
The data provided indicate that our method and the DL-CNN architecture models could accurately detect citrus trees in the study area.However, a more important consideration is the time complexity of both approaches.We use the lightweight YOLOX as the DL-CNN model and present the corresponding data in Table 11.When many available samples are present, the DL-CNN model demonstrates excellent performance in terms of accuracy.However, this comes at the cost of considerable time required for training the model weights.For instance, in this case, training the DL-CNN model on a GeForce RTX 3060 GPU requires approximately 5 h, and training such a model on a CPU is even more challenging.On the other hand, the GEOBIA model completes the segmentation, feature extraction, training, application, and LAEDT-WS processes using only the CPU in 1 h 31 min.Although the training process of the DL-CNN model is slow, it exhibits excellent inference speed.To address the potential loss of smaller objects when directly scaling large-sized images to the  DL-CNN model's input size, we employed a slicing approach on the original image with a 15% overlap, generating a total of 637 sub-images of the same size as the network's input.Consequently, the average inference speed is approximately 0.2 s per image.Furthermore, YOLOX and YOLACT are end-to-end DL-CNN models that can automatically extract features from images through convolutional operations, thereby significantly reducing the time and cost of feature extraction by humans (Puliti & Astrup, 2022).However, these models require large amounts of training data to provide detection capabilities across scales or scenes in different experimental environments (such as varying solar elevations or flight altitudes).In contrast, our optimized GEOBIA can fuse multiple features, including texture and the CHM, and obtain more information regarding the surface of target objects with less manual annotation and lower computational costs.In combination with our improved LAEDT-WS algorithm, fast and accurate detection can be provided.

Conclusions and summary
In this study, we developed a new method for CTC extraction and ITC detection using a combination of GEOBIA and LAEDT-WS.First, we improved the workflow of the GEOBIA method.Specifically, we compiled 45 features, including spectral, textural, geometric, and CHM features.We used a heuristic parameter search method to optimize the hyperparameters of three classifiers, namely the SVM, KNN, and RF and further optimized the feature parameters through forward feature selection.Second, we used the mask image that was generated by GEOBIA as the base map and developed the LAEDT method to separate connected crowns to facilitate the automated delineation of individual trees using the WS algorithm.
Although this research has improved the potential for implementing related software or algorithms and has achieved promising results in typical manually planted citrus orchards, it is easily applicable to a broader range of applications.However, due to limitations in the dataset, we have not implemented this framework in a larger set of systems/scenarios.Furthermore, the proposed framework is constrained by the limited accessibility of eCognition, hindering the development of an integrated GEOBIA workflow.Future research directions include the development of a unified workflow using open-source object-based analysis tools (e.g.Orfeo Toolbox, RSGISLib).
Moreover, although unsupervised image processing methods offer computational efficiency, the complex hierarchical structure in more challenging scenarios (such as fragmented tree crowns or understory vegetation-covered canopy gaps) complicates the mathematical definition of individual crown features during decision-making.Therefore, developing a higher-level framework for segmentation that applies to a broader range of environments is essential.These techniques would find more effective and meaningful applications in agriculture or forestry, such as forest management, biomass estimation, determination of forest structure parameters (e.g.crown area, tree height estimation), and construction of thinning frameworks.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Figure 1 .
Figure 1.Workflow of this study.

Figure 2 .
Figure 2. Geographical location of the study area, where (a) is the three-dimensional (3D) topographical map of the area.

Figure 4 .
Figure 4. Locations of three study plots.The data from plot 1 were used for the CTC extraction accuracy evaluation, and the data from all three plots were used for the ITC detection accuracy evaluation.

Figure 5 .
Figure 5. Visual extraction results of CTC using pairwise combinations of four segmentation methods (MRS, MSS, WS, PB) and three ML algorithms (KNN, SVM, RF) in test plots.

Figure 6 .
Figure 6.Confusion matrix accuracy evaluation results of extracting CTC using pairwise combinations of four segmentation methods (MRS, MSS, WS, PB) and three machine learning algorithms (KNN, SVM, RF) in test plots.

Figure 7 .
Figure 7.Comparison of GEOBIA and PB methods using KNN, RF, and SVM algorithms at the individual tree scale.

Figure 8 .
Figure 8. Feature set optimization for ML models: (a) FIM of optimized feature set and (b) relationship between the number of features and model accuracy and time consumption.

Figure 9 .
Figure 9. Detection results of our method in example plot.

Figure 10 .
Figure 10.Comparison of detection results with previous studies on three study plots.

Table 1 .
UAV and flight parameters.

Table 2 .
Introduction to Commonly Used Segmentation Methods in GEOBIA.
. Requires fewer parameter settings for image segmentation.
border length, length, length/width, number of pixels, relative border to image border, volume, width 18 Shape Asymmetry, border index, compactness, density, elliptic fit, rectangular fit, roundness, shape index, radius of largest/

Table 5 .
Common parameters of DL-CNN models.

Table 6 .
Example of confusion matrix.

Table 7 .
Accuracy evaluation of ML models under different parameter settings.

Table 8 .
Quantitative evaluation compared to previous research.

Table 9 .
Evaluation results of DL-CNN models on test set.

Table 10 .
Quantitative detection results of our method compared to YOLOX and YOLACT on test sites.