Estimation of log-gripping position using instance segmentation for autonomous log loading

ABSTRACT Autonomous forestry machinery is necessary both to ensure safety and improve productivity. Previous research related to automation technology for forestry machinery has mainly focused on autonomous driving; research on log loading/unloading is still in progress. To automate the loading and unloading of logs, it is necessary to evaluate the errors of several processes quantitatively: detecting logs in the environment, estimating the gripping position, and controlling the machine. This paper focuses on the development of an autonomous log loading operation. This study aims to propose an estimation method for log gripping position based on log detection using instance segmentation. Evaluation of the proposed system shows that the root mean square errors in the radial, axial, and vertical directions are 0.162, 1.526, and 0.140 m for sparse logs, 0.384, 0.271, and 0.119 m for dense logs, and 0.764, 1.022, and 0.194 m for unorganized logs, respectively. Our results demonstrate that the proposed method is sufficiently accurate to achieve gripping of a single log; however, the accuracy is insufficient for gripping one in a dense group of logs accurately.


Introduction
Demand for autonomous forestry machinery has emerged as a response to both ensure safety and improve productivity on industrial scales.From a safety perspective, keeping people out of a forest, where heavy trees are handled in an unstructured and unstable environment ensures essential safety, can be achieved through the automation of forest machinery.From the productivity perspective, it has been noted that operators may become a bottleneck in productivity (Hellström et al. 2009).Replacing manual operations with automated machinery can lead to faster completion of work.Thus, automation of forestry machinery should be promoted.
In a cut-to-length (CTL) system, machines that can benefit from automation include harvesters and forwarders.Harvesters perform felling, delimbing, and driving; forwarders perform forwarding; i.e. log loading, driving, and unloading.These types of machine operations within a CTL system are inherently complex, regardless of the machine type or operation.To improve productivity and safety by introducing autonomous machinery into a CTL system, it is necessary to automate all operations of each specific machine.Felling requires gripping trees, cutting standing trees, moving logs, and driving, which requires handling long, heavy treeswhich can exceed 20 m in length -in a forest environment that includes densely packed trees, uneven terrain, and restricted spaces.On the other hand, forwarding requires only gripping logs, driving, and unloading logs, typically ranging from 2 m to 6 m in length.Thus, the forwarder is relatively less complex than the harvester, and thus, has the potential for automation.Additionally, both forwarding and felling include gripping and handling logs, and log loading can also be applied to felling operations and log transportation by log trucks and trailers.This study has thus focused on the autonomous log loading operation.
Following the early success of autonomous driving of forestry machinery (Hellström et al. 2006;Ringdahl et al. 2011), several examples of autonomous driving or navigation have been reported in unmanned aerial vehicle flights of small robots (Smolyanskiy et al. 2017), vegetation removal (Mowshowitz et al. 2018), and wildfire prevention (Couceiro et al. 2019).However, research on autonomous driving and loading operations for forestry machinery has been limited compared to other automated systems, such as machinery used for construction or agriculture.Further, Visser and Obi (2021) also noted that automation of forestry equipment lags larger industries such as agriculture, mining, or the military and one limitation for more extensive use of autonomous equipment is the lack of larger-scale market demand for harvesting machinery for the forest industry.
Compared with driving, few previous reports have addressed automated log loading; however, some recent research has reported progress in autonomous log loading the detection of logs in images (Usui et al. 2019;Usui 2021;Fortin et al. 2022;da Silva et al. 2022), the automation of the forwarder (Geiger et al. 2020(Geiger et al. , 2021(Geiger et al. , 2021)), and the development of new machinery for unmanned operations (Lulea university of technology 2021) by integrating technologies for automated driving and log loading.Such research has the possibility for significant industrial breakthroughs.To apply these automated forwarders in practical forestry operations, however, certain technical issues need to be addressed.In a CTL system, forwarders usually collect multiple logs near a spur road.During transportation by trucks, logs are loaded from log stacks onto the cargo bed of the truck.In this process, machinery is required to detect individual logs and choose the most appropriate log for loading.In addition, machinery often grabs multiple logs at the same time to reduce the operating time.To grab logs effectively with a swing-mounted grapple, determining the most appropriate position and orientation for gripping is critical.Our study, therefore, concentrates on estimating the gripping positions and orientations of the logs.
Automated object gripping/handling has been widely used in other industries (Tai et al. 2016), e.g.manufacturing (Fantoni et al. 2014), warehouse automation (Azadeh et al. 2019), and agriculture (Zhou et al. 2022).Specifically, the loading/harvesting operations of agriculture share many similarities with forestry operations because both handle natural objects in an uncontrolled environment.Kootstra et al. (2021) referred that variation, incomplete information, and safety are the main challenges for selective harvesting robotics in agriculture.These challenges are also common in forestry.Many harvesting robots have been proposed for agriculture (Zhou et al. 2022), and their general operations have been reported (Jia et al. 2020) as follows: the harvesting robot approaches the plant through the walking device, the robot's vision system identifies and locates the target fruit, the robotic arm guides the end effector to avoid the obstacle and approaches the harvesting target, the end effector harvests the target fruit, and the robot arm and the end effector store the harvested fruit.The methods of estimating the harvesting and gripping positions are usually similar in different harvesting robots, and many robots adopt the two-step process of (1) object detection and (2) gripping position estimation.This approach is also considered to be effective for automating log loading in forestry; in fact, similar approaches have been reported in several forestry studies (Usui et al. 2019;Geiger et al. 2020Geiger et al. , 2021Geiger et al. , 2021;;Usui 2021;Fortin et al. 2022;da Silva et al. 2022).In particular, Geiger et al. (2021) reported autonomous log loading with a control error of less than 5 cm using instance segmentation.However, few studies have investigated the accuracy of each process in autonomous log loading; e.g.environmental recognition, path planning for the grapple head, and control of the machinery.Thus, it is crucial to clarify the errors in each process to ensure more precise control, such as unloading from a forwarder or log stacking.
Focusing on each process of autonomous loading, investigators have proposed the detection of individual logs and trees from images using object detection (Usui et al. 2019;da Silva et al. 2022) and instance segmentation (Usui 2021;Fortin et al. 2022).Although these methods are capable of detecting individual logs, they use only the coordinates of the image coordinate system and not 3D coordinates.For example, da Silva et al. (2022) proposed a standing tree mapping system based on object detection using an OAK-D camera; however, the positional accuracy of this system has not been reported.
A similar method using instance segmentation has been proposed by Grondin et al (2022Grondin et al ( , 2023)).Grondin et al. (2022) reported systems for detecting standing trees from a situation obtained using a simulation.In addition, they proposed a dataset with keypoints and estimations based on deep learning, wherein trained models could estimate the cut position for felling a tree with an average error less than 7 cm.Their dataset was focused on harvesting operations.However, applying this keypoint-based method to log loading requires the labor-intensive annotation of a dataset of bucked logs with keypoints.Geiger et al. (2020) showed that it is possible to estimate the length and diameter of a log with errors of 7% and 18%, respectively, using a 3D point cloud created from segmented stereo images.However, no report of the precision of the global coordinates of this system has been provided.Similar methods for estimating log shapes using 3D sensors have been widely used for forest resource estimations.The estimation of log shapes using 3D sensors has included such approaches as laser scanning with TLS, hand-held LiDAR, and SfM-based image data acquisition (Bauwens et al. 2016;Wallace et al. 2016;Iglhaut et al. 2019;Hunčaga et al. 2020;Hyyppä et al. 2020).These methods, which rely on high-precision points, accurately estimate tree shape parameters represented by the diameter at breast height; however, they acquire large point clouds and extract the features of the trees from the data obtained.Collecting a sufficient number of points can therefore require multiple perspectives for each individual tree.Additionally, this large number of points used necessitates greater processing times.
Autonomous log loading requires real-time processing for the detection of individual trees because the environment, including the positions and numbers of logs, changes dynamically during the log loading process.Hence, it is difficult to obtain large point clouds from multiple or backside viewpoints of the logs because the sensors installed on the loading machinery can only acquire single-view data.Consequently, a method is required for estimating the gripping positions of multiple logs in real-time using data from a single viewpoint.
The appropriate gripping position of the log is also important.This position can be selected from several candidates: the center of gravity of the log, the lengthwise center of the log, or the end of the log.Operators of forestry machinery typically grip logs horizontally using a swinging grapple, which is commonly equipped on forwarders.To achieve a horizontal grip, the center of gravity of the log must be selected as the gripping position.In addition, the path of the grapple head with the gripped log must be planned appropriately.If the log is gripped at an inclination -rather than in a horizontal orientation -the path must raise the grapple head and log to avoid the forwarder stakes.If the logs cannot be gripped horizontally during loading and unloading, the planned path can increase the operating time.Consequently, gripping the logs horizontally at their center of gravity improves the efficiency of loading and unloading by reducing the machine operating time.In this study, we have therefore adopted the center of gravity of each log as the gripping position.Herein, we propose a method by which the gripping positions and orientations of logs can be estimated.

System overview
Determining the gripping position of a log first requires detecting it within the environment.Generally, this process uses "object detection" methods that detect a target with a rectangular shape.Recently, most object detections have been based on deep-learning methods.Various object detection methods, such as R-CNN (Girshick et al. 2014), Fast R-CNN (Girshick 2015), Faster R-CNN (Ren et al. 2017), and YOLO (Redmon et al. 2016), have been proposed.Specifically, YOLO has been reported to achieve a balance between accuracy and real-time operation.It is relatively easy to use in creating a training dataset because the supervised labels are rectangular.
Conversely, object detection methods must take into account both the target log and additional features, such as the ground or vegetation.Usui (2023) proposed a method for estimating the gripping position of a log using rectangle detection but reported that the root mean square error (RMSE) remained at approximately 0.6 m due to the detection of objects other than logs.For these reasons, it is appropriate to use instance segmentation, which enables a more precise detection of individual logs in pixels, to estimate log-gripping positions accurately.
In addition, for this research, it is necessary to detect logs, estimate their gripping positions, and update the estimated gripping position as the chassis moves in real-time during autonomous log loading.For this purpose, we focused on YOLACT (Bolya et al. 2019) and YOLACT++ (Bolya et al. 2022), which are instance segmentation methods that perform real-time segmentation while maintaining accuracy.General instance segmentation such as Mask R-CNN (He et al. 2020) use feature localization, that employs a two-step detection process using a two-stage detector.Two-stage detectors process sequentially the information from a bounding-box region and employ mask prediction based on feature localization, which is difficult to speed up through parallelization.Rather than using feature localization, YOLACT generates a prototype mask and predicts a set of linear-combination coefficients at each instance, which can be accelerated by parallel processing.YOLACT++, an improved model based on YOLACT, incorporates deformable convolution into the backbone network, optimizing the prediction head by improving the anchor scale and aspect ratio.It also introduces a novel, fast, mask-rescoring branch.These features are refined to improve the model accuracy.As YOLACT++ can achieve mAP of 34.1 on MS COCO at 33.5 fps, it is both fast and accurate, and adopted it as the instance segmentation method for the logs in the present study.
The 3D coordinates of the grapple position were estimated using the following process.First, an image of the environment around the machinery was acquired using a camera.Next, each log was detected in the pixels from the obtained images by using instance segmentation.Then, the 3D points corresponding to the logs were extracted from the segmentation results, and the 3D points of the whole image were acquired from the stereo camera.Finally, the gripping position for each log was obtained by estimating the center of gravity of the detected points corresponding to that log.An outline of the processing algorithm is shown in Figure 1.If an appropriate gripping position could not be determined in the above process, the estimated gripping position was not output.In this study, to evaluate the accuracy of the estimated gripping position for autonomous loading, the operation was assumed to consist of gripping a single log for loading, and the gripping positions for each detected log were estimated one by one.Each process is described in detail in the following sections.

Datasets for instance segmentation
Images including logs and supervised labels are required for learning instance segmentation.Images for segmentation were acquired between September 2018 and September 2020 in Gunma Prefecture and between September 2020 and January 2022 in Ibaraki Prefecture.Both areas are located in the central part of Japan.The images from Gunma Prefecture included logs of the species Japanese cedar (Cryptomeria japonica), Hinoki cypress (Chamaecyparis obtusa Sieb.et Zucc.), and fir (Abies firma), while the images from Ibaraki Prefecture included only Japanese cedar.Stereo cameras (ZED and ZED2i,Stereolabs) were used to collect images of bucked logs at a resolution of either 1280 × 720 (ZED) or 2209 × 1242 pixels (ZED2i).The labels were created in MS COCO format, with 7466 log annotations for a total of 3262 images.Some examples of labeled images are shown in Figure 2. The annotation class is "log" only; tree species were not classified.All annotations were labeled manually.Only logs for which the whole part is shown within the image were annotated.
Empirically, when creating a machine learning model, the model is optimized by using 80% of the dataset as training data (Gholamy et al. 2018).Therefore, for evaluation the dataset obtained for the logs was divided into training/validation and test datasets at a 90/10 ratio.Subsequently, the rest of the training/validation datasets were randomly divided into training and validation datasets in an 80/20 ratio.Thus, three datasets for training, validation, and testing, with 2348, 588, and 326 images, respectively, were obtained.

Tracking logs using multiple object tracking
It is not possible to determine whether the same log is being detected continuously, even in temporally continuous images, because of the instance segmentation model processes involved in each image.
It is desirable to detect logs to be grappled continuously during camera and machine movement in order to update the position coordinates of the logs to be grappled.In addition, momentary misdetections of the log were assumed.In this case, temporally discontinuous detections may be excluded from gripping position candidates by temporarily continuous object tracking.Multiple object tracking (MOT) was adopted to estimate the gripping position.For the MOT scheme, this study uses simple online and real-time tracking (SORT, Bewley et al. 2016), which does not use any appearance information about the tracked object.Instead, SORT handles motion prediction and data association using a Kalman filter (Kalman 1960) and the Hungarian algorithm (Kuhn 1955) based on a bounding box of external object detections.These simple algorithms can handle fast tracking.The tracking accuracy of SORT depends on the accuracy of the detected bounding box.Thus, high accuracy detection makes SORT both fast and accurate.In this study, log tracking was performed by using SORT to process bounding boxes that encompass the mask of each log after segmentation.Logs detected in both the previous and the current frames were selected as  candidates, and their gripping positions were then estimated.Other detected logs were considered misdetections, and their results were discarded.

Acquisition of 3D points and estimation of the gripping position and orientation
First, a mask image was created from the pixels that included the logs detected by instance segmentation and tracking.Second, the mask image was integrated with 3D points created from the stereo images using a stereo camera software development kit (Stereolabs 2023).As a result, 3D points were obtained that corresponded to the detected logs.It was predicted that estimating the center of gravity of the log would bias the estimate toward the visible part of the log.To overcome this potential bias, the gripping position was selected to be the log-surface point (g s ) nearest to the log's center of gravity (g), since log-surface points are easy to measure directly.This point g s obtained from the 3D points of the detected logs was estimated to be a candidate for the gripping position of the log. Figure 3 shows an outline of this gripping position.This process for a single log was repeated until all the logs in the image had been processed.After searching for the point closest to the center of gravity for all of the logs detected in the image, the candidate closest to the camera was used as the estimated gripping position.3D points were created only from stereo images obtained at the same time as the instance segmentation.Consequently, 3D points within the range visible from the stereo images were used for processing.The estimation of the log's orientation was carried out simultaneously with the estimation of the gripping position.Principal component analysis was performed on the detected 3D points of the log.The eigenvector corresponding to the largest eigenvalue was selected as the gripping orientation of the log.Each estimation process used only single 3D points from stereo images.The proposed method for estimation of the log gripping position does not output gripping positions for the simultaneous gripping of multiple logs, since the primary objective in this paper is to evaluate the gripping position for a single log.
During the processes described above, it is important to prevent the outputting of coordinates with huge errors due to misdetections of the logs.Therefore, candidates for the gripping point were filtered to lie within the range 2.7 to 6.0 m, which lies within the working range of the grapple loader (SK50SR, Kobelco) used in this study.
To evaluate the accuracy of the proposed system for estimating the gripping position, we implemented it on a robot operating system (ROS); i.e. a software platform for robots.The data obtained from the stereo camera were saved in a log file and processed separately after the experiments on a computer (CPU Intel Core i9-10900K @ 3.70 GHz, memory 128 GB, GPU NVIDIA GeForce RTX 3090, OS Ubuntu 20.04).

Field experiments
To evaluate the proposed system, three types of field experiments resembling realistic operations were conducted at Ibaraki Prefecture: (1) an experiment on 14 July 2022 in which the logs were arranged separately; (2) an experiment on 11 July 2022 in which the logs were arranged densely and close to each other, assuming stacking; and (3) an experiment on 20 July 2023 in which the logs were arranged disorderly.
It is important to note that during these experiments, only data acquisition was carried out, and real-time estimations of log gripping positions were not performed.Consequently, to evaluate the real-time capability of the gripping position estimations necessary for automated log loading, the processing time required for each gripping position estimation was also recorded.
For the assessment of instance segmentation during the experiments, 10 images from each experiment were labeled, resulting in a total of 30 images evaluated for feasibility.In an assumed automated operation, the logs move out of the camera's field of view after being gripped Therefore, the gripping positions closest to the camera are presented sequentially.However, in the experiments conducted for this study, the logs were not moved.Thus, the log closest to the camera was identified as the gripping position while the grapple chassis was in motion.

Experiment 1: sparse logs
This first experiment assumed a situation in which the logs were placed separately.This corresponds to a case in which logs are left spaced relatively far apart in the forest, assuming forwarding in a CTL system.Five Cryptomeria japonica logs, each 4 m in length, were placed on a flat surface, spaced apart.
Figure 3.The outline of the estimated gripping position and evaluation axis in the log coordinate system.Here, l is length of the log, c is the lengthwise center of the log, g is the center of gravity of the log, and g s is the gripping point in this study.
The large ends of the logs were arranged in front of the grapple loader at the beginning of the experiment.A grapple loader equipped with a stereo camera (ZED2i, Stereolabs) was moved around the logs, and images of the logs were collected continuously.The camera was installed pointing at a downward angle of 0.504 radians.
By rotating the upper swing body of the grapple loader 90 degrees against the direction of movement, the captured images always included the logs.During 170 s of video recording, 2289 images were captured.Figure 4 shows the experimental setup, and Figure 5(a) shows the trajectory of the chassis and the positions of the logs during the experiment.The shapes of the logs used in this experiment are presented in Table 1.
Experiment 2: dense logs Next, an experiment was performed for a situation in which logs were positioned densely next to each other.This situation corresponds to operations such as unloading from the forwarder bed and loading from a log stack in a CTL system.Ten Cryptomeria japonica logs, each 4 m in length, were placed next to each other on a flat surface.Images were collected by driving around the logs using the same method as described above for the sparse logs experiment, except that the camera was installed on the grapple loader pointing at downward angle of 0.241 radians.Here, 524 images were obtained over 59 seconds.simulate the real environment in the forest.Sixteen Cryptomeria japonica logs, each 4 m in length, were placed in the forest.The experimental equipment and method were the same as described above for the sparse and dense logs experiments, with the camera installed pointing downward at an angle of 0.346 radians.Here, 2259 images were obtained over 600 seconds.Figure 5(c) shows the trajectory of the chassis and the position of the logs during the experiment.In this third experiment only, it was anticipated that the positions derived from the self-localization of the camera could result in errors in the estimations of the gripping positions due to the considerable driving distance covered during the experiment.Therefore, the position of the grapple chassis was measured using a total station (TS, IS303, TOPCON).

Evaluation of instance segmentation
Generally, precision and recall are used to evaluate machine learning.To evaluate the log detection accuracy, recall, average precision (AP; Everingham et al. 2010), and mean AP (mAP) in COCO evaluation metrics, commonly used indices for evaluating instance segmentation, were used.Because mAP requires a threshold for determining a successful detection, the intersection of union (IoU), which indicates the overlap ratio between the estimation and true area, is used as the index threshold.The indices recall 50 , AP 50 , and mAP were used; recall 50 and AP 50 denote the values of recall and mAP for a threshold of IoU = 0.50, and mAP considers values of IoU in the range 0.50-0.95.

Evaluation of gripping positions and orientations
The process for evaluating gripping positions and orientations is explained below.First, the direction of the axis on the surface of each log was measured in advance, using a TS to measure the large end and the center of each log, and the coordinates of the estimated gripping position were converted into the coordinate system of the TS.To evaluate the errors from the aspect of the important direction, the errors in the gripping position were evaluated by using coordinate transformations in the radial, axial, and vertical directions of the log coordinate system.Each evaluation axis is shown in Figure 3. Similar to the evaluation of the gripping position, the coordinates of rotational errors were transformed into the radial, axial, and vertical directions in the log coordinate system.In this study, the rotations around the radial, axial, and vertical axes of the log were referred to as roll, pitch, and yaw, respectively.
The true value of the gripping position of each log in the TS coordinate system was calculated from the measured center of the log's length on its surface and the calculated center of gravity on the log.For each log, the point c s at the center of the log's length (c) on its surface and the point p sle at the large end of the log on its surface were measured using the TS.The length between the small end of the log and its center of gravity, l/2 + l g , was calculated from the length l of the log and the diameters (d 1 and d 2 ) of the small and large ends of the log, assuming the log to be a truncated cone.Here, l g is the length from the center of the log to its center of gravity in the axial direction along the log.The axial vector on the surface of the log was calculated from the difference between c s and p sle .
Based on the measurements and calculations described above, the true value of the gripping position was calculated from the difference along the log's axial vector between c s and the product of l g and the log's axial direction in the axial vector between c s and l g on the surface of the log.The true value of the orientation was set to be the axial vector on the surface of the log in the TS coordinate system.In calculating the axial vector on the log's surface as mentioned above, a discrepancy arises between the axial vector on the log's surface and the internal axial vector (from c to g) in the log's vertical direction, attributable to the log's taper along its axis.However, this bias was not removed.One of the 3D coordinates of the center of gravity on the log in the sparse logs experiment was obtained from the stereo camera images and was used as the target gripping position because the coordinates could not be obtained from the TS.The errors were calculated from the difference between the true value of the gripping position closest to the estimated gripping position and the estimated gripping position.The errors were evaluated as RMSEs.

Evaluation of the graspable range
The most critical factor when gripping a log is ensuring that the gripping position falls within the graspable range of the grapple head.If there is an error in the estimated gripping position along the log's axial direction, it is still feasible to grip other parts of the log.In the case of a swinging grapple, the log can become tilted, depending on the gripping position.Consequently, it is conceivable that the grapple head could be negatively influenced during automated control by swinging due to gripping out of the center of gravity.In contrast, a fixed grapple could be less influenced by such tilting, as it can maintain its hold on the log regardless.Therefore, it is considered that errors in the axial direction are acceptable for a fixed grapple; however, log gripping could fail if an error in the radial direction or the vertical direction were to leave the log outside of the graspable range of the grapple head.
Similarly, orientation of the grapple head needs to be generally aligned with the roll and yaw of the log in the aspect of log gripping availability.Conversely, the pitch orientation does not affect the gripping process.Therefore, this study evaluated the roll and yaw directions of the log.The errors in the roll and yaw orientations were considered to be acceptable if they fell within the maximum graspable range of the grapple head.The conditions necessary for gripping logs are therefore as follows: The grapple coordinate system is established as shown in Figure 6.The estimated gripping position is assumed to be aligned with the center of the graspable range of the grapple head.Moreover, the gripping orientation is also assumed to be aligned with the Z-axis of the grapple coordinate system and the direction of the estimated gripping orientation.When the gripping position is accurately determined, it is presumed that gripping can be achieved.Since the gripping position was evaluated in the log coordinate system, the graspable range was evaluated using the transformed points of the estimated gripping position in the grapple coordinate system, as described below: Let P be defined as a set of points that have the positions x, y, z, and orientations θ, φ, ψ (Equation 1): Here, θ, φ, and ψ represent the roll, pitch, and yaw of a given log, respectively.From the aspect of the gripping position, the gripping range is confined within a circle defined by the trajectory of the grapple arm's tip, originating from the rotation center of the grapple arms.Consequently, considering both grapple arms, the following conditions are set (Equations 2 and 3).Assuming c 1 (a 1 , b 1 ) and c 2 (a 2 , b 2 ) are the xy coordinates of the rotation centers of the grapple arms (a 1 < a 2 ): Here, R represents the length of the grapple arms.As the rotation center of the grapple arms is located above the upper limit of the graspable range of the grapple head, b 1 and b 2 can be written as follows (Equation 4): Here, H is the maximum gripping height, where the left and right arms overlap at the center of the grapple head, H c is the distance from the upper limit of the graspable range of the grapple head to the rotation center of the grapple arms, and x c and y c are the coordinates of the center of the graspable area.Furthermore, considering that the gripping range is situated below the rotation center of the grapple arms and the range is from the maximum length L of a grapple arm, the following conditions are imposed (Equations 5 and 6): If the point p 1 lies on the circle with center c 1 that has the smallest value of x satisfying condition C, and the point p 2 lies on the circle with center c 2 that has the largest value of x satisfying condition C, then their coordinates are p 1 (x c −l/2, y c −H/2) and p 2 (x c +l/2, y c −H/2).Additionally, if the points p 3 and p 4 have the smallest values of y satisfying conditions A and B, respectively, they can be represented as p 3 (a 1 , b 1 −R) and p 4 (a 2 , b 2 −R).Since p 1 and p 2 correspond to the coordinates of the tips of the grapple arms at the maximum open position, the graspable range is defined as a polygon with the coordinates c 1 , c 2 , p 1 , p 2 , and the trajectory of the grapple arms tips.That is, if we define p 5 (x c −L/2, b 1 −R) and p 6 (x c +L/2, b 2 −R), then (Equation 7): Here, P vertex is defined as follows (Equation 8): From the aspect of gripping orientation, it is imperative that the rotated log remain within the graspable area.This establishes the following (Equations 9 and 10): Here, W is the width between the grapple arms.Overall, the conditions that define graspable area are as follows (Equation 11): Here, P in defines the set of points that are within the graspable area of the grapple head.
The specifications of the grapple head used in this study (BHS10MMR-3, Nansei Machinery) are as follows: R is 0.660 m, L is 1.430 m, H is 0.458 m, H c is 0.035 m, and W is 0.402 m, respectively.

Alignment of the coordinate systems
The estimated gripping positions and orientations were output in the ROS coordinate system.To compare these positions with the true, measured values of the gripping positions of the logs in the TS coordinate system, it is necessary to align the ROS coordinate system with the TS coordinate system.The alignment procedure is detailed below.For the experiments with both sparse and dense logs, the initial camera position and orientation in the ROS coordinate system were aligned with the TS coordinate system at the start of the measurements.This process established a transformation between the initial positions of the TS and ROS coordinate systems.The gripping positions estimated during camera movements were subsequently transformed into the TS coordinate system.During this process, the positions and orientations of the camera in the ROS coordinate system were calculated using simultaneous localization and mapping (SLAM), which outputs the map and position/orientations of the camera.Finally, all the estimated gripping positions were transformed into the TS coordinate system by applying coordinate transformations using the initial position of the camera and the positions and orientations of the camera during each experiment.
For the unorganized logs, it was anticipated that significant positional and orientational errors could arise using the aforementioned approach due to the extended measurement times and travel distances.Therefore, a transformation between the TS and ROS coordinate systems was performed via a point cloud map.The coordinates were transformed by aligning the large end of each log in both coordinate systems.The positions of the large end of each log in the ROS coordinate system were obtained manually from the points acquired from SLAM.The positions and orientations of the camera in the ROS coordinate system at each time were estimated using the same approach as for the sparse and dense logs.

Results using instance segmentation
The results of training in log detection using instance segmentation are shown below.Here, recall 50 was 46.61, mAP was 62.01, and AP 50 was 74.37 in the test dataset.The segmentation accuracy in each of the three experiments were recall 50 of 33.12,14.59,and 22.73;mAP of 45.59,10.75,and 24.93;and AP 50 of 59.41,22.90,and 40.59 for the sparse logs, dense logs, and unorganized logs, respectively.Examples of detections and misdetections in the test dataset and the experiments are shown in Figures 7 and 8.

Estimation of the gripping positions of the logs
Table 2 shows the errors in the estimated gripping positions.For the sparse logs, the RMSE was 0.162, 1.526, and 0.140 m in the radial, axial, and vertical directions, respectively.For the dense logs, the RMSE was 0.384, 0.271, and 0.119 m in the radial, axial, and vertical directions, respectively.For the unorganized logs, the RMSE was 0.764, 1.022, and 0.194 m in the radial, axial, and vertical directions, respectively.Table 3 shows the errors in estimated gripping orientations.For the sparse logs, the RMSEs of the orientation errors were 0.249, 0.388, and 0.533 radians in roll, pitch, and yaw, respectively.For the dense logs, the RMSEs were 0.231, 0.594, and 0.617 radians in roll, pitch, and yaw, respectively.For the unorganized logs, the RMSEs were 0.274, 0.736, and 0.288 radians in roll, pitch, and yaw, respectively.The estimated positions and orientations in each experiment are illustrated in Figure 9. Additionally, to illustrate the estimated gripping position and orientation in 3D, the viewpoint was rotated by π/4 radians around both the X-axis and Y-axis in the TS coordinate system.These rotated viewpoints are shown in Figures 10 and 11.Note that not all estimated orientations of the logs are displayed in Figures 9-11 in order to enhance the visibility of the figures.The processing times for the entire set of experiments were as follows: instance segmentation and MOT took an average of 147 milliseconds, while the estimation of the gripping position and orientation, as described in the section "Acquisition of 3D points and estimation of the gripping position and orientation," took an average of 35 milliseconds.In total, an average of 183 milliseconds was required from image acquisition to the output of the gripping position.
The method proposed in this study for estimating the gripping position utilizes single-shot information from the camera's field of view, potentially introducing errors due to occlusion.To analyze this, the rotational errors in the camera coordinate system were calculated as shown in Figure 12.The effect of log occlusion was evaluated by calculating the rotational error between the log's orientation and the camera's X-axis.In this study, the rotational errors around the Z-axis  Rotations around the radial, axial, and vertical axes of the log in Figure 3 were referred to as roll, pitch, and yaw, respectively.  of the camera corresponding to the log's yaw, were considered to be crucial indicators of the log's orientation relative to the camera The alignment between the yaw angle of the log and the Z-axis of the camera is not entirely congruent due to the installation of the camera.Figure 13 shows the absolute rotational error between the camera and the axis of the log, along with the positional errors of the log-gripping position in each axis.Figure 14 illustrates the absolute rotational error between the camera and the axis of the log, along with the rotational errors of the log-gripping position in each axis.

Log detection using instance segmentation
Detecting the shapes of logs is important for assuring the accuracy of the proposed estimation method.Logs were detected with AP 50 of 74.37 and mAP of 62.01 on the test set.Usui (2021) reported that logs were detected with mAP 81.1 by combining training using 295 images and RandAugment (Cubuk et al. 2020)  As demonstrated in Figure 8, certain logs remain undetected in each test.Consequently, when applying these segmentation methods into autonomous loading, there is a risk of overlooking and failing to load all of the logs.Therefore, improving the log segmentation accuracy for the practical application of the proposed method is necessary.
To address this issue, two possible solutions were suggested: (1) the use of high-precision segmentation and (2) the implementation of data augmentation.Regarding high-precision segmentation, various methods, including Mask-R-CNN (He et al. 2020) or Mask Scoring R-CNN (Huang et al. 2019), have shown higher accuracy compared to the YOLACT++ employed in this study.While these methods exhibited slower performance than YOLACT++, it is anticipated that they could enhance the segmentation accuracy of the logs.Concerning data augmentation, research suggests that techniques such as simple copy and paste methods (Ghiasi et al. 2020) can enhance accuracy.Thus, future application of such methods could lead to improved detection of missed logs.

Trends of the estimated results
For the sparse and unorganized logs, errors in the axial direction tended to be higher than along either the radial or vertical axes, reaching a maximum of 2.052 m for the sparse logs and 3.175 m for the unorganized logs.Conversely, the dense logs exhibited lower axial RMSEs.These error trends appear to be due to the log shapes and to misdetections.In general, significant errors were observed when a non-log point was erroneously identified as a log.In the case of the unorganized logs, notable errors in the estimated gripping positions were observed, primarily due to the misdetection of the road as logs.An example of such misdetection is shown in Figure 8(f).Furthermore, gripping positions estimated in this study were extracted from points on the surfaces of the detected logs.Consequently, the estimated gripping position did not deviate outside the log.Since the axial length of a log was longer than its extent in either the radial or the vertical direction, the range of estimated gripping positions in the axial direction was larger than that in either the radial or vertical directions.
In this study, all the detected logs were processed uniformly, regardless of their instance segmentation score, which indicates the reliability of the detection.Misdetections of the road during log segmentation resulted in low scores.In contrast, logs were detected with higher scores, over 0.9 in close range.This result thus indicates that the scoring threshold employed can contribute to the exclusion of misdetections of the logs.To validate the effect of the scoring threshold, it was applied using scoring thresholds ranging from 0.0 to 1.0 to the instance segmentation in the experiment with unorganized logs.Thus, it was observed that the evaluation metrics resulted in mAP 28.63 with scoring thresholds of 0.4-0.9 and AP 50 of 44.55 with scoring thresholds of 0.2-0.3.Therefore, based on these scores, errors due to misdetections can be decreased by setting appropriate thresholds.
Based on Equation ( 11), there were 74 (52.1%) points in the graspable range of the grapple head in the sparse logs experiment, 50 (46.3%)points in the dense logs experiment, and 53 (61.6%) points in the unorganized logs experiment.The points in the graspable range in each experiment are depicted in Figure 15.Based on Equation ( 2) and (3), the rotation centers of the grapple arms and the trajectories of the grapple arms tips are also shown in Figure 15.In this study, the estimated gripping position was set on the log surface.Therefore, if the estimated gripping position was located near the lower limit of the Y-axis in the gripping range, it could result in a trajectory where the grapple merely contacts the surface of the log.Conversely, it is possible to achieve a successful grip with the grapple arms even when the gripping points are located above the upper part of the log, potentially causing the grapple arms to become embedded into the log's surface.To distinguish between these situations, the estimated gripping points were separated into two categories: (1) with P in in the center of the circular cross section on the log or (2) not.In the first condition, P in is within the lower part of the log; this condition was called "graspable."In contrast, the second condition meant that P in was within only the upper part of the log; this condition was called "on the border."Based on these categories, there were 74 (52.1%) "graspable" points in the sparse logs experiment, 39 (36.1%) points in the dense logs experiment, and 50 (58.1%)points in the unorganized logs experiment.Similarly, there were 11 (10.2%)"on the border" points in the sparse logs experiment, none in the dense logs experiment, and 3 (3.5%)points in the unorganized logs experiment.These results indicate that almost half of the points were within the graspable range across the experiments.Even when focusing solely on the points categorized as "graspable" in the graspable range, more than half of the points were within P in for the sparse logs and unorganized logs.Consequently, more than half of the estimated points in sparse logs and unorganized logs could be gripped successfully, assuming that they were gripped one by one.However, in dense logs scenarios, such as loading or unloading from log stacks or from the platform of a forwarder, precise manipulation is required to insert the grapple arms into the spaces between logs.The estimation method proposed here exhibited an RMSE of 0.384 in the radial direction for dense logs; hence, it is not appropriate for gripping a single log from a dense pile of logs.

Error factors
A plurality of factors can be considered as causes of error.The first error factor considered is in self-position estimation of the grapple chassis using SLAM.In our experiments, the gripping positions of logs were estimated during movements of the machine and camera in order to simulate a real operation.Therefore, an inaccurate position or orientation of the chassis results in an error in the global coordinate system even if the gripping positions are accurately estimated in the local coordinate system of the machine.The position and orientation of the machine were acquired from the stereo camera software development kit using SLAM, and the accuracy of the self-position estimation using images, i.e. the visual SLAM depends on the situation.Merzlyakov and MacEnski (2021) also reported a benchmark in stereo visual SLAM, observing a 0.04%-0.11%localization error in the outdoor KITTI dataset.In addition, Sharafutdinov et al. (2023) reported a benchmark accuracy for the major open-source visual SLAM, observing that the localization accuracy was 0.02-0.11%for the same KITTI dataset.
The distances traveled in the sparse logs, dense logs, and unorganized logs experiments were calculated from the position coordinates to be 35.154, 6.893, and 288.231 m, respectively.Assuming an error in the range from 0.02% to 0.11%, localization errors of 0.007-0.039m in the sparse logs experiment, 0.001-0.008m in the dense-logs experiment, and 0.058-0.317m in the unorganized-logs experiment may have occurred.The RMSE of the grapple chassis trajectory between the measured using the TS and calculated using the positions and orientations obtained from SLAM was 0.237 m for the unorganized logs.This corresponds to an error of 0.082%, which is similar in accuracy to previous research  (Merzlyakov and MacEnski 2021;Sharafutdinov et al. 2023).These errors may have exerted a limited influence on operations, since the coordinate values required for automated operations were those expressed in the local coordinate system of the machine.
A second error factor is log occlusion.In our experiments, the gripping position was estimated individually from the point cloud acquired from the stereo camera at each time.That is, neither the former nor the latter point cloud continuous in time were used in estimating the gripping position.Thus, the only information used was that obtained from the visible range to the camera, and blind-spot information could not be used.The density and accuracy of point clouds decrease with increasing distance; in these situations, a sufficient quality and quantity of point clouds cannot be obtained for analysis.It is possible that this factor significantly affects the estimation of the center of gravity from the point cloud.The positional errors in Figure 13 show that a smaller absolute rotational error between the camera and the log corresponded to an increased axial error in the dense logs and unorganized logs experiments (p < 0.001).The correlation coefficients between the absolute rotational error of the camera and the positional-orientational errors along each axis are depicted in Table 4.A state with minimal absolute rotational error between the camera and the log orientation represents an alignment of the axis of the log with the X-axis of the camera, indicating that the log is oriented vertically to the camera.In this configuration, a smaller number of log points was obtained far from the camera, whereas a larger number of log points was obtained near the camera.The variability in the number of points depending on the distance from the camera is one of the factors contributing to the axial errors in the estimated gripping positions.Lindroos et al. (2015) considered the positional accuracy of harvester heads required for various forestry operations and argued that centimeter-level accuracy is essential for machine automation.To automate forestry operations, the lowest positional accuracy in each process is the bottleneck for the accuracy of the entire work.Therefore, the positional accuracy for gripping the target tree or log -as well as the positional accuracy of the harvester head -must be accurate at the centimeter scale.
In our experiments, the gripping positions were estimated with a radial RMSE of 0.162 m for the sparse logs,  0.384 m for the dense logs, and 0.764 m for the unorganized logs, respectively.These positions are of sub-meter accuracy, but they are not of cm -mm accuracy, which Lindroos et al. (2015) have argued to be essential for machine automation.It is therefore concluded that using the system proposed in this research for estimating the gripping position, it is possible to grip logs only when they are separately placed.However, it is not possible to satisfy the accuracy required for loading/unloading from stacked logs or from the loading platform of a forwarder.For operations that require precise control, it can be possible to improve the accuracy in estimating the gripping position by using the following methods: (1) stopping the movement and averaging the measurements when estimating the gripping position; (2) reducing blind spots due to occlusion by simultaneously mapping with technologies such as visual SLAM and using blind-spot information; or (3) equipping the grapple chassis with another sensor such as LiDAR that is capable of measuring more precise information.Since the first solution is effective only for reducing the variance of the estimated gripping position, the rest of the solutions were considered effective for the biased errors in estimation; e.g. when a point cloud of only a part of a log is acquired due to occlusion.
The system proposed in this study processes each acquired image.To estimate the gripping position accurately, it is necessary that the entire log be detected within the camera frame because the segmentation dataset includes only entire logs, not parts of the logs.Nevertheless, there is a potential for utilizing information about the entire log even if a part of the log is outside the camera's field of view, provided that supplementary data, images, or points concerning the camera's blind spots are available, possibly through techniques such as like SLAMbased mapping.
The integrated system proposed in this study involves the stereo camera based system being used mainly to move the grapple head from a visible distance to the vicinity of the gripping position.The installation of a distance sensor such as LiDAR at the grapple head, could be used to correct the gripping position and plan the path of the tip of the head.Furthermore, the step to autonomous forest machinery has other problems; i.e. complex control of hydraulic systems.The hydraulic systems used in forest machinery exhibit non-linearities due to actuators, cylinders, and delays.The same problem occurs in the automation of construction machinery (Dadhich et al. 2016).It is therefore a strong challenge to control the grapple head precisely for autonomous log loading.Quantitative step-bystep analyses of the errors in each of these processes including this study could ultimately lead to precise autonomous control for forest machinery.
The methodology of this research could be extended to estimate the gripping positions and log orientations of multiple logs by processing closely segmented logs, while the estimation of gripping positions in this study was initially directed at individual logs.However, in actual loading operations the handling of multiple logs simultaneously is a common practice.Therefore, it is important for future studies to consider expanding the proposed methodology to encompass the estimation of gripping positions for handling multiple logs.

Conclusions
This study proposed a method for estimating the gripping positions of logs by integrating log detection with instance segmentation and evaluated its feasibility for use in a realistic log loading environment.The resulting estimations of the positions for gripping logs exhibited RMSEs in the radial, axial, and vertical directions of 0.162, 1.526, and 0.140 m for sparse logs; 0.384, 0.271, and 0.119 m for dense logs; and 0.764, 1.022, and 0.194 m for unorganized logs, respectively.From calculations based on the dimensions of the grapple head used in this study, 52.1% of the points were within the graspable range of the grapple head for the sparse logs, 36.1% for the dense logs, and 58.1% for the unorganized logs.The method presented herein is sufficiently accurate to enable the grapple arms to grip a single, sparse log; however, it did not have the accuracy required to grip dense logs.For operations that require precise control, such as loading log stacks, the accuracy of the proposed approach may be enhanced by combining the system presented in this research with visual SLAM-based mapping to reduce blind spots and using LiDAR for short-range sensing capabilities.This system for estimating the gripping positions could be employed in the development of autonomous forestry machinery.

Figure 1 .
Figure 1.The outline of the algorithm for estimating the gripping position and orientation of a log.
Figure 5(b) shows the trajectory of the chassis and the position of the logs during the experiment.Experiment 3: unorganized logs Finally, an experiment was performed for a situation with randomly placed, unorganized logs surrounded by trees to

Figure 4 .
Figure 4. Experimental setup.The grapple loader is equipped with a stereo camera to estimate the log gripping position.The total station was used to measure the positions and orientations of the logs.

Figure 5 .
Figure 5. Trajectory of the machine chassis in the experiment by using simultaneous localization and mapping (SLAM).(a) Sparse logs.(b) Dense logs.(c) Unorganized logs.

Figure 6 .
Figure 6.A three-dimensional diagram of the grapple head.

Figure 7 .
Figure 7. Examples of log detection in the test dataset.

Figure 8 .
Figure 8. Examples of log detection during the experiments.(a) Successful detection in sparse logs.(b) Misdetection in sparse logs.(c) Successful detection in dense logs.(d) Misdetection in dense logs.(e) Successful detection in unorganized logs.(f) Misdetection in unorganized logs.
using Mask R-CNN(He et al. 2020).Fortin et al. (2022) also reported that Mask2Former(Cheng et al. 2022) detected logs with mAP of 57.53 using the TimberSeg 1.0 dataset, which consists of 220 images containing 2500 instances of logs.In addition,Fortin et al. (2022) reported that both precision and recall achieved an accuracy of 80% when considering the 0.5 IoU curve.The result of instance segmentation in the test dataset of this study was 74.4 in AP 50 and 62.01 in mAP using YOLACT++; it is slightly lower but of almost the same accuracy as in previous research.Compared to the test dataset, during the experiments the mAP and AP 50 exhibited lower values across all experiments.

Figure 12 .
Figure12.Evaluation outline of the log-camera rotational errors in the camera coordinate system.The rotational errors α between the X-axis of the camera and the orientation of the log were evaluated.

Figure 13 .
Figure 13.The positional relations of the absolute rotational error between the X-axis of the camera and the log orientation around the Z-axis of the camera.(a) Sparse logs.(b) Dense logs.(c) Unorganized logs.

Figure 14 .
Figure 14.The orientational relations of absolute rotational error between the X-axis of the camera and the log orientation around the Z-axis of the camera.(a) Sparse logs.(b) Dense logs.(c) Unorganized logs.