Automobile indexation from 3D point clouds of urban scenarios

In this paper, we introduce a methodology for the detection and segmentation of automobiles in urban scenarios. We use the LiDAR Velodyne HDL-64E to scan the surroundings. The method is comprised of three steps: (1) remove facades, ground plan, and unstructured objects, (2) smoothing data using robust principal component analysis (RPCA), and finally, (3) unstructured objects model and indexing. The dataset is partitioned into training with 4500 objects and test with 3000 objects. Mean Shift thresholds, the filter, the Delaunay parameters, and the histogram modelling are optimized via ROC analysis. It is observed that the car scan quality affects our method to a lesser degree when compared with state-of-the-art methods.


Introduction
Object recognition from point clouds is a challenging computer vision problem due to noise, sparse data, and scenarios' wide variability. Moreover, data acquired by the LiDAR Velodyne 64E sensors contains partially scanned objects, making the problem more interesting, and a common practice is to register multiple point clouds [1]. This paper proposes a method for the segmentation and recognition of automobiles in LiDAR generated point clouds. Figure 1(a) shows our acquisition platform capable of performing a mobile mapping from static or dynamically. Platform consist of three data sources: a LiDAR, a panoramic camera, and a Global Positioning System (GPS). Mobile mapping refers to the collection of data from multiple georeferenced sources. Applications are numerous, such as cartography, archaeology, geography, geomorphology, seismology, and atmospheric physics.
3D modelling of cities can solve traffic problems, prevent disasters in mines, and help design cities with organized growth [2]. For example, in [3], the authors perform 3D building detection and modelling by processing airborne LiDAR point clouds. As safety applications using airborne LiDAR scanning stand out from the work of [4], the authors' monitor power-line networks for vegetation clearance stating that the safety of the electrical network infrastructure can significantly affect our daily lives. In autonomous driving applications, the work presented in [5] stands out where they segment and classify objects from point clouds obtained with a LiDAR mounted on the roof of a vehicle. Their approach combines 2D and 3D techniques reaching real-time performance at 0.1 FPS. Object segmentation in 3D point clouds is a growing field of study due to the need to characterize and recognize objects scanned with LiDAR or segment sizeable 3D point clouds. Object segmentation is the early step towards more advanced robotic behaviours; for example, robots need to localize objects before attempting tasks such as grasping, manipulation, or path planning [6][7][8][9][10][11]. In [12] is proposed as a solution to automate mobile robots by segmenting the urban scene. In one point cloud, they stored the building's facade and the ground, and on another, they stored the foreground. Finally, the authors grouped objects such as cars, people, and walls.
Google's autonomous driving cars [13] can detect and track obstacles on their way for safe driving. The equipment of this car includes multiple sensors and cameras, including a LiDAR that generates a map of the environment; radars that detect the closeness of the objects allowing safe in traffic navigation; cameras located on the rear-view mirrors used to detect semaphore's lights; GPS, IMU (Inertial Measurement Unit) and an encoder on one wheel that determines the precise location of the car. Their system combines laser measurements with high-resolution maps to determine the location of the car. Our approach is quite different since we aim to segment parked or cars moving in urban scenes to perform 3D reconstruction of the scene without the cars leaving the rest of the scene objects such as facades, trees, lampposts, Etc. In [14], the authors propose a data fusion system based on scanner laser and computer vision. The pedestrians are detected using a pattern matching approach with the LiDar data and Histogram of Oriented Gradients (HOG) with camera data. Both detections are fused, and the movement of the pedestrians is computed with Kalman Filter (KF), and Unscented Kalman Filter (UKF) approaches.
Regarding indexing and matching 3D point clouds, [15] introduced a recognition approach based on 3D interest points' indexing. A set of interest points represent each site, where each point contains a descriptor vector. A comparison between two sets of points decides the similarity between two places. In [16], the authors proposed a method to detect interest points in 3D meshes using a modified Harris detector. We use a different approach for the indexing using a histogram of normal vector directions to the object surface. With the 3D reconstruction of the urban environment without obstacles such as pedestrians, parked cars, Etc., we need to automatically detect and remove these obstacles. This work introduces a new method of segmentation, filtering, and detection of automobiles in point clouds.

Object segmentation
Point clouds of urban environments contain structured and unstructured objects. Structure objects are the ground and facades; unstructured objects do not have simple shapes such as trees, pedestrians, cars, Etc. Our segmentation method first detects and extracts structured objects from the point clouds, and then, the remaining points are segmented using Mean Shift. In [17], the authors provide a detailed assessment of the Mean Shift algorithm for the tree segmentation using airborne LiDAR data. Figure 1(b) shows the segmentation algorithm's modules and their place in our automobile indexing approach.

Planes extraction
Points from the ground plane are detected and extracted from the point cloud using the method proposed in [18], then we detect and extract the facades. The extraction of the points corresponding to the ground plane uses a threshold of α. In this work, we improve the threshold obtained from the measurement system's uncertainty using Equation (1) from [19]. It is common to use the expression (Equation (1)) to calculate an instrument's measurements; we adapted it to obtain a reference point that discriminates planes from other objects. We use the calibration obtained in [20], where the variance of the sensor calibration is 2.22 cm 2 , and a mean error of 1.56 cm.
Where U 2 cal is the calibration variance of the LiDAR sensor, U 2 p is the sum of the errors in the measurement process, U 2 w is the average size of the sidewalk (13 cm), k is the coverage factor used to obtain a confidence level p = 94.5% in the uncertainty, and |b| is the mean error. A threshold α is defined as α = U = 15.46 cm.
We use the normal vector to the ground to define a new coordinate system for the points. Knowing that the facades are perpendicular planes to the ground, we did not use the third coordinate of the segmentation points. Then, using the modified Hough transform [18] we searched for the set of points that model a plane. The parametric space is given by ρ, θ , which are the normal vector parameters that pass through the origin on the modified Hough transform. Finally, we segmented the unstructured objects using MeanShift to obtain their location inside the point cloud. This algorithm groups a set of dimensions d, associating each point with the mode or peak of the data set's probability density function. Our method is composed of three stages: segmentation, filtering, and indexing. The segmentation stage extracts the ground, perpendicular planes to the ground, and unstructured objects. Filtering is applied to the objects to remove outliers. We apply an Indexing method to the filtered objects by obtaining a Delaunay triangulation and matching the normals' histogram against a library of models.

Filtering
We remove from the point cloud, the ground plane, and the facades; the remaining objects correspond to planes composed of few points and unstructured objects such as trees, pedestrians, cars, lampposts, Etc.
In this section, we describe the filtering step that we apply to the segmented objects. The filtering consists of removing points that affect the estimated normal vector's accuracy and lower the detection performance. Following an approach similar to [21], we estimated the normal vector in two steps: On the first step, we determined the neighbourhood size (r of the Equation (6)) for each point; on the second step, we correlated the estimation on edges and corners. To determine an appropriate neighbourhood size, we chose an initial r value and reduced it iteratively until the Equation (6) is true. Once the size is defined, we estimated a tangent plane to the neighbourhood and the normal. As a starting point, we obtained the k-nearest neighbours to each p i . The next step is to adjust a plane to the surface using RPCA [22]. See Equations (2) and (3). For neighbour estimation, we used Mahalanobis distance (MD), which measures data dispersion concerning p w .
where W = { √ w 1 , . . . , √ w n } are the weights associated to each point p i in the neighbourhood. The filtered point is defined by : where The local curvature value is defined as The following equation determines the distance of the neighbours defined for each point. This distance allows us to define small threshold distances when the curvature is large and large threshold distances when the curvature is slight.
where p i ∈ neig(P) corresponds to all points at a distance less than or equal to r from point p i . Filtering points need to know the initial distance for search neighbours and the minimum number of neighbourhood points. It is required a minimum number of points in a neighbourhood to consider it an object; if not, it corresponds to a data noisily. These two parameters, the starting distance r of neighbours and the minimal number of points at a neighbourhood, are determined by a ROC analysis

Indexing
Different object instances or object classes often have other geometric shapes. Thus, a geometric descriptor uses an object's shape based on specific geometric features to index a particular instance of object or object class. In [23], the authors propose an approach that includes a novel formulation of a disparity term that simultaneously considers the structural similarity index. The indexation step uses the segmented and filtered objects and our dataset of object models. During indexing, the 3D objects are modelled and compared against the models in the dataset.

Modeling
3D models indexing and searching in a database is a process of coding and describing 3D models' shapes. The approach proposed in [24] classifies moving objects into four classes: vehicle, pedestrian, bicycle, and the crowd. The authors use LiDAR and camera data and modelling the information using four number-ofpoint-based features, eleven shape features, and nine statistical features. The orientation of the normal vector of the tangent plane on each point of the surface cab describes Objects' surfaces. In our proposed work, three steps perform the modelling : (1) Orientation normalization. Each segmented object is composed of a set of points Z i oriented to the data distribution; for this, we calculate

Results
In this section, we present the results of applying our method to detect vehicles on point clouds of urban environments.

Plane extraction
Planes are projected into lines as described in Section 1.1. We used the modified Hough transform to search for large sets of points that model lines. Each point on edge had an associated parametric line, the intersection of the parametric line designated the existence and position of collinear points. The higher the number of collinear points, the higher the probability of finding the plane. The extracted facades corresponded to parametric positions with several collinear points higher than 600 points in this stage. This parameter prevents side views of cars from being confused with facades. However, facades with collinear points less than 600 were not detected. Table 1 shows the ground and perpendicular plane extraction equations where we used the modified Hough transform. In this table, we show the extraction of five perpendicular planes and the ground plane.
Follow the ground and the building facade removal; the remaining points contain unstructured objects and plane fragments.

Segmentation of unstructured objects
Our dataset vehicles have 1100 points on average; the sideways have 300 points on average. Therefore, the sideways of cars are not detected as facades. If multiple cars were aligned, the thresholds used to detect the facades would prevent the segmentation of these as a plane. Figure 2(a) shows a 3D points cloud where were extracted ground plane and facades. The remaining points correspond to trees, cars, persons, and small plane segments. Figure 2(b) shows the segmented objects using Mean Shift, points belonging to the same object are painted with the same colour and labelled automatically (we repeat the colours to improve visibility). We used a threshold of 150 points to discard small objects that usually correspond to noise or far away objects and poorly defined. The segmentation result passes through the filtering stage and then to modelling and indexing.

Filtering
Once the unstructured objects such as pedestrians, trees, walls, lampposts, telephone booths, and cars are    segmented, they are passed through a filter stage to reduce noise. The 3D points acquired by the LiDAR Velodyne 64E are noisy. Several factors introduce the noise in 3D points; some are LiDAR-object distance, incidence angle, object-colour, and object-material [27].
Using ROC analysis, we computed the starting distance r of neighbours and the minimal neighbourhood points. The initial distance r is 80 cm, and the minimal number of points permit in a neighbourhood is 14. The filtering method allows a smooth surface and eliminates points that do not correspond to the object. Figure 3(a,b) show the objects before filtering and Figure 3(c,f) after filtering. We partition our dataset into training with 4500 objects and test with 3000 objects and optimize the Mean Shift thresholds, the filter, the Delaunay parameters, and the histogram modelling via ROC analysis.

Modeling
After object filtering, we modelled the objects using histograms using the directions of the normal vector. Using Delaunay triangulation, we extracted the normal vector of each triangle to build the histograms, and we optimized the radius of the kernel filter of the Mean Shift method. Figure 3(c,d) show the filtered point clouds of a car and a three, respectively. Figure 3(e,e) show the same objects after triangulation. After triangulation, we can define the unique characteristics of each object based on its shape. Finally, Figure 3(g,h) show the histograms of the car and the tree.
We used four types of cars to improve detection: sedan, compact (size between 4 m and 4.7 m), SUV, and hatchbacks. By using different types of cars, we were able to optimize the algorithm and improve detection.

Matching
In this work, our interest is to detect cars. We partition our dataset into 4500 objects for training and 3000 objects for tests. Our objects included: cars, trees, lampposts, pedestrians, walls, or ground segments. Table 2 shows the confusion matrices after applying χ 2 , histogram intersection, Haussler distance, Euclidean distance, and Earth Mover's distance on the test set. Figure 4(a,b) show the ROC curves for the five methods. The best method was Histogram Intersection, with an area under the curve (AUC) of 0.8501. We separated the ROC curves on two graphs to improve visualization. Table 3 shows the AUC for each of the five methods.

Discussion
To evaluate our proposal, firstly, we compare the final results modifying some of the steps that we consider most crucial, and we determined the best method for each step present in the flowchart of the Figure 1(b). As a second way to evaluate, we compare our results with other techniques proposed in the literature. The method proposed in this paper, differs from the ones in the literature as follows: The flowchart of the Figure 1(b) shows three principal processes: Segmentation, 3D processing, and indexing. The segmentation process is out of the scope of this paper.. Table 3 shows the 3D processing and indexing process evaluation. Comparing the second and third columns, we can see that the final results are improved when we filter the 3D points of the segmented object. We observe in Table 3 that the matching step's best metrics are χ 2 , and the intersection.
In the context of 3D city reconstruction [1,28], we joined 3D data acquired in different positions to better Table 3. The second and third columns show the performance similarity metrics for each of the five metrics corresponding to the matching step.
Metric AUC, procedure shows in Figure 1 Note: The second column shows the results using the proposed procedure in Figure 1(b). The third column shows the results without the 3D processing (Objects Filtering) step shown in Figure 1(b). define the object in the urban environment. However, there are several methods for urban object recognition that use the database KITTI. According to the size, truncation, and occlusion classes of objects, authors classify the object in the database KITTI into three difficulty classes: easy, moderate, and hard. Figure 5 shows the car examples of our database classify our dataset in easy, moderate, and hard. Table 4 shows the number of objects in our database for each class.
We compare our method with other state-of-the-art approaches on the car class of the KITTI validation set for 3D detection when the authors use the easy car class, and Average Precision (AP) Intersection-over-Union (IoU) 0.7 threshold. Table 5 shows that the car scan quality affects our method to a lesser degree when compared with state-of-the-art methods.

Conclusion and future works
In this work, we developed a new method for car detection on LiDAR point clouds. Our method has three parts: segmentation, filtering, and indexing. Segmentation rules out points belonging to facades and ground, keeping the remaining objects such as cars, trees, pedestrians, lampposts. Filtering improves the quality of the segmented objects by removing outliers. Indexing models the objects based on histograms of normal directions. We used a dataset acquired with our acquisition platform mounted on the top of a car and driving around the city for training and test. We partition our dataset in training and test, obtaining a detection rate of 85.01% on the test set using Histogram Intersection.
The future works in this research can be extended to the construction industry. For this application, we can be capture 3D point cloud data of construction sites,  works, or equipment to enable better decision making in construction project management. In future work, we will create semantically 3D models from point cloud data; object recognition must be labelled on point cloud data into object classes, e.g. wall, roof, floor, column, beam, Etc.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
To the CONACYT, for supporting project number 669. We also want to thank Instituto Politecnico Nacional , for supporting project SIP-20210280.