An improved volumetric grid deep network model for point cloud segmentation

Voxel grid is widely used in point cloud segmentation due to its regularity. However, the memory consumption caused by high resolution restricts the performance of voxel grid. This paper proposes an improved voxel grid deep network (IVDN) model to represent more comprehensive point cloud features at the same resolution, thus improving the segmentation performance of point cloud. Firstly, the point cloud data are structured within a voxel bounding box to correspond with the three-dimensional(3D) convolution kernel, and a fixed number of point coordinates are selected to generate the point feature vector. Then, in order to consider the distribution characteristics, the reliability coefficient is used as an equivalent descriptor of the point cloud distribution density. Finally, a corresponding deep network is constructed to deal with the above features. Experimental results show that the proposed IVDN model can improve the mean classification accuracy and segmentation index mIoU(Mean Intersection over Union) effectively, with a 0.45% and 0.3% improvement on Shape net16 dataset respectively.


Introduction
Convolutional Neural Network is a deep learning model framework based on multi-layer neural network for image classification and recognition. In recent years, a lot of in-depth research has done in the area of image classification, segmentation, detection, etc. (Chen et al., 2018;Krizhevsky et al., 2017;Zeng et al., 2016;Zeng et al., 2018;Zeng et al., 2019;Zeng et al., 2020). Meanwhile, the spatial three-dimensional(3D) point cloud data which obtained by 3D measurement techniques such as lidar/depth camera have been widely used in 3D reconstruction (Mescheder et al., 2019), unmanned vehicle, real-time positioning and mapping (SLAM) and other areas. The application of convolutional neural network to the classification and segmentation of 3D point clouds has a broad prospect (Yang et al., 2018).
Compared with two-dimensional(2D) images, 3D point cloud data has the advantages of efficient data processing, flexible structure and rich information description. However, because of the unstructured nature of point cloud (Thomas et al., 2019), deep learning method cannot be directly applied to point cloud, it is particularly critical to solve the unstructured problem of point cloud. Charles R (Charles et al., 2017) first obtained a transformation matrix by using neural network, then used the matrix to filter the point cloud to have structural characteristics for the usage of the deep learning operator. However, CONTACT Xinliang Zhang zxldq@hpu.edu.cn the learning process of this transformation matrix is much complicated and lacks generality. References (Maturana & Scherer, 2015;Zhou & Tuzel, 2018) put the point cloud in a fixed voxel frame by analogy with the matrix arrangement of pixel points in a two-dimensional image. In this way, the spatial 3D point is uniquely determined by the voxel index within the frame. In literature (Huang & You, 2016), the tagging of scenic spot clouds in real fields has been realized by the combination of the voxelized point clouds and 3D convolution operator. And it gives a good performance in large-scale point cloud segmentation. However, for the high-resolution point cloud data, its long processing time and the heavy memory consumption are considerable. PointGrid (Le & Duan, 2018) selects the coordinates of a fixed number of voxel internal points as the point feature to compensate the subtle information. This method can balance the resolution and memory consumption to some extent, but the fixed number of points lacks the ability to describe distribution characteristics. This paper uses voxel method to structure point cloud and describes local details of point cloud by selecting point coordinates in voxel. A reliability coefficient is constructed to highlight the difference of voxel components while describing the distribution density. The improved voxel grid deep network (IVDN) model can describe more complete/comprehensive point cloud features at a given resolution.
Our contributions are summerized as follows.
(1) The shortcomings of voxelized point cloud method in describing point cloud features were analysed. (2) On the basis of selecting a fixed number of point coordinates to describe the detail information of point cloud, the reliability coefficient is constructed to describe the distribution information.
(3) An IVDN network model for point cloud segmentation is proposed.
The remainder of this paper is organized as follows. Section 2 summarizes the relevant literature and highlights the relevant issues. Section 3 details the input layer of the IVDN model and the reliability coefficient. Section 4 introduces the network structure of the model. Section 5 discusses the evaluation results of the model, and compares them with the existing network structure. Section 6 presents the conclusions and discusses possibilities for future work.

Problem statement
Point cloud data describes three-dimensional objects in the form of spatial coordinate sets. For point cloud data containing, B(x b , y b , z b ) and C(x c , y c , z c ), it can be described as the following matrix. (1) The elementary row transformation of matrix M corresponds to different point cloud storage order. According to the elementary transformation property of the matrix, the transformed matrix is equivalent to the original matrix M. However, for the convolution operator, the same input corresponds to different output values. Therefore, point cloud data needs to be structured so that each point corresponds to its features, indexes, and coordinates.
For structuring point cloud methods, VoxNet (Maturana & Scherer, 2015) and its variants (Brock et al., 2016;Li et al., 2016;Mutz et al., 2016;Qi et al., 2016;Wang et al., 2017;Wang & Posner, 2015;Xiao et al., 2017) are the most direct way to convert 3D models into occupied voxel grids. Although this approach addresses the unstructured problem of point clouds, the main disadvantages of this method are the loss of too much 3D information. Figure 1 shows the descriptive performance of voxelized point cloud to the target. It can be seen from Figure 1 that the delicacy of voxel grid's description of the target depends on the resolution of the grid. In order to obtain more 3D information, the most intuitive method is to improve the resolution of the grid. But for three- dimensional point clouds, a small increase in resolution creates an exponentially growing amount of data. In addition, there must be a large number of empty voxels in regular voxel frames, which consume a large amount of unnecessary computing power.
In this paper, a fixed number of points in voxels are selected to describe the target details more accurately at the same resolution. In addition, the reliability coefficient highlighting voxel difference is added to the point feature vector to increase the descriptive ability of the target. The section III -input layer will introduce the pretreatment method in detail.

Generation of the point feature vector
Using voxel grid to structure point cloud data, the mapping relationship of 'point -voxel -index' is established. The index V i to which point I(x i ,y i ,z i ) belongs to the corresponding voxel can be described by the following formula.
are the voxel resolutions along the X-Y-Z axis respectively. The points of the same index are allocated to the same voxel unit, and the number of points within the voxel can be obtained by counting the number of the same index points.
On the basis of voxelazation point cloud, the point coordinates within voxel are selected to be extended into feature vectors to better reflect the target contour details. Figure 2 shows the descriptive performance of the point information. Using placeholder filling method, the number of points in the voxel selected by the design is K. If the number of points contained in voxel is greater than or equal to K, K points are randomly selected. If the number of points in the precursor element is less than K, the insufficient points are filled with (0,0,0). If the voxel does not contain any true points, K zeros (0,0,0) are used to fill it. This method selects K points whose coordinates span vector K × 3 to represent each voxel. The method above selects a fixed number of points, but each voxel is different. In order to highlight the differences between voxels and enhance the descriptive ability of the model, the corresponding quantization parameters are constructed in this paper.

The reliability coefficient
Define the reliability coefficient of voxel data: where P is the number of real points in voxel, and K is the number of selected points in voxel. When P = 0, the reliability coefficient is 0. That is, the K points used in the design are filled with placeholders instead of real points.
When 0 < P < K, then 0 < C < 1. And the k-p points are placeholders.
When P ≥ K, the credibility coefficient C ≥ 1 and K points are all true and credible points. The reliability coefficient C represents how dense the number of true points P in voxel is relative to the number K selected.
At the same resolution, the voxel detail information varies linearly with the selected points. The reliability coefficient is not only a measurement index of data reliability, but also defined by it. After the parameter is linearly amplified by K times, the distribution density information of point cloud can be just represented. In other words, the higher the density, the higher the reliability. Therefore, the amplified confidence parameter is used as the density to introduce the feature vector of voxel to make the voxel descriptors into the network more complete. Figure 3 shows the feature composition of the input layer.
Deep network has great advantages in feature extraction. The purpose of structured point cloud is to make point cloud data suitable for deep network. The main work of this paper is to improve the ability of feature descriptors to describe local details on the basis of structured point cloud and construct the credibility coefficient to supplement the distribution information. Therefore, any deep network that can handle structured point clouds is applicable to the method in this paper.

The network structure of IVDN
For point clouds with structural information, 3D convolution operator can be used for feature extraction. The network structure of the IVDN used is shown in Figure 4. The forward network is adopted to extract point cloud features at different levels through convolution and pooling operation (Max pooling). Finally, classification prediction is completed through full connection.
The output of 3D convolution operation in the l layer is described as follows. where v xyz lm represents the m-th sub-block of the output feature map of layer l with the starting position of (x, y, z) and size of f × f × f . b lm is the threshold deviation of the corresponding sub-feature block. v (x+i)(y+j)(z+k) (l−1)q is the traversal of the convolution output feature map of the layer l − 1. w ijk lmq is the weight of the q-th kernel at (i, j, k) in v xyz (l−1)q . f is the size of the feature map where f i is the size of the feature map at layer i, F is the size of the convolution kernel, and the stride is s. The pooling output of layer l-th is where p (gx+i)(gy+j)(gz+k) (l−1)m donates the parameter value traversal of the m-th feature map subblock at position (x, y, z). The size of the pooling layer kernel is g × g × g and p xyz lm is the maximum output. Each convolutional layer in the IVDN includes a 3 × 3 × 3 convolution kernel with stride 1, a batch normalization and a rectified linear unit. The first block use 32-filter convolutions and they are doubled in each successive block. IVDN has two fully connected layers, each of which contain a ReLU activation and a Dropout layer. The last fully connected layer followed by a softmax to regress to the probability of each category. The number of data set categories determines the number of nodes in this layer.
Forward network through training can get the weights and deviations of each convolution layer and the result of classification. Backward segmentation network relies on the classification prediction results of Forward network. And in turn, the forward features are connected to the segmentation network. Then the point cloud segmentation is realized by deconvolution operation. The deconvolution operation can be realized by reversing the positive and negative traversal of the convolution. For single label point cloud in the model, it only needs forward network to complete classification and segmentation, while for multi-label point cloud, it needs backward network to predict point-by-point label.
The segmentation task.is completed by mapping the predicted point tag to all points in the point cloud model. The mapping method corresponds to the point feature construction. If the number of fixed points selected is K, the number of output channels of the segmentation network is K+1. K channels correspond to the point label prediction of K points; The other channel is the total label of voxel. And the label with the most votes among K points determined by voting.
K points are selected from the voxel to span the eigenvectors. And Correspondingly, the decoding prediction results are mapped to each point when the segmentation task is completed. For empty voxels, the label is empty; For the number of points is less than or equal to K of voxels, each point corresponds to the predicted point label. And if the number of points is greater than K, K points correspond to K channel prediction tags. Finally, all other points are covered with voxel total tags to complete the tag prediction of all points.

Evaluation
The Model net40 (Wu et al., 2015) dataset and Shape net16 (Yi et al., 2016) dataset are used to verify the effectiveness of the proposed algorithm.
The Model net40 dataset is a point cloud deep learning dataset provided by Stanford university, mainly for verifying the classification of point clouds. This dataset contains point clouds of 40 categories of objects, with a total of 12,311 CAD models. Virtual scanner is used to obtain surface point cloud data, and each point cloud model is fixed with 10,000 points. The Shape net16 dataset is a subset of the Shape net model which contains point cloud data and point-by-point labels. There are 16,881 point cloud models including 16 kinds of objects, which are divided into 50 parts. And each model is divided into 2-6 parts. In this dataset, the gaussian function with mean value of zero and standard deviation of 0.02 is used to construct the rotation matrix, and the random jitter point cloud is enhanced. This paper uses the enhanced point cloud dataset and randomly selects 80% of the models for training and 20% for testing.
MIoU (Mean Intersection over Union) was used to evaluate the segmentation effect, which is described below.
where p ij is the number of points that belong to class i but are predicted to be class j. In the same way, p ji is the number of points that belong to class j but are predicted to be class i. P ii is the number of correct points predicted. After 200 iterations, the forward training model of segmentation network was obtained. In the test under this Model, the average classification accuracy of Model net40 dataset is 85.7% while the mean classification accuracy of Shape net16 dataset is 98.6%, and the MIoU is 77.3%.
In order to compare the performance of the IVDN algorithm in this paper, the PointGrid algorithm was used to classify and segment on the Model net40 and Shape net16 dataset. Table 1 shows the comparison results of two methods with the mean classification accuracy of Model net40 dataset. As the results show, the average classification accuracy of IVDN is 0.45% higher than that of PointGrid on single label M40 dataset compared with PointGrid.
Furthermore, the comparison results of the mean classification accuracy and MIoU of the two methods in the dataset of Shape net16 are shown in Table 2. And Table 3 shows the comparison results of classification accuracy and Iou of each category of the two methods in Shape Net16 dataset.
According to Tables 2 and 3, the mean classification accuracy of IVDN on Shape net16 dataset is improved by 0.45%, and the MIoU is improved by 0.3%. Moreover, the improvement of the two types of Earphone and Knife is especially obvious when the average accuracy and MIoU are improved. For Earphone, the classification accuracy increased by 20%, and the IoU increased by 0.38%. For Knife, the classification accuracy increased by 12.8%, and the IoU increased by 14.5%. The comparison of the segmentation results of the two methods after visual rendering is shown in Figure 5. Figure 5(a-d) respectively shows the original Point cloud, ground truth, the segmentation effect of PointGrid and the segmentation effect of IVDN in this paper. The different colours in Figure 5 are only used to distinguish different parts. The segmentation of point clouds can be understood as the point-by-point classification of point clouds, so the IoU's promotion indicates that there are a number of points that predict the right label. The IoU of guitar in PointGrid is 0.855467 and that of guitar in IVDN is 0.864712. Figure 5 gives a rendering of the guitar under the above segmentation effect, it can be seen that the two methods have little difference in the segmentation effect of the head part of the guitar, while there are obvious differences in the neck and body part of the guitar. The IoU of knife in PointGrid is 0.648448 and that of guitar in IVDN is 0.793500, it can be seen that the middle point segmentation difference is greater. For earphones with three parts, the IoU in PointGrid is 0.434958 and that in IVDN is 0.438768, there are obvious differences in headphone, body and machine wire. The above is the visualization result of segmentation effect sampling. The detailed segmentation effect is shown in Table 3.

Conclusion
This paper selects a fixed number of points in voxel to achieve a more accurate description of the target details. In addition, the reliability coefficient highlighting voxel difference is added to the point feature vector to increase the descriptive ability of the target. Based on the voxel descriptor, an improved volumetric grid deep network model for point cloud segmentation was constructed to complete the point cloud segmentation task. According to the experimental results, the IVDN model can improve the classification accuracy and MIoU, and significantly improve the two categories of Earphone and Knife in Shape net16 dataset.
The proposed IVDN can provide more abundant information description at a certain resolution and construct a complete depth network with feature description. Other network optimization algorithms can be directly applied.