M3R-CNN: on effective multi-modal fusion of RGB and depth cues for instance segmentation in bin-picking

Picking tasks in logistics warehouses requires handling many objects of various types, increasing daily. Therefore, high generalization performance is required for object detection in bin-picking systems in logistics warehouses, but conventional methods have yet to meet this requirement. We propose a Multi-modal Mask R-CNN (M3R-CNN) and its training method for that aim. M3R-CNN is a network for the instance-segmentation task that takes RGB and depth as input and obtains high generalizability with small training data. We trained this network with 561 scenes of training data using our proposed method and obtained a recognition accuracy of F1-score = 0.631 and mAP = 0.958 for unknown objects. We also performed an object-grasping experiment with a robot using the M3R-CNN and obtained an availability-score of 0.97. GRAPHICAL ABSTRACT


Introduction
With the rapid spread of online sales, automation of various warehouse tasks in the logistics industry has become an urgent necessity in recent years.In particular, picking and packing various items stored in warehouses in response to customer requests requires processing more than ten items per minute, which can be as many as several tens of thousands of items, and this number is increasing daily.Therefore, we are currently working on the practical application of an automatic picking system in a logistics warehouse.
In automated warehouses targeting the business-tocustomer(B2C) domain, items are often stored by type in an aligned or semi-aligned state, and only a single type item exists in the target container, as seen in Figure 1(a).In addition, objects can only be observed and picked from the top of the container.In the scope of this paper, only the computer vision part of our bin-picking system will be discussed.
To date, there has been extensive research on the automation of bin-picking tasks [1][2][3][4].Recent studies for bin-picking often use state-of-the-art learning-based object recognition methods such as instance segmentation [5][6][7].However, these conventional methods are not designed for frequent object updates and have yet to CONTACT Takao Nishi nishi.t.es@osaka-u.ac.jp achieve the required recognition performance for unleraned (unknown) objects.
With the recent commoditization of RGB-D cameras, depth information is used for many practical scenarios in factory automation [5,[8][9][10].Some bin-picking studies attempt object recognition or instance segmentation with RGB-D input [7,11].
In our method, in addition to using the RGB and depth information, we also use a solid prior knowledge of the unknown object shape to improve the recognition of unknown objects.Many items in B2C warehouses have similar shapes, such as boxes or bags.Adding learning from shape observations to learning from ordinary texture observations may effectively generalize learned object recognition models.
For bin-picking scenarios, this paper looks at the effective fusion of RGB and depth cues for the instance segmentation and proposes Multi-Modal Mask R-CNN (M3R-CNN) that is a practical RGB-D extension of Mask R-CNN [12], a well-known instance segmentation model.Specifically, we compare several architectures and learning strategies to fuse RGB and depth channels for bin-picking scenarios.Our experiment shows that our method achieves accurate instance segmentation for practical bin-picking scenarios, as shown in Figure 1(d).We develop a working robot system to showcase the effectiveness of the M3R-CNN in practical use cases.The experimental results show that our picking system achieves high performance for standard warehouse items.

Contribution:
The main contributions of this paper are summarized as follows.
(1) We propose a practical extension of Mask R-CNN, i.e.M3R-CNN that effectively fuses RGB and depth cues for bin-picking tasks.(2) We show an appropriate transfer learning strategy from different modalities works, enabling our model to train from small datasets.(3) We analyze our M3R-CNN on a practical robotic picking system, which achieves high availability.

Related works
For bin-picking tasks, this paper proposes a multi-modal instance segmentation method using RGB-D input and a suitable learning strategy based on transfer learning for relatively small datasets.This section, therefore, describes the prior work for each of them.

Object picking
For the bin-picking tasks in controlled environments, e.g.factory automation, many studies propose model-based algorithms that use the template shapes of the target objects [1][2][3]8,13], which have been put into practical use in applications where the template 3D shape of an object can be easily obtained from CAD data.On the other hand, the automation of in-the-wild tasks, such as agricultural operations, requires the object picking of highly deformable shapes.For this problem, several studies estimate parameters for robot operation via extracting hand-crafted empirical or statistical features designed according to the nature of the target objects [14][15][16].
With the rapid development of deep learning, recent picking methods often use learning-based methods for finding target objects.Pose estimation of predefined template shapes is sometimes used for modelbased approaches [17,18], while the general-purpose object recognition methods, namely object detection and segmentation, are widely used for the picking tasks [12,19,20].Besides, several studies focus on directly estimating the optimal location for a robot to grasp [21].
A practical difficulty of object picking is the development cost.Traditional methods require template model creation or feature selection.On the other hand, learning-based methods brought accurate object detection but at the expense of the burden of training data collection.

Using shape information for instance segmentation
Learning-based instance segmentation, especially those using deep neural networks (DNNs), is widely used for object detection tasks that require accurate object regions.Although many typical instance segmentation methods (e.g.Mask R-CNN [12]) are originally proposed using RGB input, several studies attempt to use 3D shape or depth input together with the ordinary RGB cue.
One of the methods to provide RGB and depth cues to the CNNs is to combine the respective data in the channel direction, resulting in a 4-channel input [22].Although this method is simple, it is challenging to obtain satisfactory generalization performance when the RGB and depth data are tightly combined, e.g. when the respective resolutions and fields of view are different.
One way to address this problem is to input the RGB and depth cues into separate feature extraction networks and combine them at the final layer.In indoor and outdoor environment recognition tasks, it is widely used to convert depth data into a format similar to RGB data and then input RGB and depth into independent feature extraction networks [23][24][25].In factory automation, an attempt has also been made to simultaneously perform object detection and posture detection by passing the depth data through a network capable of acquiring object posture based on the segmentation results obtained by RGB [11].
Although all of these have achieved some success, they require domain-specific depth data format conversion and additional information, and their application to different domains requires the same amount of trial and error as using hand-crafted features.

Learning from small dataset
While the few-shot learning [26,27] aims to learn from a small number of datasets, DNN-based methods often require large amounts of training data, creating nontrivial difficulties for DNN applications, especially in areas where no existing datasets exist, including most binpicking scenarios.
A common approach for this problem is to gain the number of training data in some way.Widely used is data augmentation, which adds transformations in the existing data, such as image rotation [28], noises, and exposure changes.Automatic training data generation is also a choice, for example, using physical simulators [29,30] or even simple mathematical formulas [31].
Another group of methods is transfer learning from domains with large datasets [32].In the computer vision domain, large datasets for general object recognition, such as ImageNet [33], are frequently used to pretrain the models.A well-known problem during transfer learning is the domain gap, i.e. the semantic difference between the source and target domains.A report shows that learning from scratch is more efficient than transferring learning from RGB to depth for indoor environment recognition tasks [25].It also needs to be determined whether the pre-training using RGB-based datasets would be helpful for depth input in our task.
To solve this problem, we propose a multi-modal version of Mask R-CNN with a backbone network of three RGB channels and one depth channel, as shown in Figure 2. The area enclosed by the red dotted line is our extension.
The RGB 3-channel and depth 1-channel input images pass through FPNs [34] derived from ResNet101 [35] for feature extraction, and then 1×1 full convolution (the feature mixing block; the symbol in Figure 2) is performed at the last stage to blend the RGB and depth-derived features.
When we use a 4-channels RGBD image as input, the single-modal method learns a feature extractor that combines all four channels.However, our M3R-CNN can learn feature extractors for each of the 3-channel RGB, 1-channel depth, and their addition weights separately by introducing the feature mixing block, which expects to make it possible to use the RGB and depth features separately for each scene.
As well, this feature mixing block has the same input/output kernel size, the number of channels, and the number of layers as all conventional Mask R-CNN feature extractors.Thus, M3R-CNN can implement with minimal modification from Mask R-CNN.
The number of model parameters is 109.5 million for our M3R-CNN compared to about 63 million for the conventional Mask R-CNN 1 .

Datasets
No publicly available datasets are comparable in scale to ImageNet or CoCo for depth and other data using non-RGB/Grayscale images as sources.In addition, unlike RGB data, depth data has the property that noise characteristics vary greatly depending on the input device.Therefore, training only on relatively small datasets of several hundred to several thousand sets is necessary, which the user can prepare.
In this study, as a dataset, we annotated graspable regions to 17 RGB+depth data items and 561 original scenes, as shown in Figure 3.By applying the dataaugmented processes of background data replacement, rotation around the Z axis, and random noise addition to RGB to each scene, 21,324 training data and 10,662 validation data were prepared.
By annotating not the object itself but each graspable area separately, it is possible to prevent the detection of ungraspable objects and greatly simplify the grasp point detection process.

Pre-training
Figure 4 shows weights of the first convolution layer of the feature extractor of single-modal Mask R-CNN obtained by training the RGB and depth backbone for 30 epochs each using the dataset created in the previous Section 3.2.The figure also shows the transition of the training and validation loss.
This figure shows that more than the dataset prepared in this paper alone is needed to provide good generalization performance because no clear line or corner detection filters formed in the feature extractor.Especially at depth, the validation loss pulsates significantly, which suggests that there needs to be more variation in the data and that overfitting is occurring.
Figure 5 shows saliency maps using Eigen-CAM [36] for 30 epochs of M3R-CNN training on our dataset without pre-training.This figure shows a heat map of the per-pixel saliency of each of the RGB and depth feature extraction blocks and the feature mixing block immediately after them concerning the input image, with blue, green, and red being the most salient in that order.
It is clear that the RGB feature extraction block does not form a region of interest in the vicinity of the object, and the depth feature extraction block only pays attention to the entire container, confirming that it has yet to acquire sufficient generalization capability as described earlier.
The above shows that pre-training of feature extractors is necessary for both RGB and depth to exploit the performance of M3R-CNN fully.
For RGB data, several previous studies have shown that feature detectors can achieve acceptable generalization performance by using large datasets such as Ima-geNet and MS-Coco for pre-training [37].However, a large-scale dataset of depth data suitable for object grasping by robots was not published when writing this paper.
On the other hand, it has long been known that there is a strong correlation between the edges of RGB images and those of depth images [38].
Figure 6 shows the RGB and depth edges for the same scene.
As it is clear from this figure, the RGB and Depth edges for the same scene do not match the object's surface texture, while they generally match near object's boundaries.This indicates that the texture information in the RGB image is lost in the depth image, while the boundaries of objects that appear as height differences can be emphasized in the depth image.In addition, since the edges of    object boundaries are essential information in the Depth image, the same feature extractor as the RGB image can be used.
Therefore, in this study, we applied pre-training using ImageNet to the Depth feature extractor and the RGB feature extractor to obtain generalization performance.Specifically, the G component of the pre-training result using ImageNet for the RGB feature extractor is applied directly to the depth feature extractor.This figure also shows that pre-training with a large dataset forms clear lines and checkout filters in the feature extractor.The verified loss of depth also decreased monotonically, indicating that suppressing overfitting.
Figure 8 is this model's visualization results of saliency maps.Comparing this figure with Figure 5, we can see that the region of interest is formed around the target object in the RGB feature and that the region of interest around the target object disappears in the depth feature.That indicates that pre-training with ImageNet improves the performance of the feature extractor in RGB while it degrades it in depth.
We discuss this issue in more detail in the next section.

Fine-tuning
Networks that have been pre-trained using data from a domain different from the target environment need to undergo domain adaptation [39,40] (fine-tuning).However, the number of training cycles required for adaptation expects to differ significantly between RGB data, whose distribution is almost homogeneous with the pretrained data, and depth data, whose distribution is very different.As described in the previous section, the performance of the feature extractor improves in RGB while it degrades in depth.Therefore, we adopted the three-stage learning method described below in this study.
(1) Pre-training each backbone of RGB and depth using ImageNet (2) Fine-tuning each backbone of RGB and depth individually by own dataset (3) Fine-tuning after the mixing block by own dataset Figure 9 shows a comparison of the training progress between the proposed individual fine-tuning method (indivisual method; black line) and the conventional simultaneous fine-tuning method (simultaneous method; orange line), which is the tuning method for the model generated in the previous section Figures 8, and 10 shows the results of applying Eigen-CAM to the model generated by the individual method.
The loss values at 30 epochs are, for the individual method, training = 0.020 and validation = 0.031; for the simultaneous method, training = 0.021 and validation = 0.033.
Figure 9 shows no significant difference between the individual method (shown in black) and the simultaneous method (shown in orange), except for the convergence speed.
However, comparing Figures 8 and 10, in the depth feature, the simultaneous method keeps the low saliency around the object, while the individual method increases the saliency near the object's edges.The above results show that the individual method can effectively transfer learning from different modalities.

Experiments and results
To confirm the validity of the methods described in the previous section, we performed the following two experiments.
(A) Comparison of accuracy between conventional methods and our M3R-CNN using the same training and test data (B) Evaluation of grasping availability with a robot using the method proposed in this paper In this section, we first describe the experimental setup, followed by a description of each experimental method and results.

Hardware configurations
Figure 11 shows an overview of the experimental setup.
We set the depth and RGB cameras at 1800 mm above the container containing the object, and their external and intrinsic parameters are known.The depth camera is a PhoXi L from Photoneo, and the RGB camera is an Ace 2 from Basler, with each image resolution of 2048 × 1536px.
The acquired depth image is re-projected onto the RGB camera frame using the PhoXi camera's built-in GPU to obtain a single RGB-D image from the depth and RGB images.
The robot arm unit, consisting of a 3-axis Cartesian slider and 2-DOF end effector with the suction cup, is installed at a position not in the image during image acquisition, and it performs the picking operation from above the container by suction.

Accuracy comparison
Using 399 test images (232 scenes for 24 box items and 167 scenes for 9 bag items) acquired with the device described above, we evaluated the object detection accuracy of the conventional Mask R-CNN with RGB (3ch), Depth (1ch), and RGBD (4ch) input and our M3R-CNN.For M3R-CNN, to confirm the usefulness of the Depth variation on the training data, we also compared models trained on the dataset that omitted the background replacement process among the data extensions described in the Section 3.2 Datasets section with those trained on the complete set.
Each method uses ImageNet for pre-training, 30 epochs of fine-tuning for the conventional Mask R-CNN, and fine-tuning with individual learning described in the previous section for the M3R-CNN.
The implementation of the Mask R-CNN was that of TorchVision, and the optimization function for each method used Momentum SGD [41] (learning rate = 0.001, momentum = 0.9, dampening = 0, weight decay = 0.0001).
The F1 and Average Precision (AP) scores with IoU = 0.5, familiar with object recognition tasks, were used for the evaluation.Table 1 shows these results.The M3R-CNN proposed in this paper scores higher than other methods in both classes.
Focusing on mAP, the difference in scores between Mask R-CNN (depth) and M3R-CNN is slight, but this is due to their comparable performance in bag object detection, and the predominant difference in F1 scores and AP scores for box objects indicates that M3R-CNN is able to dynamically adjust the weights of RGB and depth features according to the scene.
On the other hand, comparing the results among M3R-CNNs with different training data, the models trained on the dataset without the background replacement process show a significant performance loss in both the F1 score and mAP, especially in the box class.The model trained on the dataset without background replacement frequently failed to detect the boundaries of individual objects correctly when the objects were aligned.Figure 12 shows the results of applying Eigen-CAM to this model.
The model using the training dataset without the background data replacement process does not form a particular region of interest in RGB and Depth.The above confirms the importance of variations in background data.

Evaluation of grasping availability by robot
We evaluated the system's availability [42], which indicates the continuous operation performance of the system, by performing object detection processing using the M3R-CNN on objects placed inside the container and repeating the operation to remove the detected objects using the robot arm.
Availability S indicates the availability per set arbitrary continuous operation time and is calculated by Formula (1).
where S: Availability; 0 ≤ S ≤ 1 F n : Number of items remaining after failure on nth attempt T n : Total number of items on nth attempt In the container, we conducted trials for five boxed items (a)-1 to (a)-( 5) and five bagged items (b)-1 to (b)- (5) shown in Figure 13.However, only one item  type can be present in a scene in a single attempt.The training dataset does not include each item type in Figure 13.
In object grasping by suction, it is necessary to change the grasping strategy depending on the surface shape of the object.Therefore, in this experiment, we decided to dynamically change the self-interference check range and pad push-in amount according to the class information obtained during object detection.
The results of the availability evaluation using this method are shown in Table 2.
In items (a)-3, (a)-4, and (b)-3, grasping failures due to object detection were observed once each.These results show that the recognition success rate is 471/474 0.994.In all failure cases, multiple overlapping objects were considered one object, and the detected area was classified as a bag object in the box objects (a)-3 and (a)-4.These failures may be due to the similarity of the learned bag object patterns and surface geometry, indicating a need for learned samples of bag-type objects.

Discussion
Our discussion thus far has revealed the usefulness of M3R-CNN and the importance of individual fine-tuning of each feature extractor in the bin-picking task, where obtaining a large amount of training data is difficult.
In recent years, large-scale depth datasets that can be used for object detection have gradually become  available [43,44], and methods have been developed to automatically generate data sets from mathematical expressions [31,45].However, it is well known that the quality of depth data varies greatly depending on the characteristics of the input device, and it is often difficult to apply the data as it is to actual environments.This problem can be solved by fine-tuning each feature extractor individually, as described in this paper.
As mentioned in Section 4.2, it is essential to have a wide variety of backgrounds (non-objects in the scene) when creating a dataset for fine-tuning.However, we have not been able to examine the differences in recognition results due to differences in depth image input devices.As a test, we input depth images taken by IDS Ensenso N35 and RGB images into the trained model used in this paper.As shown in Figure 14, the results were similar to those obtained when training on a dataset without background replacement and with an insufficient variation.This condition may be because the characteristics of the Depth data and the characteristics of the input device PhoXi L were learned.From the above, it is necessary to use as many different devices as possible when creating a dataset to create a model that can handle a variety of Depth input devices.
We have yet to reach the stage of qualitative or quantitative evaluation, but we have confirmed that learning proceeds well with the method described in this paper, even on the dataset using the Ensenso N35.
On the other hand, in a real-world environment, a person experienced in the object detection task using DNNs may only sometimes be involved in generating the dataset.It is difficult for a non-experienced person to read saliency maps or know the model's fitness from F1 scores or mAP.In this paper, we use the availability index to measure the model's suitability for the field, but it requires long hours of experimental work to obtain it and could be better suited for actual operation.There is an urgent need to develop a method to obtain only evaluation indices specific to bin-picking tasks from images.
In addition, we switched grasping strategies according to the object's class by selecting one from a list of preprepared strategies.Nevertheless, this method requires the preparation of strategies for the number of classes to be grasped, which could be more scalable.An automatic grasp strategy generation method that includes grasp point search [46][47][48], and collision avoidance, will be put to practical use.
The Mask R-CNN method used for instance segmentation in this study can be considered classic at the time of this writing.In recent years, methods that outperform Mask R-CNN, such as ViT [49], have been proposed one after another.
These methods also employ the same encoderdecoder structure [50] as the M3R-CNN proposed in this paper.Therefore, the domain transfer method presented in this paper, in which each encoder -that is, feature extractor -and decoder are fine-tuned separately, is expected to work effectively in methods other than M3R-CNN and to contribute to further performance improvement.

Conclusion
This paper presents the application of computer vision machine learning in the automation of picking tasks in a logistics warehouse, which reveals the following: (1) The M3R-CNN using RGB and depth cues in parallel, pre-training with ImageNet, and fine-tuning with a small dataset resulted in a model with satisfactory accuracy for the bin-picking task.(2) Since there is some correlation between depth and texture data, the transfer learning worked effectively even among different modalities.However, when learning multi-modals, it is necessary to fine-tune each feature extractor individually.(3) By using the 'availability' metric, which represents the sequential success rate of a task, we quantitatively demonstrate the adaptability of our proposed M3R-CNN and its learning method to practical applications.
Moreover, we note the need to develop a simple evaluation index suitable for the bin-picking task and the practical application of a method to generate grasping strategies automatically.
In the future, we plan to develop a practical method for automatically generating grasping strategies and a simple evaluation index suitable for bin-picking tasks.

Note
1.The number of parameters for Mask R-CNN is measured from torchvision implementation

Disclosure statement
No potential conflict of interest was reported by the author(s).

Figure 1 .
Figure 1.Example input scenes and results by our Multi-Modal Mask R-CNN (M3R-CNN), compared with conventional Mask R-CNN.

Figure 3 .
Figure 3. Example of our dataset.

Figure 4 .
Figure 4.The weights of the 1st.convolution layer obtained by training the RGB and depth backbone w/o pre-train.

Figure 6 .
Figure 6.Correlation between RGB and Depth data.

Figure 7
Figure 7 shows the pre-training results on the RGB and Depth features.This figure also shows that pre-training with a large dataset forms clear lines and checkout filters in the feature extractor.The verified loss of depth also decreased monotonically, indicating that suppressing overfitting.Figure8is this model's visualization results of saliency maps.

Figure 7 .
Figure 7.The weights of the 1st.convolution layer obtained by training the RGB and depth backbone for ImageNet pretrained.

Figure 9 .
Figure 9.Comparison of the learning progress between the individual (black line) and the simultaneous (orange line) fine-tuning method.

Figure
Figure Experimental environment.

Figure 13 .
Figure 13.List of experimental items.

Figure 14 .
Figure 14.Example of results using Depth image from Ensenso as input.

Table 1 .
Results of accuracy comparison

Table 2 .
Results of the availability evaluation.
Kosuke Iewaki received his M.Eng.from Osaka University in 2018.He is now an invited visiting research scholar at the Graduate School of Engineering Science, Osaka University.His research interests include intelligent robotic systems.Fumio Okura received his M.S. and Ph.D. degrees in engineering from the Nara Institute of Science and Technology in 2011 and 2014, respectively.He has been an assistant professor with the Institute of Scientific and Industrial Research, Osaka University, until 2020.He is now an associate professor with the Graduate School of Information Science and Technology, Osaka University.His research interest includes the boundary domain between computer vision and computer graphics.He is a member of IEEE, IEICE, IPSJ, and VRSJ.Damien Petit received his M.Eng.degree and M.Sc.degree in robotics and computer vision from the Télécom Physique Strasbourg and Strasbourg University, France, in 2010.He received his Ph.D. in Robotics from the University of Montpellier, France, in 2015; From 2012 to 2015, his research was conducted in large part at the CNRS-AIST Joint Robotics Laboratory in Tsukuba, Japan, and at the Interactive Digital Human group of LIRMM, at Montpellier, in the frame of the European Commission Virtual Embodiment and Robotic re-Embodiment (VERE).From 2016, he has been working as a researcher at the Graduate School of Engineering Science, Osaka University.His current research interests include vision and robot learning, robotic manipulations and human-robot interaction.Yoichi Takano has completed the doctoral coursework for Osaka City University, without obtaining a degree.He has been a Guest Professor with the Graduate School of Engineering Science, Osaka University.His research interests include computer vision, intelligent robotic systems.He is a member of RSJ.Kensuke Harada received his B.S., M.S., and doctoral degrees in Mechanical Engineering from Kyoto University in 1992, 1994, and 1997, respectively.He worked as a research associate at Hiroshima University from 1997 to 2002.From 2002 to 2016, he worked as a research scientist at the National Institute of Advanced Industrial Science and Technology (AIST).For one year from 2005 to 2006, he was a visiting scholar at the Computer Science Department of Stanford University.From 2016, he has been working as a professor at Graduate School of Engineering Science, Osaka University.From 2016, he has also been working as a cross-appointment fellow at National Institute of Advanced Industrial Science and Technology (AIST).His research interests include mechanics and control of robot manipulators and robot hands, biped locomotion, and motion planning of robotic systems.He is fellow of IEEE, RSJ and JSME.