Prior knowledge-based deep learning method for indoor object recognition and application

ABSTRACT Indoor object recognition is a key task for indoor navigation by mobile robots. Although previous work has produced impressive results in recognizing known and familiar objects, the research of indoor object recognition for robot is still insufficient. In order to improve the detection precision, our study proposed a prior knowledge-based deep learning method aimed to enable the robot to recognize indoor objects on sight. First, we integrate the public Indoor dataset and the private frames of videos (FoVs) dataset to train a convolutional neural network (CNN). Second, mean images, which are used as a type of colour knowledge, are generated for all the classes in the Indoor dataset. The distance between every mean image and the input image produces the class weight vector. Scene knowledge, which consists of frequencies of occurrence of objects in the scene, is then employed as another prior knowledge to determine the scene weight. Finally, when a detection request is launched, the two vectors together with a vector of classification probability instigated by the deep model are multiplied to produce a decision vector for classification. Experiments show that detection precision can be improved by employing the prior colour and scene knowledge. In addition, we applied the method to object recognition in a video. The results showed potential application of the method for robot vision.


Introduction
The detection and recognition of indoor objects is an essential task in robot vision. Real-time, highly accurate indoor object recognition can greatly assist in robot navigation and manipulation (Khan, Hayat, Bennamoun, Togneri, & Sohel, 2016). In fact, many tasks associated with robot navigation depend directly on the recognition of indoor objects (Collet, Berenson, Srinivasa, & Ferguson, 2009;Ramisa, Alenyà, Moreno-Noguer, & Torras, 2014;Srinivasa et al., 2010). To enhance the robot's performance during indoor navigation, it is therefore necessary to design a reliable recognition method.
CONTACT Xintao Ding xintaoding@163.com Supplemental data for this article can be accessed here. https://doi.org/10. 1080/21642583.2018.1482477 Other studies focussed on the design of statistical models to understand indoor geometry (Espinace, Kollar, Roy, & Soto, 2013;Pero et al., 2012;Wang, Gould, & Roller, 2013). However, these models lack sufficient precision (Pero et al., 2012;Wang et al., 2013). Because RGB-D sensors, such as Kinect, provide not only colour but also depth information in scenes, RGB-D cameras are being widely used to guide indoor robot navigation (Husain, Schulz, Dellen, Torras, & Behnke, 2017). Jiang, Koch, and Zell (2016) developed a real-time recognition system for fruit and small-textured objects for a mobile robot equipped with the Kinect RGB-D sensor. Other studies also contributed the design of RGB-D descriptors for object recognition. Blum, Springenberg, Wülfing, and Riedmiller (2012) proposed a convolutional k-means descriptor for object recognition in RGB-D data. Chae, Park, Yu, and Song (2016) proposed a way to recognize objects for simultaneous localisation and mapping (SLAM) based on an object-level descriptor using a depth sensor. Bo, Ren, and Fox (2013) proposed an unsupervised feature learning method for RGB-D data, and the features were employed for object recognition using linear support vector machines. By using RGB-D data, Asif, Bennamoun, and Sohel (2017) employed convolutional neural networks (CNNs) to extract features for object recognition and grasp detection. Although depth information contained in the RGB-D data can produce more robust results, relevant techniques are usually more complex and computationally expensive. Furthermore, because depth information is generally captured by infrared lasers, RGB-D implementation involves a process of multimode optimization. For the sake of brevity, we focus on object recognition within the scope of the RGB mode.
In order to improve detection precision, our study carries out indoor object recognition using a prior knowledge-based deep learning method, which learns deep features using annotated objects and predicts unknown objects using the features. Generally, public deep learning datasets (e.g. ImageNet (Schwarz et al., 2015), Chalearn's Looking at People dataset (Neverova et al., 2014), and Washington RGB-D Object (Eitel et al., 2015)) or private datasets (e.g. MIT campus buildings (Zhang et al., 2016)) are employed for training indoor objects. In this study, we first combine the public Indoor dataset (Quattoni & Torralba, 2009) and the private frames of videos (FoVs) dataset to train a CNN model, because the integration datasets are in favour of improving the detection precision (Ding et al., 2017). Because object colour may be helpful for object recognition, we second employ colour as a type of prior knowledge to enhance the detection precision of the resulting deep model. In addition, due to particular objects having a tendency to occur in certain scenes, we then employ scene as another type of prior knowledge to enhance the detection precision of the model.
The remainder of this paper is structured as follows. Section 2 describes our proposed method. Training and experiments are presented in Section 3 and Section 4, respectively. Section 5 focuses on an application of the method for robot vision. Finally, some concluding remarks follow in Section 6. Figure 1 shows the architecture of the method proposed in this paper. For the implementation of indoor object recognition, we propose deep features involving colour knowledge and scene knowledge for recognition ( Figure 1). After combining the public Indoor dataset (Quattoni & Torralba, 2009) and the private FoVs dataset, we first train a CNN model (Figure 1a). Mean images, which are used for colour knowledge, are then generated for all the corresponding classes in the Indoor dataset ( Figure 1b). After that, scene knowledge, which consists of frequencies of occurrence of objects in the scene, is employed as another type of prior knowledge (Figure 1c). When a detection request is launched, as shown in Figure  1d, the input image is first forwarded to the deep learning model to produce a vector of classification probability p CNN . Second, the input image subtracts every mean image in Figure 1b to produce a class weight vector. Similarly, its scene knowledge is used to produce a scene weight vector. After that, the three vectors are multiplied to produce a decision vector, as shown in Figure  1e. Finally, the output classification of the input image is the index with the maximum value in the decision vector ( Figure 1f).

Convolutional neural network
In order to implement recognition of indoor objects, we employ a CNN to train a deep model for classification. In detail, we use CaffeNet as our reference implementation, as shown in Figure 2. The images used for training consist of the public Indoor dataset (Quattoni & Torralba, 2009) and the private FoVs dataset. They were scaled to 256 × 256 pixels without regard for their original width and height ratio, since Caffenet requires input images of this size. Every private video was recorded from the surroundings of an object. The Indoor dataset contains 481 categories, while the number of annotations among categories varies. There are over 300 categories containing no more than 100 objects. Because a category with small object numbers cannot be used to train a deep model, and the number of samples used for training must be greater than the number of parameters, we rebuilt Caffenet by designing the size of the full connection layers, i.e. fc6 and fc7, to be 2048 ( Figure 2).

Class weighting
Because object colour may be helpful for object recognition, we employ colour as a type of prior knowledge to enhance the hit rate of the resulting deep model. Let D be the Indoor dataset (Quattoni & Torralba, 2009) Let the mean image of D k be MI D k . Then, MI D k is given by equation (1): where M and N are the number of rows and the number of columns of I, respectively; card(D k ) is the number of elements in D k . An input image I is compared with MI D k to produce class distance, as shown in equation (2): where I(x, y) is the intensity of I at a pixel lying at the xcolumn and y-row. Similarly, we have MI D k (x, y). The class weight is then defined as in equation (3).

Scene weighting
Generally, particular objects are found in certain scenes, such as a bed is usually found in a bedroom. Therefore, we also employ scene as another type of prior knowledge to help the deep model in decision-making. Let S l , (l = 1, 2, · · · , L) be the l-th scene in D. The scene weight vector of the l-th scene f l is defined as in equation (4).
where f l,k is the scene weight of the k-th class in the l-th scene, and it may be calculated as equation (5).

Knowledge fusion
When a detection request is launched, colour and scene knowledge are fused to help detection. The detection image is first forwarded to the deep model to produce a vector of classification probability p CNN . Then, its colour and scene weights are generated from (3) and (4), respectively. The output classification of image I, which comes from the l-th scene, is the index for which the decision vector is at a maximum, as shown in equation (6).
where w c • f l is the Hadamard product of w c and f l , which is defined as

Time analysis
For our knowledge-based method, there is no additional work involved in the training stage. During the stage of detection, the input image is subtracted by every mean colour image, which is in size of 256 × 256. Therefore there are K × 256 × 256 subtractions for using colour knowledge. In order to use its scene knowledge, a L loop is required to index the input scene knowledge. During the knowledge fusion step, two Hadamard productions tailed by the probability vector normalization are required. The operations for this step total to 6 K. In all, additional K × 256 × 256 + L + 6 K operations are required compared with the non-knowledge method.

Training the deep model
Our pre-training module runs on a dual-core i-3 4160 CPU with 16GB RAM equipped with the NVIDIA GeForce GTX 1080 graphics card with 8 GB memory. Caffenet is compiled under Ubuntu 16.04 with CUDA Toolkit 8.0, cuDNN 5.1 library, Anaconda3, and OpenCV 3.1. The protocol buffer version employed for Python 3.5 is 3.0.0. We use the Indoor dataset (Quattoni & Torralba, 2009) and the FoVs for our CNN model. The experimental dataset consists of indoor objects belonging to 21 different classes. Because a category with a small number of objects cannot be used to train a deep model, an Indoor dataset class was retained if it contained more than 500 members. Using these parameters, we obtained 18 object categories, as shown in Table 1. In order to take personalized objects into account, we extended the resulting dataset using the FoVs. The extension included 17 categories. Every category extension employed multiple videos, and each video was created from the surroundings of a particular object. The extended categories DP, screen, and TM are three new categories added to the categories from the Indoor dataset. The other fourteen

Test experiments
In this section, we describe the test experiments implemented against the test subset. After parsing all the object images from the Indoor dataset, all images annotated with the same object were placed in a folder. The mean images were generated by a folder scan using equation (1). Figure 3 shows the mean images of classes 0-17 in the Indoor dataset.
In this study, all the annotation files of the Indoor dataset were scanned to count the occurrence of the 18 classes. The images from the Indoor dataset were placed in 67 different folders. When the object images of the Indoor dataset were parsed, we counted their scene occurrences to find the scene weight. Figure 4 shows the resulting scene weight, in which the scenes were sorted by their names in alphabetical order. The scene with tag 0 represented 'airport_inside' in the Indoor dataset. Figures 3 and 4 present heterogeneity, which may be helpful for object recognition. The mean image (MI) of plant, i.e. class 8, in Figure 3 presents green. The MI of painting, i.e. class 10, in Figure 3 shows four borders like a frame around it. Class 1 presents high intensity in its centre. It is lamp. On the contrary, classes 5 and 11 show  low intensity. In Figure 4, certain classes in a certain scene present high weights. Overall, the deep model may take advantage of prior knowledge, such as colour and scene, during detection.
After the acquisition of prior knowledge, we implement detection using equation (6). In order to evaluate our approach, we use top-1 precision and mean average precision (mAP) as two measures of performance, where mAP is the average of all the precisions obtained for all queries. Together with mAP, the resulting classes' top-1 precisions on the test subset are shown in Table 2.
Compared with Caffenet, the prior knowledge-based method achieves a better result with a mAP of 86.3%. The increase in the detection precision for wall, books, painting, chair occluded, and book categories are remarkable. Although the resulting precisions for ceiling, door, table, and curtain categories are inferior, the small differences show the comparability of the method. Results demonstrate that detection precision can be improved by employing colour and scene knowledge. Table 3 shows some top-3 classifications together with their probabilities. The semantic classes marked with asterisks are top-1 classification. (F|T) reveals the example, in which CaffeNet results in a false classification whereas our proposed method results in a correct classification under top-1 classification. Similarly, we have (F|F) and (T|F).
It is unavoidable for top-1 classification to incur misclassification due to samples that are difficult to categorize, such as chair, wall, picture, and ceiling in Table  3. However, misclassification may be reduced if top-3 evaluation is implemented, as shown in Table 3. For Caf-feNet and our proposed method, both chair and ceiling are correctly classified under top-3 evaluation. The wall   is correctly detected by our proposed method. Neither CaffeNet nor knowledge-based method is able to correctly detect picture in (F|F). However, painting, which occurs in top-3 classification, is a close classification of the object picture. It can be inferred that top-3 accuracy may alleviate misclassification.
In Table 4 we summarize the running time of the entire object detection on test dataset using python. Since the total number of samples in test dataset is 18,133, the test results reveal that our proposed method requires approximately 30 ms for every input.

Comparison experiments
In this section, we describe the comparison experiments implemented on the NYU v2 dataset. The dataset consists of a total of 1449 samples of different indoor scenes. We parsed seven object classes from images in the labelled dataset based on annotated labels. After resizing the parsed patches to 256 × 256 pixels, they are inputted into the proposed model for recognition. Couprie et al., 2013 applied a multiscale convolutional network to learn features combining the images and its depth information. Hermans, Floros, and Leibe (2014) proposed a fast 2D semantic segmentation approach based on a novel 2D-3D label transfer method. Table 5 Table 5 are transfer results from the Indoor dataset to the NYU dataset. The average accuracies of Caffenet and the proposed method are 45.2 and 47.0, respectively. Compared with Caffenet, our proposed method achieves a better result overall.

Application
In this section, we present an application of our proposed method for robot vision. A video was created using a camera to test indoor object recognition. The scene is a typical indoor office environment. A video of the room was captured over a span of 31 s. The video is then parsed into 940 frame images. The resolution of the FoVs is 1280 × 720. Figure 5 illustrates the overall structure of the application presented in this paper. To implement indoor object recognition, we parse the input video into frame images (Figure 5a). Then, the regions of interest (RoIs) are extracted using a selective search (Uijlings et al., 2013) (Figure 5b). These RoIs are then resized to 256 × 256 pixels and classified into candidates using the proposed method (Figure 5c). Those candidates that are in the same category and show an overlap greater than 0.5 between the nearest frames are fused into one classification ( Figure  5d). Finally, the frames annotated with bounding boxes are concatenated into video as the output (Figure 5e). The parameter k, which controls the size of segments in the initial segmentation, is set to 200 in this study. The number of RoIs extracted from FoVs ranges from 45 to 217. The detection was implemented offline. The detection result was resized into a 256 × 256 image and forwarded to our proposed model for classification.
Although our model (Figure 5c) may predict a classification for every RoI, misclassification is unavoidable. In order to reduce misclassification, detection fusion is employed in our design. We fused candidates that were derived from the deep model between nearest frames when they were classified in the same category and their overlap was greater than 0.5. Figure 5d shows object fusion in two frames. The top-left object shown by a black line is indexed with category 12. However, the top-left object shown by a red line is classified into category 15. As the two candidates are classified in different categories, they are misclassified. On the contrary, both the second candidates shown in the two frames are classified into category 1, and their overlap area is greater than 0.5. They are annotated in a fused state and the annotation box is the minimum coverage of the two candidates. After the decision vector is normalized to a unit vector, the frame of the box is coloured red, yellow, green, or blue to show the probability of the prediction if its probability is in the interval (0.9,1], (0.75,0.9], (0.5,0.75], or (0,0.5], respectively. Table 6 shows these intervals counting over all the prediction probabilities, which totals to 18,865. Table 6 shows that most of predictions hit their classifications with a probability more than 0.75. The low prediction probability that less than 0.5 is rare. After all of the FoVs are assigned the aforementioned probability, we merge the frames into an annotated video at a frame rate of 6 fps (Supplement 1). For convenience, some detection results are shown in Figure 6. Figure 6a is the first output frame, in which the recognized object does not fuse with other objects. The detection results  of the 70th, 494th, and 808th frames are shown in Figure  6b-d, respectively.
Our method may be applied to indoor object detection. Figure 6 shows that detection fusion is necessary in our pipeline. Although there are a lot of detections in Figure 6a before fusion, many of them are misclassified. Furthermore, although the door of the cupboard is misclassified to 'floor' (Figure 6b), the misclassification may be corrected when the cupboard goes through the video and comes to the centre of the scope. The video (Supplement 1) shows the door of the cupboard is detected frequently and correctly from the 40th frame to the 108th frame. The real windows, floor, table, DP, TM, chair, and door are almost all correctly detected in Figure 6c and d. In the detection video, the objects labelled 'window', 'desk partition', 'TV monitor', and 'chair' can be frequently and correctly detected most of the time when they are in the scope of the video.
The experimental results suggest that our proposed method may be applied to indoor object detection, and the use of prior knowledge is helpful in enhancing a robot vision for indoor object recognition.

Conclusion
In this paper, we proposed a prior knowledge-based deep model for indoor object recognition. In our design, prior knowledge of colour and scene was utilized to help the deep model to make a decision when a detection request is launched. Both the test and comparison experiments demonstrate that the knowledge-based method enhances the hit rate for object detection. Based on our proposed model, we implemented an application for an indoor video. The application experiments show that this method may be applied to robot vision. The main contribution of this work is three-fold. (a) This work contributes to indoor object recognition for robot vision. (b) Our study proposed a knowledge-based method. (c) The proposed method is in favour of improving detection precision. A potential future research project may concentrate on accelerating detection speed so that real-time detection may be implemented on a mobile robot.

Disclosure statement
No potential conflict of interest was reported by the authors.