Few shot object detection for headdresses and seats in Thangka Yidam based on ResNet and deformable convolution

Aiming at the problems of few detecting samples, deformable target sizes and overlapping among targets in the detection of headdresses and seats of Thangka Yidam, we propose an optimised few shot Thangka detection method based on the ResNet and deformable convolution. Firstly, the optimised deep residual network is designed to address the problem of few categories and complicated composition in Thangka images. Then, we replace the 3×3 convolution of the optimised deep residual network with deformable convolution. By introducing the offset of deformable convolution, the receptive field can adapt to the different sizes and shapes of the detection target of Thangka Yidam. Finally, the box regression is achieved through the multi-relation detector, where DT-NMS is proposed to reduce the missed and repeated detection target. Experimental results show that the proposed method has better performance than the SOTA on the COCO dataset. In addition, the AP of 2-way 5-shot on the Thangka dataset is 33.3%, and the AP50 reaches 71.2%, which increases by 4.7% and 5.3%, respectively.


Introduction
As a unique 1300-year-old scroll painting, Thangka was embodied in the intangible cultural heritage of the United Nations in 2006 (Ge, 2018). The subject matter and content of Thangka are extremely rich, including Eikon, 1 Mandala, Biography, Architectural Attraction, Tibetan Medical Calendar and so on. The Eikon is the most representative type, accounting for about 80%. When people initially understand the Thangka art, they are often shocked by the rich layout and the strong artistic tension of Thangka. However, a lot of religious knowledge is required for sake of identifying semantic objects and understanding image contents. For instance, Lord Sakyamuni Buddha with a blue mitral and a high bun holds an alms bowl in the left hand and a flower in the right hand, and sits on the padmasana, as shown in Figure 1. The alms bowl and the flower symbolise different wishes and merits of the divine statues. By identifying the mitral and the padmasana, we can determine that the Thangka Yidam is Sakyamuni Buddha. A key point in computer vision is how to analyse static images to achieve the purpose of scene understanding. Through the detection of key objects in the Thangka images, we can obtain important semantic information about Thangka images, and interpret image contents and experience the Tibetan living culture.
The traditional object detection technology was usually realised by statistical learning methods Rama et al., 2018;Saha et al., 2021;Li et al., 2018). All of those algorithms were trained by a manual design feature, resulting in the high time complexity and the poor robustness. To this end, some efficient detection algorithms with convolutional neural networks (CNNs) were proposed. According to different training methods, object detection algorithms based on CNNs can be divided into two types: the single-stage detection algorithm (Kim, 2020;Bochkovskiy et al., 2020;Liu et al., 2021) and two-stage detection algorithm Girshick, 2015;Ren et al., 2017). However, all of them require a large amount of training data with accurate border annotations. The collected Thangka images show unbalance categories and few samples problems. We will face greater challenges when we directly apply the data-driven detection model to the detection of Thangka images with few samples.
To address few samples problem, the domestic and foreign scholars proposed a new kind of machine learning theory: few shot learning, which can be learned from limited data. With the excellent achievements in few shot image classification (Gidaris & Komodakis, 2018;Dvornik et al., 2019;Cai et al., 2018), few shot object detection algorithms have also been developed rapidly. However, recent works for few-shot object detection (Kang et al., 2019;Yan et al., 2019;Sun et al., 2021;Zhu et al., 2021) all require fine tuning and cannot be directly applied on novel categories. To address this problem, Fan et al. (2020) proposed a Few-Shot Object Detection with Attention-RPN and Multi-Relation Detector (FSOD) model, which introduces the attention mechanism and the data information of support set in RPN. In addition, FSOD integrates the features of query image and support image to guide the network to generate more bounding boxes related to query images and suppress irrelevant bounding boxes. It can achieve the significant detection performance without fine tuning. Therefore, we perform Thangka image detection based on the FSOD model.
However, Thangka images have the characteristics of complex composition, few detection categories, different detection shapes and sizes and so on. It is not ideal to directly transplant Thangka images into the FSOD model, which is easy to cause poor generalisation ability, high missed detection rate and repeated rate. Therefore, we propose an improved method based on the FSOD and named it RD-FSOD, where R represents the improved ResNet and D represents the deformable convolution. Our contributions are follows: (1) Proposed the improve backbone network to extract Thangka features, which addresses the problem of complicate Thangka images and few categories, thus effectively improving the detection performance of the model. (2) Introduced the deformable convolution (DC) (Dai et al., 2017) according to the geometric deformation characteristics of the seats in Thangka Yidam. We replace the 3×3 convolution of the improve backbone network with the deformable convolution. (3) Proposed the double thresholds non-maximum suppression (DT-NMS) algorithm to solve overlapping among objects problem. By comparison under different threshold combinations, we choose N t = 0.45 and N d = 0.90 as the final threshold. Experiments show that the performance of our proposed method is better than the FSOD model in the detection for headdresses and seats of Thangka Yidam.
The rest of this article is arranged as follows: Section 2 introduces the related work of few shot object detection and the FSOD model. Section 3 is the detailed design and implementation of the model. In Section 4, the datasets and the experimental environment are introduced first. Then we carry out the comparison and ablation experiments, and explain and analyse the results. Section 5 is the conclusion.

Few shot object detection method
Few shot object detection refers to learning a better detection performance model from few labelled samples. Karlinsky et al. (2019) converted the region of interest (RoI) into the feature vector as the input of the distance metric learning sub-network. The posterior probability of each RoI is calculated to achieve the detection and cognition of few samples by comparing the distance between the embedded feature vector and the representation vector of each target. In LSTD (Chen et al., 2018) based on fine tuning, the authors proposed a novel few shot object detection network that can transfer knowledge from one large dataset to another smaller dataset, by minimising the gap of classifying posterior probability between the source domain and the target domain. Singh et al. (2020) also adopted the fine-tuning method to obtain the detection results by fine-tuning the pre-training model in the target domain. However, these methods ignore that potential bounding boxes can easily miss unseen categories and falsely detect images in the background. Aiming at this problem, Fan et al. proposed the FSOD model, which proposed a general few-shot object detection model that can be applied to detect novel objects without retraining and fine-tuning.

FSOD method
FSOD is a 'few shot object detection' model based on the model structure. Given a support image s c with a close-up of the target and a query image q c which potentially contains objects of the support category c, central to the FSOD is to find all the objects belonging to the support category in the query and label them with the bounding boxes. If the support set contains N categories and K examples for each category, the problem is dubbed N-way K-shot detection. Figure 2 shows the overall architecture of FSOD. Fan et al. build a weight shared framework consisting of multiple branches, where one branch is for the query set and the others are for the support set. The model first extracts features of s c and q c through the residual network (ResNet), and then performs the bounding box regression and localisation in the query branch. The query branch is the Faster R-CNN network containing RPN and detector. This framework is utilised to learn the matching relationship between support features and query features, so that the network can learn common features among same categories. Since RPN only classifies anchor boxes between foreground and background, all foreground anchor boxes are given high confidence, and the ROI sent to the next stage contains many categories unrelated to the support image. Therefore, the authors introduce a novel Attention-RPN and detector with multi-relation modules to produce an accurate parsing between support and potential boxes in the query.

Few shot object detection for headdresses and seats in Thangka Yidam based on ResNet and deformable convolution
Our research objects have the characteristics of complicated composition, few detection categories, different detection shapes and sizes and overlapping between detection objects. However, the detection object of FSOD is natural images, which makes the FSOD model unable to be directly applied to the detection of Thangka images. Therefore, we propose the RD-FSOD model based on the FSOD model in accordance with the above characteristics, as shown in Figure 3.
Firstly, according to complex composition and few detection categories, we improve the ResNet50 to focus on the shallow features and decrease feature loss. In this work, we adjust the convolution layers of the conv3_x stage from 12 to 15 and the conv4_x stage from 18 to 6. Secondly, DC is introduced to the improved ResNet50 to address the problem of different detection shapes and sizes. We replace the 3×3 convolution of the improved ResNet50 with DC. By introducing the offset of DC, the receptive field can adapt to the different sizes and shapes of the detection target of Thangka Yidam, so as to achieve better detection results. In addition, we propose a DT-NMS algorithm to solve overlapping among objects problem in the bounding box regression stage.

Adjustment and optimisation of convolution layer structure
As a kind of deep network, ResNet can effectively address the gradient dispersion problem when the conventional network is stacked to a certain depth, and achieve a better feature extraction effect.
Thangka, as a unique religious art works, pays particular emphasis to colour application and composition, making it difficult to extract the feature of Thangka images. In addition, the character attributes in Thangka images are only six categories. The Thangka images and character attributes are shown in Figure 4. In order to obtain the maximum extraction of Thangka images and improve the detection accuracy, the improved ResNet50 is designed.
The existing researches (Lu, 2021;Shah et al., 2020) show that the shallow network has a small receptive field, which can contain more image details and improve the detection accuracy. On the contrary, the deep network extracts more abstract features and pays more attention to the semantic information of images, which is conducive to the detection of targets (Xu & Zhang, 2020). Therefore, we adjust the number of convolution layers of ResNet50 to better extract Thangka image features. We appropriately reduce the number of convolution layers of the deep network and add the number of convolution layers of the shallow network. The improved network pays more attention to image details information and decreases image feature loss and network computing. We reduce the network layers from 50 to 41, namely ResNet41: the convolution layers of the conv3_x stage are added to 15 layers and the convolution layers of the conv4_x stage are reduced to 6 layers. The adjusted ResNet41 structure is shown in Figure 5. The conv3_x and conv4_x represent 3th and 4th stages of ResNet, respectively. In the conv(256,128, k = (1×1), s = 2, p = 0), k is the size of the convolution kernel, s is the stride, p is the padding, 256 is the number of convolution kernel channels, and 128 is the number of channels output by the upper convolution layer.

Introduction of deformable convolution
In the 2D convolution, for each location p 0 on the output feature map, the calculation formula of the feature value is (−1,0), . . . ,(1,1)} (1) Among them, i = 1, ... , N, N = |F|, ω represents the weight of convolution kernel, f is the input feature map, F represents the output feature map which defines a 3×3 kernel with dilation 1, p 0 represents central location and pi enumerates the location in the F. Figure 6 shows the images of seats in Thangka Yidam. It can be seen that the appearance of seats is variable and the size is different. Therefore, it is necessary to address the problem of the target geometric deformation in the detection of the Thangka image. However, according to Equation (1) where f (p 0 + p i + Δp i ) represents the input eigenvalue of the sampling offset location. DC uses a parallel convolution layer to learn the offset so that the sampling points of the convolution kernel on the input feature map are offset, that is, a direction vector is added   to each sampling point, as shown in Figure 7. Numerous studies (Wei et al., 2020; have indicated that the DC has excellent performance in the field of object detection. The application of DC can obtain position information of detection targets through continuous learning and extract more exact features of detection targets of Thangka Yidam.
We introduce DC into all 3 × 3 convolution layers of conv3, conv4 and conv5 of ResNet41 to enhance the detection accuracy for Thangka images. Figure 8 shows the stacking sequence of conv3. By introducing deformable convolution layers, the model can obtain sampling points of more extensive hierarchical features, so that the network as a whole has the ability to learn the spatial support region more precisely.

Double threshold non-maximum suppression
Non-maximum suppression (NMS) (Neubeck & Van Gool, 2006) is a key post-processing process in many computer vision filed. In object detection, it is utilised to convert a response map which activates many imprecise object windows in a single bounding-box for each detection object. The NMS is a greedy algorithm. It greedily selects high confident score detection, and deletes less confident score detection adjacent to high confident bounding box to avoid the same object being repeated detection. The fractional reset function is where B max represents the bounding box with the highest confidence and B i is the ith bounding box. IoU(B max , B i ) is the Intersection-over-Union (IoU) ratio between the rest of bounding boxes and the bounding boxes with the highest confidence. N t is the preset threshold of NMS. s i is the confidence. However, when two detection objects are adjacent in the image, the bounding box with lower confidence is likely to be directly suppressed, resulting in missed detection. As shown in Figure 9(a), the sumeruseat is close to the padmasana in the Thangka images. The NMS algorithm will take sumeruseat and padmasana as the same detection object. The padmasana bounding box with the highest score is retained, but the sumeruseat is missed.
The soft non-maximum suppression (soft-NMS) (Bodla et al., 2017) adopts the 'weight punish' strategy to address the missed detection problem. The specific method is shown in where the symbol is the same as in Equation (3). The soft-NMS algorithm designs an attenuation function in the overlapping part of the adjacent bounding box between detection objects. It re-scores recursively according to the current confident score rather than directly suppresses the adjacent bounding boxes with lower scores, thus retaining the detection bounding boxes of adjacent targets. Figure  9(b) shows the detection result of the soft-NMS algorithm, where sumeruseat and the padmasana were identified simultaneously.
However, as the number of bounding boxes increases, the probability of target misclassification and repeated detection increases. To this end, we propose the DT-NMS algorithm combining the advantages of the NMS and the Soft-NMS. The fractional reset function is shown in Equation (5).
where the symbol is the same as in Equation (3).
We add a threshold N d on the basis of soft-NMS algorithm and set N t < N d . When the IoU(B max , B i ) is higher than N d , we delete the bounding boxes. This is because if IoU(B max , B i ) is too large, B max and B i are very likely to be the same target, which will increase the probability of repeated detection. If N t ≤ IoU(B max , B i ) ≤ N d , a 'weight punish' strategy the same as soft-NMS is given as a new score s i . If IoU(B max , B i ) > N d , the overlap area between B i and B max is small, and the original score is retained.
The proposed DT-NMS algorithm can filter the bounding boxes again while decreasing the repeated detection rate. In the threshold selection process, we follow the 2-way 5-shot evaluation protocol to evaluate our DT-NMS. Table 1 shows the ablation study of our proposed under the naive 2-way 5-shot training strategy on the Thangka dataset. We adopt the same evaluation setting hereafter for all ablation studies on the Thangka dataset.
We choose N t = 0.45 and N d = 0.90 as the final threshold, which makes the AP reach the highest 33.34%.

Experiment
In the experiments, we compare our approach with other state-of-the-art (SOTA) methods on different datasets. For fair comparison, we adopt the same train/test setting as these methods. In these cases, we use a multi-way few-shot training in the fine-tuning stage with more details to be described. At the same time, we carry out ablation experiments to verify the effectiveness of each component of the RD-FSOD model.

Experimental setting
The experiment employs the pytorch framework, and is carried out in Ubuntu 16.04 system. All experimental platforms are: Intel(R) Core(TM) 5-1035G1, the main frequency is 1.19 GHz, and the memory is 16 GB. GPU is used to accelerate the calculation.

Thangka dataset
Research objects, headdresses and seats in Thangka Yidam are from the Institute of Thangka, School of Art, Northwest University for Nationalities. All of six categories are padmasana, chignon, cushion, mitral, sumeruseat and coronet, as shown in Figure 10. According to the existing research, padmasanas have a high probability of appearing in any Thangka images. Hence, we need to balance the dataset by means of data augmentation to further improve the detection accuracy. Table 2 shows the number of categories before and after the Thangka image augmentation.

MS COCO dataset
MS COCO (Lin et al., 2014) dataset is constructed by Microsoft, which contains more than 800 thousand pictures and 80 classes. It is mainly used for the task of detection and segmentation in natural scenes. MS COCO contains a complex image background, a large number of instances and small target size, so it is reasonable to verify the effectiveness of the proposed method on the COCO dataset.

Dataset split and results analysis
In order to verify the effectiveness of the method on the Thangka dataset, the experiments are carried out on the Thangka images with six categories. It is noted that padmasanas, cushions, coronets and mitrals are used as training sets, and chignons, sumeruseats as testing sets, which ensure that the training set and testing set belong to the same conceptual domain. Dataset partitions are shown in Table 3. We train respectively sumeruseats, cushions, coronets and mitrals on the FSOD model and our model. In the experiment, we set the learning rate as 10 −4 , the number of iterations as 70,000, and the batch size as 4, and then evaluate the performance of the model on the new testing set. We adopt AP and AP50 for evaluation. Figure 11 shows the change curve of loss value during the model training. The training loss values of the two models show a decreasing trend and the model gradually converges in the training process. The total_loss of our method slowly drops after 30,000 iterations, tends to be stable after 50,000 iterations and finally stabilises at 0.5-0.6, while the total_loss of the FSOD model suddenly drops after 50,000 iterations and finally stabilises at 0.7-0.8, which fully proves that the performance of our model outperforms the FSOD model.
We construct the few shot learning regime of the 2-way K-shot for the training of few shot data, where K∈1,5. In the few shot testing stage, a batch of samples extracted from chignons and sumeruseats are sent to the network to obtain the predicted results. Results for the Thangka dataset are shown in Table 4.   It can be seen from the experimental results in Table 2 that our model achieves 28.5% and 33.3% AP on 2-way 1-shot and 2-way 5-shot, with up to 4.7% and 7.1% above FSOD respectively. And the AP50 of our model is higher than that of the FSOD model in the Thangka dataset, which fully proves the effectiveness of our method in the few shot object detection. We will further analyse the impact of our improved parts on model performance in the Thangka dataset and COCO dataset in Section 4.3.2. Figure 12 shows the detection result on the 2-way 1-shot task. Two images on the left side are the new class and the image on the right side is the class to be tested.
The network model has a certain influence on the performance. In Table 5, we compare our results with those of the state-of-the-art (SOTA) on the COCO dataset based 20-way 10-shot detection task. We adopt the same data split as their method and follow their evaluation protocol: we set the 60 categories as training categories in MS COCO, and use the rest 20 categories as novel categories for evaluation. In addition, we use the same backbone network for the SOTA model to carry out comparative experiments. The experimental analysis shows our model with the same MS COCO training dataset outperforms SRR by 1.1%/2.2% on AP/AP50 metrics. This indicates that our model obtains enough high-level features when ResNet41 and DC are utilised as backbone networks. At the same time, the application of  T-NMS algorithm in the regression stages avoids missed or repeated detection between adjacent targets, thus increasing the detection accuracy.

Ablation experimental analysis
Based on the RD-FSOD model, we carry out the ablation experiment on the Thangka and COCO dataset under the different evaluation tasks. Table 6 shows the ablation experiment result. When combining ResNet41 and DC, we obtain better performance than that of other combination type. When the ResNet41 structure is not introduced, the AP on the Thangka dataset is significantly lower than that on the COCO dataset, which further indicates that strengthening shallow features is useful for the detection of complex Thangka images. However, DT-NMS has the most obvious effect on the detection of the Thangka dataset. We believe that the occlusion among Thangka targets makes the DT-NMS algorithm more efficient. By combining all components, we achieve best performance.

Conclusion
Thangka Yidam includes various characteristic attributes, such as seats, headdresses and handheld. The detection of headdresses and seats can not only promote the development of the image understanding field of Thangka Yidam, but also has a certain significance for the spread of Tibetan culture. In order to address the problems of minority categories, different sizes and overlapping between target objects in Thangka image detection, we improve the FSOD model as follows: (1) A ResNet-based backbone network is designed to address the few categories and complicate composition problem in Thangka images. The ablation experimental results verify the effectiveness of the improved ResNet in Thangka Yidam detection.
(2) The deformable convolution is adopted in the improved ResNet to expand the receptive field of Thangka feature maps, which effectively strengthen the ability of extracting effective Thangka feature. (3) In the border regression process, a DT-NMS algorithm is proposed to address the problem of the overlapping between seats.
The experimental results show that the proposed method has higher detection accuracy than the STOA model on the MS COCO dataset. In addition, we also carried out ablation experiments, and the three improvements are carried out to further prove the effectiveness of our method.
In future, the effectiveness of our method will be further verified with the expansion of the Thangka categories, such as books and bottles in handheld objects.

Disclosure statement
No potential conflict of interest was reported by the author(s).