Instance Segmentation, Body Part Parsing, and Pose Estimation of Human Figures in Pictorial Maps

ABSTRACT In recent years, convolutional neural networks (CNNs) have been applied successfully to recognise persons, their body parts and pose keypoints in photos and videos. The transfer of these techniques to artificially created images is rather unexplored, though challenging since these images are drawn in different styles, body proportions, and levels of abstraction. In this work, we study these problems on the basis of pictorial maps where we identify included human figures with two consecutive CNNs: We first segment individual figures with Mask R-CNN, and then parse their body parts and estimate their poses simultaneously with four different UNet++ versions. We train the CNNs with a mixture of real persons and synthetic figures and compare the results with manually annotated test datasets consisting of pictorial figures. By varying the training datasets and the CNN configurations, we were able to improve the original Mask R-CNN model and we achieved moderately satisfying results with the UNet++ versions. The extracted figures may be used for animation and storytelling and may be relevant for the analysis of historic and contemporary maps.


Introduction
Digital 'view-only' maps, such as scanned historical maps or modern maps made with graphic editors, include a multitude of information. To process this information in a machine-readable manner, the content of these maps needs to be extractedin a best case fully automatically. At present, however, maps stored in a raster image format have been mostly manually annotated with metadata on social media websites and additionally georeferenced in digital map libraries, but information about their actual content is largely missing. In this work, we have a closer look at one particular content element of maps, namely human figures (Figure 1). This object type frequently occurs as a decoration in pictorial maps (Child, 1956). After a successful detection, the following two use cases are thinkable: Firstly, historians (e.g. Davies, 2016) may be interested in the ethnos and clothing of figures, as well as in certain rites or common activities. Offering an additional search filter option in digital map catalogues for human figures in maps would therefore be highly beneficial. Secondly, figures could be animated to act as storytellers or guides in maps or paintings, for instance for museum visitors (e.g. Liu et al., 2020). For the latter usage scenario, it is additionally required to identify their body parts and pose keypoints.
A promising technology to tackle the above-mentioned tasks are convolutional neural networks (CNNs). Recent experiments have shown that it is feasible to extract labels (Weinman et al., 2019), road intersection points (Saeedimoghaddam & Stepinski, 2020), or building footprints (Heitzler & Hurni, 2020) from maps with CNNs. By conducting our research, we like to extend this list by human figures. To our knowledge, this object class has not been retrieved with CNNs in maps yet, only in natural images such as photos. Here, architectures like Mask R-CNN  or PANet (Liu et al., 2018) were developed to segment individual objects. The task of parsing object parts, such as body parts, was approached with configurations like an adapted fully convolutional network (FCN) (Oliveira et al., 2016) or DeepLab (Chen et al., 2018). Convolutional Pose Machines (Wei et al., 2016) and Stacked Hourglass (Newell et al., 2016) enable to detect pose keypoints of single humans, whereas newer networks (e.g. Cao et al., 2017;Sun et al., 2020) are capable of registering multiple persons. Body part segmentation and keypoint detection were also combined, for example, in a FCN including a conditional random field (Xia et al., 2017) or JPPNet (Liang et al., 2019).

Data
Compared to real persons in photos, annotated datasets of human figures in maps are non-existing. Therefore, we created the data for training and testing our CNNs by ourselves. The data partly originates from a Pictorial Map Classification Dataset. 1 This dataset consists of 3100 maps harvested from Pinterest accordingly to Art. 24d of the Swiss Copyright Act, 2 among 1500 are pictorial maps. A manual classification by ourselves revealed that roughly half of the pictorial maps include human figures. We included humanoid representations (e.g. Statue of Liberty, Christ the Redeemer; snowmen, robots) but we excluded humanoid animals (e.g. King Kong) in our classification. Fifty two of those maps, which include 387 larger figures, were manually annotated using a web application (Figure 2). The average size of maps is 577 × 581px and of figures is 44 × 78px. The low resolution of the map images to be fed into CNNs is due to memory limitations of current graphic boards. Annotating one figure took about nine minutes (incl. corrections), resulting in a total working time of about 60 h. Six body parts (head, torso, two arms, and two legs) were masked and skeletons consisting of 16 keypoints (head, neck, thorax, two shoulders, two elbows, two wrists, pelvis, two hip joints, two knees, and two ankles) were created. Since the number of annotated maps would be too little to train CNNs, these 52 maps will serve as testing data for the CNNs.
For training the CNNs, we produced a synthetic dataset, what is a common approach in machine learning (e.g. Varol et al., 2017). It has the advantage that larger amounts of annotated data can be created in less time compared to manually annotated data.
However, the variability of data may not be as large (i.e. the styles of the figures in our case) and training times may need to be reduced so that the CNN does not specialise in the synthetic data. Our synthetic data consists of background maps, on which persons and objects from photos, generated figures, and icon objects are placed ( Figure 3). By this mixture of real and abstract representations of humans, we hope that the CNN interpolates between them to recognise pictorial figures. Since not only human figures but also other objects (e.g. means of transportation, animals) are an integral part of pictorial maps, we included them as well so that the network learns to distinguish between figures and objects.
The background maps were also derived from the Pictorial Map Classification Dataset by selecting only those without any human figures. Thirty-three anthropomorphic maps (e.g. Europa Prima Pars Terrae In Forma Virginis, The Avenger -An Allegorical War Map) were excluded because their content is interwoven with humanoid figures. One map of them was added due to balancing reasons. The selection resulted in 2306 background maps, among which about one third are pictorial maps without any persons and two thirds are non-pictorial maps. To enrich these maps, 4558 real persons (Figure 4(a)), including skeletons and body part masks, and 5169 real objects (Figure 4 (b)) from 19 different categories were firstly taken from the PASCAL-Part Dataset, 3 which contains annotations for photos. Secondly, 4558 synthetic human figures ( Figure  4(c)) were generated in a custom web application, initially as scalable vector graphics (SVG) which are finally rasterised to images. The selection of real persons and the generation of synthetic figures are based on the frequency of occurrence of certain body part configurations in the test dataset (Table 1). The postures of the synthetic figures are derived from skeleton annotations of the MPII Human Pose Dataset. 4 At the corresponding joints of each skeleton, ellipses were drawn for the head, whereas polygonspartly with rounded cornerswere drawn for torso, arms and legs. Shapes for hats, hair, glasses, eyebrows, eyes, noses, mouths, hands, and shoes were additionally attached to the synthetic figures. Shapes, colours, fill patterns, body part sizes, and stroke widths were randomly varied. Pose keypoints, originating from the skeletons, and body part masks, derived from the overlays, were generated aside from the synthetic figure images. Thirdly, 4759 medium sized, non-circular icon objects (Figure 4(d)) from 44 categories were retrieved from Iconfinder. 5 In a last automated step, a random number of zero to 15 real and synthetic persons as well as objects are scaled randomly between 20 and 120px and placed randomly on the background maps. In case of overlaps, person masks covering an area of less than 50px, and corresponding keypoints, were excluded.
We tested how the following training datasets, varying in real and synthetic entities (= persons and objects), affect the accuracy of the CNN targeted at instance segmentation: . Real: 2304 maps with real entities.

Methods
We follow a top-down approach (e.g. Lin et al., 2020) by first segmenting instances of human figures on maps and then body parts and pose keypoints. In the first step, we try to identify silhouettes of individual characters on pictorial maps with the established Mask R-CNN  architecture. This CNN is targeted at recognising objects, such as persons, from photos at a pixel-level. Mask R-CNN is an extension of Faster R-CNN (Ren et al., 2017), a network, which is able to detect bounding boxes of objects. Similar to Faster R-CNN, a series of convolution and downscaling operations are initially applied inside a backbone network in Mask R-CNN to extract specific image features. The output, so-called feature maps, are processed next in two stages: Firstly, objectness scores, denoting the likelihood that a region contains an object, and offsets for anchors, that are rectangles differing in size and aspect ratio, which are distributed equally in a grid covering the feature maps, are predicted in a region proposal network. Secondly, the regions of interest are further refined and a score for the potential object class is predicted. As an addition to Faster R-CNN, Mask R-CNN predicts a binary mask in this second stage, where each pixel of the mask corresponds to a probability. Only one channel is required for the mask since the separation of a potentially contained object from the background is determined by a threshold. Mask R-CNN is included in the TensorFlow Model Garden 6 where different CNN architectures are pre-implemented and accessible via a Python API. In our experiment, we retrained the model based on the COCO dataset (Lin et al., 2014), which comprises segmentations for 500,000 masked objects on photos, such as persons. For transfer learning, we take the five training datasets (i.e. Real, Synthetic, Separated, Mixed, Separated-Mixed) described in the previous chapter. As a backbone network for feature detection, we use ResNet with 101 layers (He et al., 2016) and atrous convolutions, which has a good accuracy-speed balance compared to the other three available TensorFlow models. Atrous (aka dilated) convolutions 'enlarge the field of view of filters to incorporate larger context, which [has been] shown to be beneficial' (Chen et al., 2018, p. 4). We set the anchor stride to eight, which has been favourable to detect smaller objects (e.g. Schnürer et al., 2020). We vary the four sizes for the anchors (minimum: 0.0625, maximum: 2.0) and retain their three aspect ratios (i.e. 2:1, 1:1, and 1:2). Since the architecture requires much graphics memory, images could be fed in batches of one (i.e. single images) on a NVIDIA GeForce GTX 1080 Ti. On this graphics boards, one epoch of learning (i.e. 2304 steps) takes about 30 min. Other parameters have not been modified from the given Mask R-CNN configuration file. For evaluation, we calculated the average precision (AP) for masks 7 with the COCO API. As given in the configuration for the original model, we set the threshold of confidence scores to larger than 0.3, which means that detections below the threshold will be discarded.
After having recognised human silhouettes, we parse body parts and detect keypoints simultaneously in our own network consisting of four different versions ( Figure 5). Other networks have been proposed, which perform both tasks at the same time, but those do not have any code available (e.g. Xia et al., 2017) or classify different types of body parts (e.g. Liang et al., 2019). Therefore, we cannot report any baselines for pre-trained models on real persons. Mask R-CNN would be also able to indicate keypoints; however, this functionality was not part of the code. Therefore, we decided to implement and test our own network configurations, inspired by a simple deconvolution head network (Xiao et al., 2018) and UNet++ (Zhou et al., 2020). In contrast to the simple deconvolution head network (to be called 'Simple Deconv' in the following) but similarly to UNet++, we used a decreasing number of filters for the deconvolution operations. Diverging from UNet++ but similarly to Simple Deconv, we do not include the loss of intermediate layers and we do not perform any convolution operations after having upsampled the feature maps. Opposed to both networks but similarly to other networks like Stacked Hourglass (Newell et al., 2016), we perform an 'Add' operation to merge layers instead of concatenating them and we do not upsample the image to the full resolution. More details of the architectural decisions and their alternatives are given in the Discussion chapter.
We implemented our body part parsing and pose estimation networks with Tensor-Flow's high-level keras API. 8 We use ResNet with 50 layers (He et al., 2016) pre-trained on ImageNet weights as a backbone network. We feed square RGB images encoded in the JPEG format with a size of 128²px into ResNet, which is smaller than the default image input size of 224²px, but the reduced size better conforms to our data. An additional one-strided convolution is performed to adjust the number of channels to 128 of the output layer after the first two-strided convolution in ResNet. This step is not necessary for the ResNet output feature maps of the second to the fourth stage. We omit the lowest stage of ResNet because figures on maps do not have as many details as persons on photos. The results of the backbone network are passed to a head network, where we test four different versions ( Figure 5): . Simple Deconv: The output feature map of the fourth ResNet stage (X 3,0 ) is upsampled three times (X 2,1 , X 1,2 , and X 0,3 ) by two-strided convolutions. . Simple UNet: Supplementary to Simple Deconv, outputs of the third (X 2,0 ), second (X 1,0 ), and first (X 0,0 ) ResNet stage are added to the upsampled feature mapsone at a time. . Simple UNet+: The output feature maps of the third and second ResNet stages are upsampled and added to the outputs of the second and first stage. The first result (X 1,1 ) from the previous addition is upsampled and added to the second result (X 0,1 ). The output of the third ResNet stage, the first and the third result (X 1,2 ) are added to the upsampled feature maps of Simple Deconv. . Simple UNet++: Supplementary to Simple UNet+, the following skip connections are inserted: X 1,0 before X 1,2 , X 0,0 before X 0,2 , X 0,0 before X 0,3 , and X 0,1 before X 0,3 .
X m,n is the notation from UNet++: X symbolises the tensor, m corresponds to the downsampling level, and n denotes how many upsampling (i.e. deconvolution) operations have been performed. In our head networks, each deconvolution and adding operation is followed by a batch normalisation operation and a Rectified Linear Unit activation function. Kernels of convolutional layers are initialised by a truncated normal distribution centred at zero (i.e. 'he_normal'). The final convolution operation is followed by a sigmoid activation function so that we have a 64²px image with 24 channels in the end. The channels correspond to six body parts and one channel for the background, and to 16 keypoints and one channel containing the inverted image of the summed keypoints. In the ground truth data, body part pixels have a value of one and other pixels a value of zero. The keypoints are represented by a 2D gaussian kernel, similarly to Convolutional Pose Machines (Wei et al., 2016), having a probability of one in the centre and gradually decreasing values around the centre (Figure 6). The loss is split between body parts and keypoints, and reduced in both cases using the categorial cross entropy function and the RMSprop optimiser during training. The background channel for body parts is ignored in the loss function so that the network is not biased towards the white background of the images with human figures. We fed images in batches of 15 into the network and trained for 15 epochs, which took about 10 min in total.

Results
We trained each Mask R-CNN configuration (i.e. same training data and same anchor scales) five times, disregarded the highest and lowest score, and averaged the remaining three scores to reduce the variability in the comparison. Our evaluation procedure is similar to Zhang et al. (2019), who additionally calculated the standard deviation but did not discard the extreme scores. Quantitative results (Table 2) show that the highest Figure 6. One-hot encoded masks for body parts (a) and keypoints (b) of a human figure from the test dataset. The first mask is the difference of the summed other masks. The other masks represent single body parts or pose keypoints.  (Figure 7) illustrate that more human figures could be identified on maps with the retrained model. However, the number of false positives also increased as seen on the exemplary visual results. Yet, not all figuresespecially smaller onescould be identified. APs were lower when training Mask R-CNN with synthetic or real entities only, likewise where those entities were mixed on maps. The Separated-Mixed dataset resulted in an AP ranging in the middle between the standalone datasets. The results vary slightly for different anchor scales, however, no clear trend could be observed whether smaller or larger values are favourable. For example, the best overall AP was achieved for the largest tested anchor scale values with the Separated dataset, whereas the highest AP on average was measured for the smallest anchor scale values when training with the Mixed dataset. For the other datasets, sometimes larger, intermediate, and smaller anchor scales led to higher accuracies on average.
As training times for our CNNs on body parsing and pose estimation were lower, we could run each configuration (i.e. same dataset, same architecture) 20 times, disregarded the four highest and lowest scores, and averaged the remaining 12 scores. Quantitative results (Table 3) indicate that the highest AP on average for both tasks and the best overall accuracy was achieved for Simple UNet trained with real and synthetic persons. Simple UNet+ and Simple UNet++, which are architectures with more connections, Notes: The four best and worst results of 20 runs have been removed from the calculation. The highest average scores for body parts and pose keypoints are marked in bold. The last row contains the highest achieved accuracy of all runs.
performed slightly worse, whereas the results of Simple Deconv are comparable to Simple UNet. Training the CNNs with real data only yielded similar results for parsing body parts, whereas the addition of synthetic data led to higher average accuracies for detecting pose keypoints. Training with synthetic figures only resulted in clearly lower accuracies. Qualitative results (Figure 8) demonstrate that common poses and some of the more difficult ones (e.g. side view, overlapping or missing body parts) could be identified satisfactorily, however sometimes more challenging cases (e.g. persons viewed from behind, similar body parts) caused classification errors. Unusual poses (e.g. during sport activities) or too small figures were not recognised very well.

Discussion
While there were several training datasets available for human parsing in photos, no datasets existed for pictorial maps to our knowledge. As supervised learning with CNNs requires many training samples though, we decided to create a mixture of real persons extracted from photos and automatically generated abstract figures. Training a CNN with data from a different domain is usually less promising, however, it is not clear at this point whether a fully manually annotated dataset in our target domain would have led to better results due to the different drawing styles. In our case, the combination of real and synthetic data paid off for segmenting instances of human figures, their body parts, and estimating their poses. However, training the CNNs with synthetic data only would be not sufficient as the result metrics show. The accuracy may be higher when more variations of shapes, body features, and clothes of synthetic figures would be included. We tried to minimise the manual effort, for example, by generating a fill pattern with random polylines, but still, it is quite different from the original folds and shadows of clothes and hair. Originally, 10 body parts (incl. lower/upper arms/legs) were classified in our training datasets, but as CNNs have already struggled to segment 6 body parts in moderately complicated poses, we trained them with this number of categories. Therefore, a post-processing step (e.g. Voronoi diagram) would be required to distinguish the upper and lower parts of the limbs. The achieved scores between 10% and 20% sound low, but qualitative results look already reasonable. The original Mask R-CNN model had an AP of 33% 9 for instance segmentation of different object categories on photos, but it is not clear where the score for persons would range. Our low scores for body parts and pose keypoints may be justified by the small image sizes so that already minor deviations have a large impact. Furthermore, the COCO metrics for keypoints are not directly comparable to those of real images since they contain pre-defined standard deviations for every pose keypoint. 10 Body proportions of pictorial figures, however, could be largely distorted. Internally, we have calculated a simpler error metric but as these results correlate with the COCO scores, we reported only the standard metric. For a comparison of the different CNN configurations, however, we relied on average results instead of giving only the best result. This methodology may be more robust concerning outliers since a configuration can be tested only a couple of times due to the long training times. When training more often, slightly higher APs than the reported ones can be achieved.
The developed CNN configurations were already the result of many trials, but a detailed evaluation of every architectural decision would go beyond the scope of this article. Instead, we like to give a brief overview of the different factors to consider: For both CNNs, it would be possible to use a different backbone network, image input sizes, learning rates, or number of learning steps and epochs. Besides those factors and the compared anchor scales, we relied on the default hyperparameters for Mask R-CNN because they already have been fine-tuned by the authors. For our CNN versions on body part parsing and pose estimation, we listed the tested possibilities and our decisions in Table 4. The RMSprop alternatives led to worse or only marginally different outcomes compared to our parameters. We reported the scores of Simple UNet+ and Simple UNet++ to demonstrate that increasing the architectural complexity did not help to solve our problem. However, our results suggest that we have not found an optimal solution yet, only that we reached a local maximum. Further investigations, as proposed in the next chapter, will be needed to improve the quality of the outcomes.

Conclusion and future work
We have made the following contributions within the scope of this article: . Creation of publicly available training datasets including annotated body parts and skeletons of real persons and synthetic figures on maps and on single images . Creation of publicly available test datasets including annotated body parts and pose keypoints of human figures on pictorial maps and on single images . Application of a CNN for instance segmentation of human figures on pictorial maps . Development of CNN architectures for simultaneous prediction of body parts and pose keypoints of human figures on single images . Qualitative and quantitative evaluation of the results of both CNNs We measured an increased accuracy compared to the baseline model for identifying silhouettes of human figures on pictorial maps when training the CNN with real persons and synthetic figures on separate maps. The accuracy of CNNs detecting body parts and joints of human figures simultaneously was slightly higher for the simpler architectures. Here, the combination of real and synthetic data only led to a small gain in accuracy. Qualitative results showed that many figures on maps can be found, but not all. Body parts and keypoints were satisfactorily recognised for common poses by our developed CNN architectures, however, not for special cases.
Our work offers various potentials for improvement. Eventually, other datasets than PASCAL-Part with real persons could be added; unfortunately, those of DensePose (Güler et al., 2018) and Look Into Person (Liang et al., 2019) were only partly compatible. The synthetic figure generator could be extended so that a larger range of appearances and clothes is supported. Adding a generative adversarial network (GAN) may help to reduce the domain gap between training and test data (e.g. Sankaranarayanan et al., 2018). The number of annotated maps in the test dataset could be also increased as well as a distinction could be made between visible and hidden keypoints. Ideally, only one CNN is needed to fulfil all three tasks (i.e. instance segmentation, body part parsing, pose estimation). Newer feature extractors than ResNet, for instance, HR-Net , may lead to a higher accuracy. Body part predictions may be smoothed by the addition of conditional random fields (Arnab et al., 2018).
The results of our work may be transferable to recognise human figures in books, comics/manga, or paintings. Extracting additional properties of figures (e.g. age, gender, skin colour, pieces of clothing, performed activity, and spatial relation to other figures) would be helpful for an even fine-grained search. Beyond, it would be interesting to identify other pictorial entities like animals or means of transportation. When figures were displaced from their original position, for example when being animated, an empty background would remain. This is a semantic image inpainting task that could be tackled by GANs (e.g. Yeh et al., 2017).

Notes
René Sieber is a senior scientist at the Institute of Cartography and Geoinformation of ETH Zurich. In 1995, he became project manager of the 'Atlas of Switzerland' and administered four digital versions, of which the latest release was an online application. Since 2015, he is also the chair of the ICA commission on atlases.