A feasibility study of applying generative deep learning models for map labeling

ABSTRACT The automation of map labeling is an ongoing research challenge. Currently, the map labeling algorithms are based on rules defined by experts for optimizing the placement of the text labels on maps. In this paper, we investigate the feasibility of using well-labeled map samples as a source of knowledge for automating the labeling process. The basic idea is to train deep learning models, specifically the generative models CycleGAN and Pix2Pix, on a large number of map examples. Then, the trained models are used to predict good locations of the labels given unlabeled raster maps. We compare the results obtained by the deep learning models to manual map labeling and a state-of-the-art optimization-based labeling method. A quantitative evaluation is performed in terms of legibility, association and map readability as well as a visual evaluation performed by three professional cartographers. The evaluation indicates that the deep learning models are capable of finding appropriate positions for the labels, but that they, in this implementation, are not well suited for selecting the labels to show and to determine the size of the labels. The result provides valuable insights into the current capabilities of generative models for such task, while also identifying the key challenges that will shape future research directions.


Introduction
Automated map labeling has been on the research agenda for decades and is today used in production environments.The current methods and tools are based on quantification of label placement rules defined by cartographic experts and well-known cartographic sources (e.g.Imhof, 1975).The question that we address in this study is whether cartographic map labeling examples, conducted by cartographic experts, can be utilized as an alternative or complementary source of information for automating the labeling process.
Map labeling consists of three processes: (a) selection of labels to be added to the (unlabeled) map, (b) determination of label size (dependent on length of text, use of abbreviations, division of the text label into several lines, etc.), and (c) finding appropriate positions of the labels.There are challenges in automation of all these three processes, and there is also an interplay between the processes which make the automation even harder.In this paper we study these challenges by using a machine learning approach, or more specifically by using generative deep learning models.These generative models are trained with an extensive number of map examples consisting of raster maps with manually placed labels represented as polygons.Besides, the deep learning (DL) model is also trained with numerical information about the size of the label text and position of its associated feature in the map examples.The intention here is that the DL model should learn: (1) linkages between the numerical information about size of the labels and position of their associated features and the actual labels in the map, and (2) where it is suitable to place a label in the map.In the prediction phase of the generative models, where the map labeling is conducted, the following input data is used: raster map data (without any labels), and numerical information about the size of the labels and position of the map feature to be labeled (taken from the map features and their attributes, not from any manually placed label).In an ideal case, the DL model should be able to learn patterns from the numerical information so to decide which labels that should be included in the map and to determine the size of the labels as well as to find a suitable position of these labels in relation to their associated map features and to other map features including other text labels.
The overall aim of this study is to take the first step to address the question of whether automation of map labeling could be based on map examples and generative deep learning.We acknowledge that in short term, the DL models will not increase the quality of map labeling nor raise the automation level in the map production process.However, it is interesting to study the possibilities and limitations of usage DL to guide future development, if the methodology shows potential.In the study we address the following research issues: (1) how to design and implement generative deep learning models for performing map labeling, and (2) evaluate if the generative deep learning model is capable of selecting appropriate labels to map (process a), learning size of the label (process b) and determining suitable locations of the labels (process c).The second research question is answered by a visual evaluation (for smaller areas) by professional cartographers and a quantitative evaluation (for large areas) against manual map labeling and labeling performed a by state-of-the-art optimization-based method.The quantitative evaluation is performed by measures of labeling quality in terms of legibility, association and map readability.

Generative deep learning
In recent decades, machine learning has gone through a quick development and has provided successful solutions to complex problems in several disciplines.Particularly, it has succeeded and outperformed classical techniques in many tasks such as image recognition (Ohri & Kumar, 2021), image classification (Zhao & Du, 2016) and robot technology (Levine et al., 2018).It is noteworthy that machine learning has been successful in a wide range of vision-based applications since it is more flexible and adaptable to complex and diverse tasks than conventional techniques.
Classical machine learning techniques require substantial efforts in feature engineering and preprocessing work to extract meaningful features from data.In contrast, deep learning techniques do not require such engineering work and model layers and have the ability to extract various and rich features with different granularities In addition, deep learning techniques ˗ e.g.convolutional neural networks (CNN) and learning mechanisms such as attention, adversarial and spatial transformation ˗ have provided new opportunities to learn from complex and big data.
A CNN model includes convolutional layers stacked on top of each other and each layer is capable of learning various features and generating feature maps using filters of different sizes.The fully connected networks are prone to overfitting if not regularized as each neuron in one layer is connected to all neurons in the next layer (Szegedy et al., 2015).With CNNs, regularization is achieved by exploiting the hierarchical patterns in their input data by employing increasingly complex filters or kernels on the data with increasing network depth.Much research has gone into optimizing the network design to increase the performance of learning specific tasks, to overcome some challenges such as imbalanced data, lack of data and interpretability, and tackle technical issues such as under-specification, interpretability, overfitting, and exploding/vanishing gradient problem (Alzubaidi et al., 2021).Building on top of the basic architecture, many efficient models such as Faster-R-CNN, U-Net, YOLO, SSD, FPN, and Inception have been developed (Dhillon & Verma, 2020).
Deep learning has been used for various purposes in cartography and the achieved results are promising.Multiple deep learning models have been used for identification and classification of road networks (Deng et al., 2021) as well as river networks (Yu et al., 2022(Yu et al., , 2023)).In addition, deep learning has been used to extract, synthesize and reconstruct features from historical maps (Andrade & Fernandes, 2020;Can et al., 2021;Uhl et al., 2020).Furthermore, deep learning has been applied in map generalization applications (Courtial et al., 2020;Feng et al., 2019;Touya et al., 2019).

Generative adversarial networks -GANs
One type of deep learning models increasingly applied in many image applications and of potential interest to map labeling is the generative adversarial networks (GANs).A GAN includes two networks trained in contest: a generator that creates new samples and learns to map from a latent space to a given data distribution, while the discriminator evaluates the generated samples and distinguishes them from the true data distribution (Goodfellow et al., 2014).The generator and discriminator play a minimax game which, if its equilibrium is reached, results in very good performance, e.g.generating highly realistic looking images.GAN has successfully been applied in other cartographic applications such as map style transfer and map generalization, e.g.Kang et al. (2019) trained GAN to transfer cartographic knowledge across multiple scales, including stylistic elements and generalization rules.
GANs are relevant in this context because the problem of placing labels on maps can be formulated as an image synthesis problem and take benefit of the visionbased advanced techniques developed so far.Map labeling can be mostly considered as a graphical problem as most of the constraints that guide labeling can be captured by an image that includes the objects to be labeled and deep learning is very well efficient in image processing.The two most interesting approaches for image synthesis are image composition and image translation.Image composition aims to synthesize new images by placing foreground objects into an existing background image.The foreground objects in our case are the labels that should be placed in the background image, i.e. the map, at semantically sensible regions.On the other hand, image-to-image translation aims to find a mapping from one visual representation to another and can be seen as an analogy to language translation (Isola et al., 2017).
The approach proposed for the automated labeling in this study is based on generative models, which try to explicitly model the distribution of the data in multidimensional space.Hence, if we consider a vector x from a source domain X as input of the model and y belonging to a target domain Y as output, the generative model learns the joint probability distribution p(X, Y) while the discriminative model learns the conditional probability distribution p(Y|X=x).The basic idea of our approach is to train a specific welldesigned GANs to learn the mapping from nonlabeled source domain (x) to labeled target domain (y); similar approaches have been applied successfully in placing other types of object in images (e.g. Lee et al., 2018).Given the image input representing the background of the map which includes only the object features, the model should provide a labeled image by adding the labels as foreground.Thus, it can be considered as an image-to-image translation task.

Map labeling
Automation of map labeling have been on the research agenda for several decades after pioneering work by e.g.Yoeli (1972) and Freeman (1988).The most common approaches are rule-based techniques (e.g.Jones, 1989) and optimization techniques (e.g.Zoraster, 1997) which both are based on quantifying map labeling rules found in seminal work such as Imhof (1975).These cartographic requirements consider a wide range of aspects, where the most important are linked to (for more details see Imhof, 1975;Rylov & Reimer, 2015;Van Dijk et al., 2002;Wood, 2000): • Legibility: a label is not allowed to overlap with another label.• Association: it should be easy to interpret which map object a label refers to, hence avoid placing labels too close to other objects.Map labeling can be defined as an optimization problem that involves finding an optimal solution according to the cartographic requirements (quality factors) of the labels.The objective function ( f ) of the optimization problem can be defined in many ways, here taking a modified example from Zhang and Harrie (2006): where w i (i = 1,2,3,4) are weight factors.There are several methods to find the optimal solution of the objective function such as simulated annealing (Zoraster, 1997), genetic algorithms (Yamamoto & Lorena, 2005) and integer programming (Haunert & Wolff, 2017).
One challenge with the optimization techniques is to define the optimization function in such a way that the most interesting cartographic requirements are considered (and not only the number of labels placed).This issue is addressed by Rylov and Reimer (2014) who specifies and evaluates several requirements for label placement (of point objects).Other researchers have identified that the available optimization techniques, as well as other map labeling methods, cannot provide a solution comparable to manual labeling and has therefore developed combined automated and interactive solutions (see e.g.Klute et al., 2019).One knowledge source for the automation of map labeling are map examples generated by cartographic experts.The question is whether it is possible to derive useful information from these map examples.So far there are quite a few map labeling studies that have taken this machine learning approach.Pokonieczny and Borkowska (2019) utilized a neural network to determine feature labeling in topographic maps.They trained a network with input terrain coverage data and labels from several maps to determine in which rectangle a label should be placed around a feature.They achieved up to 80% correctly placed labels which made it possible to reduce manual editing by 50%.
A common strategy in area feature labeling, implemented in several GIS programs, is to place the label on top of the centroid of the polygon that defines the area.However, for many polygonal shapes this strategy is not cartographically satisfying, and in map production cartographers manually select other positions.Furthermore, it is difficult to formalize what is a good position for an arbitrary polygon shape.Li et al. (2020) utilized data to train a stacked hourglass network to produce a heatmap that indicates a good position of the area label.The methodology was applied to map labeling of property units in a cadaster map and yielded relatively good results.Lan et al. (2022) performed a study in labeling schematic maps.They trained a neural network on the relationships between station attribute (foremost line relations) and placement of labels (position and direction) using existing schematic maps.The neural network was then used to perform placement of labels with effective and satisfactory results for some test cases.In a previous study (Harrie et al., 2022), we identified map production challenges where current map labeling tools do not provide satisfactory result.In that study several deep learning approaches are discussed including image translation, image compositional methods and keypoint detection models.This paper describes implementations of some of the ideas presented in that paper.
Evaluation of automated map labeling is often performed visually by cartographic experts.The quantitative evaluation of automated map labeling methods is often restricted to count the number of placed (nonoverlapping) labels, the larger the number of labels the better.But the quality of map labeling cannot solely be based on the number of labels, the placement should also adhere to cartographic requirements (e.g.association, map readability and aesthetics).Van Dijk et al. (2002) formulated a large number of quantitative measures for these requirements, much based on cartographic principles found in e.g., Imhof (1975) and Yoeli (1972).These measures were later extended and used for evaluation of automated point labeling by Rylov and Reimer (2014).A practical use of these types of measures was performed by Kern and Brewer (2008).They compared the labeling conducted by two commercial labeling tools by using the measures: number of labels, no overlaps (legibility) and preferred positions (association).

Study setup
The study workflow is divided into five phases, where each phase consists of several steps (Figure 1).The first phase is to select and design map data to the study (A in the workflow).This is followed by a phase where we generate input data to train the deep generative models as well as designing and training the models (B).The input to this phase is a vector map where the labels are created manually, and the output is trained deep learning models.In the third phase, we predict the labels using the deep learning models (C); the outcome of this phase is a vector map with deep learning generated labels.Then, in the fourth phase we create labels using an optimization approach (D).Finally, in the evaluation phase (E) we compare the labels generated manually, by deep learning and by an optimization method.Details of the phases are provided below.
All the code developed for the steps is available on GitHub. 1 The deep learning model experiments are implemented in Tensorflow version 2.8.1 and ran on 2 × NVIDIA Tesla A100 GPU with 40GB RAM.The batch size is 8.The map data is under a license form that does not allow it to be distributed.

A: select and design map data for the study
In this study we selected to use the city wayfinding maps 2 in London (Figure 2).City wayfinding maps are information dense and contains information in complex urban environments mainly for pedestrians and cyclists.The cartographic design of the map follows the design standards produced by Transport for London 3 as well as internal production rules by the mapping company T-Kartor. 4The data were provided for us as vector data for several themes.We symbolized these data according to the rules provided by Transport of London and T-Kartor.
The city wayfinding maps include a large number of labels (cf. Figure 2).These labels are placed by T-Kartor using ESRI ArcGIS -Maplex label engine 5 and substantial manual label (annotation) editing both in the ArcGIS environment and in the publishing tool Adobe Illustrator.In the manual part of the process almost all labels are moved, which imply that we can regard the labels in the City wayfinding maps as manually placed.
In production, he labels were placed in accordance with instructions from Transport of London 6 and internal rules at T-Kartor.These instructions can be summarized as: • Line feature labeling: line feature labels, e.g., for roads, are to be placed within the road area.Straight parts of a road are preferable for labels due to readability; if not possible the label shape needs to adapt to the shape of the feature.Labels can also be wrapped into two (or more) lines, or shortened, to make them fit.For long line features, labels are repeated.• Area feature labeling: preferably, area labels should be completely placed within the polygon feature they represent, wrapping text into several lines if necessary.But if unavoidable, area labels may cross the polygon boundary.Labels should be horizontal and aligned according to their relation to the polygon feature (e.g., left alignment if placed more to the right of the feature).
• Label overlap and removal: In short, the first rule is that no text labels and icons may overlap, and the second rule is that it is not allowed to remove a text label or icon.Clearly, these rules often result in conflicts that require exceptions, e.g., for text labels to overlap icons or buildings they do not represent as long as it is still clear which building each label corresponds to.

B: create input data for training as well as design and train the deep learning models
This section starts with a description of the input data used to train the generative deep learning models.Then follows a description of the two GAN models used: CycleGAN and Pix2Pix.These models were chosen because of their impressive performance in several image generation tasks, notably in image-to-image translation.The subsequent sections describe how the CycleGAN model is designed and trained in this study; the Pix2Pix model is used in a similar manner.In Appendix A a theoretical background to CycleGAN and Pix2Pix is provided.

Input data for training the deep learning models
We used data from the city wayfinding map in London region as input data for training our deep learning model.We selected 80% of the total area for training, saving the remaining 20% for prediction i.e., for conducting label placement (cf. Figure 3).The central part of the map is much denser in terms of features and labels.The density decreases while going away from the center until we reach the areas without any labels such as the corners where only the dark blue background is displayed.To have a balanced data, we removed areas without any labels.
To obtain training data for the deep learning models we rasterized the vector data where each landmark and road label were represented by rectangles (and all other labels and icons were removed).Then, we transformed the vector data into RGB images of size 512 × 512 with a ground resolution of 1 m (Figure 4).To guarantee that   each (manually placed) label is completely within at least one raster image, we created two overlapping and complementary datasets (where the center of a raster map in one dataset coincides with the corner of a raster map in the other dataset).Figure 4 illustrates the importance of using two data sets.In total 9620 training maps were created (4810 in each data set).
The first input of the deep learning model consists of three channel RGB images showing the features in different colors: landmark buildings in canary yellow, streets in dark blue, green areas in green, background in sky blue, etc.
The second input of our models is the numerical data representing the text features.It is a vector that includes some attributes of the label text, namely the number of characters, length of the text on the map, in addition to the number of lines so to include implicitly the font size.As each label is associated with a specific map feature, then it is necessary to incorporate spatial attribute within this vector.We opted to include the centroid coordinates of the landmark to be labeled and approximate coordinates along the road where the label should be placed.This additional data serves as ingredients for the model to solve the selection problem, which is a challenge in all automated labeling methods.It will also contribute, to some extent, to the learning the association between text and map objects.
The process above is repeated for maps without labels (cf. Figure 1).These map examples are necessary to train the deep learning models.
Each unlabeled raster map is accompanied by a corresponding set of vectors that includes the numerical data describing its label texts.This numerical data is transformed into a 2D image-like, referred to as channel to enable parallel feeding with raster data to the models, to leverage the capabilities of convolution layers for extracting relevant patterns and to enable the attention layers to extract the contextual features.

Design the CycleGAN model
Our goal is to learn mapping function between a source domain encompassing non-labeled images and text features and a target domain including labeled map images, so we set two domains: data unlabeled and data labeled , which are, respectively, denoted by X and Y, where ðx i ; t i Þ 2 X represent the samples belonging to the data labeled domain, where x i is the raster images and t i is the numerical data describing the text attributes; whereas y i 2 Y are the samples in the data unlabeled domain representing the target raster images.Thus, (x, t) ∼ p data (x,t) represent the joint probability distribution of both unlabelled raster maps and text features and y ∼ p data (y) represent the data distribution labeled raster maps.To achieve the translation from unlabeled to labeled domains, we use two CycleGAN generators G unlabeled!labeled and G labeled!unlabeled , simultaneously (Figure 5).In addition, we use two discriminators D labeled and D unlabeled , where D unlabeled is to judge the authenticity of (x,t) and G labeled!unlabeled y ð Þ, and D labeled is to judge the authenticity of (y) and G labeled!unlabeled x; t ð Þ.The generator model is tailored for the intricate task of image translation with associated textual features by trying to leverage the attention mechanism and learn the interplay with textual cues.Through a blend of attentive information integration, multilayer processing, and convolutional operations, the model accomplishes the task of image translation while retaining fine-grained information.In details, this generator is designed to work with inputs consisting of images having dimensions of 512 × 512pixels, ensuring a substantial resolution for intricate transformations, and text features transformed to the same dimensional space.The model employs multi-head attention mechanism, which serves as the cornerstone for integrating the textual and visual information.This mechanism orchestrates a fusion of the provided text features with the image content, yielding a context vector that encapsulates the combined patterns.This context vector is then merged with the original images, setting the stage for subsequent transformations.
Moving forward, the generator sets out on a transformative path, comprised of a sequence of down-sampling and up-sampling layers.The extracted feature maps are fed into the encoder which is composed of three convolutional layers so to distil their high-level details.The role of the encoder is to reduce the representation size while increasing the number of channels and extracting the essential and the most important features of the input.This is achieved by progressively down sampling the input using series of layers having the numbers of filters set to 32, 64, and 128 and each layer is followed by a normalization and a ReLU layer.The resulting output is then passed to the transformer which tries to extract the structural information.Then, comes the decoder which reconstructs the original input by expanding again the received batch to its original dimension.It includes two transpose convolutions to enlarge the representation size, and one output layer to produce the final image.Skip connections are used to ensure that more features flow from input to output during forward propagation from loss gradients to parameters during backpropagation.They share information between mirrored layers in the stacks as shown in Figure 5. Specifically, as in U-Net architecture, skip connections are added between each layer i and layer n -i, where n is the total number of layers, and the role of each skip connection is to concatenate all channels at layer i with those at layer n -i.
At the summit of this intricate architecture lies the ultimate transformational layer which generates the final translated image.This operation is performed by transposed convolution that endeavors to align the model's internal representation with the actual pixel structure of the image.The Hyperbolic Tangent (Tanh) activation function is used in this final layer to ensure that the pixel values of the generated image are bounded within a scaled range of −1 to 1.The number of output channels in this layer is 3 as the target image after scaling to the interval [0, 255] is RGB.
The inverse generator reverses the image translation process by processing the 512 × 512 generated image through convolutions and multilayer processing.It employs reverse multi-head attention to recover textual cues from visual features.The textual cues and visual features are fused to recreate the original input representation.The ultimate goal is to minimize the loss between the recovered and original data, ensuring accurate reconstruction of both text and visual cues from the generated image.
For the two discriminators, Markovian discriminator (PatchGAN) is used to distinguish whether the image patches are real or fake (Isola et al., 2017).It penalizes structure at the scale of image patches, i.e. pieces, in contrast to the regular GAN discriminator which focuses on the entire image.The result is obtained by running the discriminator convolutionally across the image and averaging all the results obtained on the patches.The only assumption to apply this is to ensure independence between pixels separated by more than a patch diameter.The advantage of using PatchGAN is to reduce the number of the model parameters and consequently decrease the computational cost, in addition to the ability to handle arbitrary image size compared to the full-image discriminator.For its stable and better training results, the least square loss function is used rather than the conventional negative log likelihood function.The discriminator consists of six layers with number of filters set to 32, 64, 128, 256, 512 and 1, respectively.The first five convolutional layers have filter size 4 × 4 and LeakyReLU as activation function to introduce a small positive gradient when a neuron is not active.The last layer ends with a sigmoid function.
Specifically, cycle consistency loss aims to ensure that we get back to the original image if we translate from one domain to another and then back again.On a high level of the architecture, cycle consistency encourages generators to avoid unnecessary changes and thus to generate images that share structural similarity with inputs.It is also a sort of regularization as it can prevent the generators from collapse or overfitting.However, in the late stage of the training, it can prevent from yielding realistic images.Thus, we gradually decay the weight of cycle consistency loss λ during the training.
L1 loss is also used to compute the average absolute difference between the predicted and ground truth images in the pixel-level.The combination of L1 distance loss on the CNN features extracted by the corresponding discriminator with cycle consistency is done using a weight factor γ. As the training progress, γ should vary by starting low as the discriminator features are not of high quality at beginning, and gradually linearly increase to a high value close but not equal to 1 because some fraction of pixel level consistency is required to prevent unwanted objects in the image or modification on the background.

Training the CycleGAN model
A single training iteration step of a GAN involves three steps: (1) The discriminator is shown a batch of real images (Figure 4) and its weights are optimized to classify these images as real images.
(2) Then, a batch of fake images are generated using the generator and shown to the discriminator.
Then, the weights of the discriminator are optimized to classify these images as fake images.
(3) The third step involves training the generator.
A batch of fake images are generated and streamed to the discriminator, but instead of optimizing the discriminator to classify these images as fake images, we optimize the generator to push the discriminator to classify these fakes images as real images.
To optimize our networks, we alternate between one gradient descent step on discriminator (D) while fixing the generator (G) parameters, then one step on G is to update its parameter while keeping the D parameters fixed.During the first step, the generator's parameters are adjusted to improve its ability to produce more persuasive samples that can better deceive the discriminator.Subsequently, during the second step, in the second step the discriminator aims to refine its discrimination ability while temporarily holding the generator's parameters constant.It is necessary to obtain a generator with a clear generation goal in the early training stage and improve its accuracy and stability in the later stage.During the training process, when the discriminator's loss reaches the minimum and tends to be stable, CycleGAN model training is completed.
In the training phase, we employ minibatch Stochastic Gradient Descent (SGD) as optimization method while utilizing the Adam optimizer, with a learning rate linearly decaying from 5e-3 to 5e-6, and momentum parameters β1 = 0.5, β2 = 0.999.We train the models for 106 epochs, evaluating their performance on the validation set after each epoch.For each model, we retain its state from the epoch providing the best validation performance.Figure 6 shows the convergence of the total training losses for the generator and discriminator.

Input data for predicting labels using the deep learning models
As input data to the prediction of labels in the deep learning models we used the city wayfinding map for the red region in Figure 3.The data was created in a similar way as for the training data with the important difference that no labels are present in the input data for the prediction (cf. Figure 1).In total 2220, raster maps (of size 512 × 512) are used for prediction (1110 for each of the two overlapping data sets).
Numerical data used for the prediction is created in a similar manner as for the training data.It should be noted that none of the numerical input data for the prediction is based on the size and/or location of the manually placed labels.

Prediction of the labels
For the prediction of the labels using CycleGAN and Pix2Pix, we run the two generators (cf. Figure 5) in exactly the same manner as during the training phase.For each input image that includes only the features and the background, an output raster is generated such that the labels are shown as boxes with different colors.As seen in Figure 7, the labels are not perfect rectangles since the models learn and make the prediction in almost a pixel-wise manner.

Post-processing of the labels generated by CycleGAN and Pix2Pix
The output of the deep learning models are the map rasters highlighting possible positions of labels.It should be noted here that even though we use numerical input of the size of the labels to the prediction phase, the size of the labels cannot not be perfectly predicted in the generative models.Therefore, the size of the label has to be adjusted in the postprocess by adapting the label sizes according to the length of the label texts (which is also illustrated in Figure 1).Furthermore, the postprocess includes a vectorization step of the labels as well as text strings generation.Finally, these text strings are added to the original vector map.In more detail, the postprocessing includes the following steps (cf. Figure 8): (1) Raster files are created for each predicted file that only contains the label pixels.Then the closing operator is performed on this raster file, i.e., the morphological operator expand followed by shrink (see e.g., Gonzales & Woods, 2018).
(2) All the raster files are merged into a single file that is stored as a georeferenced GeoTiff file.This is performed utilizing the Python rasterio library. 7(3) Vectorization of the GeoTiff raster file is performed using the GDAL tool integrated in QGIS.This process creates vector polygons for all connected regions of pixels in the raster sharing a common pixel value.Each polygon is created with an attribute indicating the pixel value of that polygon.(4) Simplification of the vector data: as the obtained vector data has irregular shapes, the simplification and bounding are necessary operations to get the rectangular shape of the labels.The algorithm by Visvalingam and Whyatt (1993) is used to simplify a curve composed of line segments to a similar curve with fewer points.
The labels created by the process above are denoted the CycleGAN and Pix2Pix original labels.One shortcoming with these labels are that the label size is not adjusted to the length of the landmark text.To account for that, additional steps are done in the postprocessing: (1) The first part of this step is to remove all labels that are smaller than a threshold value (set to 150 m 2 ).This is done both for the road and landmark labels.(2) A geometrical search is performed to determine which road/landmark the minimum bounding box is related to.This search provides information about which text that should be placed and also which font type and size that should be used; together this provides information about the size of the final text string.( 3) The text string is placed so that its centroid coincides with the centroid of the bounding box.The landmark labels are placed horizontally, while the road labels are formed to fit the direction and curvature of the road.The latter is implemented by adjusting the text to the center line of the road.(4) The text string layer is added to the original vector map (Figure 8).
These final steps generate the CycleGAN and Pix2Pix processed labels as are shown in Figures 11-13.

D: performing map labeling using an optimisation method in a map labeling tool
The map labels created by an optimization method are used for comparison in the evaluations step (cf. Figure 1).
The labels are placed using the open source GIS platform QGIS. 8Some years ago, QGIS integrated the open source map labeling library PAL. 9 This library contains algorithms that use a combinatorial optimization method that includes two steps (Ertz et al., 2009).In the first step several candidate positions are created for each point, line, and polygon label.In the next step, the optimization is performed.The placement cost is computed for each candidate label position based on: (1) label placement in relation to the feature (association); (2) eventual overlapping of another map feature, or the distance to another map feature (map readability); and (3) label suppression.For more details on PAL see Ertz et al. (2009).QGIS utilizes most functionality of PAL, and has also extended the functionality substantially (QGIS Development Team, 2022).
In the study, we generated labels for landmarks and roads for the city wayfinding maps in QGIS.The size, font type, type case etc. of the letters followed the rules provided in the design standard for city wayfinding maps produced by Transport for London.The size is set in map units to avoid that the labeling is affected by the zoom level in the practical evaluation of the label positions.For the landmarks the text orientation is set to Horizontal and the text is wrapped to 12 characters.No callouts (i.e., line between the feature and labels) are used and the placement mode option is set to Around Centroid (where the centroid in this case is forced to be inside the landmark).For the road labels the text mode option is set to Parallel with the allowed position On line.Connected line segments with the same road name are merged, and the repeating label parameter is set to 500 map units (in this case meters).It should be noted that QGIS allows a variety of options for label placement.For example, for polygonal objects (such as landmark objects) the labels can be set to allow non-horizontal labels (which makes it easier to fit the labels into objects and hence increase association), to have a certain distance from centroid and even to be outside the polygon (might be good for map readability, but not for association).In our study, we set the labeling parameters in QGIS in such a way that the labels follow the rules for city wayfinding maps by Transport of London and T-Kartor (e.g., only using horizontal labels).We do not claim that this is the optimal parameters for good label placement (it is certainly not optimal to maximize association since it only allows horizontal labels), but the selection of QGIS label parameters is good for enabling comparison of the QGIS labels with the manually placed labels, and indirectly also the deep learning placed labels since the latter are predicted based on examples of the manually placed labels.

Design and implement the evaluation metrics
In this study, we use metrics for legibility, association, and map readability for landmark labels.The legibility metric also works for the road labels, while the other two needs to be adjusted to be applicable; however, since the evaluation below concentrate on the landmark labels we have not designed any specific road label metrics.The metrics, including the notation, are inspired by Van Dijk et al. (2002), but the details of the definitions of the association and map readability metrics are our own.
Legibility.The legibility of landmark label l lm;i Legibility lm;i À � is defined as the degree of overlap between this label and all other labels (irrespective of type of label): where area(G) -area of geometry G L-set of all label geometries.Association.To create a good association, the landmark label should substantially overlap the landmark feature it refers to but ideally no other landmarks.Therefore, we have used relative overlaps with landmarks in our definition of the landmark association metric (Association lm;i , cf. Figure 9), and ended up in the following definition for the association for landmark label f lm;i À � : F þ lm -all landmark geometries associated with a label F À lm -all landmark geometries with no associated label < l lm;i ; f lm;i > the label l lm;i is associated with the feature f lm;i CH(G) -convex hull of geometry G.

Map readability.
Ideally, the label should be placed in a completely homogenous area (e.g., completely within the feature it is associated to); if it overlaps other features the map readability will decrease.Especially, if break points are overlapped, it will impair the interpretation of maps (and graphics in general), see arguments and examples in Biederman (1985) and Harrie et al. (2004).Our definition of the map readability is expressed as follows: where bp is the number of break points overlapped by the label l lm;i tl is the total length (in map units) of overlapped line segments by the label l lm;i w 1 ; w 2 are weight factors.
In our study we used the weight factors w 1 ¼ 0:01 and w 2 ¼ 0:0005(where map units are in m).
It should be noted that geodata often contain break points where very little or no direction change of the line occurs.If the label overlaps these "unnecessary" breakpoints the map readability is not severely affected.Therefore, these breakpoints should be removed before the map readability metric is computed.In our study, this is implemented by executing the line simplification algorithm by Douglas and Peucker (1973) with a threshold of value one meter.
Implementation of the metrics.The metrics in Equations (2-4) are implemented using QGIS Python API using spatial indexing to speed up the computations.The evaluation metric code is available under BSD license and distributed on GitHUB. 10  Example of metric results.Figure 10 shows manual placed labels and the metric values for those labels are given in Table 1.As shown in Figure 10, there are small overlaps between the landmark labels and some road labels which is reflected in the legibility factors (which are somewhat less than one).The overlap between the top label and the landmark label is greater than for the bottom label, which provides a slightly higher association factor.

Procedure for the visual evaluation
The visual evaluation is conducted by three cartographic experts: a cartographer at T-Kartor with long experiences in labeling city wayfinding maps, a cartographer at the Swedish national mapping agency, and one person that teaches cartography at university level.
The experts were all provided the maps in Figures 11-13 for the evaluation.They knew that we were working with a project concerning machine learning for label placement, but they did not know which methods that were used for labeling the maps; that is, they did not know that one map was generated by QGIS-PAL and the other two by deep learning models.The maps were simply named MethodA_1 (labeled by "method A" for area 1), etc.The maps were sent to them as raster files together with the following questions: (1) Is the selection of road labels appropriate?Is the selection of landmark labels appropriate?(2) Is the placement of road labels appropriate?Is the placement of landmark labels appropriate?(3) Do you seen any general bias in the positions of the labels (or other property) that decrease the overall quality of the map labelling?
Please provide some text about this, where you provide examples of good and bad placements in the map for the methods.
The experts responded with notes, and two of them included several map examples.These notes and examples were then summarized by the authors to the text provided in the result section.This summary text has been read and approved by the experts so that it reflects their view.

Quantitative result of the selection of labels
Table 2 shows the number of labels of each type.For the labels created by CycleGAN and Pix2Pix both the original number (generated by the deep learning network) and the processed number of labels are specified.CycleGAN model tends to label most of road features and repeats the labels along the road with a total number of 17,442, not far from the number of manual labels.However, some of the obtained labels are very small and cannot fit the whole label text and thus are discarded in the post-processing.Both deep learning models tend to label most of the landmark features, with CycleGAN slightly outperforming by initially labeling 4351.The optimization method in QGIS creates almost labels for each landmark in the test area in total a bit more than 6000 labels.For the road labels, QGIS generates one geometry for each letter in the label text, and it is therefore hard to count the actual number of labels.By visual inspection, we can state that the number of road labels is in similar size as the manual labels.
The landmark labeled both manually and by CycleGAN (processed) are 2543 and those labeled both by manually and by Pix2Pix (processed) are 2494.

Quantitative result of the size of the labels
The performance of the DL models can be evaluated by comparing the average lengths of the manually labeled    data and the data generated by the DL models.On average, the manual labels have length of 159.26 units, while the labels produced through the models, as shown in Table 3, have an average length of 120.91 and 123.62 for CycleGAN and Pix2pix, respectively.This indicates that the models are underestimating the label sizes, as the predicted lengths tend to be shorter than the original lengths.
To measure the discrepancy between the predicted and original label, we computed the mean absolute error (MAE): as well as the standard deviation of the error (MAE Std).The result, provided in Table 3, indicates that the models are not capable of perfectly estimating the size of the labels on an individual level.

Quantitative result of the position of the landmark labels
Table 4 provides the quantitative result of the landmark label positions.The statistics are based on the landmarks that has a label for each of the four types of method (cf.Table 2), which in total implied 2379 landmark buildings.For the readability evaluation, we also report the average of number of points and line lengths hidden by the labels.

Visual evaluation by cartographic experts
The three cartographers provided several examples of good and bad labeling for the three labeling methods in Figures 11-13.A summary of the examples is provided below for the two deep learning labeling methods.

CycleGAN
The cartographers state that, in general, that CycleGAN was more successful in selecting road labels than Pix2Pix and also that the positions of the labels are somewhat better (even though there are a few examples of missing labels, see e.g., Figure 14a).But still there are things that could be improved, e.g., that labels should not start near where roads intersects/cross (Figure 14b) and that there are some unnecessary duplicates of labels, e.g., Bloomsbury Way in Figure 14c.There are also some labels that do not follow the curvature of the road (Figure 14g) which indicates a problem with the postprocessing of the labels.
Most landmark buildings do have labels, but you can find examples of relatively large landmark buildings that lacks a label (Figure 14b & 14h).Many land marks labels are placed good (e.g., Figure 14b), but there are several solutions that are not good.For example, the label in Figure 14d (as well as the southernmost landmark label in Figure 14h) should overlap the building more (to get better associations) and the landmark labels should not overlap any road object (as done in Figure 14e & 14f).

Pix2Pix
In general, Pix2Pix does not provide enough road labels (Figure 15a).And the placement of the road labels should be more centralized in many cases, and there are even some examples were the road labels go outside the road (Figure 15b) and (Figure 15e; Denmark street).There are also some examples where Pix2Pix failed to find the best option to place a street label (see the placement of Soho street in Figure 15e which would preferably have been placed on the street just south of the square).
Some landmark labels are placed nicely in/near the buildings (Figure 15a), but there are also examples where the labels overlap the roads even if there are other spaces to place the labels (Figure 15c & 15d).In some cases, the landmark labels fail to be placed inside the building even if there are space (see St Giles hotel label in Figure 15a).There are also several important landmark buildings that lack labels in all three example areas (see e.g. Figure 15a).

Additional notes
The placement of both land mark and road labels are, according to some of the cartographers, somewhat better for CycleGAN than for Pix2Pix.But the variability of quality within the different parts of the maps are large.
Our aim here is not to evaluate the capability of the QGIS-PAL labeling algorithm, and therefore we do not provide any specific examples here.It should also be noted that the result is much dependent on the parameter setting in QGIS-PAL.In short, we can state that one of the cartographers explicitly stated that the QGIS-PAL produced labels were preferable and one in general preferred CycleGAN.The third cartographers did not explicitly state any judgment of which method that is preferable.They all provided several examples where QGIS-PAL were better, but also some examples where the DL models provided better labels.

Selection of the landmark labels
The total number of landmarks existing in the test area is 6490 of which 4631 have labels placed manually..The  A challenge for all automated methods is to decide which landmarks are most important and should have a label.Several optimization methods, including the one in QGIS, simply add labels to all landmarks (if there are space), but this is not always wanted (indicated by that the cartographers that creates labels manually chose not to label all landmarks).In our study, we used numerical input data and raster maps for training the model of which labels to include.But from this information, it is not completely possible to decide which labels a cartographer think are important.

Size of the landmark labels
The acquired MAE values, reported in Table 3, imply that the DL models do not accurately capture precise label sizes, and in general that the sizes are underestimated.Generally, the notable absolute difference emphasizes the necessity for deeper exploration and potential enhancement of the text features provided to the models, as well as methodologies for extracting contextual label information pertaining to the map background.Additionally, room for improvement lies in refining the model architecture and fine-tuning hyperparameters.Based on our experiments and literature review, we can assert that the generative models could benefit from further improvement to effectively tackle well-defined controlled tasks involving specific spatial transformations.For instance, tasks like precisely drawing specific objects in specific locations with some specific constraints.On the flip side, models based on Spatial Transformer Networks (STN) may excel at this, but they lack the innovation and ability to learn unrestricted transformations.

Position of the landmark labels
Table 4 reveals that the deep learning models provide somewhat lower value for legibility than the optimization method in QGIS-PAL.One reason for this might be the limitation in the DL models that they were not able to predict the size of the labels appropriately and therefore has low control of overlap of the final labels.That the legibility is higher in DL models than in the manual labels do not necessarily entail that the DL models (nor the optimization method) are superior in this context.The legibility measure (as well as most automated methods in map labeling) treat the labels as (bounding) boxes in the computations.However, that bounding boxes overlaps does not always imply bad legibility (nor readability) since that the actual text could be easy to read (despite bounding box overlap).This is a general limitation in most (all?) automated methods for map labeling and measures of legibility.
The DL models provide label positions with association values in similar size as the optimization method in QGIS-PAL, but somewhat lower than for the manual labels.Anyhow, it seems that the DL models are capable of identifying location of the labels that are well associated with the landmark buildings in many cases.But, as also identified by the cartographers (cf.Figures 14 and 15), the labels do not have optimal positions for the labels.
When it comes to the map readability part the optimization method, in QGIS-PAL, outperforms the labels generated by DL models.The shortcomings with the DL model might, again, be linked to the limitation that the sizes of the labels are not well modeled by the DL models.This entails that in cases when the labels are extended in the postprocess there is no control of overlap with other objects.This is likely also the reason of that e.g., road labels extend outside the roads in some cases (Figures 14c and 15b) as identified by the cartographers.It is interesting to note that optimization methods provide labels with higher values than the manually placed labels.A reason for this could potentially be that the manual labels were placed together with e.g., icons (not used in this study) that sometimes affect the position of the manual labels.Another reason could be that the cartographer could (correctly or not) deliberately accept an overlap of the landmark label with other features.
The result reported in Table 4 and in this section is dependent on the quality of the evaluation metrics.To define map labeling quality metrics is difficult, partly due to there is no consensus about what a perfect labeling is, and hence there is no consensus about the quality metrics.The legibility metric, used in this study, is straightforward, under the assumption that legibility is defined as the avoidance of overlapping labels (represented as bounding boxes).For the association, we chose to utilize the amount of overlay with the referring object (in this case landmark) and other objects of the same type as the basis.The map readability metric is based on overlapping points and line segments, i.e. how much geometries that the label obscure.Our experience is that these three metrics together provide a good quality indication of the labels on aggregated level, but that one can find individual examples where they are not perfect.One weakness of the metrics is that they are dependent on the size of the labels, another shortcoming is that the aesthetic component is not regarded (e.g. the visual interplay between the label and other labels and features, for example alignments).

Limitation of the study and direction of future research
In this study we included road labels, mainly for studying the interaction between landmark and road labels.Therefore, we have not performed an analysis of the quality of the road label placement, which is a lengthy topic in itself (e.g.new metrics have to be developed).
GANs are effective with image and image-like data.Even if they learn spatial features in the input, they are just learning visual features and near-pixel level information.However, labeling in cartography implies georeferencing and vector data which are richer.So, learning labeling from image only may lead to a loss of information.Thus, adding relevant vector data may improve the learning and help to get the accurate size of the labels.This has been utilized in this study, but more work is here necessary to better utilize external information (attribute and vector information) and not rely too much on the raster input data.In addition, using customized loss function that includes some cartographic objectives is a possible way to improve the performance of the model.
The result from CycleGAN and Pix2Pix is rather similar in terms of cartographic quality, even though the cartographer slightly preferred CycleGAN in the visual assessment.In general, CycleGAN creates more labels than Pix2Pix, with significant difference in road labels.However, the training of Pix2Pix requires less time even if it needs more epochs to converge.Our conclusion from this is that the main issue here is not which DL technique that is most important, but how we model the labeling process and what additional data to learn from.Moreover, how to manipulate latent vectors in a controlled manner to effectively enforce certain constraints or transformations using generative model remains an ongoing research challenge.

Conclusions
In this study, we investigated the feasibility of generative models in the automation of map labeling.To realize that we designed, implemented, and executed the generative deep learning models CycleGAN and Pix2Pix.A quantitative evaluation was performed by designing and implementing metrics for legibility, readability, and association.This evaluation states that the deep learning models do not provide good enough result on selecting which labels to include in the map, and also the size of the labels.When it comes to the position of the label the deep learning models achieve comparable results in terms of legibility, but that manually placed labels are superior in terms of association and that labels placed the optimization routine provides better map readability scores.Furthermore, a visual assessment was carried out by a professional cartographer.This assessment indicates that the deep learning model provides good solution for some map situations, but that there are problems with foremost the selection of labels and sometimes also the association.
There are several challenges for the proposed approach of using generative deep learning models such as deciding on which features to be labeled and predicting the precise the size of the label bounding boxes that fit the actual string text.In the study, we used additional input data (besides the raster maps) to the model to facilitate the learning of text attributes, especially targeting the size of the labels.This part needs to be improved to enable the deep learning models to be used in a production environment.

Figure 1 .
Figure 1.Workflow of the study.

Figure 2 .
Figure 2. Snap shot of the city wayfinding maps of London which are used as test data © Copyright Transport for London.

Figure 3 .
Figure 3. Overview map of the training and prediction data in greater London area (around 1600 km 2).The prediction area is marked with a red polygon; data from the remaining areas is used for training the deep learning models.The city wayfinding map includes around 18,000 landmark labels and 96,000 road labels, all placed manually.© copyright Transport for London.

Figure 4 .
Figure 4. Example of input data for training the deep learning models (before split into channels).The landmark labels (yellow) and road labels (red) are split along the boundaries of neighboring raster maps.

Figure 5 .
Figure 5. Architecture of the used CycleGAN model, with two inputs representing the map raster and the text (label) features.

Figure 6 .
Figure 6.Training losses for generator and discriminator of CycleGAN.

Figure 7 .
Figure 7. Output of CycleGAN model, the yellow boxes designate the landmark labels and the red boxes designate the road labels.

Figure 8 .
Figure 8. Illustration of the postprocessing steps.a) the output from the generative deep learning models; b) after the closing operator and extraction of landmark labels; c) after vectorization and simplification of the labels; d) text strings added to the original (unlabeled) vector map.

Figure 9 .
Figure 9. Computations of the association metric for a landmark label.

Figure 10 .
Figure 10.Examples of bounding boxes for manually placed labels: landmark labels in blue and road labels in red.Note that the landmark building has two separate parts, with one label for each part.

Figure 14 .
Figure 14.Examples of label placement performed by CycleGAN.

Figure 15 .
Figure 15.Examples of label placement performed by Pix2Pix.

Table 1 .
Evaluation metrics for the samples depicted in Figure10.

Table 2 .
Number of labels created by each method.It should be noted that for the manually created maps some landmarks have icons and not text labels.These icons are not included in this study.

Table 3 .
Mean absolute error of the label size predicted by deep learning models and its standard deviation.

Table 4 .
Statistics for the four labeling methods.The legibility, association and map readability metrics are defined in Equations (2-4) and definitions of the last two columns are given in Equation10.
optimization tool in QGIS labels a major part of these landmarks exceeding the manual method which labeled about 73% of the landmarks (the landmarks labeled with icons not included).Concerning deep learning (DL) models, Pix2Pix labeled originally about 65.7% of landmarks and CycleGAN about 67%, and after the postprocessing Pix2Pix labeled 40.7% and CycleGAN 41.8%.We could have used a smaller threshold in the postprocessing to include more labels from the DL models.In general, if we consider the fact that the DL models are trained on the manual labeled data which is limited in 71.3% of the existing landmarks, we could say that CycleGAN achieves 58.6% and Pix2Pix achieves 57.1% in terms of label selection.Regarding this particular aspect, the cartographers, in the visual evaluation, pointed out that the DL methods are sometimes missing to add labels to landmark buildings (e.g.,Figures 14b &  14h and 15a & 15b).