Learning Bilingual Word Embedding Mappings with Similar Words in Related Languages Using GAN

ABSTRACT Cross-lingual word embeddings display words from different languages in the same vector space. They provide reasoning about semantics, compare the meaning of words across languages and word meaning in multilingual contexts, necessary to bilingual lexicon induction, machine translation, and cross-lingual information retrieval. This paper proposes an efficient approach to learn bilingual transform mapping between monolingual word embeddings in language pairs. We choose ten different languages from three different language families and downloaded their last update Wikipedia dumps1 1. https://dumps.wikimedia.org. Then, with some pre-processing steps and using word2vec, we produce word embeddings for them. We select seven language pairs from chosen languages. Since the selected languages are relative, they have thousands of identical words with similar meanings. With these identical dictation words and word embedding models of each language, we create training, validation and, test sets for the language pairs. We then use a generative adversarial network (GAN) to learn the transform mapping between word embeddings of source and target languages. The average accuracy of our proposed method in all language pairs is 71.34%. The highest accuracy is achieved for the Turkish-Azerbaijani language pair with the accuracy 78.32%., which is noticeably higher than prior methods.


Introduction
Nowadays, there are several ways to represent language words. One broadly used word representation method is word embeddings, which connects the human understanding of language to a machine and is crucial to solving many natural-language-processing (NLP) problems. Word embedding is a common method to learn word representation where words with close meaning have close representations (Mikolov et al. 2013b;Pennington, Socher, and Manning 2014). Some traditional methods, such as one-hot encoding and bag of words, are helping some machine learning (ML) tasks, but they are un-ordered, and applications. Prior works have focused on independently trained word embeddings in each language by monolingual corpora. They learn a linear transformation to map the embeddings using a small or mediumsized lexical matching as a bilingual seed dictionary from the source language to the target language (Artetxe, Labaka, and Agirre 2016). The ability to produce lexical items of two different languages in a shared cross-lingual space leads the NLP research further. Word-level connections between languages are used in transferred statistical parsing (Ammar et al. 2016;Zeman et al. 2018) or language understanding systems (Mrkšić et al. 2017) and later by using a tiny seed bilingual dictionary (Artetxe, Labaka, and Agirre 2016;Kondrak, Hauer, and Nicolai 2017). However, they do not satisfactorily handle good accuracy and need more labeled data to get better results.
To succeed in lexical matching in language pairs' problem, we perform the exact dictated words in our experiments that increase accuracy without labeled data. Notably, if the source and the target languages are relevant or come from a common language family, they have much mutual intelligibility. By applying some pre-processing steps, we increase lexical matching among pair languages. Since the number of the exact dictated words is significant, we use them to learn a neural network to find a nonlinear mapping between word vectors of languages.
This paper presents a new approach to studying bilingual word embeddings mapping between related languages. First, we use Wikipedia XML dumps for each language as the text source and extract tokens in each language. Next, we use the Word2vec model library to produce word embeddings. Then, we obtain the words with the same dictation between language pairs. Finally, we train our model using the results obtained in the previous step, to find the mapping between word embeddings. The contributions of this paper are: • To improve the bilingual word embedding mapping method between languages. • To find nonlinear transformation mappings, especially for low-resource and relative languages. • The proposed model is based on recent research on the combination of neural machine translation encoder-decoder and GAN models. • Our proposed model augments a 4-layer BLSTM encoder-decoder with an attention mechanism, taking context into the model to learn bilingual word mappings and complete bilingual word embeddings. • A convolutional neural network implements our proposed model discriminator to distinguish real target vectors. • We design a list of experiments on seven language pairs. Our experimental results demonstrate a significant advantages of learning word mapping in related languages.
The structure of our paper is illustrated in Figure 1. The rest of the paper proceeds as follows. First, we present some essential points and the evolution of the cross-lingual word embedding models in Section 2. Next, in Section 3, the method for data collection and experimental setup are detailed. Section 4, describes the implemented system. Next, in Section 5, our experimental results are illustrated. Finally, we conclude our paper results in Section 6.

The Evaluation of the Field
Most cross-lingual word embedding models are created and extended using monolingual word embedding models (Vulić and Moens 2015). At first, the model learns word embedding vectors for each language words using its large monolingual corpora. Then, it retains a mapping from the source language word embeddings to the target language word embeddings. In the next section, we briefly review the monolingual word embedding models.

Word Embedding Models
Word2vec (Mikolov et al. 2013b) is a shallow neural network with two layers to produce word embeddings in a language. It receives a massive corpus of text documents as input and creates a vector space where each word in the corpus keeps in touch with a vector in the space. Word vectors in the vector space have a specification that semantically close words in the corpus have close vectors in the space. Word2vec is implemented in two structures: Skip-gram and continuous bag-of-words (CBOW). Skip-gram with negative sampling (Mikolov et al. 2013b) is a popular model due to its robustness and training performance (Levy, Goldberg, and Dagan 2015). It produces a language model by converging on learning effective representation instead of modeling word probabilities accurately. It provides word vectors that are good at predicting the surrounding context words by offering a source word. The model minimizes the following skip-gram objective, using training data: N is the number of words in the training corpus, and C is the context window's size. The reverse of Skip-gram is Continuous Bag of Words (CBOW). It tries to produce a source word according to the surrounding words. CBOW minimizes the following objective in training data.
log Pðw t jw tÀ C ; . . . ; w tÀ 1 ; w tþ1 ; . . . ; w tþC Þ The skip-gram architecture weighs nearby context words more heavily than more distant context words. According to the authors (Mikolov et al. 2013a), CBOW is faster while skip-gram is slower but better for infrequent words. Explained models are shown in Figure 2. The Global Vectors for Word Representation (GloVe) (Pennington, Socher, and Manning 2014) extends the Word2vec method; and efficiently learns word vectors. Word2vec and GloVe do the same things and perform similarly in NLP tasks. The notable difference is the way they are built. Word2vec builds word embeddings using a predictive model, while GloVe is a count-based model. Glove learns to make a co-occurrence matrix by counting the frequency of appearing a word in a context.
FastText (Bojanowsk et al. 2017), another extension of the Word2vec model, handles each word as a composition of character n-grams and not tokens. For example, with different representations of "school" and "house," we can build a representation for "schoolhouse," which would otherwise appear too infrequently to learn dictionary-level embeddings. This difference enables FastText to generate better word embeddings for rare words and out of vocabulary words. Both Glove and Word2vec cannot generate highly efficient word embeddings for rare words.

Cross-Lingual Mapping-based Approaches
Mapping-based approaches try to learn a mapping between monolingual representations of two languages. In these approaches, first, the method trains monolingual word embeddings on massive monolingual corpora. Then they learn a transformation matrix between monolingual representations in different languages to map unknown words of languages. They frequently generate a list of word pairs between the source and the target e2019885-1546 G. ALIPOUR ET AL. Figure 2. The architecture of CBOW and Skip-gram as described in (Mikolov et al. 2013b).
languages that they translate. There are four types of mapping-based word embedding approaches proposed (Ruder, Vulic, and Søgaard 2019): Regression methods, Orthogonal methods, Canonical methods, and Margin methods. Regression methods are the most powerful methods for learning a linear transformation between word embeddings of source and target languages by maximizing their similarity. Mikolov, Le, and Sutskever (2013) noted that the words and their translations have similar geometric relations in monolingual word embeddings if a suitable linear transformation is applied.
Orthogonal methods apply orthogonality constraints on the transformation mapping matrix, which improves regression methods' performance. Based on the assumption, the transformation matrix W is orthogonal (W T W ¼ I). The solution is obtained from the singular value decomposition of YX T .
Canonical methods map both languages' word embeddings to a new shared space using Canonical Correlation Analysis (CCA) that maximizes their similarity. Faruqui and Dyer (2014) use CCA to map words from two languages into a shared embedding space. Margin methods map the source language's word embeddings to maximize the margin between correct translations and other candidates. Lazaridou, Dinu, and Baroni (2015) propose another objective for the linear transformation. They realize that using least-squares as an objective for learning a projection matrix leads to hubness. To find the correct translation vector y i of a source word x i , they use a margin-based (max-margin) ranking loss (Collobert and Weston 2008) to train the model. Jinsong Su et al. use graph-based semantic information to learn bilingual word embedding (Jinsong, et al. 2018a).
Creating robust cross-lingual word representations with some parallel data (seed lexicon) is an essential avenue of research. All references in Table 1, have worked on linear transformation.
Unfortunately, most linear transformation mapping approaches are not accurate enough. Therefore, the approaches require a long way to be more precise and reliable. Besides, there are rare efforts in nonlinear transformation mappings. Both Mikolov et al. (2013b) and Conneau et al. (2018) found that a linear transformation performs better than a nonlinear transformation learned via a feedforward neural network. Makhzani et al. (2016) use adversarial autoencoders to map word embeddings between languages. The reported performances are weak in comparison to other methods. Jinsong et al. (2018a) to model the bilingual semantics produce a neural generative autoencoder. Zhang et al. (2020) for cross-lingual embedding mappings use Wasserstein GAN (Arjovsky, Chintala, and Bottou 2017), which combines back-translation with target-side and preliminary mappings learning. Their used dataset was not big enough, and the model requires more iterations to converge on the discriminator as it will be slower to be trained on it.
In brief, there has not been any neural network-based model yet that proves to construct a more effective mapping model on feedforward neural networks. Early cross-lingual word embedding models relied on a large amount of parallel data (Artetxe, Labaka, and Agirre 2016;Mikolov et al. 2013b). Still, more recent methods have tried to minimize the amount of supervision necessary (Artetxe, Labaka, and Agirre 2017;Levy, Søgaard, and Goldberg 2017;Smith et al. ;Vulic´ and Korhonen 2016). Some researchers have presented almost unsupervised methods that do not use any form of crosslingual supervision data Shigeto et al. 2015;Valerio and Barone 2016;Zhang et al. 2017). Unsupervised cross-lingual word embeddings try to evolve bilingual lexicons and machine translation models without parallel corpora and translations (Duong et al. 2016;Lample et al. 2018).
Recently, approaches have been proposed that learn an initial seed lexicon in a completely unsupervised way. All unsupervised cross-lingual word embeddings methods are based on the mapping approaches. Conneau et al. (2018) learn an initial mapping in an adversarial way by training a discriminator to differentiate between projected and actual target language embeddings. Artetxe et al. (Artetxe, Labaka, and Agirre 2018) propose to use an initialization method based on the heuristic that translations have similar similarity distributions across languages. Hoshen and Wolf (2018) introduced a method with the first project vectors of the N most frequent words to a lower-dimensional space with PCA. Their approach minimizes the sum of Euclidean distances by learning W s!t and W t!s enforce cyclical consistency constraints that force vectors round-projected to the other language space and back to remain unchanged.

Data Collection and Experimental Setup
Turkic languages are spoken across a wide area, stretching from the Balkans in Europe through Central Asia to northeast Siberia (Hammarström, Forkel, and Haspelmath 2017). There exist several alphabets used by Turkic languages. The Latin alphabet is a well-established alphabet in the Turkic languages today. It is currently used by (with different versions) Turkey, Uzbekistan, Azerbaijan, and Turkmenistan and will be used by Kazakhstan.
Turkish, the official language of Turkey, is the most widely spoken of the Turkic languages and has the biggest articles set in the family inside the Wikipedia dumps. We use Turkish as the source language in our bilingual mapping experiments and Azerbaijani, Turkmen, and Uzbek as the target languages.
The Indo-European languages are among the most major language families and are mostly used in western and southern Eurasia. For our experiments, from the North Germanic branch of the family, we chose Swedish as the source language, Danish and Norwegian languages as the target languages, and from the south Slavic branch, we selected Serbian as a source, Croatian, and Bosnian as the target languages.
One of the first things required for NLP tasks is a corpus that refers to a collection of texts. One of the best rich sources of a well-organized vast amount of non-adversarial textual data is Wikipedia. It is freely and conveniently available online, which makes it a valuable resource to build NLP systems.
By each language Wikipedia text dumps (XML files), we prepared a monolingual corpus for all mentioned languages. For each language, its Wikipedia dump contains just the latest versions of the Wikipedia articles (November 2021). Table 2 shows the number of articles and tokens of the languages.
To construct a text corpus from Wikipedia without article markups, punctuations, and links, we use the WikiCorpus tool from gensim, 1 an XML parser library for Python, which converts Wikipedia dump files to text corpus. To pre-process the text corpus for the Word2Vec model, we convert all the corpus text to lowercase form and delete all the special characters, digits, and extra spaces from the text. After that, we use the Word2vec implementation of the gensim library to provide a monolingual embedding model in each language. As for Word2vec parameters, no lemmatization was done, the window size was set to 5, and the output dimensions were set to 768. We only estimated representation vectors for words, which occurred five times or more in the monolingual corpus. Figure 3 shows the learning process for word vectors in each language. In our experiments, there are seven language pairs, Turkish-Azerbaijani, Turkish-Uzbek, Turkish-Turkmen, Swedish-Danish, Swedish-Norwegian, Serbian-Bosnian, and Serbian-Croatian. All language pairs are relevant and use the Latin alphabet; so many words have the same dictation and meaning.
We need a few thousand word pairs as a seed dictionary for better and accurate bilingual word embeddings transformation. Preparing a seed dictionary between languages is usually not easy and requires a lot of cost and effort. On the other hand, a reasonable size seed dictionary makes the final word embedding mapping model more accurate.
We propose choosing the exact dictation words as the bilingual seed dictionary. The underlying assumption is that word embeddings across relative languages share similar local and global arrangements. For example, the distance between the words Kedi and Köpek in Turkish should be relatively similar to the distance between Pişik and İt in Azerbaijani. We try to recover the transformation between language pairs using seed dictionaries. We split each seed dictionary into three parts: a training set, a test set, and a validation set. Table 3 shows the number of the same dictation tokens in the language pairs and the amount of their training set, test set, and validation set.

Implemented System
In this section, we present our proposed network. A brief overview of the proposed network is illustrated in Figure 4. The network includes two main parts. These parts are: A generator network that transfers a word vector from a source language to a target language, and the discriminator network that distinguishes the real/fake word vector.
A GAN (Goodfellow et al. 2014) comprises a generator model, G, and a discriminator model, D. The generator objects to learn a mapping function from a prior noise distribution p y to an unknown data distribution p x in the real data space. The discriminator tries to discern between generated and real data. Both networks are trained competing against each other in a min-max game with value function V (G, D): During training, the generator learns to generate more realistic vectors to deceive the discriminator while the discriminator improves itself to discern the real vectors from the generated one. Our GAN model is mainly focused on learning one-to-one mappings from an input vector to an output vector.
be the vocabulary of a source language S i with | X| words, and X 2 R X j j�l be the corresponding word embeddings of length l and let Y ¼ y 1 :y 2 : . . . : y Y j j � � be the vocabulary of the target language T j with |Y| words, and Y 2 R Y j j�m is the corresponding word embeddings of length m. We denote the word vector for a word x by X.
The source and target languages are aligned with a seed lexicon dictionary (binary matrix) D so that D ij ¼ 1 if the i-th word in the source language is aligned with the j-th word in the target language. Our objective is to find the dictionary matrix D by learning the mapping matrix W, which transforms input language word embeddings X to the target language word embeddings Y. Our bilingual word embeddings training algorithm is as follows: The generator consists of an encoder-decoder architecture with an attention mechanism (Bahdanau, Cho, and Bengio 2016;Luong and Manning 2016), as shown in Figure 5.
In our experiments, encoder and decoder networks are recurrent neural networks (RNN) implemented by stacking multiple Bi-directional Long Short-Term Memory (BLSTM) layers. The encoder reads the source word embedding vector x k and produces a high-level representa- The decoder network reads the encoding and generates an output sequence in the target language word embeddings space. Attention is a mechanism that gives a richer encoding of the source sequence to construct a context vector used by the decoder. The decoder calculates  the likelihood of the sequence, based on the conditional probability of y u , given the input feature h and the previous labels y 1 : u À 1, using the chain rule: pðyjxÞ ¼ Y u pðy u jh; y 1:uÀ 1 Þ The input to the encoder is a word embedding with d = 768 elements. To implement each of the encoder and decoder models, we use 4 BLSTM stack layers. The encoder model's output is a fixed-size vector that represents the internal representation of the input sequence. The number of memory cells in each layer is 256. Hence, we use the generator network to learn a mapping function from a real word vector sample X to generated a sample y gen which is corresponding to a real word vector y real . The discriminator network D is a CNN network used to evaluate how well the generator network generates fake samples. The discriminator inputs all the generated vectors and tries to distinguish between the real and generated vectors.
The network's output is a 768-dimensional vector, where it is a closely aligned word vector to the model's input word vector. To learn word embedding mapping, we use an iterative refinement to find the final mapping. First, we produce the seed dictionary through the exact dictation words. Next, the system refines the dictionary until convergence. The proposed algorithm used to find the dictionary matrix D is shown below.
Input: X (source language word embeddings) Input: Z (target language word embeddings) Input: D (seed dictionary) 1: Until convergence: 1.1: Mapping_GAN_Model ← LEARN_MAPPING (X, Y, D) 1.2: D ← LEARN_DICTIONARY (X, Y, Mapping_GAN_Model) 1. 3: EVALUATE DICTIONARY(D) Output: D We use the dot product as the similarity measure to learn a dictionary, roughly equivalent to cosine similarity between the source language word embeddings and the target language word embeddings. We set D ij ¼ 1 if j ¼ argmax k y gen dotY k � À � and for otherwise, we set D ij ¼ 0.

Results
To induce word embeddings, we use Wikipedia text dumps. We create independent monolingual word embeddings in each language using Wod2vec in the genism library. In our experiments, we set d = 728 for the number of dimensions of word embeddings and w = 5 for the size of context window. Each word embeddings vector contains floating-point numbers within the range −8 to +8. Experiments are conducted on the Google Colab server. We implemented the model using TensorFlow and Keras. Backpropagation through time (BPTT) and Adam optimizer with learning rate 0.001 are used to optimize the objective function. We implemented four neural networks to find the best bilingual mapping model, including Vanilla LSTM, Encoder-decoder, Encoder-decoder with attention, and our proposed model. All of the implemented models are trained at least 1000 epochs, and the batch size is set to 500. The models take around 8-10 hours to train in the Google Colab server system, except for our proposed model, which takes approximately 11-12 hours. The similarity percentage between two vectors y real and y gen is computed using the following formula: Sim y gen ; y real The mean similarity of each language pair is obtained using the mean of all similarities in it. Table 4 summarizes the accuracy of our proposed model compared to the other implemented models. The results show that the highest performance is achieved in the proposed model. The impact of the initial  Figure 6. Initial seed dictionary impact on the bilingual transform mapping.  dictionary mass on the quality of the results is shown in Figure 6. For example, for the Azerbaijani column, we calculated the rate of its similar words by Turkish to its all words (82/140 = 46%). Our experiments show that mass seed dictionaries increase the quality of mapping. In Figure 7, we show the difference between the real and generated vectors of 3-sample word vectors (the word vectors of şanslı, sevgi, and barış).
Previous works have used different methods to learn bilingual word embedding mappings; Table 5 reports previous methods' best results compared to the proposed method. These results demonstrate that our method produces better mappings than previous ones.