The solution of the problem of unknown words under neural machine translation of the Kazakh language

ABSTRACT The paper proposes a solution to the problem of unknown words for neural machine translation (NMT). The proposed solution is shown by the example of NMT of the Kazakh-English language pair. The novelty of the proposed technology for solving the problem of unknown words in the NMT of the Kazakh language is an algorithm proposed for searching for unknown words in the dictionary of a trained model of NMT and using the dictionary of synonyms of the Kazakh to replace an unknown word with a word that is close in meaning. A dictionary of synonyms is used to search for words that are similar in meaning to the unknown words, which was defined. Moreover, the found synonyms are checked for the presence in the vocabulary of a trained model. After that, a new translation of the edited sentence of the source language is performed. The base of words-synonyms of the Kazakh language is collected. Software solutions to the unknown word problem have been developed in the Python. The proposed technology solution to the problem of unknown words was tested on the two parallel Kazakh-English corpus in both variants: baseline NMT and NMT with using of the proposed technology.


Introduction
The Kazakh language belongs to the Turkic group of languages. About 13 million people use the Kazakh language according to Wikipedia, who live in Kazakhstan, Russia, China, Uzbekistan, Mongolia, and Turkmenistan. According to linguistic resources, the Kazakh language belongs to low-resource languages. Especially few linguistic resources are available for parallel corpora. This situation significantly affects the interaction of citizens of various countries with Kazakhstan.
For Kazakhstan, the problem of machine translation is very relevant as Kazakhstan is currently actively integrated into the global space and the needs of the translation of modern information in various fields of politics, economics, industry, and the social sphere are growing exponentially every year. The urgent issue is the timely translation of modern textbooks and scientific and technical literature into the Kazakh language as the Kazakh language begins to prevail in the educational sphere. The corps of interpreters of Kazakhstan, on the one hand, is not large enough to cover the ever-increasing needs of translation from leading world languages into the Kazakh language, and on the other hand, it is necessary to increase the productivity of translation by using machine translation programmes. In this connection, the relevance of high-quality systems of machine translation of the Kazakh language is very important, especially languages relevant for Kazakhstan, like English and Russian. Since the problem of machine translation has not yet been solved at a sufficiently high level as professional translation, the problem of improving of machine translation is very relevant. It should be noted that solving the problem of machine translation can open the way for solving other very important problems of artificial intelligence, such as understanding natural language.
Recently, the best results of machine translation have been shown by an approach based on neural networks, namely, neural machine translation. However, the quality of neural machine translation is not yet approaching professional translation. The main problem of neural machine translation is the need for large volumes of parallel corpora necessary for learning neural machine translation. This is especially true for low-resource languages, which include the Kazakh language. Ways to solve this problem are either the creation of natural parallel corpora by professional interpreters, or the creation of synthetic parallel corpora. The first case is a very resource-consumer process; in the second case, various approaches to the generation of synthetic parallel packages are possible. The quality of neural machine translation is also affected by the problem of unknown words, i.e. words that are outside the dictionary of a machine translation system (Out Of Vocabulary -OOV).
The quality of a neural machine translation substantially depends on solving the problem of unknown words. This problem is associated with the concepts of 'indomain' (in the domain) and 'out-of-domain' (outside the domain). By 'in-domain' term is meant a selection of source data from domain on which neural machine translation is trained. If during testing or during a real translation in source sentence are words that did not appear in the 'in-domain' vocabulary come across, then these words will be unknown words. Some machine translation systems leave these unknown words untranslated, either replace them with the abbreviation 'UNK', or translate them with words that are close in meaning. Accordingly, the last decision, namely, finding a word that is close in meaning, is also a difficult task.
This work is aimed at solving an urgent problem, that is, the problem of unknown words. The paper proposes an approach to solving the problem of unknown words in neural machine translation for a Kazakh-English pair of languages. The proposed approach is applied to the input data during the pre-processing of neural machine translation for the Kazakh-English pair of languages. This approach aims to replace an unknown word with a synonym for that word. For this, a database of synonyms of words of the Kazakh language was collected. Segmentation of the corpus is performed first. Then the unknown word is searched in the vocabulary of the trained model of neural machine translation and it is replaced with a word close in meaning using the dictionary of synonyms.
The main scientific contribution of this work is the development of an algorithm for converting an out-of-domain word into an in-domain word by finding an unknown word in the dictionary of the trained model, which is replaced by a word from the synonym dictionary that is close in meaning. This paper is an expanded version of work (Turganbayeva & Tukeyev, 2020). For this work, additional experiments were carried out and an addition was made for each section.

Related works
In order to solve the problem of unknown words for translation from the Kazakh language, we have considered and investigated practical solutions of this problem for other popular languages.
Since in traditional machine translation many words are left out of vocabulary during testing, they significantly reduce translation performance. In (Marton et al., 2009), when solving the problem of extra-vocabulary, attention is paid to how to correctly translate extra-vocabulary words. For this, additional resources such as comparable data and thesaurus of synonyms are used. One notable exception is the work (Zhang et al., 2012;Zhang et al., 2013), which also focuses on the syntactic and semantic role of off-vocabulary words and suggest replacing off-vocabulary words with similar words during testing.
In (Gulcehre et al., 2016), a method is proposed for processing rare and unknown words for models of neural networks using the attention mechanism. Their model uses two softmax layers to predict the next word in conditional language models: one predicts the location of the word in the original sentence, and the other predicts the word in the short list dictionary. At each time step, the decision about which softmax layer to use is adaptively taken by the multilayer perceptron, which is context-specific.
To solve the problem of unknown words, in (Li et al., 2016) a replacement-translationrecovery method is proposed. At the substitution stage, rare words in the test sentence are replaced by similar dictionary words based on the similarity model obtained from monolingual data. At the stages of translation and restoration, the sentence will be translated with a model trained in new bilingual data with the replacement of rare words, and finally, the translations of the replaced words will be re-placed by the translation of the original words.
In (Li et al., 2017), a method for processing unknown words in the NMT is proposed, based on the semantic concept of the source language. First, the authors used the semantic concept of the semantic dictionary of the source language to find candidates for dictionary words. Secondly, they proposed a method for calculating semantic similarities by integrating the source language model and the semantic concept of the network to get a better word replacement.
Also in (Thi-Vinh et al., 2019), three solutions are proposed for solving rare words in neural machine translation systems. The first is improving the source context to predict target words by directly connecting the source attachments to the output of the attention component in the NMT. Secondly, an algorithm is proposed for studying the morphology of unknown words for the English language under supervision in order to minimize the negative consequences of the rare word problem. And thirdly, a synonymous relation from WordNet is used to overcome the problem outside the NMT Dictionary (OOV-out of vocabulary). The proposed approaches are evaluated on two language pairs with a low level of resources: English-Vietnamese and Japanese-Vietnamese and there is an improvement in BLEU.
During the research work, it was revealed that all work can be divided into three categories. The first category is aimed at increasing the vocabulary and increasing the speed of calculating performance for fast dictionary processing. The second category uses information from the context, that is, if the system cannot translate, then it copies the value from the source to the target sentence. The third category of approaches replaces the word itself with a word similar in meaning. By opinion of authors the third approach is more effective for solving of unknown word problem, because this approach is oriented directly to solve the considered problem.
Our proposed approach falls into the third category. Based on the approaches of the third category, the synonymy problem lies, and for those languages in which there is a large amount of data for machine learning, the synonymy problem has a good solution. That is, for those languages (for languages that have a clean volumetric corpus of texts), a base of words has been created, grouped by their meanings. And for the Kazakh language in the open access there are no clean corpora and bases of words with similar meanings, since the Kazakh language belongs to the group of low-resource languages. Therefore, we have proposed an approach to solving unknown words in neural machine translation, which works based on a dictionary of synonyms.

Description of method
The technology (method) for solving the problem of unknown words in the neural machine translation of the Kazakh language has been developed, which consists of the following steps: (1) Segmentation of the source text of the Kazakh language.
(2) An algorithm for searching for unknown words in the vocabulary of a trained model of neural machine translation.
(3) For each unknown word in the source text of the test corpus, a search is made for its synonyms in the dictionary of synonyms. (4) The defined unknown words are replaced with synonymous words. (5) Repeat the machine translation of the modified source text.
The base of words-synonyms of the Kazakh language, consisting of different parts of speech, is collected.
In Figures 1 and 2 are an overview scheme of the input/output data and the process of execution of works.
Below is a more detailed description of the stages of the proposed technology for solving the problem of unknown words in the neural machine translation of the Kazakh language.

Segmentation of the source text of the Kazakh language
Segmentation of the Kazakh language source text is performed by the method proposed by the authors (Karibayeva et al., 2020). This segmentation method is based on the definition of the complete set of endings of the Kazakh language. The Kazakh language ending system is divided into two groups: nominative endings (nouns, adjectives, numerals) and verb endings (verbs, participles, gerund, mood and voice). In the Kazakh language, a word is formed using 4 types of affixes. These species are: C-case, T-possessive, K-plural, J-personal. Kazakh language endings can be represented as all kinds of combinations of these basic types of affixes. All kinds of combinations of the basic affix types consist of combinations of the one type, combinations of two types, combinations of three types and combinations of four types. The total number of combinations is determined by the formula: Ank = n!/(N−k)!.
Then the number of combinations (placements) will be determined as follows: Similarly, the definition of all kinds of placements for endings with a verb stem was made, which amounted to 55 semantically acceptable types of endings. In general, the total number of ending types for nominal bases plus the total number of ending types for words with a verb stem is 70. In accordance with these types of endings, finite sets of endings are constructed for all the main parts of the Kazakh language. So, for parts of speech with nominal bases, the number of endings is 1213 (all plural variants are taken into account), and the number of endings of parts of speech with oral bases is: verbs -432, participles -1582, adverb -48, moods -240, voices -80. In total -3565 (Tukeyev, 2015;Tukeyev et al., 2016). The morphological segmentation algorithm of the Kazakh language words includes two stages: (1) allocation of the basis and endings of words; (2) segmentation of word endings into suffix segments.
The stage of dividing the base and endings of a word is performed using a stemmer, also based on the use of the complete Kazakh ending system. At the second stage, a simple transducer model is used, using the table of the complete ending system, in which the output is a segmented ending divided into suffixes. The table of the complete Kazakh ending system contains two columns: the column of endings of words of the Kazakh language and the column of a sequence of suffixes corresponding to this ending. The Table 1 below shows the fragment of the table of Kazakh endings with segmented suffixes. The symbol @@ is affix separation symbol.

An algorithm for searching for unknown words in the vocabulary of the trained model
An algorithm is developed for searching unknown words in the vocabulary of a trained model of neural machine translation for the Kazakh-English pair of languages. The main idea of this algorithm for searching unknown words in the vocabulary of a trained model is as follows: for the sentence of the target language, where there is a symbol 'unk', in its equivalent sentence of the source language, all words are checked for the absence of a trained model in the dictionary. Since it is assumed that if the word is not in the vocabulary, then it is not translated. Then, it is proposed to find its synonym, i.e. another word close in meaning. For this, it is proposed to use the dictionary of synonyms of the Kazakh language.

Define synonyms of an unknown word
To determine the synonyms of an untranslated (unknown) word, a dictionary of synonyms of the Kazakh language has been compiled manually. The total volume of the synonym dictionary is 1995. Each word contains at least one synonym word, maximum 35 synonyms. Since there can be several synonyms for each word, it is necessary to check for the presence in the vocabulary of a trained model. For this, an algorithm has been developed that sequentially checks for the presence of synonyms of an untranslated word in

Replace an unknown word by a found synonym
The synonym of the untranslated word found at the previous stage is substituted into the source text instead of the unknown word that was not translated, i.e. word that was 'outof-domain'. If all synonyms from the synonym list for the unknown word is not found in the trained model's dictionary, or the unknown word has no synonym, then the word remains unchanged in its original form (as shown in FOR LOOP on lines 21-25 of Listing 1). For all the above stages of solving the unknown word problem, software solutions have been developed in the Python3 programming language.

Translation of the modified source text
The resulting adjusted source text is submitted to the machine translation stage.
The novelty of the proposed technology for solving the problem of unknown words in the neural machine translation of the Kazakh language is the proposed algorithm for searching for unknown words in the dictionary of a trained model of neural machine translation and using the dictionary of synonyms of the Kazakh language to replace an unknown word with a word that is close in meaning. To find words that are close in meaning to an un-known word, a dictionary of synonyms is used. In this case, an additional check is made for the presence of this synonym word in the dictionary of the trained model. These steps of the proposed technology for solving the unknown word problem are essentially actions that convert the out-of-vocabulary words of the source text into dictionary words, i.e. out-of-domain words are converted to in-domain words.

Experimental part
The main goal of the experiments is to show the effectiveness of the proposed approach to solving the problem of unknown words for neural machine translation of the Kazakh language. Experiments include: (1) use of parallel corpuses as input data: KAZNU Kazakh-English and WMT19 Kazakh-English; (2) using parameters: --num_train_steps = 100000 (number of training steps), --num_layers = 2 (number of neural network levels), --num_units = 1024 (number of blocks), --dropout = 0.2 (dropout parameter), -metrics = bleu (score metric).
A number of experiments were carried out to train the model with and without the use of our proposed approach. The BLEU ((bilingual evaluation understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another) metric is used to evaluate the result. Based on the results (see Tables 6 and 7), it can be seen that the proposed approach improves BLEU rates and reduces the number of unknown words.
For the training is used two Kazakh-English parallel corpora: one is KAZNU Kazakh-English with a volume of 143 262 sentences (Table 2), second is WMT19 Kazakh-English with a volume of 140 870 sentences (Table 3). Before dividing to training and testing, the general text was shuffled to prevent learning the same structures. Both two corpora are checked on duplicates of sentences, therefore the KAZNU Kazakh-English corpus have volume 140 851 parallel sentences and the WMT19 Kazakh-English corpus have volume 140 000 parallel sentences (Table 4).
The KAZNU Kazakh-English was divided on training set 132 983 sentences, development set 4 868 sentences and test set 3 000 sentences.
The corpus WMT19 Kazakh-English was divided on training set 135 000 sentences, test set 3 500 sentences and development set 1500 sentences.
The vocabulary is created from frequently used words in corpus (occurring more than 3 times).
The vocabulary for the KAZNU Kazakh-English corpus training set is 16 878 words in Kazakh and 19 124 words in English (files: origvocab.kaz, origvocab.eng).
The vocabulary for the WMT19 Kazakh-English training set is 48 154 words in Kazakh and 20 957 words in English (files: vocab.kaz, vocab.eng).
To evaluate the results of the translation, the BLEU score was used.
In the Table 5 below provides the description of the source data for the training and testing of the NMT of the Kazakh-English language pairs for the proposed technology (method) of solving the problem of unknown words.
Tables 6 and 7 presents estimates of the machine translation of the Kazakh-English language pairs of the baseline version and the version using the proposed method for solving the problem of unknown words.
Training of model is performed using seq2seq model with attention in TensorFlow. TensorFlow is an end-to-end open source platform for machine learning. TensorFlow makes it easy to create machine learning models. And, Sequence-to-sequence model is a model that receives a sequence of elements (words, letters, image attributes, etc.) as input and returns another sequence of elements. In neural machine translation, a sequence of elements is a collection of words that are processed in turn (Tensorflow tutorials).     (128,256,512,1024). In all experiments layers was trained for 100 000 steps. Table 8 presents number of unknown words in the text in baseline version of NMT processing and in the NMT with proposed method.
The application of this technology does not provide such a significant improvement, since only synonyms are used for rare (unknown) words and it may be that synonyms themselves are rare words or rare (unknown) words do not have synonyms.

Conclusion and future work
The Kazakh language is a low-resource language, especially few parallel corpora for machine translation. Therefore, it very important to research the problems improving the quality of machine translation, such as problem of unknown words. In this work proposed technology improving the solution of this problem by the definition of unknown words in the source text by the search them in the vocabulary of the trained model, then define its synonyms by the dictionary of synonyms, after replace an unknown word by this found synonym and repeat translation.
Our proposed approach has a distinctive feature in logical implementation compared to other approaches. In other approaches, which belongs to the third category (approaches that solve the problem of unknown words based on substitution with words close in meaning), the corpus is first marked to determine unknown words. This is also a time-consuming task that requires additional time and effort to process the corpus. In our approach, the corpus marking is not performed, we determine unknown words by searching for this word in the dictionary of the trained model. Since the unknown words include those words that are not captured by the dictionary of the trained model. And this has the best effect on the performance of the approach.
The experiments show improving of results by the decreasing numbers of unknown words in output text, but the quality of translation by BLEU metric is very little improving. In our opinion that results are explained by (1) the volume of synonyms dictionary is not enough; (2) meanings of synonyms using for replacing of unknown words is not quite suitable.
In order to improve the results of the proposed technology in the future, it is planned to supplement the dictionary of synonyms. It is also planned in the future to use statistical methods to determine the position of unknown words in the source text. Since only a small part of unknown words in the text are covered by synonyms, in the future it is planned to apply the word2vec model to this technology to replace rare words with words that are close in meaning.