GSA-Net: gated scaled dot-product attention based neural network for reading comprehension

Reading Comprehension (RC) is concerned with building systems that automatically answer questions about a given context passage. The interactions between the context and question are very important to locate the correct answer. In this paper, we propose a Gated Scaled Dot-Product Attention based model for RC task. The character-level embedding is incorporated into thewordembeddingwhichishelpfultodealwithOut-of-Vocabulary(OOV)tokens.Theattentiondistributionisobtainedbyscaleddotproductwhichcapturestheinteractionbetweenquestion andpassageeffectively.Further,self-matchingattentionmechanismisadoptedtoresolvetheproblemoflong-distancedependency.Thesecomponentsprovidesmoreinformationforthe predictionofthestartingandendingpositionoftheanswer.WeevaluateourmethodonStan-fordQuestionAnsweringDataset(SQuAD)andtheresultsshowthatdifferentcomponentsin themodelboosttheperformance.


Introduction
Question Answering (QA) is the task of retrieving answer to a given question. It is an intelligent search engine based on natural language processing and information retrieval techniques. The user is allowed to ask questions in natural language and the system will return the corresponding answer directly. It has been explored both in the open-domain field [1] and domain-specific settings, such as BioASQ for Biomedical field [2]. The Reading Comprehension task limits the candidate answer in a given passage.
Question Answering techniques have been improved by the promotion of official evaluation and open data set publication. Wang [3] proposes the generative probability model to compute the matching degree of dependency tree between question and answer. Heilman and Smith [4] make use of the conditional random field model to estimate the structural distance between question and answer in dependency tree. Ko [5] puts forward a probability-based ranking model to select the candidate answers, and logistic regression method is used to estimate the probability of the correct answer. Severyn and Moschitti [6] use the SVM tree kernel to learn the shallow syntactic features for classification of question-answer pairs.
The traditional methods have also been applied on biomedical datasets and achieve better results. The OAQA(Open Advancement of Question Answering) system [7] combines the biomedical resources, including domain-specific parsers and entity markers, to retrieve concepts and synonyms. Logistic regression classifiers are used for question classification and candidate answer scoring.
The neural network based Question Answering makes difference with the traditional methods. Usually, the neural model is trained by end-to-end way for achieving an answer to the given question and passage. Yu [17] uses Convolutional Neural Networks (CNN) to model the distribution of question and answer. Feng [18] publishes a question-answer data set for the insurance domain and proposes several CNN models based on this data set. Because of the superior performance of the Attention mechanism [19] in the sequence to sequence model, the researchers try to introduce the attention mechanism into the question answering task. With the publication of SQuAD dataset [8,9], the techniques based on deep learning have been well verified. Wang and Jiang [9] propose two kinds of answer prediction models: Sequence Model and the Boundary Model, and the interaction layer is imported to compute the attention distribution. Seo [15] improves the model of Wang and Jiang [13] by adding the bidirectional attention mechanism. Though these approaches work well in Question Answering task, in all but a few cases, such methods have a high requirement for computing resources and have difficulties to answer complex context-dependent questions.
Most question answering systems based on neural networks use interactive attention mechanism or biattention mechanism to obtain answers [20]. The existing methods mainly focus on the relationship between the question and the passage, and they pay little attention to the interactive verification between candidate answers. The problem is more obvious in open-domain QA, where a question needs to be answered by considering candidate answers from multiple paragraph. In order to resolve this problem, Wang et al. [21] propose a two-stage process extraction: first extract answer candidates from passages and then select the final answer by combining information from all of the candidates. V-net [22] adopts an end-to-end neural model that enables answer candidates from different passages to verify each other based on their content representations.
To obtain answers to complex questions, reviewing after reading documents for further reasoning is necessary. This can be realized by multi-round reasoning mechanism, which attempts to combine the information of questions with the new information extracted from previous iterations ([ [23][24][25]). Gatedattention reader [26] uses multiplicative interactions between the query embedding and intermediate states of a recurrent neural network reader, which is realized by feeding the question encoding into an attentionbased gate in each iteration. Cui et al. [27] further propose that question specific attention should be extended to bi-attention mechanism, including both question to document and document to question. ReasonNet [28], which is different from those methods which use fixed iterations, adds a termination module to recognize whether to go on to the next inference or to terminate the reasoning process when the information is sufficient.
To alleviate the problems above, our model allows for significantly more parallelization and even gets higher accuracy for answer prediction. We propose a Gated Scaled Dot-Product Attention based model for Reading Comprehension task, which aims to answer questions in a given context passage. The character-level embedding is incorporated to word embedding which is helpful to deal with Out-of-Vocabulary (OOV) tokens. The attention distribution is achieved by Scaled dot product and self-matching attention mechanism. Finally, the Pointer Network is used to predict the starting and ending position of the answer. The rest of paper is organized as following. Section 2 introduces the hierarchical multi-layer mechanism of the proposed model. Section 3 introduces the different level encoding for question and passage. Section 4 describes the attention-based passage encoding by question interaction. Section 5 is the Pointer Network for answer selection. Section 6 gives the experimental results and Section 7 is the conclusion.

Hierarchical multi-layer reading comprehension model
We put forward a Gated Scaled Dot-Product Attention based model for RC task, which is represented as a hierarchical multi-layer mechanism shown in Figure 1. It consists of six components.

Character-Level Word Embedding Layer. Each
Character is mapped into a high-dimensional vector space and the character-level word embedding for each word is generated by Bi-GRU.
Word Embedding Layer. The word-level vector is concatenated with character-level vector. The distributed matrix representation of each word in question and passage is generated by the two-layer Highway Network.
Question and Passage Encoding Layer. It utilizes contextual cues from surrounding words to refine the embedding of each word in question and passage.
Gated Scaled Dot-Product Attention Layer. The representation of each word in the candidate passage is encoded by the conjunction of question-aware feature vector.
Self-matching Attention Layer. The representation of passage is enriched by matching itself with the output representation of the previous layer. It captures the important cues with long-distance.
Pointer Network for Answer Selection. For each question, predict the starting position and ending position of the answer in the passage by Pointer Network.

Character-level embedding layer
The character-level embedding layer is responsible for mapping each word into a high-dimensional vector space, which has been shown to be helpful to deal with Out-Of-Vocabulary (OOV) tokens.
Question and contextual passage are represented as the word set Q = {w We use the final hidden state of Bi-GRU u p k and u q k to represent character-level word embedding which is shown in Figure 2.

Word embedding layer
Word Embedding Layer also maps each word to highdimensional vector space. We use the pre-trained word vector model Glove [29] to obtain fixed-dimension word vectors {e Further, the distributed matrix representation of each word w t in question and passage is generated by the two-layer Highway Network [30], which are shown in Equation 5-6. Where

Contextual encoding for question and passage
Based on the output of previous layer, question representation matrix Q = {y q t } m t=1 and passage representation matrix P = {y p t } n t=1 , Bi-directional GRU is used to model the temporal interaction between words and they are denoted as Equation 7-8.
The distributed representations of the word w t in question and passage is denoted as u

Gated scaled dot-product attention layer
An attention mechanism called Gated Attention-based Recurrent Network is proposed to generate new passage representation aligned to question [31]. We adopt the scaled dot-product to implement question-aware passage encoding. Dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.
An attention function maps a passage and a set of key-value pairs of question to an output, where the passage(P), keys(Q key ), values(Q value ) and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the passage with the corresponding key of the question. The question-aware passage embedding is achieved by this particular attention mechanism-Scaled Dot-Product Attention [32], which is shown in Figure 3.
The input consists of passage and question's keyvalue pairs with dimension d q . We compute the dot products of the passage with all keys of question, divide each by d q , and apply a softmax function to obtain the weights on the values.
The attention function on a set of words in passage is computed simultaneously, packed together into a matrix P. The keys and values are also packed together into matrices Q key and Q value . Given question and passage representation U q = {u   Here, P = σ (U p ) ∈ R n * d , Q key = σ (U q ) ∈ R m * d , Q value = U q ∈ R m * d and σ = ReLU (Wx + b) is a non-linear mapping function. The purpose of attention in RC system is to read the passage by incorporation of question and re-encode the passage by the question-aware information.
To make the attention focused on the important parts of passage that is relevant to question, we add another gate to the input of bi-GRU and it is updated as Here, u p t is from the previous encoding layer and it is an additional input into the recurrent network. c t is the tth vector of attention matrix C.

Self-matching attention layer on passage
The question-aware passage encoding is generated by Gated Scaled Dot-Product Attention Layer. There still exists a problem that it has limited understanding of the context and misses the important cues outside its surrounding window. Passage context is necessary to infer the correct answer. To address this problem, we use the question-aware distributed representation directly to match the passage itself. It dynamically collects the evidence from the whole paragraph and encodes the evidence relevant to the current passage word. The encoding result u p t for word w t in passage is got by Equation 13. (13) Here, P = P key = σ (U p ) and P value = U p in a selfattention layer are obtained as the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer.

Pointer network for answer selection
We use pointer networks [33] to predict the starting and ending position of the answer in the passage. The attention-pooling over the question representation is used to generate the initial hidden vector for the pointer network. Given the passage representation {u p t } n t=1 , the attention mechanism is utilized to select the starting position p Start and ending position p End in the passage, which can be formulated as following.
For each word w t in passage, predict the corresponding probability to be the starting (L = Start) or ending (L = End) word of the answer.
We use question representation U q = {u

Dataset
We evaluate our model on Stanford Question Answering Dataset (SQuAD) V1.1 2 dataset. It consists of questions on a set of Wikipedia articles, where the answer to every question is a segment of text from the corresponding reading passage. With 100,000+ questionanswer pairs on 500+ articles, SQuAD is significantly larger than previous reading comprehension datasets ( Table 1).
The question-context sample from SQuAD is shown in below.
Passage: In 1870, Tesla moved to Karlovac, to attend school at the Higher Real Gymnasium, where he was

Experimental setting and evaluation
We use the StanfordNLP tokenizer [34] to preprocess each passage and question. For word embedding, we use pre-trained case-sensitive GloVe embeddings 3 [29] for both questions and passages, and it is fine-tuning during training. All of the out-of-vocab words are represented by zero vectors. The length of hidden vector is set to 128 for all layers. The hidden size used to compute attention score is also set to 128. We apply dropout between layers with a dropout rate 0.5. The model is optimized with Adam with an initial learning rate of 0.5. BioASQ (http://participants-area.bioasq.org/) is a challenge providing training data for biomedical semantic indexing and question answering task. The 2017 BioASQ training dataset contains 1799 questions, of which there are 413 factoid questions and 486 list questions. These questions have about 20 snippets on average and each of them have 34 tokens long.
Two metrics are utilized to evaluate the model performance on SQuAD and they are Exact Match (EM) and F1 score. EM measures the percentage of the predictions that match any one of the ground truth answers exactly. F1 measures the word overlap between the prediction and ground truth answer.
Here, M denotes the number of test samples. The predicted answer is represented as a i and ground-truth answer is denoted as a * i .

Experimental results
In order to evaluate the performance impact by different components in our hierarchical multi-layer model, we give detailed comparison by removing them separately and the results are shown in Table 2. The scores on DevSet are evaluated by the official script. 4 As it can be seen from Table 2, the performance declines by removing different components in GSA-Net. These components play an important role to select In order to show the ability of the model for encoding evidence from passage, we draw the alignment of the passage against the question in Gated Scaled Dot-Product Attention Layer. The attention weights are visualized and shown in Figure 4. The darker of the colour, the higher weight value of the word. For example, the answer "Martin Sekulic" in the passage is given more attention to the question. Other words with darker colour, such as "was", "Kalovac" and "Tesla", are overlapped with question.
The cross-entropy loss for the RC task in models GSA-Net and GSA-Net without Gate are demonstrated in Figure 5. The loss value decreased gradually with the increasing of training step. It is shown that GSA-Net model has the less loss and also converges faster.
We also compare the performance of our model GSA-Net with other related work on SQuAD as shown in Table 3.
LR Baseline [8]: A model based on Logistic Regression for RC task, which extracts several types of features for each candidate and computes the unigram/bigram overlap between the sentence and question.
Match-LSTM with Ans-Ptr [13]: It proposes two new end-to-end neural network models for machine comprehension task, which combine match-LSTM and Ptr-Net to handle the special properties of the SQuAD dataset.
Dynamic Co-attention Network [14]: A Model called Dynamic Co-attention Network (DCN) for question answering. The DCN firstly fuses co-dependent representations of the question and the document in order to focus on relevant parts of both. Then a dynamic pointing decoder iterates over potential answer spans.
RaSoR [36]: A model called RASOR that efficiently builds fixed length representations of all spans in the evidence document with a recurrent network. It explicitly computes embedding representations for candidate answer spans.  BiDAF [15]: A Model named Bi-Directional Attention Flow (BIDAF) network, a multi-stage hierarchical process that represents the context at different levels of granularity and uses bidirectional attention flow mechanism to obtain a query-aware context representation without early summarization.
Fine-Grained Gating (ensemble) [37]: A model with fine-grained gating mechanism to combine wordlevel and character-level representations dynamically. It further extends the idea of fine-grained gating to model the interaction between question and paragraph for reading comprehension.
The results in Table 3 denote that our model GSA-Net performs best in both EM and F1 value. Our method obviously outperforms the baseline and several strong state-of-the-art systems for both single model and ensemble one.
We also compare the performance of our model with the participating systems in BioASQ. 5 For each batch and different question type, the result of the top 2 competing systems and our model are shown in Table 4.
Our model has a strong ability to process the interactive encoding of questions and contexts, which can get high quality question-context alignment representation. Gated mechanism is adopted in our model to solve the problem of propagating dependencies over long distances. Moreover, hierarchical attention mechanisms can locate the segments which are related to the answer step by step, therefore the semantic representation ability of the model is enhanced.

Conclusion
Machine Reading Comprehension is an important task for natural language understanding. It evaluates the machine's ability to access knowledge and answer questions from the given passage. This paper proposes an end-to-end machine reading comprehension framework, which can well understand questions and relevant fragments for answer prediction. We present a Gated Scaled dot-product Attention based Neural network (GSA-Net) for Reading Comprehension task. The different components in this hierarchical multi-layer model play an important role to locate the correct answer. The gated scaled dot-product attention and self-matching attention mechanism are used to obtain the suitable question-aware representation for passage.
Further, the pointer network predicts the answer position effectively. Our model achieves an exact match score (EM) of 71.1% and an F1 score of 80.1% on SQuAD, which outperforms several strong competing systems. This model has little memory requirements, and it performs even better than the models that rely on more computing resources.
Although we have added two attention layers in the proposed model, the interpretability is still not enough, which is a common problem in the deep learning based natural language processing task. In addition, the crossparagraph reasoning ability of this model needs to be improved and it is important to answer complex question. In the future work, we will try to combine BERT with our method to improve the reasoning ability of the model. Meanwhile, how to learn the prior knowledge of human language expression from large-scale unstructured data and apply it to machine reading comprehension is also a significant goal of our work.