Multi-distribution noise quantisation: an extreme compression scheme for transformer according to parameter distribution

With the development of deep learning, neural networks are widely used in various fields, and the improved model performance also introduces a considerable number of parameters and computations. Model quantisation is a technique that turns floating-point computing into low-specific-point computing, which can effectively reduce model computation strength, parameter size, and memory consumption but often bring a considerable loss of accuracy. This paper mainly addresses the problem where the distribution of parameters is too concentrated during quantisation aware training (QAT). In the QAT process, we use a piecewise function to statistics the parameter distributions and simulate the effect of quantisation noise in each round of training, based on the statistical results. Experimental results show that by quantising the Transformer network, we lose less precision and significantly reduce the storage cost of the model; compared with the full precision LSTM network, our model has higher accuracy under the condition of a similar storage cost. Meanwhile, compared with other quantisation methods on language modelling task, our approach is more accurate. We validated the effectiveness of our policy on the WikiText-103 and PENN Treebank datasets. The experiments show that our method extremely compresses the storage cost and maintains high model performance.


Introduction
The Transformer-based models have surpassed most LSTM-based models in the field of natural language processing. Meanwhile, the recent researches on Transformer network have been widely used in the field of computer vision (Brock et al., 2021;Riquelme et al., 2021;Zhai et al., 2021), and its accuracy in some datasets have approached the model performance of CNN-based network Qi et al., 2021). However, Transformer architecture itself has a large number of parameters, which makes it limited in low computational devices. Therefore, in order to trade-off the high-precision and large storage cost of Transformer network and make it enable to be used in low computing power edge devices, many model compression methods under Transformer network include pruning (Dong & Yang, 2019;Wang, Wohlwend, et al., 2020;Zhu et al., 2021) and quantisation (Chung et al., 2020;Prato et al., 2020;Zhang et al., 2020) have been widely explored.
In this paper, we mainly explore the quantisation under the framework of Transformer. This paper explored the problem of decreased model accuracy caused by the overconcentration of model parameters around the mean after quantisation. In this work, we consider quantisation as a kind of noise added to the full precision model parameters; we simulate such noise in the quantisation Aware Training (QAT) to improve the robustness during the model training, as mentioned in Section 3.3. Inspired by Yan et al. (2021), Wang, Li, et al. (2021), Ning, Duan, Li, Shi, et al. (2020), Zhou et al. (2020) and Bai et al. (2018), we introduce a regular term to loss function to quantify the performance of the model in the STE process, as mentioned in Section 3.4. A suitable quantisation method should simultaneously solve the problems of uneven weight distribution after quantisation and the significant performance gap between the full-precision model and which correspond to two contradictory points in the quantisation.
The distribution of the model's parameters during the quantisation-Aware Training has a significant impact on the accuracy of the model. In Han et al. (2016) pointed out that the distribution of the model obeys the Bell distribution and the long-tail distribution instead of the binomial distribution. In Miyashita et al. (2016), activation of the model also obeys this distribution too. Williams (1993) matches the distribution of the model parameters, assigns more quantisation levels (higher resolution) for the peak of the distribution and fewer levels (lower resolution) for the tails. However, during QAT the parameters in the model are often in different distributions at different training stages, as depicted by Figure 1. Therefore, the training method of the model needs to be dynamically adjusted in the QAT process.
Reducing the accuracy gap between the quantised model and the corresponding full precision model. In the work of scalar quantisation (Williams, 1993), the floating-point  parameters in the model are compressed into low dimension space directly, with a significant decline in accuracy. The introduction of QAT class methods like (Jacob et al., 2018) solves part of the problem of this accuracy degradation. It regards quantisation as a kind of noise and introduces such noise into all full-precision parameters of the model, to improve the robustness of the quantised. However, the parameter distribution of the model is not taken into account in the above quantisation method. Our method adds quantisation noise to QAT according to the distribution of model parameters, as shown in Figure 2.To this end, we propose Multi-Distribution Noise (MDN) quantisation to resolve these two contradictories; our contribution can be as follow: (1) According to the weight distribution of different training stages, a piecewise function is used to quantify the model's subset parameters in the QAT process. In the quantisation process, Iterator Product quantisation (IPQ) (Stock et al., 2020) is used to further avoid the parameter concentration near the average value. We set different compression rates for different layers for further accuracy increase. (2) We introduce a quantised regularisation term into the loss function to make the performance of the quantised model approach that of the full-precision model in the training process.
(3) Experimental results show that our method dramatically reduces the storage cost of the model under the condition of only losing part of the accuracy under the Transformer network. Compared to a full precision model, our approach compresses the model size significantly with only a partial loss of precision, as mentioned in Section 4.3. Meanwhile, our method is a better trade-off between storage costs and model accuracy than other quantisation methods on language modelling task, as mentioned in Section 4.

Related work
In this section, we first analyse the technical development direction of quantisation and the limitations of various quantisation methods. We summarised the quantisation technology as Post-Training quantisation (PTQ) and quantisation-Aware quantisation (QAT) and reviewed the works closest to ours of these two aspects for a comprehensive overview.

Quantisation development and limitations
Due to the high storage cost of the full-precision model, it is difficult to migrate to low computational devices . In Jégou et al. (2011) lower-precision representation replaces used to replace floating-point weights of a trained network. However, when the model parameters compress to fixed-width integers is often accompanied by great model performance degradation. The errors made by these approximations accumulate in the computations operated during the forward pass, inducing a significant drop in performance (Stock et al., 2020). The Product Quantiser (PQ) is used to alleviate such errors accumulation (Ge et al., 2013;Jegou et al., 2010;Norouzi & Fleet, 2013;Xu et al., 2021), The idea is to decompose the original high-dimensional space into a cartesian product of subspaces that are quantised separately with a joint codebook (Stock et al., 2020). However, this type of PQ algorithm often suffers from a large loss of accuracy in models of high complexity, like VGG network (Simonyan & Zisserman, 2015), Transformer network (Vaswani et al., 2017) and other high storage cost networks (Ning, Ning, Li & Zhang, 2020;Ning, Nan et al., 2020;Wang, Bai et al., 2020). The QAT (Jacob et al., 2018) solves this problem and makes quantisation schemes on complex models feasible. Quant-Aware Training simulates low precision representation in the forward pass while updating the full-precision representation of model weights in backpropagation. This makes model parameters more robust to quantisation and makes the model process almost lossless. Due to the characteristics of Transformer's high accuracy and high storage cost, the research on transformer quantification has attracted more and more attention, including

Post-training quantisation
Post-Training quantisation has gained more and more attention from the industry because it does not require retraining. Recent work (Nagel et al., 2019) proposed an offline 8-bit quantisation method that does not need extra data to finetune the recovery accuracy, which makes full use of the sizing equivalent scaling feature of the RELU function to adjust the weight range of different channels, and at the same time, it can correct the deviation introduced in the quantisation process. In Lee et al. (2018), a simple method for channel level distribution recognition is proposed to reduce the loss of accuracy caused by quantisation and minimise the data sample required for analysis. In order to solve the problem of outliers in the distribution of model parameters, Karayiannis (1999) proposes Outlier Channel Splitting, which can reduce the magnitude of outliers without retraining.

Quantisation-aware training
While Post-Training quantisation can significantly reduce the quantisation time overhead of a model; it also comes with a significant accuracy decrease. To solve this problem, Han et al. (2016) proposed a new quantisation framework, in which the quantisation noise is introduced in the training process to simulate the error caused by the quantisation process. Due to the unstable during quantisation-Aware quantisation easily occurred, Stock et al. (2021) quantised subset of weights randomly, in which most of the weights are updated with unbiased gradients. Zhang et al. (2018) developed LQ-Nets, quantisation model, and quantisation level are trained together instead of fix-point quantisation. This considerably narrowed the gap between the quantised model and the full-precision model. Li et al. (2020) assigned higher resolution around the mean by adding a clipping function so that the quantisation levels can match weights and activations distribution dynamically.

Preliminaries
Quantising neural networks In our work, we split the weights in the model into fix-size blocks and use a codebooks to represent quantisation vector. For example, weight matrix W ∈ R l×o , We split W into k × n blocks b ip : We adopt the algorithm mentioned in Carreira-Perpiñán and Idelbayev (2017) to map the weight matrix into the codebook. Weight matrix was first divided into subvectors, then a codebook with K centroid C = {c[1], . . . , c[K]} was calculated by K-means algorithm, a total of K codewords are used to represent the weight matrix, where K-means algorithm is used to find the clustering centres present in the weight matrix. Each subvector of W was represented by a codeword from codebook with formulation: After quantisation, each weight block in weight matrix is represent by a codebook vector indice I, such that b ip = C[I ip ]. The blocks in the model is represented by the subvector in the codebook: Iterative PQ To prevent the accumulation of errors across layers, Stock et al. (2020) minimised loss reconstruction error for in-domain inputs. The method focuses on the Euclidean distance between the activation values before and after the quantisation instead of the Euclidean distance between the weights before and after the quantisation in tradition PQ (Jégou et al., 2011), is a training parameter updated by formulation: Where Jc = {(i, p) | c[I ip ] = c}, Loss is the loss function, η is learning rate. By introducing the vector in the codebook into the loss function, the iterative update of the codebook in the process of model backpropagation is introduced.

Overview
In the training phase of MDN, we refer to the IPQ algorithm to quantise the embedding layer, transformer layer, and fully connected layer in the model layer by layer from bottom to top as shown in Algorithm 1. At the beginning of each training iteration, we construct the codewords of the weight blocks of the unquantised layer and calculate the weight distribution of the codewords corresponding to these blocks. According to this distribution, we introduce a piecewise function to achieve local quantisation under different distributions by adding distribution noise to the unquantised weight as shown in Equation (5). A simple visualisation of this computational flow is provided in Figure 3. Meanwhile, we add a regular term in the loss function to introduce the Euclidean distance between the quantised and pre-trained full-precision weights as shown in Equation (7). Backward to iterative W q &W MDN with custom loss function per Eq.7 12: end for

Multi-distribution noise
Model weight distribution varies considerably under different training stages, this affects the quantisation performance (Zhang et al., 2018). We tackle this problem by introducing a piecewise function to judge the different distributions of weights and then introduce noise to the full-precision parameters in the model according to different distributions: The model training is divided into a weight pre-processing phase and a forward propagation phase. The pre-processing phase uses the IPQ algorithm to quantise the weights of the specified layers according to the training dataflow, while the codebooks of the unquantised weights are calculated but do not map this unquantised weights into codewords. Instead, the MDN algorithm is used to add quantised noise to the full precision weights so that these layers of weight include both code block and weight itself.
b represents the code blocks calculated by Equation (3) and the p denotes the dropout rate of the Bernoulli function. The model first initialises and obeys uniform distribution. Hopkin function is used as the judgment condition for the uniform distribution. After that, the model will gradually obey the binomial distribution to the long-tail, bell distribution (Han et al., 2016). We introduce confidence-spaced intervals as a judgment condition. When the confidence space is less than 1.0 and greater than 0.9544, this represents the weight is too concentrated near the mean, and the MDN algorithm is used to generate the quantise noise: Hyperparameter τ is used to control the quantisation ratio of long-tailed distribution and bell distribution. In our training strategy, we add more random quantisation ratios in the case of long-tailed distribution to improve the model's ability to fit outlier factors.

Regular setting
In backpropagation, we update the codebooks of the model. Meanwhile, the STE algorithm is used to update the full-precision weight of the unquantised layer. In order to reduce the performance gap between the quantised model and the pre-trained full-precision model, we introduce a regularisation term into the loss function: We use cross-entropy as the loss function, where N denotes the length of the sentence, y ∈ R V×N and y ∈ R N represent label and output respectively, where V is the vocabulary dictionary length. layer represents the weight of the layer to be quantised. w fp and w q represent the pre-trained full precision weight and the unquantised model weight, respectively. In the backpropagation of the model, the quantised weights are fixed, and the unquantised full-precision weights are iterated to simulate the impact of the quantisation process on the performance of the model.

Experiment
In this section, we validate our proposed method Multi-Distribution Noise on WikiText-103 (Bradbury et al., 2016) and Penn Treebank (Marcus et al., 1994) Language modelling benchmarks. WikiText-103 contains more than 100 million tokens and is widely used in the modelling of natural language processing. There are 103,227,021 training, 217,646 validation, and 245,569 test tokens. Meanwhile, to verify our method's performance in small sample learning, we use Penn tree bank to test the effectiveness of our method, which includes 929,590 training, 73,761 validation, and 82,431 test tokens from the Wall Street Journal. Before the word sequence sends into the model, there is not any preprocess on the dataset. We extensively compared full-precision modelling methods with the same storage cost as well as other quantisation methods on the two datasets mentioned above. Then we conduct an ablation study for important components. Results of all the language modelling task experiments are presented in Tables 1-3. We compared the experimental results extensively with the full precision model results and the experimental results of other quantisation methods.
Note that, although algorithm needs to statistic the weight distribution during the model training phase, which causes additional computational cost. However, in the model validation phase, our model does not require statistics on the weight distribution. It takes  (Santoro et al., 2018) 122.9 M 31.6 --char3-MS-vec (Wang, D. et al., 2019) 175  0.0072 s per word sequence in the training stage and only 0.0029 s per word sequence in the validation phase.

Implementation details
We trained a total of 60 epochs on the dataset, compressing the embedding layer of the model when epoch = 0, set 8 for block size and 256 for centroids. Compress the transformer layer of the model when epoch = 20, set 4 for block size and 256 for centroids. Compress the fully connected layer of the model when epoch = 40, set 4 for block size and 256 for centroids. We used p = 0.2 to quantise the blocks for all model layers. Meanwhile, We use τ = 0.4, thus introducing a larger outlier factor into the model training. We trained on a 16 layers transformer (Vaswani et al., 2017), and using Fairseq framework (Ott et al., 2019).

Evaluation metric
Perplexity (PPL) is a metric used in natural language processing to measure the degree of convergence of the model. We use perplexity when evaluating language models to estimate the training effect and make judgments and analyses. We leverage the same loss algorithm mentioned in Stock et al. (2021). PPL is defined: Where N denotes the length of the whole sentence. w i denotes the ith token of a sentence. p(w i | w 1 w 2 . . . w i−1 ) denotes the ith word probability of occurrence under the premise of the i−1 tokens probability. Table 1 shows the difference in accuracy as well as storage cost between our method and the full precision method under each model structure.

Compare with full-precision language modelling results
We compare with the full precision Transformer-based method. On the WikiText-103 dataset, decrease +16%/+12%/−10% accuracy than kNN-LM/Transformer-XL/BERT-Large-CAS with reduction 6.69 × /6.96 × /10.69× in storage costs. On the PENN Treebank dataset, our approach decrease +19% accuracy than BERT-Large-CAS with storage compression 10.69×. The results show that we have significantly reduced the model storage cost compared to other full-precision Transformer-based models, with only a partial loss of precision.
Meanwhile, we have significantly improved the model's performance compared to the full-precision LSTM architecture with comparable storage costs. On the Wikitext-103 dataset, our quantisation method has a 1.71 × /1.72× accuracy increase compared to LSTM-RMC/char3-MS-vec with less storage cost. Meanwhile, on the PENN Treebank dataset, Our approach increases the storage cost by only 1.42 × /1.05 × /1.53× than char3-MS-vec/AWD-LSTM/Mogrifier-LSTM while outperforming all the methods in terms of performance. The results show that Our approach is better in terms of performance by compressing the model size of LSTM architecture.

Compare with quantisation language modelling results
We compared other work on model quantisation in the language modelling task, and the results are presented in Table 2. On the dataset of WikiText-103, we compared with the transformer-based model quant noise (Stock et al., 2021), our method has an accuracy increase of 9.9% with a nominal storage cost. Meanwhile, compared with the transformerbased model DIFQQ (Défossez et al., 2021), although 4.5% decreases the accuracy of our model, the storage overhead is one-fifth of its occupancy. Both quant noise (Stock et al., 2021) and DIFQQ (Défossez et al., 2021) are transformer-based quantisation methods, which researches are closer to ours and are based on quantisation methods within the Transformer structure. On the dataset of PENN Treebank, Compared to the LRLSTM-1500 method, our method increases the storage overhead by 55.6% and improves the model accuracy 2×.

Ablation study
Our approach consists of two techniques, MDN to fit the bell-shaped and long-tailed distribution; a regular term to narrow the gap between the full-precision model and quantisation model. In this section, we conduct an ablation study for these two techniques. We compare the effect of quantisation on a subset of the model parameters under different τ in formulation (5). Also, we use different values of η to investigate the effect of the value of hyperparameters on the performance of the model in formulation (6). In order to investigate the effect of the outlier factor on model accuracy in model quantisation, we set different values of τ , as well as different p without regularisation term, as shown in Table 3. The paper (Karayiannis, 1999) proposed that unless outliers are identified and suppressed or eliminated, they can influence the formation of clusters by competing with the rest of the feature vectors to attract the prototypes. Inspired by this, we used a larger τ to quantise the outlier points in the model, thus eliminating the effect of the outlier factor on the accuracy of the model. The experimental results show that the MDN method gives the most significant performance improvement to the model when p = 0.2, τ = 0.5.
For verifying the effectiveness of the regularisation, ablation experiments were taken in the experiment. The results show that the addition of the regularisation term improves the model performance (increase 4% performance as shown in Table 1). Meanwhile, to investigate the effect of our proposed regularisation term on model performance, we conducted extensive experiments on multiple sets of different values of η, as shown in Figure 4. The model has higher accuracy when η is taken as 0.00001.

Discussion
In this paper, we use MULTI-DISTRIBUTION NOISE (MDN) to quantise the parameters into codewords. During quantisation Aware Training (QAT), we use a piecewise function to randomly quantise the model parameters under different distributions to simulate the noise generated by the quantisation operation. Meanwhile, a regular term is introduced into the loss function to make the quantisation model accuracy approximate the full precision model accuracy. Our approach can compress complex model (Transformer-based) to the same storage cost and take fuller advantage of complex structures than simple ones (with higher accuracy). Meanwhile, the experiment shows that our approach can excellent trade-off between model accuracy and storage overhead compared with other state-of-art quantisation methods.

Disclosure statement
No potential conflict of interest was reported by the authors.

Funding
This work is supported by the National Natural Science Foundation of China [grant number 61901436].