Neural architectures for gender detection and speaker identification

Abstract In this paper, we investigate two neural architecture for gender detection and speaker identification tasks by utilizing Mel-frequency cepstral coefficients (MFCC) features which do not cover the voice related characteristics. One of our goals is to compare different neural architectures, multi-layers perceptron (MLP) and, convolutional neural networks (CNNs) for both tasks with various settings and learn the gender/speaker-specific features automatically. The experimental results reveal that the models using z-score and Gramian matrix transformation obtain better results than the models only use max-min normalization of MFCC. In terms of training time, MLP requires large training epochs to converge than CNN. Other experimental results show that MLPs outperform CNNs for both tasks in terms of generalization errors.


PUBLIC INTEREST STATEMENT
Voice gender and speaker identification are interesting and challenging tasks. In this paper, we present two approaches for both tasks by using the various architecture of the neural networks: multi-layer perceptron (MLP) and convolutional neural network (CNN).
These models are compared with each other in terms of their training time, model performance, model size, and generalization errors. Generally, people would think/expect that CNN should better than MLP because CNN is invented later than MLP. Indeed, from a single run experiment, we see that CNN can outperform the MLP in terms of their prediction accuracy. But through the 500 times experiments, we notice that MLP is much better than CNN in terms of generalization error. In this work, we also applied different techniques for audio signal transformation and tested them for both two tasks.

Introduction
Automatically detecting gender and identifying speakers through a speaker's voice is an important task in the audio signal processing area. Gender detection deals with finding out whether a speech spoken by a male or a female. This task is very crucial for gender-dependent automatic speech recognition (ASR), which let the ASR system be more accurate than gender-independent systems. Speaker recognition is the process of automatically recognizing the speakers on the basis of individual information carried in the speech wave, which can be categorized into speaker identification and speaker verification. Speaker verification is the process of accepting or rejecting the identity claim of a speaker. Speaker identification, on the other hand, is the process of determining which registered speakers provide the input speech.
In general, gender detection and speaker identification can be viewed as classification problems, in which the former classifies the input audio into two categories and the latter classifies the input audio into the number of registered speakers. Many approaches have been proposed for gender classification, the most commonly used methods are decision tree (Naeem et al., 2013), support vector machine (SVM) (Lian & Lu, 2006), Bayesian network (Darwiche, 2010), K-Nearest neighbor (Cunningham & Delany, 2007) and Random forest (Ho,19).
In this paper, we use different neural architectures for both gender detection and speaker identification. We apply multi-layers perceptron (MLP) (Li, Chen, Shi, Tang, & Wang, 2017;Youse, Youse, Fathi, & Fogliatto, 2019) and convolutional neural networks (CNNs) (Liew, Khalil-Hani, Radzi, & Bakhteri, 2016; in order to able to learn gender/speaker-specific features from original MFCC vectors which do not cover specific voice-related characteristics of the speech signal. Another aim of this work is to compare two neural architectures' performance for both tasks with different settings/ways: 1) different feature transformation, 2) different model size, 3) adding the noise signal to test set to measure the model's generalization error. We evaluate our methods on Kazakh Speech corpus . The experimental results show that MLP outperform CNN in terms of generalization error and MLP requires large training epochs to converge than CNN for both tasks.
The rest of the paper is organized as follows: Section 2 introduces the task description with notations. Section 3 illustrates the signal feature extraction process. Section 4 describes our approach for both tasks. Section 5 reports the experimental outcomes. Section 6 summarizes the results of this work as a conclusion.

Task description: gender detection and speaker identification
Voice gender detection aims to automatically detect the author's gender through audio signals. Similarly, speaker identification is to distinguish the author's identity (name or ID) by analysis his/her audios.
Let X ¼ x 1 ; x 2 ; . . . x n denotes a series of audio signals as input. G ¼ g 1 ; g 2 ; . . . g n is a binary vector of 0/1 for gender categories corresponding to the audio signals X. Here, we use 1 to denote Female, and 0 for Male. S ¼ s 1 ; s 2 ; . . . s n denotes speaker's ID, we use unique number to distinguish speakers. The training element pairs for the two tasks can be de ne in the following: i) X; G ð Þ= (x 1 ; g 1 ), …, (x n ; g n Þ for gender detection; ii) X; G ð Þ= (x 1 ; s 1 ), …, (x n ; s n Þ for speaker identification. For those two tasks, we use X as input and extract corresponding signal features then use different neural networks to train models for the gender detection and speaker identification.

Signal feature extraction
Like many speech processing tasks (speech recognition, etc), the first step is to extract features which can be used for identifying linguistic content contained in the audio signals and for discarding the background noise information. Mel Frequency Cepstral Coefficients (MFCC) (Sahidullah & Saha, 2012) are the state-of-the-art features widely used in many speech processing applications. Before describing the MFCC, let us show an original audio signal shown in Figure 1. An original signal consists of thousands or millions of numbers, it can be considered as a very long vector which contains the linguistic content and noise.
In this work, we use MFCC to perform gender/speaker detection/identification, and the way of extracting MFCC feature is not the focus of this paper. In practice, we apply LibROSA, a python package for audio signal analysis. Its function of librosa.feature.mfcc was used for the extraction purpose of MFCC. The extracted features are shown in Figure 2.
In practice, we set the number of MFCC features to 40 then the dimension of an MFCC for audio is M 2 R 40Ân j j Max-min normalization is computed to each MFCC features and it refers to MFCC original in the following. We tried an alternative normalization, the z-scores, for MFCC features by the following calculation: where μ is the mean and std M ð Þ standard deviation.  One of the standard ways to handle the variable length of input is to find the max-length of the audios and padding its MFCC features with zero value if the length is less than max-length. One of the efficient ways to solve variable length issue by following transformation: j j , 40 is the number of MFCC features. Then we could apply the atten operation to M g or use its 2-D form.

Models
Neural networks (NN) can be considered as a classifier function with parameters, and a neural network with several layers can be seen as a composition of functions defined as follows: where θ denotes parameters of NN and l is the number of the layers. In the following, we describe our two neural network architectures for gender and speaker detection tasks: i) feedforward NN, it is a multi-layer perceptron and refers to MLP; ii) convolutional NN, it refers to CNN;

Feed-forward neural networks
To better describe the model, let us start by a simple neural network. As known, a single-layer perceptron (Auer, Burgsteiner, & Maass, 2008;Freund & Schapire, 1999) is a NN with no hidden units, which only contains an input layer and an output layer. There is no non-linear feature extraction, which means the outputs are computed directly from the sum of the product of weights corresponding to the input. We use the MLP, and it is an NNs composed of many perceptions and MLP can learn or extract non-linear features. Generally speaking, MLP consists of an input layer, some number of hidden layers, and an output layer.

Convolutional neural networks
Convolutional neural networks (CNN) (Collobert & Weston, 2008;Krizhevsky, Sutskever, & Hinton, 2012; Van den Oord, Dieleman, & Schrauwen, 2013) are a specialized kind of neural network for processing the data with 2-D grid-like topology. CNN has been tremendously successful in practical applications. Unlike MLPs, which uses fully-connected layers to extract features, CNN leverages two important ideas that can help improve the model: sparse interactions and parameter sharing. The former is a feature extraction process with a smaller kernel than the input. For example, when processing audio, the input signals might have thousands or millions of numbers, instead of feed such a long vector to NN, CNN can detect small and meaningful features by capturing local information. Parameter sharing refers to using the same parameter for the smaller kernel sliding on a 2-D input.
A typical CNN consists of three stages: i) use convolution layers to perform a set of linear activation. ii) each linear activation run through a non-linear activation function. iii) use pooling function to modify the output of the layer further.

Experiments
We conducted a series of experiments to evaluate the MLP and CNN models for gender and speaker detection tasks: • the first experiment is designed to analyze the e effectiveness of extracted features from MFCC, which are tested for both models. As described in section 3, we use two types of features and compare them, i) the normalized-flattened original MFCC as a long feature vector, and ii) using z-score on MFCC, then turn it into the Gram matrix (equation 2).
• in order to compare two models e affectively, we test the trained two types of models 500 times by adding different noise (normal distribution with zero mean and one variance) to the test set.
• visualization of the audios after the model training for both tasks.
For model evaluation, we report precision, recall, F1-score results under the different model setup configuration.

Data sets
We use the data-set from the study . Tables 1 and 2 show the statistics of the data sets for gender and speaker recognition. It can be seen that there are total 855 and 570 audios in training and test set for gender detection task. The number of audios for Male and Female are 448 and 407 in training set. The remaining 570 audios for Male and Female as test set. Table 2 shows the training and test sets for speaker identification. It can be seen that there are 19 speakers in total, and each of them has around 60 audios for training. It is a quite small training set and not able to train any data-hungry deep learning models well. But in this work, for speaker identification, we use this data set to train MLP and CNN models and do the evaluations.

Model setup
The neural architectures of MLP and CNN with small and large models are tested in the experiments. We train two versions of our models to assess the trade-off between performance and size.   The architectures of small and large models are summarized in Tables 3 and 4. The main hyperparameter is the number of hidden units which we set to 128, 64, 32 and 512, 256, 128 for small and large models, respectively. We use Relu as the activation function for all layers and the dropout is set to 0.15.

Results
Figure 3 displays the distribution of audios with MFCC features and the features after z-score normalization and Gramian matrix transformation. Figure 3(a,c and e) are the distribution of audio with original MFCC, after z-score normalization and Gramian matrix transformation for gender's data-set respectively. Similarly, Figure 3(b,d and f) are the audio's distribution for the speaker's data-set. It can be seen that the distribution of original MFCC for gender and speaker's audios is not separated by gender or speakers or we can say that they are mixed together and hard to distinguish the gender or speaker audios/After the z-score transformation, the distribution of audios becomes denser than the original ones. More unexpectedly, the audios after z-score and Gramian matrix transformations, the distribution of audios for gender and speakers, are changed. From Figure 3(e,f), we can observe that the audio signal for gender and speaker are more distinguishable, and a few audios are mixed. It turns out that If we train/test models on this data-set with such distribution, it is easy to reach 100% accuracy. Table 5 lists the results for gender and speaker recognition with different features (L and G) and model (small and large) settings. It can be observed that the model trained/tested after Gramian transformation (denoted as G in the Table) gave 100% F1-score, which confirms the results mentioned above. Comparing the results of models that use long-flattened MFCC vector with max-min normalization (denoted as L in the Table 5), we can see that fully-connected MLP outperforms CNN for gender detection. Figure 4 shows the training curve of those models for gender detection, and it can be seen that MLP tends to need more training epochs to converge than CNN (Figure 4(a,b)). Here, we only show the small-sized models, and the situation for the large one is similar. From Figure 4, it can be seen that MLP and CNN with Gramian matrix (G) takes relatively small training epochs than L.
Let us move on to the results of speaker identification in which we also used different feature settings and the model size. Table 6 shows the results, and we can see that the model using the feature L obtained relatively worse outcomes no matter which models we use. F-score of the models using L in the range of 19% to 36%. The Gramian matrix transformation seems to raised up model performance distinctly. F-score of both models almost reach 99%, which are higher than the model trained with L. Figure 5 shows the training curve for speaker identification. As we can see that the model with L takes large training epochs to converge; in contrast, the model with G takes fewer training steps. Another issue that can be found in this figure a, b is that the training accuracy of the model with L reaches 99% gradually, the validation results remain at 20% level.

Generalization ability
In order to compare the models' generalization error, we conducted another experiment: added the noise signal to test set, each of them was normally distributed with zero mean and one variance, and the already trained models with G were tested 500 times. Figures 6 and  7 depict how the F-score of the modes are distributed. It can be seen that for gender and speaker recognition, the CNNs cannot outperform MLPs when adding different normal distributed noisy signals to the test set. One of the possible reasons for that each layer of MLP is fully connected, and the input interacts with each other in more dimensions. In contrast to this, the CNNs have a convolution layer which commonly treated as a region feature extractor that uses a slide 2-D window with a specified step and a shared weight on an input to extract region features, and the inputs interact only in the regions. The integration between one input region a) Original MFCC of Gender's audios. b) Original MFCC of Speakers' audios.
c) After standardisation (z-score). d) After standardisation (z-score). e) Gramian matrix and standardization. f) Gramian matrix and standardization.      with another one is not captured, except when we choose the step size small enough. Another possible reason is that MLPs have more trainable parameters than CNNs since MLPs have fully connected layers and CNNs have shared weights for convolution layers. As a result, it can be seen that CNNs gives a large generalization error than MLPs for gender and speaker identification. Figure 8 shows the visualization of the test set after model training for gender and speaker identification of both models with different feature forms (L and G). We use the trained models on the test set, and obtain the output of the intermediate layer as representations of the audios then use t-SNE (Tur, Hakkani-tur, & Oazer, 2003) to visualize. Figure 8(a-c) and d show the visualization results for gender detection, it can be seen that the MLP seems better classifying the audios into two classes: male and female. One of the criteria to evaluate the clustered results is to see two distances: intra-class and inter-class distance. The former is to measure the distance between elements in a class and the smaller distance indicates the better results. The latter is to measure the distance between different class and the large distance indicates the better results.

Visualization
Comparison of Figure 8(c,d) show that MLP can classify the audios into a more dense class than CNN, it can be seen that the intra-class distance is smaller than CNN's. Figure 8(e-g) and h show the visualization results for speaker identification. Figure 8(g,h) show the results for model with G, and as we can see that MLP for speaker identification have large inter-class distance than CNN's.

Conclusion
In this paper, we applied different neural network architectures for gender detection and speaker identification tasks. The comparisons of two neural networks were made in different ways: 1) various features transformation (L and G), 2) different model sizes (small and large), and 3) adding noise signal to test set to measure the model's generalization errors (tested 500 times). The results show that for both tasks, the two types of neural networks obtain relatively better results after applying the z-score and the Gramian matrix transformation. In terms of training time, the MLP requires more training epochs to converge when only use the max-min normalization to MFCC features. The model size does not have a significant effect on the model's performances, and the large models only give a minor improvement. Another comparison result shows that the MLPs outperform CNNs in these experiments in terms of generalization error. The visualization results show that MLPs can classify the audios into more dense classes than CNNs for both tasks.