Gender and region detection from human voice using the three-layer feature extraction method with 1D CNN

ABSTRACT Analysing the human voice has always been a challenge to the engineering society for various purposes such as product review, emotional state detection, developing AI, and much more. Two basic grounds of voice or speech analysis are to detect human gender and the geographical region based on accent. This study presents a three-layer feature extraction method from the raw human voice to detect the gender as male or female, as well as the region from where that voice belongs. Fundamental frequency, spectral entropy, spectral flatness, and mode frequency have been calculated in the first layer of feature extraction. On the other hand, Mel Frequency Cepstral Coefficient has been used to extract the features in the second layer and linear predictive coding in the third layer. Regular voice contains some noises which have been removed with multiple audio data filtering processes to get noise-free smooth data. Multi-Output-based 1D Convolutional Neural Network has been used to recognize gender and region from a combined dataset which consists of TIMIT, RAVDESS, and BGC datasets. The model has successfully predicted the gender with 93.01% and region with 97.07% accuracy. This method works better than usual state-of-the-art methods in separate datasets along with the combined dataset on both gender and region classification.


Introduction
The human voice is the most compatible medium for interaction among human beings. When sound comes out from a vocal throat it carries much regional, bio-logical, and surrounding atmospheric data. Using those kinds of information, we can find out human language, gender, age, accent, emotional and present state. In general, the human ear is a natural sound analysing system. It has an exceptional ability to classify gender, age, region, emotional state and much more based on some attributes of the human voice like loudness, frequency, etc. (Livieris et al., 2019). A machine can't do the same things apply in many fields. As machine learning is a subgroup of artificial intelligence, it learns from experience or data by developing an algorithm to teach a computer system to make decisions on various problems like speech recognition, image processing, etc. (Holzinger, 2019). Machine learning has various types of algorithms that help the researcher to distribute or examine the data. Some deep learning classifiers like CNN, multi-layer perception (MLP), recurrent neural network (RNN), long short-term memory (LSTM), artificial neural network (ANN), Hidden Markov model (HMM), etc. and different types of machine learning classifiers like support vector machine (SVM), k-nearest neighbours (KNN), Random Forest (RF), etc. have been used in many classification and regression problems to solve it. To recognize or classify the gender from human voice several researchers used several kinds of machine learning and deep learning techniques. However, gender recognition from the human voice is not an easy task without developing an ideal model.
To develop such a model for classification problems, one must choose the right feature from the human voice by which both gender and region of a person can be determined easily. There are most common features like MFCCs, Mel-scaled power spectrogram (Mel), power spectrogram chroma (Chroma), spectral contrast (Contrast), and tonal centroid features (Tonnetz) employed for gender recognition using 1D CNN model (Alkhawaldeh, 2019). Fundamental frequency, Energy, Spectral flatness, entropy, intensity, zero-crossing rate, etc. features are also used to categorize gender and region. After collecting the features from human voice, we then combined them with the proper labels to prepare a training set. After that, Machine learning techniques are utilized to construct a good model for recognizing gender. Although all ML techniques not performed well, so we had applied multiple techniques and measure their performance to select an optimal classification technique from them.
However, the researcher conducted a multitude of experiments using multiple methods and datasets to measure the efficiency of their proposed methods. A supervised machine learning was implemented (Majkowski et al., 2019) using a two-layer neural network for gender classification. In this study, they worked with 20 types of features with minimum, maximum, and average values of them. Every audio data has some noise problem which can make a problem to extract the right features. As a result, we can't predict a problem perfectly without good features. To solve this noise problem needs a pre-processing process for feature extraction. Pre-emphasis, frame blocking, the hamming window are used in (Alkhawaldeh, 2019;Yusnita et al., 2017) as a pre-processing phase for removal of the noise.
For a robust gender detection system, researchers are trying to find the most distinctive features (Alhussein et al., 2016). For example, Simpson's rule is used in (Alsulaiman et al., 2011) to measure the intensity of the human voice measuring area under the normalized curve. An Arabic digit database and a manual threshold were used to recognize the gender. MFCC, energy entropy and frame energy are feed to ANN and SVM and the performance of the SVM is better than the ANN (Archana & Malleswari, 2015). A gender classification model is developed in (Chaudhary & Sharma, 2018;Gaikwad et al., 2012) by using pitch, energy, MFCC as features, and SVM as a classification model. Both studies used a small number of features for their experiment. Multi-layer perceptrons (MLP), Gaussian mixture model (GMM), learning vector quantization (LVQ), and vector quantization (VQ) are used as a classifier in (Djemili et al., 2012). The IViE corpus was used for their experiment and MFCC was used to extract the features. Although the researcher used multiple classifiers to classify the gender all classifiers were not performed well. FatihErtam used a three-stage model (Ertam, 2019) for gender classification. For this study, 10 most effective parameters: IRQ, meanfun, sfm, sd, median, Q25, Q75, mode, centroid, and meandom were selected. A deeper Long Short Term Memory (LSTM) network structure was used to predict the gender which performed a good accuracy of 98.4%. A non-linear parameter-based method was developed in (Yingle et al., 2008). Fractal dimension and fractal complexity are used as a feature where entropy is used to calculate the fractal complexity. The experimental results showed that the non-linear parameterbased method performed effectively well compared with the traditional method. A set of low-level time-domain features (Ghosal & Dutta, 2014) such as Mean of Short-Time Energy, Standard deviation of Short-Time Energy, Mean of Zero-Crossing Rate, Standard deviation of Zero-Crossing Rate, and Spectral flux calculated from the voice for discriminating the gender. Two simple classification techniques, RANSAC and neural-net were used for their study. Noisy speech has been used in (Zeng et al., 2006) for testing the proposed system for gender categorization. Using the pitch and relative spectral perceptual linear predictive coefficients the proposed system performed 95% against the noisy speech.
The researchers not only worked with adult people's voices they also used children's voices to predict gender. In (Chen et al., 2010), researchers proposed a gender recognition system for children of multi-age groups. To implement the system cepstral peak prominence, spectral harmonic magnitude, harmonic-to-noise ratio are used as features. One of the most dominant and most researched features Mel coefficients and its first and second derivatives (Pahwa & Aggarwal, 2016) used to determine the gender from the 46 speech sample dataset containing the Hindi vowels. Moreover, the proposed method is capable of working well with vowels, although the researchers didn't mention anything about the model that it can perform well or bad for larger sentences. Another vowel-based dataset was used in (Yusnita et al., 2017) to investigate the gender using a different order of LPC coefficients.

Dataset
In this experiment, we have used three audio datasets named The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus (TIMIT), Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), and BGC (details shown in Table 1). Every dataset was recorded in English. In TIMIT, every speaker spoke 10 sentences. Besides, RAVDESS is an emotion-based audio dataset where every actor recorded 60 audio files. On the other hand, BGC is our self-made audio dataset for this experiment where every actor has 50 audio files. In our previous paper (Uddin et al., 2020) we have worked with individual datasets, but in this study, we have mixed up all three datasets in random order.

Proposed method
In the audio datasets, there are many noise and distortion are available along with the original usable voice which needs to be filtered out. In the pre-processing part of the voice analysis, we have used many filter techniques to get good quality data from those voices. After filtering, we used our proposed three-layer feature extraction method to get the feature set from raw audio data. In the first layer, we have used fundamental frequency, spectral entropy, spectral flatness, mode frequency. In the second layer, we have used the linear interpolation method for mapping the audio data into a range. After mapping the data, MFCC has been used to extract the features. Those MFCC features have been normalized using the Z-score normalization. At last, in the third layer, LPC has been calculated for our research. Once features are extracted, the new feature dataset is ready to be processed for the Convolutional Neural Network (CNN) training. The whole process structure has been shown in Figure 1.

Pre-processing
High pass filter: It attenuates those signals whose frequencies are below a certain cut-off frequency and passes the above frequencies. In this study, a high pass filter has been used to eliminate the unwanted sounds or noise those near to the audible range or below. For the high pass filter, we used 0.05 as a value of Normalized passband frequency which was specified as a scaler in the interval (0, 1). Z-score normalization: The main purpose of normalization is to change the value of data using a common scale without distorting or changing its original nature. Generally, it is used when data have different ranges. In this study, we have been used the Z-score normalization technique to reduce this outlier issue with Equation (1). It determines the data point from the mean (μ) value in terms of the standard deviation (σ) value.
Here x is the value, μ = 0 is the mean value of the feature, and σ = 1 is the standard deviation of the feature. Savitzky-Golay filter: It is used to smooth the signal whose frequency span is large. It is also known as digital smoothing polynomial filters. It is applied to increase the precision of the data without distorting or changing the original signal and also reduce the leastsquare error in fitting a polynomial to the frames of noisy data. In the Savitzky-Golay filter, the polynomial order must be smaller than the frame size where the frame size must be odd. In this study, we used 3 polynomial order and the frame size 21 to smooth the data. The audio data pre-processing is shown in Figure 2.

Feature extraction
Fundamental frequency: The fundamental frequency is one of the most important features in speech recognition and describes the actual physical phenomenon of the signal. It is also known as the approximate frequency of the quasi-periodic voiced speech signal. In this research, we used the autocorrelation method which is a time-domain method to estimate the fundamental frequency. It measures the correlation between two values of the speech signal changes as their separation changes (Ramdinmawii & Mittal, 2016). The mathematical representation of the autocorrelation function is in the following equation: Here, k represents the lag and A(k)is an autocorrelation function at lagk. s(t)is the speech signal which is defined for all t. T represents the window size of a speech signal. Following Equation (2), when the lag value is 0, it represents the highest value. After measuring the autocorrelation values, those values are used to measure the maximum, minimum, and average value of the fundamental frequency using the following equation: Spectral entropy: It is a measure of the spectral power distribution of a signal. It is one of the most important features in the speech recognition field. By spectral entropy, the normalized power distribution of a signal is treated as a probability distribution in the frequency domain (Shen et al., 1998). The probability distribution is calculated with the following equation: is the discrete Fourier transform of the signal. The power spectrum and probability distribution are needed to compute the spectral entropy with Equation (5). For this experiment, we used the maximum, minimum, and average values of spectral entropy.

SE
Spectral flatness: It is a feature of acoustic signals which is useful in digital signal processing (Madhu, 2009). Generally, the ratio of the geometric mean to the arithmetic mean of a power spectrum is known as spectral flatness. It is mainly used to quantify the noiselike or tone-like nature of a sound signal. Using spectral flatness, we can find out the flat or non-flat position of a signal. If the ratio of arithmetic mean and geometric mean is 1 and a power spectrum is perfectly flat, then it means the arithmetic mean and geometric mean are equal. Although, the ratio can't be more than 1, because the geometric mean is always greater or equal than the arithmetic mean.
Mode frequency: For this study, we computed the mode frequency from the human voice. It is a value that has more occurrences in the signal. It is also known as the most frequent value in the speech signal. When multiple values are occurring equally in the speech signal, the mode frequency will be the smallest value among them.
Mapping: We used the linear interpolation method (Linear Interpolation, 2021) to construct new data points within the given range of known data points. Mainly, it provides interpolated one-dimensional (1D) data. In this research, each audio speech data was mapped into 6000 values. The audio data mapping using the linear interpolation method is shown in Figure 3. For two known data points (x1, x2) and (y1, y2) the linear interpolation function is shown in the following equation: Mel Frequency Cepstral Coefficient (MFCC): MFCC is a representation of a short-term power spectrum of sound which is extracted from the speech signal to perform the speech recognition tasks. It is one of the most popular features in the speech recognition system (Pahwa & Aggarwal, 2016;Practical Cryptography, n.d.). The MFCC feature extraction method is based on many steps. Frame blocking and windowing is the first step to extract the MFCC feature from the speech signal. In this step, the audio speech signal is converted into a short frame to calculate the coefficient or power spectrum. In this research, the sample rate of every audio is 16,000 Hz. The audio signal is divided into frames of 30 ms. This means we got the frame length for a 16,000 Hz audio signal 16,000* 0.03 = 480 (round value) frames. The number of divided frames overlapped the adjacent windows with an overlap length of 20 ms (16,000*0.02 = 320 samples). The next step is applying the Discrete Fourier Transform (DFT) to convert each windowed frame from the time domain into a frequency domain to obtain a frequency spectrum with (7).
Here, f is the frequency domain, L represents the length of the sequence to be transformed and t represents the time domain. After applying the DFT, the Mel filter bank has been calculated. It can be executed in the time domain and frequency domain. Generally, a Mel filter bank is executed in the frequency domain. However, MFCC designs half overlapped triangular filters based on band edges to transmute the frequency information to mimic the non-linear sound that a human perceived. The band edges of the filter bank are described as a nonnegative increasing row vector in the range [0, sample rate/2] and also determinate in Hz.
In MFCC, the non-linear rectification used quondam to the discrete cosine transform (DCT) which is specified as Log. Most probably, it is used to transform Mel frequency coefficients to provide a set of Cepstral coefficients. Log reduces the acoustic variants that are not noteworthy for speech recognition. It produces an L-by-M matrix of features where L = number of analysis frames of the speech signal is partitioned into and M = the number of coefficients returned per frame. In this experiment, MFCC produces 14 coefficients per frame (total of 35 frames available). After computing the MFCC coefficients we have used the Z-score normalization method to normalize those MFCC features which improved the recognition accuracy.
Linear predictive coding (LPC): It is one of the most significant techniques in speech recognition (Kim, n.d.). It is also known as source filter modelling for signal processing because it has a sound source that goes through a filter and produces a signal (8).

X(n) = H(n) * E(n)
Here, E(n) is the sound source which models the vocal cords, H(n)is the filter that vocal tract and X(n) is the resulting signal. In LPC, it assumes the filter is a pth order all-pole filter with a transfer function to modelled the vocal tract transfer function shown in the following eqaution: Here, a j is the filter coefficients and p is the order. Over the analysis frame of signal, the speech signal is assumed as constant which is approximated in (10).
Here s k is the speech signal. Minimizing the mean square filter prediction error between s k and S k the filter coefficients a j can be found. In this experiment, we used 16 orders of LPC that provides 17 coefficients for every audio.

Creating CNN model
To train our dataset we have used one-dimensional CNN model with Keras library in Python. The concatenated output data shape of the three layers are 8 (1st), 490 (2nd) and 17 (3rd) make the model's input shape of (515, 1) which is taken by the first Convolution layer with filter size 64, kernel size 2 and ReLu activation. We have used multiple convolution layers with filter sizes of 64, 128, 512, 128 and kernel sizes as 2 and 3. Along with the convolution layer, there are many batch normalization and maxpooling layer of pool sizes 2 and 1. To avoid overfitting, we have used dropout layers at 10-30% rate. After the flatten layer, the model is divided into output gender and region (two different output types) that has dense layer of 2 and 3 units with Softmax activation. Finally, the model has been compiled with Adam optimizer and Mean Squared Error (MSE) as a loss calculator. The model has been illustrated in Figure 4 and details have been shown in Table 2.

Model validation
We have given 200 epochs with batch size 20 to train the dataset with 70-30 train-test set. After 200 epochs the training accuracy is 99.45% for gender and 99.65% for the region. On the other hand, the validation accuracy is 93.01% accuracy for gender with 0.06 MSE loss and 97.07% for the region with 0.02 MSE loss. The detailed graph of 200 epochs of accuracy and loss is shown in Figure 5.

Result analysis
We have tested our proposed method on a combined dataset. To combine the datasets we have used three audio datasets: TIMIT, RAVDESS, and BGC, where the TIMIT dataset is the most widely used for gender recognition and the RAVDESS is the most widely used emotional dataset. We have used a multi-output-based 1D CNN model which gives good accuracy to predict the gender and region shown in Figure 1. For this work, we used 1433 audio files as training set and 615 audio files as a testing set from the combined dataset. The confusion matrices are shown in Table 3 for gender and Table 4 for the region on the combined dataset. We also tried applying many optimizers and loss functions. Among them MSE works better comparing with others and using this MSE with many optimizers' functions comparison has been done. In Table 5, some results are using MSE with Adam, SGD, Adamax, Adagrade, RMSprop and Adadelta optimizers. Among them, Adam works better and gives higher accuracy. The precision, recall, and F1 score of the combined dataset for multi-output-based 1D CNN show the excellent performance of our proposed method. For computing these values, we need to analyse the confusion matrix. The confusion matrix describes the performance of a classification model based on test data. When a class is actually positive and also classified as positive, it is called True Positive (TP). As well as, when a class is not negative and also classified as not negative, it is called True Negative (TN). If a class is negative    Tables 6 and 7 show the precision, recall, and F1 score for gender and region. Throughout this study, it is observed that several kinds of feature extraction models are used for gender classification in many research. For this, the researchers used different types of datasets such as short vowel datasets (Pahwa & Aggarwal, 2016;Yusnita et al., 2017), digit datasets (Ertam, 2019) and long (Chaudhary & Sharma, 2018) sentencebased datasets. Though the proposed method for those audio datasets performed well, but in longer sentences, there is no clarification to work well which we solved in this study. Besides, the characteristics of the human voices are separate in different regions because of geographical differences. For this reason, the same number of features    (2018) and Archana and Malleswari (2015), the proposed models ensured good performance using the lower number of features for a particular region dataset to determine the gender. But those models would not perform well in other region datasets or combined datasets because of the lower number of features. To solve this problem, we have worked with a combined dataset and two-layer architecture to extract enough number of features. We have detected the gender and region of user's from this dataset. The outcome of the proposed method is shown in Table 8, compared with the other existing methods of gender recognition. But a comparison of region detection couldn't be possible as we were unable to find any related work in this field.

Conclusion
In this study, we utilized a three-layer feature extraction method for gender and region classification from the human voices. We have extracted Fundamental Frequency, Spectral Entropy, Spectral Flatness, and Mode Frequency in the first layer of features. Besides, linear interpolation function has been used to map the audio data and extract the features using MFCC in the second layer. On the other hand, LPC has been calculated in the third layer. To get smooth and noise-free data for the right feature extraction several filtering and normalization processes such as High pass filter, Z-score normalization, and Savitzky-Golay Filter have been used in the pre-processing step. Moreover, multi-output-based 1D CNN has been utilized in this study to clarify the recognition rate of gender and region. For this work, we have used a combined dataset that consists of three English language-based audio datasets recorded in three different regions. The proposed method successfully performed better on both gender and region classification. Our future work is focused on working with more regional datasets and improving the prediction accuracy of our work. Moreover, we want to implement our methodology in a real-time system.

Notes on contributors
Mohammad Amaz Uddin is a researcher as well as a software developer. Since 2017, he is involved in research work. The main objectives of his research work are focused on the analysis of large datasets and the detection of right patterns. He always tries to work in multiple research fields such as IoT, Sound Processing, etc. have been published in many scientific journals. His fields of interest are deep learning, machine learning, IoT, image processing, and sound processing. Along with this, now he is working as a software ASP.NET developer.
Refat Khan Pathan is a researcher in computer vision and image processing related topics. Along with image processing, he also worked with numeric complex data processing such as genetic data of COVID-19, IoT and Big Data. His main research interest is in complex data, image processing and time series data processing. He also has software development experience.
Md Sayem Hossain is a full-time software engineer and a researcher. Since 2019 he has involved himself in many research works and also got one successful publication in 2020. Overall, his research, he worked with audio much. He tries to find unique patterns among the data and develop algorithms to overcome the problem. Also, he worked with many professors to learn their working methods and for developing the methods of his own.
Munmun Biswas is currently a lecturer at BGC Trust University Bangladesh. She is a researcher having many publications in the field of IoT, Computer Vision, Numeric data processing, Time series data processing, Sound processing and Cloud Computing. She has extreme research interest in IoT, Machine Learning and Big Data.

Disclosure statement
No potential conflict of interest was reported by the author(s).