Audio features dedicated to the detection and tracking of arousal and valence in musical compositions

ABSTRACT The aim of this paper was to discover what combination of audio features gives the best performance with music emotion detection. Emotion recognition was treated as a regression problem, and a two-dimensional valence–arousal model was used to measure emotions in music. Features extracted by Essentia and Marsyas, tools for audio analysis and audio-based music information retrieval, were used. The influence of different feature sets was examined – low level, rhythm, tonal, and their combination – on arousal and valence prediction. The use of a combination of different types of features significantly improves the results compared with using just one group of features. Features particularly dedicated to the detection of arousal and valence separately, as well as features useful in both cases, were found and presented. This paper presents also the process of building emotion maps of musical compositions. The obtained emotion maps provide new knowledge about the distribution of emotions in an examined audio recording. They reveal new knowledge that had only been available to music experts until this point.


Introduction
One of the most important elements when listening to music is the expressed emotion. The elements of music that affect the emotions are timbre, dynamics, rhythm, tempo, and harmony. Systems searching musical compositions on Internet databases more and more often add an option of selecting emotions to the basic search parameters, such as title, composer, and genre. One of the most important steps during building a system for automatic emotion detection is feature extraction from audio files. The quality of these features and connecting them with elements of music such as rhythm, harmony, melody and dynamics, shaping a listener's emotional perception of music, have a significant effect on the effectiveness of the built prediction models.
Division into categorical approach and dimensional approach can be found in papers on examining features for music emotion recognition. Most papers, however, focus on studying features using a classification model (Grekow, 2015;Panda, Rocha, & Paiva, 2015;Saari, Eerola, & Lartillot, 2011;Song, Dixon, & Pearce, 2012). Music emotion recognition combining standard and melodic features extracted from audio was presented in Panda et al. (2015). Song et al. (2012) explored the relationship between musical features extracted by MIR toolbox and emotions. They compared the emotion prediction results for four sets of features: dynamic, rhythm, harmony, and spectral features. Baume, Fazekas, Barthet, Marston, & Sandler (2014) evaluated different types of audio features using a five-dimensional support vector regressor, in order to find the combination that produces the best performance. Searching for useful features does not only pertain to emotion detection. The issue of features selection improving classification accuracies for genre classification was presented in Doraisamy, Golzari, Norowi, Sulaiman, & Udzir (2008).
An important paper in the area of music emotion recognition was written by Yang & Chen (2012), who did a comprehensive review of the methods that have been proposed for music emotion recognition. Kim et al. (2010) presented another paper surveying the state of the art in automatic emotion recognition.
The aim of this paper was to discover what combination of audio features gives the best performance with music emotion detection. Emotion recognition was treated as a regression problem and a two-dimensional valence-arousal (V-A) model was used to measure emotions in music. Features extracted by Essentia (Bogdanov et al., 2013) and Marsyas (Tzanetakis & Cook, 2000), tools for audio analysis and audio-based music information retrieval, were used. This article is an extension of a conference paper (Grekow, 2017a) where the problem was preliminarily presented.
The rest of this paper is organized as follows. Section 2 describes the music annotated data set and the emotion model used. Section 3 presents tools used for feature extraction. Section 4 describes regressor training and their evaluation. Section 5 is devoted to evaluating different combinations of feature sets. Section 6 presents dedicated features to the detection of arousal and valence. Section 7 describes results obtained by emotion tracking of two compositions. Finally, Section 8 summarizes the main findings.

Music data
The data set that was annotated consisted of 324 six-second fragments of different genres of music: classical, jazz, blues, country, disco, hip-hop, metal, pop, reggae, and rock. The tracks were all 22,050 Hz mono 16-bit audio files in .wav format. The training data were taken from the generally accessible data collection project Marsyas. 1 The author selected samples and shortened them to the first 6 s. This is the shortest possible length at which experts could detect emotions for a given segment. On the other hand, a short segment ensures that emotional homogeneity of a segment is much more probable. The data set consisted of 324 samples.
Data annotation was done by five music experts with a university musical education. Each annotator annotated all records in the data set, which has a positive effect on the quality of the received data (Aljanaki, Yang, & Soleymani, 2016). During annotation of music samples, we used the two-dimensional V-A model to measure emotions in music (Russell, 1980). The model (Figure 1) consists of two independent dimensions of valence (horizontal axis) and arousal (vertical axis). Each person making annotations after listening to a music sample had to specify values on the arousal and valence axes in a range from −10 to 10 with step 1. On the arousal axis, a value of −10 meant low while 10 high arousal. On the valence axis, −10 meant negative while 10 positive valence.
Value determination on the A-V axes was unambiguous with a designation of a point on the A-V plane corresponding to the musical fragment. The data collected from the five music experts was averaged. The amount of examples in quarters on the A-V emotion plane is presented in Table 1.
Pearson correlation coefficient was calculated to check if valence and arousal dimensions are correlated in our music data. The obtained value r=−0.03 indicates that arousal and valence values are not correlated, and the music data are a good spread in the quarters on the A-V emotion plane. This is an important element according to the conclusions formulated in Aljanaki et al. (2016).

Feature extraction
For feature extraction, tools for audio analysis and audio-based music information retrieval -Essentia (Bogdanov et al., 2013) and Marsyas (Tzanetakis & Cook, 2000) were used. Figure 1. A-V emotion plane -Russell's circumplex model (Russell, 1980). Marsyas software, written by George Tzanetakis, is implemented in C++ and retains the ability to output feature extraction data to ARFF format. With this tool, the following features can be extracted: Zero Crossings, Spectral Centroid, Spectral Flux, Spectral Rolloff, Mel-Frequency Cepstral Coefficients (MFCC), and chroma features -31 features in total. A full list of features extracted by Marsyas is available on the web site. 2 For each of these basic features, Marsyas calculates four statistic features: (1) The mean of the mean (calculate mean over the 20 frames, and then calculate the mean of this statistic over the entire segment) (2) The mean of the standard deviation (calculate the standard deviation of the feature over 20 frames, and then calculate the mean of these standard deviations over the entire segment) (3) The standard deviation of the mean (calculate the mean of the feature over 20 frames, and then calculate the standard deviation of these values over the entire segment) (4) The standard deviation of the standard deviation (calculate the standard deviation of the feature over 20 frames, and then calculate the standard deviation of these values over the entire segment).
Essentia is an open-source C++ library, which was created at Music Technology Group, Universitat Pompeu Fabra, Barcelona. Essentia contains a number of executable extractors computing music descriptors for an audio track: spectral, time-domain, rhythmic, tonal descriptors, and returning the results in YAML and JSON data formats. Extracted features by Essentia are divided into three groups: low-level, rhythm, and tonal features. A full list of features is available on the web site. 3 Essentia also calculates many statistic features: the mean, geometric mean, power mean, median of an array, and all its moments up to the 5th order, its energy, and the root mean square (RMS). To characterize the spectrum, flatness, crest, and decrease of an array are calculated. Variance, skewness, kurtosis of probability distribution, and a single Gaussian estimate were calculated for the given list of arrays.
The previously prepared, labelled by A-V values, music data set served as input data for tools used for feature extraction. The obtained lengths of feature vectors, dependent on the package used, were as follows: Marsyas -124 features and Essentia -530 features.
The performance of regression was evaluated using the 10-fold cross-validation (CV-10) technique. The whole data set was randomly divided into 10 parts, 9 of them for training and the remaining 1 for testing. The learning procedure was executed a total of 10 times on different training sets. Finally, the 10 error estimates were averaged to yield an overall error estimate.
Predicting arousal is a much easier task for regressors than valence in both cases of extracted features (Essentia, Marsays) and values predicted for arousal are more precise. R 2 for arousal were comparable (0.79 and 0.73), but features which describe valence were much better using Essentia for audio analysis. The obtained R 2 = 0.58 for valence are much higher than R 2 = 0.25 using Marsyas features. In Essentia, tonal and rhythm features greatly improve prediction of valence. These features are not available in Marsyas and thus Essentia obtains better results.
One can notice the significant role of the attribute selection phase, which generally improves prediction results. Marsyas features before attribute selection outperform Essentia features for arousal detection. R 2 = 0.63 and MAE=0.13 by Marsyas are better results than R 2 = 0.48 and MAE=0.18 by Essentia. However, after selecting the most important attribute, Essentia turns out to be the winner with R 2 = 0.79 and MAE=0.09.

Evaluation of different combinations of feature sets
The effect of various combinations of Essentia feature setslow-level (L), rhythm (R), tonal (T)on the performance obtained for SMOreg algorithm was evaluated. The performance of regression was evaluated using the CV-10 technique. Attribute selection, with attribute evaluator WrapperSubsetEval and search method BestFirst, was also used.
The obtained results, presented in Table 3, indicate that the use of all groups (low-level, rhythm, tonal) of features resulted in the best performance or equal to best performance by combining feature sets. The best results have been marked in bold. Detection of arousal using the set L+R (low-level, rhythm features) has equal results as using all groups. Detection of valence using the set L+T (low-level, tonal features) has only little worse results than using all groups. The use of individual feature sets L, R or T did not achieve better results than their combinations. Worse results were obtained when using only tonal features for arousal (R 2 = 0.53 and MAE=0.14) and only rhythm features for valence (R 2 = 0.15 and MAE=0.15).
Combining feature sets L+R (low-level and rhythm features) improved regressors results in the case of arousal. Combining feature sets L+T (low-level and tonal features) improved regressors results in the case of valence. In summary, low-level features are very important in the prediction of both arousal and valence. Additionally, rhythm features are important for arousal detection, and tonal features help a lot for detecting valence. The use of only individual feature sets L, R, or T does not give good results. Table 4 presents two sets of selected features, which using the SMOreg algorithm obtained the best performance by detecting arousal (Section 5). Features marked in bold are in both groups. Notice that after adding tonal features T to group L+R, some of the features were replaced by others and some remained without changes. Features  found in both groups seem to be particularly useful for detecting arousal. Different statistics from spectrum and mel bands turned out to be especially useful: Spectral Energy, Entropy, Flux, Rolloff, Skewness, and Melbands Crest, Kurtosis. Also, three rhythm features belong to the group of more important features because both sets contain: Danceability, Onset Rate, Beats Loudness Band Ratio. Table 5 presents two sets of selected features, which using the SMOreg algorithm obtained the best performance by detecting valence (Section 5). Particularly important low-level features, found in both groups, were: Spectral Energy and Zero Crossing Rate, as well as Mel Frequency Cepstrum Coefficients (MFCC) and Gammatone Feature Cepstrum Coefficients (GFCC). Particularly important tonal features, which describe key, chords, and tonality of a musical excerpt, were: Chords Strength, Harmonic Pitch Class Profile (HPCP) Entropy, Key Strength.

Selected features dedicated to the detection of arousal and valence
Comparing the sets of features dedicated to arousal (Table 4) and valence (Table 5), we notice that there are much more statistics from spectrum and mel bands in the arousal set than in the valence set. MFCC and GFCC were useful for detecting valence and were not taken into account for arousal detection.

Emotion maps
The result of emotion tracking is emotion maps. The best obtained models for predicting arousal and valence to analyse musical compositions were used. The compositions were  divided into 6-s segments with a 3/4 overlap. For each segment, features were extracted and models for arousal and valence were used.
The predicted values are presented in the figures in the form of emotion maps. For each musical composition, the obtained data was presented in four different ways: (1) Arousal-Valence over time (2) Arousal-Valence map (3) Arousal over time (4) Valence over time.
Simultaneous observation of the same data in four different projections enabled us to accurately track changes in valence and arousal over time, such as tracking the location of a prediction on the A-V emotion plane. Figures 2 and 3 show emotion maps of two compositions, one for the song Let It Be by Paul McCartney (The Beatles) and the second, Piano Sonata No. 8 in C minor, Op. 13 (Pathetique), 2nd movement, by Ludwig van Beethoven.
Emotion maps present two different emotional aspects of these compositions. The first significant difference is distribution on the quarters of the A-V map. In Let It Be (Figure 2b), the emotions of quadrants Q4 and Q1 (high valence and low-high arousal) dominate. In Sonata Pathetique (Figure 3b), the emotions of quarter Q4 (low arousal and high valence) dominate with an incidental emergence of emotions of quarter Q3 (low arousal and low valence). Another noticeable difference is the distribution of arousal over time. Arousal in Let It Be ( Figure 2c) has a rising tendency over time of the entire song and varies from low to high. The last sample with low valence is the exception; the low valence is the result of the slow tempo at the end of the composition. In Sonata Pathetique (Figure 3c), in the first half (s. 0-160) arousal has very low values, and in the second half (s. 160-310) arousal increases incidentally but remains in the low value range.
The third noticeable difference is the distribution of valence over time. Valence in Let It Be (Figure 2d) remains in the high (positive) range with small fluctuations, but it is always positive. In Sonata Pathetique (Figure 3d), valence, for the most part, remains in the high range but it also has several declines (s. 90, 110, 305), which makes valence more diverse.
Arousal and valence over time were dependent on the music content. Even in a short fragment of music, these values varied significantly. From the course of arousal and valence, it appears that Let It Be is a song of a decisively positive nature with a clear increase in arousal over time, while Sonata Pathetique is mostly calm and predominantly positive.

Conclusion
In this article, the usefulness of audio features during emotion detection in music files was studied. Different features sets were used to test the performance of built regression models intended to detect arousal and valence. Conducting experiments required building a database, annotation of samples by music experts, construction of regressors, attribute selection, and evaluation of various group features. Features extracted by Essentia, due to their variety and quantity, are better suited for detecting emotions than features extracted by Marsyas.
The influence of different feature setslow-level, rhythm, tonal, and their combination on arousal and valence prediction was examined. The use of a combination of different types of features significantly improved the results compared with using just one group of features. Features particularly dedicated to the detection of arousal and valence separately, as well as features useful in both cases, were found and presented.
In conclusion, low-level features are very important in the prediction of both arousal and valence. Additionally, rhythm features are important for arousal detection, and tonal features help a lot for detecting valence.
The obtained results confirm the point of creating new features of middle and higher levels that describe elements of music such as rhythm, harmony, melody, and dynamics shaping a listener's emotional perception of music. They are the ones that can have an effect on improving the effectiveness of automatic emotion detection in music files.
The result applying regressors are emotion maps of the musical compositions. They provide new knowledge about the distribution of emotions in musical compositions. They reveal new knowledge that had only been available to music experts until this point. The obtained emotion maps are algorithm-based and may contain a certain inaccuracy. Their precision is dependent on the accuracy of the regressors used to recognize emotions. It deserves to be improved in future papers dedicated to the construction of emotion maps. Notes 1. http://marsyas.info/downloads/datasets.html 2. http://marsyas.info/doc/manual/marsyas-user/bextract.html 3. http://essentia.upf.edu/documentation/algorithms_reference.html

Disclosure statement
No potential conflict of interest was reported by the authors.

Funding
This research was realized as part of study no. S/WI/3/2013.

Notes on contributor
Jacek Grekow received an MSc degree in Computer Science from the Technical University in Sofia, Bulgaria in 1994, and a PhD degree from the Polish-Japanese Institute of Information Technology in Warsaw, Poland 2009. He also obtained a Master of Arts degree from The Fryderyk Chopin University of Music in Warsaw 2007. His primary research interests are music information retrieval, emotions in music, and music visualization.