High accuracy human activity recognition using machine learning and wearable devices’ raw signals

ABSTRACT Human activity recognition (HAR) is vital in a wide range of real-life applications such as health monitoring of olderly people, abnormal behaviour detection and smart home management. HAR systems can employ smart human-computer interfaces and be parts of active, intelligent surveillance systems. The increasing use of high-tech mobile and wearable devices, such as smart phones, smart watches and smart bands, can be the key elements in building high accuracy models, as they can provide a tremendous number of signals. This research aims to develop and test a machine learning (ML) model, which can successfully recognize a performed activity using raw signals obtained by wearable devices. Photoplethysmography – Daily Life Activities (PPG-DaLiA) dataset contains data related to 15 individuals wearing physiological and motion sensors. PPG-DaLiA was used as an input to a custom data segmentation model to obtain the respective training and testing dataset. Overall, 23 ML well-established models were employed. The weighted and the fine k-nearest neighbours, the fine Gaussian support vector machines and the bagged trees were the algorithms that achieved the best performance with a very high accuracy level.


Introduction
Human activity recognition (HAR) is currently one of the most popular research fields among the machine learning (ML) applications (Jobanputra et al., 2019). The widespread use of internet of things, the development of capable smart wearable devices, together with the introduction of new ML and deep learning (DL) algorithms are the main key factors that have contributed to the rapid spread of HAR systems over the last years (Hassan et al., 2018;Wan et al., 2020). For an increasing number of people, the 'always attached to the body' devices are becoming essential in everyday life. Apart from their main functionality, smart phones, smart watches and smart bands can collect, in almost real-time, vastdata related to the activity performed by a specific person (Lara & Labrador, 2012;Psathas et al., 2020). HAR models integrate these data with ML algorithms to recognize the underlying activity (Boukhechba et al., 2019;Psathas et al., 2020 ). Even a simple commercial smart phone or a smart watch may contain a large variety of embedded sensors such as accelerometers, gyroscopes, magnetometers, oximeters, electro cardiographers and light sensors that could provide all data required by an HAR system (Chen et al., 2018;Ronao & Cho, 2016;Wan et al., 2020).
This research used a public dataset acquired by two wearable devices to recognize 9 real-life activities performed by 15 individuals. This dataset was chosen due to its large number of measurements for long periods (2.5 h for each subject) and since it comprises a large number of 'close to real-life' activities.
In Psathas et al. (2020) the same dataset has been used. The overall accuracy was as high as 92.8% and it was achieved by combining the bagged trees (BGT) algorithm with the default settings and properties of Photoplethysmography -Daily Life Activities (PPG-DaLiA) dataset.
The research is a major extension of Psathas et al. (2020) and it has resulted in a severely increased level of overall accuracy. It has the following characteristics: (a) a subset of the original dataset has been considered (only the raw data have been used) and (b) variable sampling technique has been employed in the activity distinction and data segmentation processes.
The rest of this paper is organized as follows: Section 2 presents previous indicative HAR research studies. Section 3 describes the PPG-DaLiA dataset, while Section 4 describes the proposed data prepossessing and preparation methodology. Section 5 presents the classification approaches used. Experimental results are presented in Section 5. Section 6 concludes this paper and discusses the results and future work.

Literature review
The HAR model is a challenging and growing field of research in ML that employs sensors and computer vision systems (Jobanputra et al., 2019). Over the past years, HAR models mainly relied on sensor-based systems, but the rapid evolution of computer vision (CV) and DL developed sophisticated algorithms that can provide higher accuracy. It is a fact that CV-based algorithms still face challenges due to the complexity caused by the variability of the background in everyday environments. More specifically, there are limitations due to the variable illumination conditions of each scene during the day. This problem reduces the tracking and positioning capability of each subject (Ferrari et al., 2020;Wan et al., 2020;Xu et al., 2019). Moreover, the aforementioned techniques are limited by the computational resources of the used infrastructure, and by the diversified, dynamic environment in which such systems need to operate (Ferrari et al., 2020;Junker et al., 2008). Pirttikangas et al. (2006), in their pioneer research, tested a model that used several multilayer perceptron and k-nearest neighbours (k-NN) algorithms to recognize 17 activities to achieve an overall accuracy of 90.61%. Casale et al. (2011) used a wearable device and applied a random forest classification algorithm to model five distinct activities (walking, climbing stairs, talking with a person, staying standing and working on the computer). This research achieved an overall accuracy of 90%. A few years later, Ahmed and Loutfi (2013) performed an HAR system, using case-based reasoning, support vector machines (SVMs) and neural networks (NN), to achieve an overall accuracy of 0.86, 0.62 and 0.59, respectively, for three specific activities (breathing, walking or running, sitting and relaxing). Wang et al. (2013) introduced a wearable system using passive tags, which achieved an accuracy of 93.6%. Li et al. (2016) achieved an accuracy of 96% and F-score equal to 0.74 for 10 medical activities. In fact, Li et al. introduced an activity recognition system for complex and dynamic medical settings that use passive radio-frequency identification (RFID) technology. Shinmoto et al. (2016) developed a system with RFID tags that could recognize two simple but very important classes (bed and chair exit alerts for elders), achieving an overall accuracy of 94%. Brophy et al. (2018) proposed a hybrid convolutional neural network and an SVM model with an accuracy of 92.3% for four activities (walking and running on a treadmill, low and high resistance bike exercise). Ryoo et al. (2018) proposed a backscattering activity recognition network of tags, which comprises a network of passive RF tags capable of recognizing daily human activities with an average error of 6%. In Boukhechba et al. (2019), a DL NN was developed for five activities (standing, walking, jogging, jumping and sitting). The F1 score was 0.86. Psathas et al. (2021) developed the wireless identification and sensing platform infrastructure with wearable RFID tags to recognize four activities in a controlled and monitored hospital room with a success rate of 98.93% on the training data.

Dataset description
PPG-DaLiA dataset was published by Reiss et al. (2019). It is a publicly available multimodal dataset that contains physiological and motion data recorded from a wrist-and a chest-wearable device. It contains data for seven males and eight females, aged between 21 and 55 years, performing eight distinct activities. The transition time between activities was also recorded and labelled as the 'zero' activity. All the activities were performed as close to real-life conditions as possible, and the recorded signals for each subject lasted approximately 2.5 h.
Raw sensor data were recorded using RespiBAN professional (RespiBAN, 2021) and Empatica E4 (Empatica, 2021) wearable devices. The sensor data provided by the two devices are described in Table 1.
For each subject, the following characteristics were also mentioned in the dataset: (1) Age (age of the subject).
(6) Fitness level (FL) which is an index on a scale 1-6 related to how often the subject exercises. An FL value of 1 is assigned to a person that exercises less than once a month, whereas a value of 6 corresponds to 5-7 times a week. The profile of the subjects that participated in this research is presented in Table 2, where i = 1-15, j = {m (for Male), f (for Female)} SKT is skin type, FNT is fitness type. The subjects are ordered by gender.
The recorded activities, the average duration of each activity and their descriptions are given in Table 3.
According to the developers of the dataset, there was only one major hardware issue during the data collection process, due to which the recorded data about person S6 are only valid for the first 1.5 h.
The available number of signal samples per group of factors (as given in Table 1) with the test duration and the number of distinct activity labels for each subject are given in Table 4.
The PPG-DaLiA dataset includes one more factor, which was not originally measured by any sensor, but it was calculated by the dataset developers. The heart rate in beats per minute (bpm) was calculated using the raw signals of the RespiBAN electrocardiogram signals. That extra factor, named ECGbpm, was manually calculated using a sliding window with a length of 8 s and a shift of 2 s. The segmentation technique will be explained in detail in the next section, but it was why the same sliding window and offset were also used in past research of our team (Psathas et al., 2020). In this extended  Table 3. Description of activities included in the PPG-DaLiA dataset.
Class/Activity ID

Duration (min) Description
Sitting still 1 10 Sitting still while reading (motion artefact-free baseline). Ascending/ Descending stairs 2 5 Climbing six floors up and going down again, repeating this twice. version, we are not employing the ECGbpm factor to use an advanced variable sampling and segmenting approach.

Data segmentation and prepossessing and classification methodology
The dataset chosen in this research was populated by millions of recorded raw signals originally acquired by two wearable devices using different sampling rates per factor and device. It is impossible to use this dataset before aligning and transforming all raw data to homogenized data vectors with the same instances, regardless of the sampling.

Variable data segmentation using fast Fourier transformation
The Fourier analysis converts a signal from its original domain (often time or space) to a representation in the frequency domain and vice versa. The fast Fourier transform (FFT) is a specific algorithm of low complexity that computes the discrete Fourier transform (DFT) of a sequence. It is an extremely powerful mathematical tool that allows observing the obtained signals from the time domain to the frequency domain inside, which several difficult problems become very simple to analyse (Brigham, 1988;Nussbaumer, 1981). The DFT is obtained by decomposing a sequence of values into components of different frequencies. Any periodic function g(x) integral in the domain D = [−π, π] can be written as an infinite sum of sine and cosine as follows: where e ιθ = cos(θ) + jsin(θ). The idea that a function can be broken down into its constituent frequencies is a powerful one and the backbone of the Fourier transform (Zhang, 2015). An FFT rapidly computes such transformations by factorizing the DFT matrix into To perform FFT to our raw data, we need to define the N data size and the time shift D of each segment, as shown in Figure 1.
As mentioned in Section2, dataset developers used the combination of 8 s for windows size and 2 s for windows sifting, to calculate the ECGbpm factor. The number of instances per activity and per subject obtained using the above parameters is presented in Table 5, while the classification results, shown in Psathas et al. (2020), is presented in Figure 2.
Although the results shown in the above-confusion matrix are really good, it is clear that most of the misclassified cases are related to the activity 'zero' which is the transition activity. There are also a few other misclassified cases, but most of them are related to activity zero. To improve the classification further, a variable window length has been employed and the windows shift segmentation approach is proposed in this paper. This is another novelty of this research effort.
According to Ferrari et al. (2020), the length of each segment cannot be a random value. It has to be carefully selected based on the type of the performed activity and its duration. Ferrari et al. (2020) and Micucci et al. (2017) proposed a window length of around 3 s when the performed activity is 'slow walking'. They both agree on the facts that (a) the cadence of an average person walking is within 90-130 steps/min (Medrano et al., 2016) and (b) at least a full walking cycle (two steps) is preferred on each window. They also conclude that very small window size could produce several segments 'without any activity at all' if the performed activity is a soft one. On the other hand, big window size could produce segments with high overlapping unless a very high offset is selected. In that case, several activity changes may be lost due to the fast-shifting of windows.
It is more than clear that the selection of window length and offset is a key factor to build an accurate HAR system.
It is not always easy to determine the nature of the monitored activities or their duration , so a variable segmentation methodology (VSM) is proposed in this paper. This is one of the main novelties of this research. According to the VSM, the system receives   the sensor's signals which are buffered. Finally, it creates multiple simultaneous segments using several predefined combinations of window lengths and offsets. As a result, each moment is represented by more than one segment. The number of segments for each moment is equal to the total number of combinations of window lengths and offsets.
The proposed VSM algorithm is presented in the following algorithm 1: Algorithm 1. Detailed steps of the proposed VSM One of the key features of the proposed methodology is that at each interval all combinations of window length-offset are used and tested. If the results of all combinations point to the same class, then the system has a strong decision. If this is not the case, then the accuracy threshold is checked to decide if one class can be accepted.
The testing of the proposed methodology is presented in Section 5 using 25 different length-offset combinations, which will be described in the next section.

Data prepossessing
PPG-DaLiA dataset consists of 15 python serialized files. Each file includes all sensor signals and the performed activity of each subject (Figure 3(a)).
The first step is the development of the necessary Python scripts to de-serialize the provided .pkl files and write the data vectors in plain text files. Python writes the respective content in the *.txt files for each *.pkl file, 135 in total (see Figure 1(b)). All text files have one column, except for the i_ACC_chest.txt and the i_ACC_wrist.txt files that have three columns. The abbreviations are given in Table 1 and i represents the number of the subject (i = 1, … , 15). To facilitate the rest of the process, we have developed and executed an original Matlab script named Matlab facilitator (MLB_FCL).
MLB_FCL reads all txt files created by the Python and prepares the data vectors to perform the segmentation procedure, described in Section 4.1. For this specific experiment, a combination of 25 window lengths and offsets was used. Five different values for window length l = [4,5,6,7,8] and five more for window offset j = [1,2,3,4,5] were used. First, we read and converted each *.txt file to a Matlab table. The i_ACC_chest.txt and the i_ACC_wrist.txt (three accelerometers x, y, z) have been converted to three tables, one for every column (Figure 3). Every feature is segmented with the combination of window length l and window offset j (Figure 3(d)) and the corresponding tables were created. For each timeseries segment of Activity_i_lj, the dominant value is held (Figure 3 (d)) FFT is applied on each time-series segment of the rest features ( Figure 3). The final dataset, after developing all combinations and merging all tables into one final table, counts 1386,994 data vectors with 11 independent variables, as given in Table 1, and one labelled variable, which is the performed activity.
The procedure of de-serialization and processing of all files to create the final and usable dataset is illustrated in Figure 3.
A total of 23 classification algorithms have been employed, namely fine tree, medium tree, coarse tree, linear discriminant, quadratic discriminant, linear SVM, quadratic SVM, cubic SVM, fine Gaussian SVM, medium Gaussian SVM, coarse Gaussian SVM, cosine KNN, cubic KNN, weighted KNN, fine KNN, medium KNN, Gaussian naive Bayes, Kernel naïve Bayes, boosted trees, BGT, subspace discriminant, subspace KNN, random under-sampling boost trees. The following sections present only the architecture and the results of the four most robust algorithms with the highest values of performance indices.

Weighted k-nearest neighbours (Wk-NN)
Classifying query points based on their distance to specific points (or neighbours) in a training dataset can be a simple yet effective process. The k-NN is a lazy and non-parametric learning algorithm. Given a set X of n points and a distance function, k-NN search finds the k closest points to a query point or set of them (Hechenbichler & Schliep, 2004). Dudani (1978) introduced a weighted voting method called the distance weighted (DW) Wk-NN rule. According to this approach, the closer neighbours are weighted more heavily than the farther ones using the DW function. The weight w i for the i-th nearest neighbour of the query x ′ is defined in the following equation: Finally, the classification result of the query is determined by the majority weighted voting using the following equation: Based on Equation (4), a neighbour with a shorter distance is weighted more heavily than one with a greater distance: the nearest neighbour is assigned a weight equal to 1, whereas the furthest one is assigned a weight of 0 and the weights of the others are scaled linearly to the interval in between.

Fine k-NN
The k-NN is a lazy and non-parametric classification method (Cover & Hart, 1967). The k-NN classifier is a traditional classification rule. An unlabelled vector (a query or test point) is classified by assigning the label, which is most frequent among the k training samples nearest to that query point. Given a set X of n points and a distance function, k-NN search finds the k closest points to a query point or set of them (Hechenbichler & Schliep, 2004). A commonly used distance metric for k-NN is Euclidean distance calculated by the following equation, while other distance metrics used, depending on each occasion, are Hamming distance and Minkowski distance (Jaskowiak & Campello, 2011).
The best choice of k depends upon the data; generally, larger values of k reduce the effect of noise on the classification (Jaskowiak & Campello, 2011), but that may change boundaries between classes to be less distinct. A good k can be selected by various heuristic techniques. The accuracy of the k-NN algorithm can be negatively affected by the presence of noisy features. Over the years, many attempt have been made to make the algorithm more robust (Nigsch et al., 2006). Fine k-NN refers to the nearest neighbour classifier that makes finely detailed distinctions between classes with the number of neighbours set to 1 and uses Euclidian distance to determine the nearest neighbours (Johnson & Yadav, 2018).

Fine Gaussian SVMs
A SVM algorithm classifies the data by finding the best hyperplane that separates all the data of one class from those of the other class (Gutierrez, 2015). It is a supervised learning technique based on the statistical learning theory employed for classification and regression tasks (Boser et al., 1992). For the binary separation of vectors in classes, the function is defined in the following equation: where w is the weight vector of the function and b is the bias, and In a linearly separable case, given a training dataset, x i [ R n and their corresponding label y i [ [+1, − 1], (i = 1, … , N), a separator can be defined as follows: The solution to the problem of finding the hyperplane can be determined using the Lagrange relaxation procedure in the following equations: a i y i = 0 and a i ≥ 0, i = 1, 2, . . . , m where a = [a 1 , . . . , a N ] are the non-negative Lagrange multipliers. The solution to the above equation gives the location of the separating hyperplane. In the case data are non-linear, there is a way to create a non-linear class separator by applying the Kernel function (Hofmann et al., 2008). Kernels project the input vector space into a feature vector space of higher dimensions, seeking to make separation easier in space. The solution to the dual problem with the projection of Equations (9) and (10) where K(x i , x j ) is the Kernel function. The mathematical representation of Gaussian Kernel is as follows: c and γ represent the hyperplane parameters and need to be optimized.
Fine Gaussian SVM refers to the SVM that makes finely detailed distinctions between classes with Kernel scale set to g = n √ /4 where n is the number of features (Savas & Dovis, 2019).

Bagged trees
Bagging is a model averaging approach ML method of combining multiple predictors. It generates multiple training sets by sampling with replacement from the available training data (Breiman, 1994). Bagged Trees was deduced from the phrase 'bootstrap aggregating' (Breiman, 1997). Bootstrap aggregating improves classification and regression models in terms of stability and accuracy. It also reduces variance and helps to avoid overfitting. It can be applied to any type of classifiers. Bagging is a popular method in estimating bias, standard errors and constructing confidence intervals for parameters. In the case of classification into two possible classes, a classification algorithm creates a classifier H: D →{−1,1} on the basis of a training set of example descriptions (in our case played by a document collection) D. The bagging method creates a sequence of classifiers Hm m=1, … , M with respect to modifications of the training set. These classifiers are combined into a compound classifier. The prediction of the compound classifier is given as a weighted combination of particular classifier predictions according to the following equation: Parameters αm, m = 1, … , M are determined in such a way that more precise classifiers have a stronger influence on the final prediction than less precise classifiers. The precision of base classifiers H m can be slightly higher than that of a random classification. That is why these classifiers H m are called weak classifiers.

Evaluation metrics
Accuracy is the overall evaluation index of the developed ML models. This is a multi-class classification research, so the 'One Versus All' strategy (Joutsijoki et al., 2016) was used.
The calculated validation indices have been calculated and considered are presented in Table 6.

Experimental results
The experiments were performedusing Matlab R2019a. Due to the vast data and the big number of classifiers, a private distributed-parallelized environment of 10 virtual instances of Matlab was employed. Each instance had access to eight i9-9900k threads and 32GB of memory, while it was equipped with two Matlab workers, which means that a total 20 workers were available. The accuracy threshold was set to 0, which means that each classification was accepted only if all instances created by VSM had been classified to the same class. The zero threshold could be also called zero error tolerance and it was selected to check the model's performance under the strictest configuration. The 10-fold cross validation was also applied to tests.
Tables 7-10 present the performance for each of the four best classification algorithms, using Sensitivity (SNS), Specificity (SPC), Accuracy (ACC), F1 Score and Precision (PRC).
The confusion matrix for the classification obtained using the BGT algorithm is shown in Figure 4. Due to the high accuracy and similarity of results, the rest of the confusion matrices are not included.
All employed metrics prove that the proposed model has an excellent performance. In all four algorithms Accuracy is higher than 99%, but all indices are proving a very well-  tuned model. The above-confusion matrix also indicates a prodigious performance with only 120 (5, 56, 5 and 54) misclassified cases in between classes 1 and 8. The vast majority of the misclassified cases are related to the 'transition period 0' and it is something expected. However, looking at the evaluation indices (Tables 7-10) and the confusion matrix, it is shown that the performance is also extremely high (but not perfect) for class zero. Even for class zero, 375,671 cases are correctly identified, whereas the misclassified cases are less than 1000). It is obvious that during the transition period (which lasts a limited, almost zero period of time) more than one activity are actually performed. However, even for this difficult case, out modelling effort is very efficient.

Discussion and conclusions
The research has managed to correctly identify human activities by considering a dataset, obtained using two wearable devices. Using the introduced VSM, we could deploy a high accurate HAR model and also recognize the transition between activity moments. The introduced modelling approach was extensively tested to distinguish nine distinct and specific activities close to real-life conditions. The results have proved that four classification algorithms can offer excellent models. The performance of the classification for six of the nine classes reached almost the absolute 100%, while it was extremely high for the other two classes, namely, 7 and 8. The efficiency of the proposed methodology is lower (but still very high) only for the case of the transition class, which should be studied further to be improved. Psathas et al. (2020) introduced an HAR model that could achieve an accuracy of 92.8%. The extended version that has been introduced and tested has achieved an accuracy as high as 99.9% and it could identify the transition moments within the dataset.
Future work will concentrate on the minimization of the used features to build a lighter version of the model that could be trained in a shorter period of time. Also, an important matter is the improvement of the classification of the transition class 'zero'.
Finally, it would be also interesting to 'inverse' the methodology to estimate the heart bead rate . Together with the estimated activity, the system could be able to provide early warnings in the cases of heart problems.