Hand features extractor using hand contour – a case study

Hand gesture recognition is an important topic in natural user interfaces (NUI). Hand features extraction is the first step for hand gesture recognition. This work proposes a novel real time method for hand features recognition. In our framework we use three cameras and the hand region is extracted with the background subtraction method. Features like arm angle and fingers positions are calculated using Y variations in the vertical contour image. Wrist detection is obtained by calculating the bigger distance from a base line and the hand contour, giving the main features for the hand gesture recognition. Experiments on our own data-set of about 1800 images show that our method performs well and is highly efficient.


Introduction
A natural user interface (NUI) is a method that allows user to interact with information systems with no help of extra devices; in other words, users can only use the nature systems with which they were born. There are three methods for a NUI; voice, touch and gestures. There are some kind of gestures, some of them are related to the face or the body, but this article focuses only on hand gestures.
NUIs are gaining interest and popularity in all areas. The capacity to give commands without any external tools is a better way of communication between humans and computers. Voice commands are used in artificial intelligence, touchscreens mainly in smartphones avoid the use of keys, and gestures also detect human expressions. As Miriam Novack said [1]: Gestures "reveal information that cannot be found in speech", which means that gestures can contain more information than simple words. Hand gesture recognition has been widely used in human-computer interaction, like in virtual reality, robotics, computer gaming and so on.
Based on their methods, hand gesture recognition technology is mainly divided into two categories: data gloves-based method and computer vision-based method. The computer vision-based method can be classified into two main classes [2]: static gestures [3][4][5][6][7] that are done with a hand or both hands with no movement, and dynamic gestures [8,9,2,10,11,12] that are done with a sequence of hand images following a path or a predefined behaviour. Both static and dynamic gestures are extracted from a sequence of hand images (video), but in the first, the gesture has to be the same and to be in the same position, in the second one, gesture has to be a little different in the form or in the position.
Image recognition of colour-based images, depth images and hand shapes was used in [7,11], but they are not to precise with fingers. There are more specific hand-related devices like the Leap Motion Controller [12][13][14] that detect mainly finger movements; however, the main drawback with this gadget is when hands are overlapping each other. Also, some applications for hand gestures are proposed in medical environments [15][16][17], in serious games [13] or in pedagogical practices [10]. It is worth mentioning works that use a mix of Kinect, Leap Motion and Oculus Rift [18,19]. Hand segmentation is the first goal in each case. This can be done using helping tools such as depth cameras [20] or stereoscopic vision [21]. The most common method is to use skin colour detection in RGB cameras [21][22][23][24][25][26][27][28][29]. Hand segmentation can be done in a non-controlled or a controlled environment. In a non-controlled environment, there are some challenges for skin colour detection, such as wide variety of illumination conditions and skin-like colour objects that appear in the image background. Controlled environment uses a specific place where hand detection can be done assuming that background would be static with no illumination changes.
To prevent the effect caused by complex background and light variety, Junxia et al. [30] propose a method that combines depth information and skin colour using a Kinect sensor. They extracted the candidate area of the hand by skin colour segmentation. Moreover, they set a depth threshold to filter out most of the background. Eventually, they regard hand extracted from the depth image as a mask, and projected it on the image after skin colour segmentation, after that they acquired human hand from the colour image.
The main challenge is to detect hand features. Sometimes some features like the centre of the hand [20,26,28] or the wrist [25] are needed, but the main features in the hands are the fingers. This work proposes a method for fingers detection in a controlled environment with black background.

Materials and methods
Hand capture sometimes is realized using markers on the hand [31] also can be done with hand gloves [32] or with gloves with certain colours [6], but these approaches are intrusive. Studies have shown that gloves may have negative effects on manual dexterity, comfort and possibly the range of finger and wrist movements [33].
There are also some expensive options like the head-mounted displays (HMD) that can detect hand gestures. Some examples are Google Glass [34] and Microsoft HoloLens [35], the Oculus also can be included with a mounted Leap Motion, but we are interested in a more natural way for the user, without using wearable or touching devices. These approaches are limited in accuracy and precision of current visionbased devices for tracking hand movements, as well as their inherent issue of visual occlusions.
In this article we are using low cost hardware and optimizing all memory resources, processor operations and so on, so we are also focusing on a fast and cheap option. The computer used to test this approach is a Dell Inspiron 5555 in a Linux Ubuntu environment using C++ and OpenCV for CodeBlocks.
In Figure 1 is depicted the block diagram of the process, from image acquisition to fingers and wrist detection. In the first step, we start by simultaneously acquiring the different views of the object of interest (one or two hands). Images then pass through different, rather standardized processes. As usual, these processes allow faster hand detection. Because we are using different points of view of the same object we can calculate the 3D position of the arm, wrist and fingers of each hand (left and right) for each view. If no points are obtained, the returned values are NULL.
There are some other considerations for the calculation of the 3D points. For example, if the process of the top image detects a wrist and the process of the right image does not detect the wrist, there is not enough information to calculate the 3D point and the data are discarded, so no hand is detected.
Considering that the work area is a desk, there are two cameras, one on the rigt side of the desk and All steps from image acquisition to fingers and wrist detection. All cameras are connected to the same computer, they capture the RGB image sequence in real time (also can be a video). The process needs a feed of simultaneously captured images (top and right), and each image is transformed to grey scale. Next step is to binarize the images with a threshold, with the result, the segmentation process begins to work. The number of objects detected is the number of hands (0, 1, 2). If detects one or two hands, we can calculate the arm angle, if no hand is detected, the process take another pair of images and the process continue over and over again. Next the Sobel filter is applied to extract the contours. The obtained points of the contour process give us the possibility to detect the wrist. After the wrist detection, a finger detection algorithm is applied to obtain finger points in 2D. The combination of the data give us the 3D points of each hand.   another is placed on the upper side. The images are captured with 320 × 240 resolution with a black background. The black background is used to remove shadows of the hand over the surfaces and to take the brightest pixels of the image as the hand itself. Figure 2 shows images from the upper and right cameras with non-significant data. Figure 2 shows that in the upper view, hands are always coming out from the upper part of the image, whereas in the right view the hands are always coming out from the green line. We can calculate an adaptive threshold in a region of five pixels where the hand is coming out in each view. The purpose of having this region is to simplify computer resources; instead of finding the threshold in all the image pixels, it is better to work in a small region (5 pixels length). The resulting threshold is applied to the whole image. Figure 3 shows this region and the result after the process.
After this step, images can contain noise or undesired objects. This can be solved taking all continuous white pixels starting from the pixels obtained in the previous step. Figure 4(b) shows how undesired pixels are removed.
The next step is to detect arm pixels. We know that each image has rows (Y coordinates) and cols (X coordinates). In the upper view, we divide the binarized segmented image row by row (every row has many X coordinates). In each row we take the first and last X coordinates of the white pixels on the object and the half is an arm pixel. In other words, think as if you have an image of 1×N pixels, where 1 is Y dimension (a row) and N is X dimension (N cols), so you only have a region with white pixels, this can be seen in Figure 3 section b, top image but in that image there is a region of 5×N pixels (5 rows).
Equation (1) shows the calculation for the red X coordinate in each row of the image ( Figure 5 top images). where: x r = X coordinate for the arm point x 1 = first white point in the row of the image from left to right x 2 = last white point in the row. The same process is applied in the right view, but instead of rows we take columns and instead of X coordinate we use the Y coordinate ( Figure 5 bottom images).

Linear regression with vectors
We remove points whose distance is greater than n pixels, with this the red points are the closest to a line. With the arm points we can use a linear regression to obtain the closest line to all these points and with that line we can calculate the arm angle. To solve this kind of problems, the first thing that comes to our minds is equation of the line, but it only works when we have two points! So, what if we have too many points like in our case? First, let us see Equation (2), which is the equation of the line in the matrix form.
where x j and y j , represent the ordinate pair of each point obtained before (red ones). If the first matrix is represented by the letter E, and the third matrix with letter F, we have Equation (3).  (4).

The final equation used for linear regression is expressed in Equation
where: G = E t E. E = Matrix from x j points. F = Matrix from y j points. The left part of Figure 5 shows arm pixels with and without the filter.
With the beginning and ending points of the line we can calculate the arm angle with the formula in Equation (5), see Figure 6.

"Y variations" method
For finding the fingers we need to apply the Sobel filter to the segmented images and rotate θ , so the hands point downwards in both the views. After that the contour is followed and Y variations are printed in a graph, the lower points in the graph are the fingers. "Y variations" mean that, following in order the contour (white pixels one by one), each white pixel has X and Y coordinates, and when there is a different Y coordinate, there is a Y variation. Figure 7 shows the graph in each view and fingers detected are in red colour. They are better explained in Algorithm 1.
For finding the wrist we propose a method in which the Sobel image is divided at the middle to left side and right side. Working with the left side, we need to find a line of reference for finding the left side of the wrist. We need two points to form this line. The first point is the Algorithm 1 Calculate "Y" variations (finger points). "I" represents the Sobel filtered image. "V" is a vector that will contain all white pixels, and F are the finger points.
first white pixel in the first row of the image. The second point is the white pixel located more to the left. With the line formed we take five-pixel samples in the line and in the contour and calculate their distances. The bigger distance is the wrist left part. The same process is applied in the right part of the image like a mirror. Two points of the wrist are obtained, and the half of these two points is the wrist. Figure 8 show images for this process.
With the wrist point and finger points, we can calculate the angle of each finger using Equation (5).

The "convex" method
This method uses two functions included in OpenCV: convex hull and convexity defects. Figure 9 shows convexity defect results.
The use of the convex hull and convexity defects allow us to find an approximation of the wrist point, the two longer lines of the result of convex hull and the convexity defects of these lines give us two points that can be the wrist. The point with the bigger distance to the respective line is the one with bigger probabilities of being the wrist. The angle of the arm can be defined

Results and discussion
We work with the algorithm from three videos taken. Each video has about 600 images each. Results are good enough to detect the corresponding interest points, so   this is a cheap and fast way to detect them. Figure 13 show many images of hands with finger and wrist detection.
When a user shows a closed hand (no visible fingers), the minimum is considered as a finger. This is because even if it is not a finger, there is a minimum in the graph ( Figure 14). This problem does not always happen but it is a point to improve for better results.
Sometimes they are misplaced wrist points when the hand is not fully visible for the cameras, in other words, when cameras cannot view fingers or fist, or when the wrist is not visible for the camera. Also there is the problem of hand occlusion mainly in the right view. Figure 15 shows images about this kind of problems.
But after all these data, there is the question of which one of the two methods (described in Section II.b and II.c) is faster? Table 1 shows the comparison of these two methods.

Conclusions
As we can observe from Table 1, the elapsed time of the proposed method is a little bit faster than the convex method. There are some inconveniences in both the methods, for example, the convex method detects too many points, for this reason it can detect fingers easily but it also has too many bad detected fingers. In both cases, the well-detected percentage is low because Figure 14. True negatives. In most of the images shown (a, b, c, d), there is a closed hand with no visible finger but the algorithm detects it like a finger (true negative). This is because using the "Y variations method", sometimes a minimum is detected and the algorithm take it like a finger. (e) shows another true negative, but this is because the hand is moving too fast for the camera, and the camera captures the previous position (motion blur).  images had too much motion blur in each image, this can be solved doing the gesture slowly or to use cameras with a bigger frame rate. The convex method needs a filter for bad detected points, with this the result data will be better than the other, but this will increase the elapsed time. So the proposed method is a low resource method to implement finger detection with good results. It can be used in scenarios where you do not need too much precision on the finger detection but you want speed.

Future work
The next step after this process is to develop the hand gesture detection. Also, it is necessary to obtain good results with low computational resources. We are working on a pixel to pixel comparison but there are problems when hand gestures are so similar.
We are thinking of a method that uses the hand contour image to obtain interest points and to compare it with base images to detect hand gestures.
To solve the "not fully visible" hand problem, it is necessary that the camera's position will be farther to the user. We are working with different distances of the cameras to have a considerable working area that can detect the hand positions easily.
When the hand gesture detection process is completed, we can work on a Client-Server application and use a Raspberry-Pi to detect the points and hand gestures, send the data to a computer and execute some commands on it.
Some other improvements can be done to the code and the method for fingers and wrist detection. Maybe it will be necessary to find the central point of the hand's palm or to improve the calculation of the points.

Disclosure statement
No potential conflict of interest was reported by the authors.