Speeding up and enhancing a large-scale fingerprint identification system on GPU*

ABSTRACT Fingerprint identification is one of the most common biometric feature problems which is used in many applications. Although state-of-the-art algorithms are very accurate, the need for fast processing a big database of millions of fingerprints is highly demanding. In this paper, we propose to adapt the fingerprint matching algorithm based on Minutia Cylinder-Code on Graphics Processing Units to speed up the matching. Another contribution of this paper is to add a consolidation stage after matching to enhance the precision. The experimental results on a GTX-680 and K40 tesla devices with standard data-sets prove that the proposed algorithm can be comparable with the state-of-the-art approach, and is suitable for a real-time identification application.


Introduction
Fingerprint matching is the task of estimating the degree of similarity between two given fingerprint images (hereafter called fingerprint, for short). Current solutions to this problem can be classified into three types: correlation-based, minutiae-based, and ridge feature-based matching, of which the minutiae-based matching is the most popular approach thanks to its accuracy as stated in Fingerprint Verification Competitions (FVC) (Cappelli, Maio, Maltoni, Wayman, & Jain, 2006). A minutia is the point where a ridge continuity breaks, and it is typically represented as a triplet (x, y, u), where x and y are the coordinates of the point and θ is the ridge direction at that point. The similarity of the two given fingerprints with minutiae-based matching approach is proportional to the number of matched minutiae pairs in them. Figure 1 illustrates the matched minutiae of two fingerprints.
The state-of-the-art matching algorithm Minutia Cylinder-Code (MCC) (Cappelli, Ferrara, & Maltoni, 2010) has the best accuracy; however, its matching speed is not good enough to apply in a real application which has a large database of millions of fingerprints, e.g. the database of civil identification systems (Unique Identification Authority of India, 2012).
Graphic cards with general purpose graphics processing units (GPGPUs, hereafter called GPU for short) are a new parallel architecture which have been proven to be very useful for accelerating the processing speed of computationally intensive algorithms. These devices include thousands of processing units; thus, they provide massive parallel calculations and have been applied successfully in many fields such as artificial intelligence (Krizhevsky, Sutskever, & Hinton, 2012;Zhang, Yi, Wei, & Zhuang, 2014), simulation (Friedrichs et al., 2009), bioinformatics (Schatz, Trapnell, Delcher, & Varshney, 2007), as well as fingerprint matching (Cappelli et al., 2015;Gutierrez et al., 2014). In this paper, we propose a novel approach to adapt/parallelize the MCC algorithm to GPU devices.
In addition, two minutia pairs between the two given fingerprints can be locally matched; however, it is not guaranteed to be globally matched. It is incorrect to use the minutiae, which are only locally matched, to estimate the similarity degree of the two given fingerprints. Another contribution of this paper is to parallelize the consolidation step to the MCC algorithm to ensure that the minutiae are globally matched before estimating the global similarity. The experiments on the standard fingerprint database indicate that our proposal helps to decrease the error rate.
The contribution of this paper includes: . Propose a novel parallelization method of MCC algorithm (cf. Section 4.1) . Parallelize the consolidation stage to enhance the precision (cf. Section 4.2) This paper was improved from Nguyen (2016a, 2016b), in which the parallelization strategies are clearly discussed in detail; further experiments with a new GPU device were carried out to confirm the correctness of the proposed solution.
The rest of the paper is organized as follows. Section 2 reviews the MCC algorithm. Section 3 briefly describes the GPU architecture and programming model. Section 4 describes the state-of-the-art approach. Section 5 describes our adapted MCC algorithm on GPU devices. Section 6 analyses the experimental results on the open databases. The final section gives the conclusions and future directions.

Fingerprint identification problem
From an existing database of N fingerprints (maybe accompanied with personal information having fingerprints in the database), given a new (unknown) fingerprint, the task of fingerprint identification is to find out the matched ones in the database, i.e. find the match score or similarity between the new fingerprint with everyone in the database. Two fingerprints are deemed to be matched if their match score (or similarity) is greater or equal a predefined threshold.

Minutia Cylinder-Code
Given a minutia m represented as {x m , y m , u m }, its local structure in MCC is defined as a cylinder having the centre at the coordinate {x m , y m }, the radius R, and the height 2π. Each cylinder is divided into N s × N s × N d cells as shown in Figure 2, where N s is the resolution of the discretized 2D space around minutia m (N s × N s ) and N d is the height of the cylinder (i.e. 2π). From this cylinder, it is possible to find its neighbouring minutiae of which the information used to form the characteristics of this minutia is the position (Chikkerur, Cartwright, & Govindaraju, 2006;Medina-Pérez, García-Borroto, Gutierrez-Rodriguez, & Altamirano-Robles, 2011), ridge (Feng, Ouyang, & Cai, 2006;Wang, Li, & Niu, 2007), orientation (Qi, Yang, & Wang, 2005;Tico & Kuosmanen, 2003), or a combination of these (Feng, 2008).
The local structures are invariant to the fingerprint rotation and translation. Therefore, it allows to match two fingerprints with different rotation and translation. The local structure matching allows to quickly find pairs of minutiae that can be matched locally and can be the candidate for aligning between the two fingerprints.
The contribution of each minutia m t to a cell (of the cylinder corresponding to a given minutia m i ) depends both on spatial information (how much m t is close to the centre of the cell) and directional information (how much the directional difference between m t and m i , is similar to the directional difference associated to the section where the cell lies). In other words, the value of a cell represents the likelihood of finding minutiae that are close to the cell and have directional difference with respect to m i .
With a negligible loss of accuracy, the cylinder c i of the minutia m i can be simply treated as a single feature vector, and each element of the feature vector can be represented as one bit (Cappelli et al., 2010).
In order to compare two given fingerprints, most minutiae-based matching algorithms consist of two steps: the first step performs a local structure matching followed by a global step, i.e. calculate the aggregated similarity based on the locally matched minutiae. Recently, an MCC-based matching algorithm (Cappelli et al., 2010) shows a good performance in terms of both accuracy and speed.

Similarity score calculation
Let v i and v j be the (cylinder) bit vectors of minutiae m i and m j , correspondingly, a simple but effective similarity sim(.) between the two minutiae is defined as (Cappelli et al., 2010): where . ⊕ represents the bitwise XOR operator, . ||.|| represents the Euclidean norm, . d(m i , m j ) is the angle difference (i.e. u i , u j ) of the two minutiae m i and m j . The formula is Given two fingerprints T 1 and T Q , a local matching is performed on every pair of cylinders, and the results are stored in a matrix. In order to aggregate the similarity score s of the two fingerprints, there are two strategies: . Local Similarity Sort (LSS) technique sorts the scores of the matrix and computes the average of the top n P highest scores. . A more accurate but less efficient similarity measure is the Local Similarity Sort with Distortion-Tolerant Relaxation (LSS-DTR). LSS-DTR adds a consolidation stage to LSS in order to obtain a score that modifies the local similarities to hold at the global level. Figure 3 shows the steps to calculate similarity score s using LSS, where s is the average of the top n P values chosen from the similarity matrix.

MCC-based matching algorithm
From a database T of N fingerprints in the format of minutia templates, i.e. T = {T 1 , . . . , T N }, given an unknown query template T q , the first step (local matching) MCC matching algorithm is to calculate similarity matrix between each minutia pairs of T i and T q . In the second step, i.e. global score estimation, the similarity score S i is calculated from the average of top n P values of the similarity matrices (LSS approach). In step 3, S = {S 1 , S 2 , .., S N } is used to determine which templates in the database are the matches of the query. Figure 4 demonstrates the calculating steps of the MCC algorithm. The Figure 3. Similarity score computing process using LSS (Cappelli et al., 2015).
most time-consuming step is Step 1, i.e. estimating the similarity scores; thus, this is where parallelization is applied. The pseudo-code of the algorithm is described in Algorithm 1, where S[] is the matched score array and maxValue[] is the maximum matched score of each minutia in T q with other minutiae in the other fingerprint T i . Since only top n P values of the similarity matrix are used to calculate the global score, the array maxValue[] is used instead of a matrix.
ALGORITHM 1 The sequential MCC-based fingerprint matching algorithm with CPU.

Sequential fingerprint identification on CPU
Input: The template set T The query template T q Output: The possibly matched templates from T 1. Let S[] be an array of size N; //Similarity score set 2.
Load the template set T = {T 1 , . . . , T N } into the main memory;//Done only once 3.
Let T q be the query template; 4.
For i = 1 to N//N is the number of templates in the database //Step 1: local minutia matching 5.
Let maxValue[] be an array of size T q .length;//The similarity of matched pairs 6.
For j = 1 to T i .length // T i .length is the number of minutiae of T i 7.
Let m[j] be the j th minutia of the query fingerprint T i ; 8. maxValue[j] = 0; 9.
For k = 1 to T q .length // T q .length is the number of minutiae of T q 10.
Let m[k] be the k th minutia of the template T q ; 11.
If  Note that, the matched score of two fingerprints is calculated based on the array maxValue[], i.e. local matching scores.
Step 3 is not mentioned in detail, since it is not the point for parallelization.

The parallelism of GPU
NVIDIA introduces the compute unified device architecture (CUDA), one of the two common frameworks for developing applications on their GPU devices (the other framework is OpenCL 1 ). With CUDA, applications written in C/C++ running on the CPU (of the hosted machine) can call to execute parallel codes called kernels (written in C) on NVIDIA GPUs. The physical architecture of CUDA-enabled GPUs consists of a set of Streaming Multiprocessors (SM), each containing 32 cores following single instruction, multiple data (SIMD) scheme. The architecture of NVIDIA GPU device is depicted in Figure 5.
Threads are the instances of a kernel (i.e. run the same code), and each processes a different dataset (i.e. SIMD). Threads are grouped into blocks. All threads of the same Figure 5. NVIDIA GPU architecture (Luebke et al., 2006). block are executed on the same SM and share the limited memory resources of that multiprocessor. The maximum number of threads in a block cannot be too large (1024 for the GPU device used in this work). However, a kernel can be executed by multiple, equally sized blocks, forming a grid: the total number of threads is then equal to the number of blocks times the number of threads per block. Each SM schedules and executes threads in groups of maximum 32 parallel threads (i.e. the maximum number of cores in an SM) called warps. A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp synchronize their execution path. If threads of the same warp take different paths (due to flow control instructions), they have to wait for each other. It is important to make GPU threads extremely lightweight.
CUDA threads have access to various memory types ( Figure 4): each thread has its registers, which are the fastest memory, and its private local memory (which is slower); each block has a small shared memory, accessible to all threads of the block and with the same lifetime of the block; all threads have access to the global memory: the largest memory and slowest memory type, which is accessible by all threads of all blocks, so it is used for communication between different blocks and with the host (the program running on CPU). When a warp executes an instruction that accesses global memory, it coalesces the memory accesses of the threads within the warp into one or more of these memory transactions. Therefore, a very important optimization in CUDA is ensuring that global memory accesses are as much coalesced as possible.
A CUDA program includes two parts: one is the kernel running in parallel on the GPU device following SIMD paradigm; the other running on CPU acts as a coordinator, i.e. prepares the data, sends the data to the GPU device, calls the kernels, and retrieves the results from the GPU.

Cappelli's parallelizing approach
The pseudo-code of Cappelli's parallelizing approach is listed in Algorithm 2, which includes three GPU kernels: . The first kernel ComputeLMC is responsible for building the lookup table for identifying whether two minutiae m i , m j (in two fingerprints) have the angle difference less than a threshold, i.e. d(m i , m j ) ≤ t u (cf. 1). This strategy helps to avoid if branch in the SIMD model, thus improving the performance. . The second kernel Step-1 computes the similarity of minutiae pairs, i.e. a minutia of the query T q and another one of a T i . To follow the SIMD model, each thread in a block is responsible for calculating the similarity between a minutia of a T i with all minutiae of T q . There are DB 1 threads in a block and NC DB 1 blocks in this kernel. In the experiments, DB 1 is set to 192. . The third kernel Step-2 computes the similarity between T q and all T i in T. Each thread calculates the similarity of (T q , T i ) pair. To make all the threads have an equal task, the author considered only w top (highest) scores of matched cylinders of (T q , T i ). The bucket B [ N N×w is used to store these highest scores. There are DB 2 threads in a block and N DB 2 blocks in this kernel. In the experiments, DB 2 is set to 64.
Reset bucket matrix B [ N N×w ; //Set matrix cell's value to 0 3.
Load the query template T q ; 4.
Initialize S in GPU 5.

Call
Step-2 kernel (DB 2 threads per block, N DB 2 block); 9. Copy S from GPU memory to RAM for finding the possible matches 5. The proposed approach 5.1. Adapting MCC matching algorithm to GPU

Strategy 1: maximize the active threads
The primary goal in writing a kernel to run on a GPU is to maximize the active 32 threads in a warp. Since the number of minutiae in a fingerprint is variable, to avoid divergence between threads in the warps, several approaches suggest to mainly divide the MCC matching algorithm into two separate kernel GPU calls (Cappelli et al., 2015;Gutierrez et al., 2014). The first kernel GPU call is to calculate all similarity matrices. The second call is to calculate the global scores from similarity matrices. When dividing the algorithm into separating calls, it is necessary to transfer data between kernel calls. Meanwhile, some advantages of the GPU architecture like shared memory are not utilized. Cappelli et al. (2015) use a very careful design to adapt the algorithm and an ad hoc technique to translate the similarity matrix to have a fixed size. From our investigation on the FVC 2002 fingerprint databases (Cappelli et al., 2006), the average number of minutiae of each fingerprint is 30, and the average number of matches for a genuine matching is only 6. Thus, the first optimization strategy, we propose to use 32 minutiae for each fingerprint which is enough for the matching process in order to fit well on a GPU block. However, we carefully select 32 minutiae (from fingerprints having more than 32 minutiae) as follows: . All existing minutiae are used to generate cylinders, so the MCC bit vectors of cylinders are not affected in comparison with the case of using all minutiae. . From the fact that, the more bit 1 the vector (of a minutia) has, the more neighbouring minutiae it has. Minutiae having the cylinder with a small number of 1 tend to be the outlines of a fingerprint. We sort the cylinder vectors in the decreasing order of number of bit 1 (in the vector), and then select the top 32 minutiae.
Since some fingerprints have more than 32 minutiae, we have to accept an increase in error rate. However, this is acceptable as detailed in the next section. In our approach, all the similarity matrices in Figure 5 have the same size 32 × 32. We set the number of threads per block to be 32, i.e. equal the maximum number of threads per warp (or per SM). Then we use one GPU block to calculate the similarity (i.e. using LSS) between a template T k and the query T q . This is the key difference between our approach and Cappelli's. In Step 1, each thread of the block is responsible for calculating a column in the similarity matrix and finding the maximum value in that column. In Step 2, the first thread of the block calculates the global similarity from the average of maximum values found from 32 threads of the block.

Strategy 2: exploit the shared memory
Since the shared memory space is faster than global memory, we exploit it to store the results in each block. Table 1 lists the data structures and variables along with its memory type.

Strategy 3: pre-compute costly functions
Since a minutia is represented by a triple (x, y, u), where (x, y) is its coordinate in integers and the ridge u is represented in degree, it is possible to pre-compute cosine and sine functions with the range of 0 ≤ u ≤ 360. Though these functions accept only radian values, a simple mapping from degree to radian is used before calling the functions. Square root function is needed in some calculations (cf. Section 4.2) and we found out the range of the input (i.e. from 0 to 1281); thus, it is possible to calculate the square root of these values.
Consequently, time-consuming functions (e.g. manipulate real numbers) like square root, cosine, and sine can be pre-computed and stored in an array. Later, we just lookup the result instead of calling the function. These details are not mentioned clearly in Algorithm 3 for simplicity.

Strategy 4: void branching instructions
The branching statement like if in the kernel will make the threads to take different branches; thus, the SIMD model does not work well. The statement 'If d(m j , m k ) ≤ t u … ' in line 11 in Algorithm 1 will be removed as follows: we pre-compute d(m j , m k ) and store the results in a list of arrays, i.e. for each u (in degree instead of radian) (0 ≤ u ≤ 360), find all the minutiae m k of the query template T q satisfying d(m j , m k ). Thus, given a new minutia m j with the angle u, the list of minutiae of T q satisfying the constraint is readily available. Lines 10-13 in Algorithm 2 demonstrate this method.

Enhancing the precision with consolidation stage
The simplest consolidation approach uses the local matched minutiae pair having the maximum similarity value in order to align the fingerprint T and T q for the consolidation stage. Let (m i1 , m i2 ) and (m ′ j1 , m ′ j2 ) be two minutia pairs to be verified. Let [Dx, Dy] be the translation between two minutiae m ′ j1 and m i1 . Let u be the angle to rotate m ′ j1 to m i1 . Let m ′′ j2 be the mapped point from m ′ j2 using translation [Dx, Dy], and rotation u. Let (x ′′ j2 , y ′′ j2 , u ′′ j2 ) and (x i2 , y i2 , u i2 ) be the triplets of m ′′ j2 and m i2 . Then x ′′ j2 and y ′′ j2 are calculated as: The two minutia pairs are deemed to be globally matched if they satisfy following constraints: . The spatial distance between the two minutiae does not exceed a threshold t s , i.e.
is the spatial distance between m ′′ j2 and m i2 . . The directional difference between the two minutiae does not exceed a threshold t u , i.e.
. Figure 6 demonstrates the step of translation and rotation to compare the two minutia pairs. The two parameters t s and t u represent the tolerance window, and their value can be determined by experiments. For example, in TK algorithm (Tico & Kuosmanen, 2003), the distance threshold t s = 12 and the angle threshold t u = π/6 brought in a good performance.
The transformation on the minutiae pair having maximum similarity value may not be the best transformation at the global level. Several authors have adopted multiple candidate transformations for the alignments. Finally, the transformation that maximizes the number of consolidation stage minutiae pairs will be chosen. For instance, Medina-Pérez et al. (2011) reduced the number of local matching minutiae pairs by for each minutia p and minutia q, selecting only minutia that maximizes their similarity values, then perform the transformation for each minutiae pair in the reduced set. Feng (2008) sorted minutiae pairs by descending similarity values and chose top n minutia pairs for Figure 6. Global checking of the match between to minutia pairs (Maltoni, Maio, Jain, & Prabhakar, 2009). the transformation. Normally, these approaches allow to get better accuracy than that of the single transformation approach. In our approach, we compare all local matched pairs of minutiae between two fingerprints to calculate the global similarity.

The proposed algorithm
The data structures and variables used in our algorithm are listed in Table 1.
The pseudo-code of our algorithm is described in Algorithm 2.
ALGORITHM 3 Our proposed parallelized algorithm Parallel fingerprint identification on GPU /***CPU code***/ 1. Initialize S[] be an array of size N on GPU; //Similarity score set 2.
Load T q into GPU; //Do for every query 4.
Enumerate the minutiae of each T i [ T satisfied d(m q , m k ) ≤ t u , where m q and m k are minutiae of T q and T i , correspondingly. /***Kernel code (running on GPU), call for each query T q ***/ //Step 1: local minutia matching 5.
//Each thread t idx processes a minutia of The syncthreads() function in lines 15 and 23 are the barriers for all the threads of the block, i.e. all the threads in the block must finish running the code above the call to syncthreads(). This function ensures the results of all the threads of the block are available for aggregation.
The function atomicMax(maxMatches, maxMatching[t idx ]) performs maxMatches = max(maxMatches, maxMatching[t idx ]) while ensuring the critical section due to the race condition caused by multiple concurrent threads within the same block. Note that the matched score of the two fingerprints is calculated based on the variable maxMatches, i.e. the consolidation results. Thus, it is different from that in Algorithm 1.

The complexity of the proposed algorithm
From the above text, some discussions about our algorithm are: . No modification to the original MCC matching algorithm is added, thus the complexity of our algorithm is intact. . The reduction the number of minutiae of each fingerprint to 32 to fit the GPU architecture reduces the search space, thus the matching will be quicker for fingerprints having more than 32 minutiae in comparison with the original algorithm. . The main reason to speed up the matching is the parallelization method as mentioned in Sections 4.1-4.3.

Experimental results
In this paper, the problem of processing fingerprints with different size is out of scope. In order to evaluate our proposed approaches, we used the FVC 2002 fingerprint database, which contains equal-sized fingerprints, for the experiments. The tool by Medina-Pérez, Loyola-González, Gutierrez-Rodríguez, García-Borroto, and Altamirano-Robles (2014) was used for minutiae extraction and MCC cylinder creation process. Some parameter values in Section 4.2 are the same as those of Tico and Kuosmanen (2003), i.e. t s = 12 and the angle threshold t u = π/6. The generated MCC templates of fingerprints were stored on the disk for the experiments. The experiments were carried out on CentOS 7.1 operating system with CUDA 7.5. The configuration of GPU devices is listed in Table 2, where Testla C2075 is the device used in Cappelli et al. (2015). For evaluating the accuracy of the proposed algorithm, the result of the proposed algorithm is compared with the result of the MCC baseline algorithm in which all minutiae are used for the matching process. Table 3 shows the error rate of different algorithms, where WCS and WOCS stand for 'with consolidation stage', 'without consolidation stage', correspondingly. Without the consolidation stage, the action of selecting 32 minutiae for matching increases the equal error rate (EER), otherwise it decreases the EER to be lower than the original algorithm.
For evaluating the speed of the proposed algorithm, we carried out all the experiments on an NVIDIA GeForce GTX 680 with 1536 CUDA cores, Kepler Architecture, and 2GB of memory. The FVC 2002 database was scaled to different sizes (ranging from 10,000 to 200,000) to study how the GPU-based algorithm scaled with the database size. Ten fingerprints were randomly selected to be identified. Table 4 shows the results of experiments with different database sizes, where KMPS stands for 'kilo (thousand) match per second'.
At larger template set sizes, the throughput of the proposed algorithm is stable at 8.5 and 9.8 million matches per second on GTX 680 and Tesla, no scalability issues were found providing that the fingerprint templates can be stored on the GPU's memory. Our results are higher than those reported in Gutierrez et al. (2014), which gains 55.7 KMPS on the same GeForce GTX 680 device. In addition, these results are comparable to state-of-the-art result (Cappelli et al., 2015), which gains 9 million matches per second on Tesla C2075 GPU.

Conclusions and future work
This paper proposed a novel method to adapt MCC-based fingerprint matching to GPU. After using all minutiae for calculating cylinders, top 32 minutiae having the biggest number of neighbours are selected (to fit a GPU block) for matching. A consolidation stage step was added to the algorithm to enhance the accuracy. The speed of the proposed algorithm is comparable with the results of the state-of-the-art published algorithm. The proposed approach can be easily scaled-up to be able running in a real-time system. Thus, it is possible to implement a large-scale fingerprint identification system on inexpensive hardware.
One possible direction in the future is to apply a mixed parallel model with multiple machines with a message passing interface model, each of which has multiple GPU cards to process large fingerprint data-sets for real situations.