Privacy-preserving data mining of cross-border financial flows

Abstract Criminal networks continue to utilize the global financial system to launder their proceeds of crime, despite the broad enactment of anti-money laundering (aml) laws and regulations in many countries. Money laundering consumes capital resources and the tax revenue needed to fund infrastructure development and alleviate poverty in developing market economies. This paper, therefore, expands on the tools available for enabling privacy-preserving data mining in multi-dimensional datasets to combat cross-border money laundering. Most importantly, this paper develops a novel measure for detecting anomalies in cross-border financial networks, allowing financial institutions and regulatory organizations to identify suspicious nodes. The research used a sample dataset comprising international financial transactions and a hypothetical dataset to demonstrate the measure of node importance and the symmetric-key encryption algorithm. The results support the argument that the proposed network measure can detect node anomalies in the cross-border financial flows network, enabling regulatory authorities and law enforcement agencies to investigate financial transactions for suspicious activity and criminal conduct. The encryption algorithm can ensure adherence to information privacy laws and policies without compromising data reusability. Hence, the proposed methodology can improve the proactive management of money laundering risks associated with cross-border fund flows for the global financial system’s benefit.


PUBLIC INTEREST STATEMENT
Money laundering poses a significant economic challenge in emerging markets, reducing the tax revenue needed to fund infrastructure development and poverty alleviation programs. Despite the recent advances in technology and the availability of financial transaction datasets, money laundering-related investigations are narrowly focused on incidents and generally triggered by tip-offs. This research proposes a network model to combat illegal cross-border fund transfers while preserving the privacy of personally identifiable information. The study leveraged data mining methods and regulatory policies to identify suspicious transaction patterns in financial transactions between residents and non-residents. The research used a sample dataset drawn from the South African database of international financial transactions to illustrate the proposed privacy-preserving data mining approach. Supervisory authorities can use the model to define and plan inspections of regulated entities in a cost-effective manner. Financial institutions can use the model to enhance compliance monitoring and risk management functions.

Introduction
The global financial system is subject to a wide range of risks and vulnerabilities exploited by criminal networks to launder their proceeds of crime with a relatively low risk of detection. One example of such a threat is the voluminous and volatile cross-border financial flows, obscuring individual transactions and providing opportunities for criminal networks to transfer funds across country borders.
Many countries have adopted the Financial Action Task Force (FATF)'s internationally endorsed standards, providing a comprehensive set of counter-measures against money laundering (Cox, 2014). Most financial institutions' automated AML systems embed the FATF standards, enabling the built-in transaction-specific triggers to identify suspicious transaction in real time. However, on many occasions, the flagged cases turn out to be false positives (Pourhabibi et al., 2020).
Money launderers are often aware of the events that trigger suspicious transactions and circumvent them using advanced transaction layering techniques and methods, such as the "straw man," sophisticated documentation, and consulting firms (Harvey, 2004;Teichmann, 2019;Van Duyne, 1994;Walker, 1999). The straw man fallacy disguises the beneficiary's identity, which is the focus of most compliance procedures that banks and other financial institutions implement.
Using graph-based substructures and measures for detecting money laundering activities is a large area of network theory research, with several measures already proposed (Sun et al., 2005b;Li et al., 2020;Xiong et al., 2010;Zhiguo et al., 2015). However, the proposed measures are not easily extensible to directed and dual-weighted networks such as the cross-border financial flow network (Akoglu et al., 2015). In addition, the standard metrics for weighted networks (such as degree, closeness, and betweenness centrality) have solely focused on tie weights and not on the number of ties.
Researchers have proposed the measure of node importance for social networks, combining tie weights with the number of ties (Opsahl et al., 2010). The metric takes the form: where α is a positive tuning parameter. If α is set between 0 and 1, then having a high degree is favorable, whereas if it is set above 1, then a low degree is favorable. Notably, the measure proposed in Equation 1 considers one network weight. Therefore, it is not extensible to the proposed directed and dual-weighted networks.
This paper closes this research gap by proposing a network model for directed and dualweighted networks, along with a measure of node importance that combines tie weights and the number of ties. The proposed metric addresses the question, "Which are the most important or central nodes in the cross-border financial flow network?". In Section 4, we compare the metric in Equation 1 with the measure proposed in this paper at different levels of α. The results show that the measure proposed in this paper can detect suspicious transaction activity involving multiple high-volume flows of funds from source to destination accounts.
The lack of research about money laundering involving cross-border transactions is mainly attributable to data access and sharing restrictions. Governments and firms use multitudes of regulations, laws, and best practices to protect datasets comprising private and confidential information. Hence, researchers recommend developing techniques incorporating privacy concerns as a fruitful direction for future data mining research (Agrawal & Srikant, 2000;Qi & Zong, 2012;Xu et al., 2014).
The three broad categories for privacy-preserving data mining are data obfuscation, summarization, and data separation (Adam & Wortmann, 1989;Cios & Moore, 2002;Clifton & Vaidya, 2004;Dwork et al., 2006;Kou et al., 2007). The development of both the cryptographic and machine learning methods and their integration with the three broad categories of privacy-preserving techniques have been a subject of research interest in recent years (Pathak et al., 2010(Pathak et al., , 2011Wang et al., 2018). This paper leverages advanced technology to develop a symmetric-key encryption algorithm at the intersection of classical cryptography and data obfuscation methods, encrypting the explicit variables from the financial transaction datasets, such as resident name and resident address. The motive for developing the symmetric-key encryption algorithm is two-fold: To leverage the group structure of the multi-dimensional dataset to transform the individual's personally identifiable information (PII) without losing data reusability. Second, we compute the dual weights of the cross-border financial flow network.
The organization of the paper is as follows: this section provides background on cross-border financial flows and AML. Section 2 presents the proposed encryption algorithm, followed by the proposed network structure of cross-border financial flow model along with the measure of node importance. Section 3 illustrates the workings of the symmetric-key encryption algorithm, network visualization and the proposed network measure. The last section concludes the paper.

Cross-border financial flows
Cross-border financial flows are money transfers made by a resident to a non-resident and vice versa because of financial transactions involving individuals, private and public firms, central banks, financial institutions, as well as legal entities such as trusts and non-profit organisations or a combination thereof, in at least two different countries. The banking sector plays an important role in channelling cross-border flows in a country. Figure 1 depicts the flow of cross-border transactional data between South African residents and the rest of the world. The authorized dealer network (comprising commercial banks and other licensed financial institutions) facilitates cross-border payments between residents and non-residents, through the corresponding bank relationships abroad. Hence, a distinctive feature of cross-border financial flows is the existence of financial transactions between residents and non-residents of a country.
Central banks and statistical agencies record cross-border flows for Balance of Payment (BoP) reporting and other regulatory purposes. The significance of the central bank's databases is that they provide transactional information concluded at different financial institutions by the same resident/non-resident. The transactional data comprise PII such as a phone number, email address, social security number, and residential address that one can use to identify a specific individual. Importantly, many countries use information privacy laws and policies to protect such sensitive data.
Most research studies focus on the impact of regulatory policies relating to cross-border flows on economic growth (Babus, 2016 Cross-border flows increased substantially in recent years due to rapid developments in financial technology (FinTech) innovations. The FinTech innovations drive productivity growth due to efficient payment systems and reduced online transaction costs (Freund & Weinhold, 2004;Meltzer, 2015;Neanidis, 2019). Migrant worker remittances also play an increasingly prominent role in the economies of many nations, enhancing financial transactions between their home and host countries. Detecting and preventing money-laundering activity in the remittances industry is a crucial area of regulatory concern and a focus of this research.

Anti-money laundering
AML refers to a set of laws, regulations and procedures intended to deter criminals from using the financial sector to disguise cash proceeds from illegal activities as legitimate. Global standards issued by the FATF enable countries to adopt a more flexible set of AML measures. The standards recommend using advanced technology and data mining methods to identify suspicious transactions, requiring no monetary thresholds for reporting suspicious and unusual transactions to regulatory authorities (Cox, 2014;FATF, 2014).
Limited information is available on the costs and benefits of implementing technology for detecting and impeding money laundering, partly due to the difficulties of estimating the volume of money laundering. However, many financial institutions continue to derive business value from the widely available AML systems. Researchers concluded that investment in advanced technology appears to be a cost burden instead of enhancing the deterrence of money laundering (Kang, 2018;Magnusson & Harvey, 2009).

Symmetric-key encryption algorithm using temporary variables
Advances in wireless technology continue to create exponential growth in connected devices, leading to the internet of things (IoT) revolution. IoT comprises millions of connected devices that can sense, compute and communicate, resulting in significant information/data security concerns. Cryptographic methods are primarily used to address such data security concerns, with several proposed algorithms (Deshkar et al., 2017;Sreeja et al., 2019).

Figure 1. A depiction of crossborder financial transactions data flow in South Africa
This paper proposes an encryption algorithm utilizing temporary enumeration variables generated automatically during the compilation phase of a computer program in order to derive a permutation. The derived permutation is a lookup table. Hence, the proposed technique is analogous to the Permutation Cipher (Stinson & Paterson, 2006). In addition to deriving a permutation, the algorithm uses the temporary variables to compute the weights of the directed and dualweighted bipartite network. The proposed algorithm executes can be executed quickly due to its simplicity. Figure 2 depicts the proposed symmetric-key encryption algorithm.
Several software environments for statistical computing, such as SAS ® and R programming language, provide packages for BY-group processing, a technique used to process data grouped by values of one or more common variables. This paper illustrates the proposed symmetric-key encryption algorithm using the SAS ® programming language.

Description of variables
A detailed description of the automatically generated temporary variables (N, FIRST.variable, LAST. variable) and other variables used for BY-group processing is as follows: (1) N is a counter variable that records the record number being processed in the dataset. Its initial value is set to 1 and is incremented by one whenever a new record is processed.
(2) BY variables are the PII variables by which the dataset is sorted or indexed.
(3) BY values are the values of the BY variables.
(4) BY groups are distinct groups of records with the same BY values. A single BY group divides the records of a BY variable by its BY values.
(5) FIRST.variable is a Boolean mapping on the BY group variable, which has a true value if the processing is done on the first record of the BY group and false value otherwise.
(6) LAST.variable is a Boolean mapping on the BY group variable, which has a true value if the processing is done on the last record of the BY group and false value otherwise.

Procedure-Part one
(1) Input the original multi-dimensional dataset.
(2) Sort the dataset by the encryption variable to enable the creation of the BY groups.
(3) FIRST.variable and N are automatically set to true at the start of dataset processing.

Figure 2. Symmetric-key encryption using temporary variables
(4) If the BY value of the next record equals the BY value of the current record, set LAST.variable to false and true otherwise.
(5) If FIRST.variable is true, concatenate the first letter of the encryption variable with the value of N to obtain the encrypted variable. Retain the value of the encrypted variable.
(6) N automatically increments by one.
(7) If the BY value of the current record equals the BY value of the previous record, then set FIRST.variable to false and true otherwise.
(9) Stop after processing the last record of the dataset.
(10) Repeat the algorithm from Step 1 through Step 9 until all the encryption variables have been encrypted. The resulting dataset is the symmetric key for both encryption and decryption of the PII variables.
(11) To obtain the encrypted dataset, drop the PII variables from the dataset in Step 10.

Procedure-Part two
(1) Input the dataset from Part One.
(2) Sort the dataset by the encrypted variables to enable the creation of BY groups.
(3) FIRST.variable and N are automatically set to true at the start of dataset processing.
(4) Initialize transaction count and transaction amount to zero.
(5) If the BY value of the next record equals the BY value of the current record, then set LAST. variable to false and true otherwise.
(6) Increment transaction count by one and increment transaction amount by its current value.
(7) If LAST.variable is true, output the current record.
(8) N automatically increments by one.
(10) Stop after processing the last record of the dataset.

Advantages and disadvantages of the symmetric-key encryption algorithm
The algorithm's decryption operation uses a technique similar to the Permutation Cipher; hence, it is not computationally intense. The algorithm does not provide descriptive statistics of a demographic nature, thereby reducing its susceptibility to linkage attacks.
The algorithm's safety depends on the security of the channel used to exchange the decryption key. However, it is essential to note that it is technically impossible to stop a person who is duly authorized to access confidential information from improperly disclosing that information to someone else. The proposed algorithm is not suitable for encrypting and decrypting live databases due to its requirement to store the symmetric key. Its effectiveness is limited to multi-dimensional datasets due to the group structure of such datasets.

Network structure and representation
Most graph-based models for countering money laundering activity consider single-step transfers, ignoring the multiple financial transactions concluded through different institutions by individuals or firms (Tang & Yin, 2005;Liu et al., 2017;Lv et al., 2008;Paula et al., 2016;Prakash et al., 2010;Rajput et al., 2014). Recently, researchers proposed a multipartite network model capable of detecting money laundering involving high-volume flows of funds from source to destination accounts via layers of middle accounts (Li etal. 2020;Sun et al., 2021). This paper proposes a directed and dual-weighted graph with weights representing the monetary value and volume of transactions, accounting for the dependencies between financial transactions while focusing on the beneficiary's identity. The cross-border financial flow network structure is similar in design to the citation networks, which enables researchers to quickly identify the important literature in a specific field within a relatively short time and with less effort. The cross-border financial flow network allows financial institutions and regulatory organizations to quickly identify the important residents/non-residents in enormous datasets comprising international financial transactions. Formally, the cross-border financial flow network is defined as the directed and weighted bipartite graph G ¼ ðV; A; wÞ, with VðGÞ ¼V R [ V NR and AðGÞ � ðV R �V NR Þ [ ðV NR �V R Þ, where the disjoint sets V R ¼ r 1 ; . . . ;r k f g and V NR ¼ nr 1 ; . . . ; nr p � � represent the resident vertex set and the nonresident vertex set with V R j j¼ k and V NR j j¼ p, respectively. The set AðGÞ represents the direction of financial flows, where the outward payments flow from residents to non-residents and the inward payments flow from non-residents to residents. The weight function computes the sum of transaction counts and the sum of the financial value of transactions. Figure 3 shows a schematic depiction of the cross-border financial flow network with k ¼ 5 and p ¼ 9. The weight function a ij denotes the total number of transactions from resident r i to non-resident nr j , whereas a' ij denotes the total number of transactions from non-resident nr j to resident r i . Similarly, the network structure could be depicted with the weight functions b ij and b' ij representing the total financial value of transactions from residents to non-residents, respectively, vice-versa or both.
To obtain the adjacency matrix representation of the cross-border financial flow network, we denote the set of k � p matrices with non-negative real entries by R k�p and arrange the node set V R [ V NR in the order r 1 ; . . . ;r k ; nr 1 ; . . . ; nr p . The adjacency matrix comprises elements of R k�p with entries A ¼ a ij � � and B ¼ b ij � � , such that: A ¼ a ij if resident r i transferred funds to non À resident nr j 0 otherwise where a 0 ij ¼ total number of transactions associated with the edgeðnr j ! r i Þ where b ij ¼ total financial value of transactions associated with the edge r i ! nr j À � Similarly, entries of matrices A 0 2 R p�k and B 2 R p�k are such that: if non À resident nr j transferred funds to residentr i 0 otherwise where a 0 ij ¼ total number of transactions associated with the edgeðnr j ! r i Þ and (7) whereb 0 ij ¼ total financial value of transactions associated with the edge nr j ! r i À � Using the transaction counts as weights, the adjacency matrix F of the cross-border financial flows network is of the form: where 0 k;k and 0 p;p represent the k � k and p � p zero matrices.
The adjacency matrix representation of the cross-border financial flow network is inefficient due to the large number of zero entries. Hence, this paper uses a list of financial transaction records, discarding the zero entries.

Centrality measure based on two weights
The proposed centrality measure uses the matrix multiplication method to identify the nodes responsible for unusual transaction patterns. Such nodes include the multiple residents who transfer funds to the same non-resident and vice versa, as well as resident nodes transacting large financial values using high transaction volumes. Criminal networks often use these strategies to avoid thresholds and triggering alerts.

Figure 3. Schematic depiction of the cross-border financial flows network
Formally, consider the dual weights of the cross-border financial flow network, A and B. Let where n ¼ k � p. Hence, W ¼ AB T is the matrix with entries W ij that are the product of the rows of A and the columns of B T .
Define the centrality measure for node i as where W is the sum of all the entries of the matrix W and ∑ i C i ¼ 1.
The diagonal entries C ii are the sum of the product of transaction volume and financial value for Resident i. Large diagonal entries indicate node dominance, which could be due to significant fund transfers or multiple transactions (or a combination thereof). Hence, the diagonal elements provide a mechanism for identifying nodes with i) large volume and sizeable financial value, ii) large volume and low financial value and iii) low volume and sizeable financial value. The onus is on the financial institutions and regulatory organizations to verify the financial flows in the event of extreme importance.
If C ij >0 for i�j, then Resident i and Resident j transferred funds to the same non-resident during the period. In that case, C ij equals the product of the number of transactions for Resident i and the financial value of transactions for Resident j. Hence, the centrality measure for Resident i increases with Resident i 0 s increasing neighbours in the network. It is not the absolute value of the measure that matters but the high or low centrality measure of each node.
The centrality measure based on matrices A and B can shed some light on the importance of each of the resident nodes in the cross-border financial flow network. The centrality measure for non-resident nodes is similarly defined, where C = A'B' T , measuring the importance of non-resident nodes in the cross-border financial flow network.

Results
This section illustrates the symmetric-key encryption algorithm as well as the proposed centrality measure. It makes use of a hypothetical dataset comprising cross-border financial flows, structurally similar to the dataset extracted from the South African Reserve Bank (SARB)'s international financial transaction database. We also present a visualization of the cross-border financial flow network using SARB's dataset. The section concludes with a comparative analysis of the proposed centrality measure with the metric defined in Equation 1, but for directed networks (Opsahl et al., 2010).

Encrypting the cross-border financial flows dataset
A computer program generated using SAS ® Enterprise Guide 7.1, Copyright© 2014, SAS Institute Inc., Cary, NC, USA, was used to encrypt both the hypothetical dataset and the dataset drawn from the SARB. The SARB data extract contained 28,649,763 financial transaction records for the 2014-2015 calendar years, comprising six data fields, namely: resident name, transaction date, flow date, non-resident name, BoP category and transaction amount. Table 1 shows a sample of 10 network observations from the encrypted list of international financial transactions, together with a display of the transient variables generated during the compilation phase of the SAS ® program. To interpret the network data in Table 1, consider observation number 2000 in the network data (first observation in Table 1). In this case, a resident labelled r10075604 paid a non-resident labelled nr15213089 an amount of USD 46.40 in a single transaction. The second observation in Table 1 shows that the same resident paid non-resident nr15557964 a total amount of USD1430.84 in 19 financial transactions during the period. The table shows the state of the transient variables for illustration purposes.  Table 3 shows the adjacency matrix representation of the cross-border financial flow network constructed from the hypothetical dataset. The table entries are the co-ordinate pairs x; y ð Þ, representing the total transaction volume and the total financial value, respectively.

Adjacency matrix representation and network visualization
To interpret the adjacency matrix, consider the entry in the ninth column of the third row (2,550). This entry indicates that the resident labelled R8 paid the non-resident labelled NR7 the total amount of USD 550 in two transactions, corresponding to the total number and financial value of transactions made by Linda in Table 2.
The cross-border financial flows network's visualization depicts the directed and weighted bipartite graph comprising two disjoint nodes. Figure 4 shows the visualization based on the dataset drawn from the SARB's database, created using SAS ® Visual Analytics software, 7.4.
Copyright 2014-2017, SAS Institute Inc., Cary, NC, USA. The links connect the nodes of different colors.

Network measure for the cross-border financial flows network
We compute the product matrix W from the two matrices A and B shown on the top right block of Table 3, representing the network weights of the hypothetical dataset. The entries of the normalised product matrix are as follows: Adding the entries of each row of the matrix W yields the proposed centrality measure as follows:   (15) Table 4 shows the results of four centrality measures used for computing node importance in the cross-border financial flow network, based on node degree, transaction volume, transaction value, as well as a combination of transaction volume and monetary value. We obtained the latter results from Equation 15. In addition, Table 4 includes the results obtained using Equation 13 for comparison purposes.
All the measures indicate that R4 (Elizabeth) is more important than others in the hypothetical network due to the large financial value of transactions as expected. The exciting node is R18 (Rosalia), considered to have minor centrality points due to its low monetary value, using the measure C Wα D . However, C i considers node R18 as the second-most important node due to its connection with R4. Furthermore, note that C Wα D allocates more centrality points to R10 than R13 when α ¼ 0:5, but quickly reverses their level of importance as the tuning parameter (α) approaches one (1). The desirable property of the proposed centrality measure is its ability to allocate centrality points to nodes connected to other vital nodes in the network.
The proposed measure's ability to identify highly connected nodes can enable analysts to exploit the cross-border financial flow network structure to understand node behaviors. For example, it can be possible to locate sub-networks in Figure 4. Figure 5 shows one of the sub-networks of the cross-border financial flow network constructed using the sample dataset extracted from the SARB's database of international financial transactions.

Discussion and conclusion
The need to analyze cross-border financial transactions to extract meaningful insights from the multidimensional dataset while preserving PII motivated this study. The paper proposed a symmetric-key encryption algorithm at the intersection of classical cryptography and data obfuscation methods, leveraging advanced technology and the dataset's group structure to gain computational efficiencies.
Performance studies comparing the proposed algorithm with other privacy-preserving techniques are a subject for further research. The algorithm's lack of suitability to function in live databases is its primary deficiency. In addition, its secrecy is a topic for further investigation.  The literature's lack of metrics incorporating the weights of the directed and dual-weighted network motivated the development of a new measure of node importance. The proposed metric uses the network's dual weights to allocate centrality scores. This study compared the proposed centrality measure with the existing one in literature, C Wα D , considering the number of edges and the edge weights, and a tuning parameter for controlling the relative importance of these two aspects (Opsahl et al., 2010). Experimental results indicated that while C Wα D can identify the important nodes of the cross-border financial flow network, it fails to recognize the nodes connected to other more important nodes.
Financial institutions and regulatory organizations can use the proposed privacy-preserving data mining approach to improve the analysis and monitoring of cross-border financial transactions and curb money laundering. In particular, the tools suggested in this study can enable the implementation of the FATF recommendation stipulating that financial institutions must draw specific attention to all complex, large, and unusual transaction patterns that have no apparent economic value or visible lawful purpose.
The existing literature on exploiting nodes' communities to identify anomalies in bipartite graphs does not apply directly to the dual-weighted networks such as the cross-border financial flow network (Sun et al., 2005a;Akoglu et al., 2015). Therefore, further research is necessary to understand the network structure and dynamics of the directed and dual-weighted bipartite networks.