Big data and credit risk assessment: a bibliometric review, current streams, and directions for future research

Abstract This study aims to track the structural development of academic research on credit risk assessment and big data using bibliometric analysis. The bibliography is obtained from the Scopus database and contains all studies with citations published between 2012 and 2021. The study’s findings suggest that credit risk assessment and big data are vast fields that have increased significantly in the last nine years. Chinese researchers and organizations contributed the most to the documents. The current study concludes that several possibilities exist to improve the knowledge of credit risk assessment and big data.


Introduction
Credit risk develops when the debtor fails or delays debt repayment, whether in entirety or part, in a debt contract. Credit risk is defined by (Anderson, 2013) as the "probability that a legally enforceable contract would become worthless (or at least significantly decreased in value) due to the counterparty defaulting and going out of business." As stated by Saunders & Cornett, 2017), it is "the risk of not being able to fully pay out the promised cash flow from financial institutions' loans and securities." Hence, credit risk develops because of default in derivative agreements between debt issuers and counterparties.
The relevance of credit risk has been widely acknowledged among academics across disciplines. Consequently, credit risk measures have received considerable attention, particularly in the corporate finance sector. Big data can be utilized to obtain a deeper understanding of the processes and credit status of various companies. (Wang et al., 2020) This will aid in better risk assessment thus decreasing the level of default.
Hence, several studies have attempted to use big data to develop credit risk assessment models that will aid lenders in decreasing the level of credit default. These models were designed to enhance the entire process and improve the quality of assessment. Therefore, this work provides a novel approach to further explore credit risk assessment and big data by conducting a bibliometric assessment to uncover distinct notions within this large body of literature, as a comprehensive bibliometric analysis on this topic is missing. This is necessary to identify what has been accomplished and what is required in future research.
The current study listed and analyzed all the articles and studies related to credit risk assessment and big data in journals published between 2012 and 2021 (July) using the Scopus database. This study explains the recent research progress in credit risk assessment and big data, which articles were more cited, and which authors and journals contributed the most to the field.
This study contributes significantly to credit risk assessment and big-data research. First, the work describes, organizes, and identifies key institutions, journals, publications, and authors for future studies. Additionally, we identify significant keywords and cite studies that could prepare future researchers to conduct new research on this topic. The study also offers an overview of the history of the research and summarizes and identifies present and developing research streams in credit risk assessment and big data.
This study combines the big data and credit risk assessment aspects. It performs a bibliometric analysis, different from other bibliometric studies published in this regard, as existing studies only focused on one of the two dimensions.
The results of this study show that the National Natural Science Foundation of China is the top sponsoring entity, and the Journal of Physics: Conference Series has the most published studies associated with credit risk assessment and big data research. The Electronic Commerce Research and Applications Journal has the most cited articles.
Author Zhang Y. published the most articles on the topic, whereas Yu Y had the most citations. Finally, China is the top country in credit risk assessment and big data research. Hence, it can be concluded that Chinese authors, institutes, and organizations have dominated credit risk assessment, big data, and related research in the last few years.

History of credit risk assessments approaches
Credit risk is an essential form of financial risk and is frequently seen as the earliest type of financial market risk from the 1800 BCE period in the ancient Egyptian era (Caouette et al., 2011). Altman's Z-score (Benzschawel, 2012), established on the multivariate discriminant analysis of five accounting measures, was the first contemporary quantitative credit risk assessment model. Even though it is 50 years old, the Z-score remains an important instrument for many market participants (Benzschawel, 2012). However, because accounting ratios depend on past data, the approach has been criticized for being backward-looking and sporadic. Other credit risk models, such as structural and reduced-form models, have been developed owing to this. (Black & Scholes, 1972) and (Merton, 1973) are credited with developing structural credit riskassessment models. According to capital structure theory (Modigliani & Miller, 1958), structural models imply that a default occurs when a company's assets are worth less than its debt. (Black & Scholes, 1972) employ an options pricing model to price debt and equity, proving that call options on equity may affect the debt value. The issue with the Black-Scholes model is that a company's asset values cannot be directly monitored. (Merton, 1974) continues his work by showing that the asset value may, on some assumptions, be calculated and then used to calculate the probability of a default, which he calls the "distance to the default." However, the Merton model is considered the most important for credit risk modeling in terms of structural models.
In contrast to structural models, reduced-form models can identify the risk of default without assuming the cause of the credit risk premium (Benzschawel, 2012). The reduced-form models are based on risk-neutral pricing theory, which asserts that a risky investment's market value equals the current value of future risk-free cash fluxes (e.g., the US Treasury rate). (Jarrow & Turnbull, 1995), (Duffie & Singleton, 1999) and many others use risk-neutral pricing theory to estimate credit risk. The reduced-form method has also been a prominent paradigm (Diaz Weigel & Gemmill, 2006).

Components of credit risk
Default, spread, and downgrade risks are the three main components of credit risk (Anson et al., 2004). Default risk is the risk that the issuer or counterparty will not fulfill the terms of a financial contract's obligation. Loss or failure to perform a problem caused by an increased lending spread is known as the risk of credit spread. The credit spread illustrates how financial markets react to an issue's projected credit quality deterioration. The danger of credit rating degradation is known as the downgrade risk. When a rating agency assigns a lower grade than the prior grade, the issuer is at risk of being downgraded. These three forms of credit risk are inextricably linked.

Types of credit risk assessment models
Credit risk assessment models are often divided into two different groups: qualitative and quantitative (Saunders & Cornett, 2017). In analyzing credit risk, the value of the variables includes the characteristics of the borrower (e.g., reputation, levers of funds, volatility in income, and collateral.) and those of the market (such as business cycles, interest rate level.). A subjective assessment is conducted on these variables to determine whether a candidate may be granted credit. Quantitative models seek to obtain a credit score to establish the chance of default or categorize borrowers into distinct default risk categories (Saunders & Cornett, 2017).
Quantitative credit risk modeling is not a simple process because default risk, a component of credit risk, seldom occurs (Anson et al., 2004). However, the concept of credit risk has evolved over the years in several quantitative credit risk models. They are frequently based on other models on a certain theoretical basis, such as ratios, theory of options, econometrics, or expert systems (Caouette et al., 2011). Econometrics, simulation, optimization, or a combination of these are frequently used to build financial models. Therefore, credit risk models may be classified into three components: methodologies utilized, application fields, and products engaged (Caouette et al., 2011). (Caouette et al., 2011) defined econometric methods as statistical models in which the likelihood of default is the dependent variable. The available models include linear probability, logit, probit, linear discrimination analysis, and other regression methods (Altman & Saunders, 1997;Saunders & Cornett, 2017). Neural networks are computer-based systems (that function similarly to the human brain) that employ the same data in econometric models to make choices, typically through trial and error. They frequently look for connections between discrete choice model variables (e.g., the logit model). Hybrid models integrate structural models with additional financial variables to approximate a company's probability of default (e.g., the book value of assets and liabilities, net income, and return on equity; Benzschawel, 2012). The KMV and HPD models (Sobehart & Keenan, 2001) are two examples.
The multiple classifier systems (MCS) have evolved in the last few years to enhance the accomplishment of a single prediction model in financial credit risk assessment. Many researchers have demonstrated that the MCS technique may generate superior results than individual credit risk evaluation models (Verikas et al., 2010;Zhou, 2012). In four different classifiers on three financial data sets, (Nanni & Lumini, 2009) explores four ensemble methods using four different classifiers on three financial datasets. They identify a random subspace that generates the largest area under the ROC curve (AUC).
Credit risk assessment models are further classified into two types: i) consumer and ii) corporate. Although both models have identical underlying assumptions, much of the existing literature focuses on the corporate perspective. Credit risk analysis was first used to assess consumer credit risk in the period after 1950 to estimate the creditworthiness of retail customers (Thomas et al., 2005). Consumer credit risk models include neural networks and expert system/decision tree models. Altman's (Altman & Saunders, 1997) is the 1 st quantitative model of business credit (Benzschawel, 2012). Then came the publications of (Black & Scholes, 1972), (Merton, 1974), and (Jarrow & Turnbull, 1995), which provided the groundwork for corporate credit risk modeling research (Kealhofer, 2003).

Big data and credit risk assessment
A study by (Wagdi & Tarek, 2022) examines the efficiency of technology models in credit riskscoring modeling in developing markets. It suggests evaluation approaches for credit risk-scoring modeling for present and possible debtors through an examination of the Egyptian banking field by proposing and investigating a framework for the integration of big data and artificial neural networks based on systematic and unsystematic risk for the macroeconomic environment and features of present and possible debtors.
Moreover, a study by (Pérez-Martín et al., 2018a) highlights the massive number of databases financial corporations manage. It has become essential to resolve this issue by deploying big data methods to enormous financial datasets for segmenting risk groups. Several Monte Carlo experiments are applied to massive datasets using known techniques and algorithms. Additionally, a linear mixed model (LMM) is employed as the latest incremental contribution to assess the credit risk of financial firms. The results show that big data can assist in extracting the value of data; therefore, superior choices can be made without the runtime element.
Additionally, a study by (Md et al., 2020) examines the present trend of how financial sectors deal with big data and demonstrates how big data affects diverse financial areas. More precisely, it demonstrates its influence on financial markets, financial institutions, and the relationship with Internet finance, financial management, Internet credit service companies, fraud detection, risk analysis, and financial application management.
Likewise, (Addo et al., 2018a) state in their study that as advanced technology related to big data, data availability and computing power is rising, and several banks and lending organizations are updating their business models accordingly. Credit risk forecasts, supervision, model consistency, and efficient loan management are crucial to decision-making and transparency. Addo et al., 2018a) focused on building a binary classifier model based on machine and deep learning real data simulations to forecast the loan default probability.
However, (Mhlanga, 2021) reveals that artificial intelligence and machine learning strongly influence credit risk assessments by employing alternative data sources, such as public data, to address the issues of information asymmetry, adverse selection, and moral hazard. Therefore, lenders perform a serious credit risk analysis, evaluate the consumer's behavior, and validate customer's competence to reimburse loans, enabling less fortunate individuals to access credit.
Big data approaches have recently been added to enhance credit risk assessment procedures. Big data is fundamentally a collection of data that can be obtained, saved, controlled, and analyzed by a computer in a short period. Big data has enhanced the effectiveness of data transmission, storage, management, and sharing. Big data analysis technology can recognize these massive amounts of data in credit risk, enhancing the exactness and scientificity of risk forecasting and early warning (Du et al., 2021). Additionally, (Cui, 2015) states that the idea, technique, and means of big data are inserted in recent evaluation systems. This is to increase the capturing width, depth, and real-time information and utilize the scientific method to mine the key value in huge data, which is effective for enhancing risk evaluation and prediction.
Hence, this study uses a longitudinal approach because articles are linked through time (Small, 1999), and credit risk models incorporate various ideas and notions from other fields. Moreover, because big data is also evolving in the credit risk assessment process, it is essential to study the link and how the topic is evolving. The study addresses three research questions, adapted from (Fetscherin & Heinrich, 2015) but adjusted for the study, using a multidisciplinary assessment of the literature: (1) What is the evolution of the idea of credit risk assessment and big data, what are the main research streams, and which require more attention?
(2) Which studies, articles, and authors are the most referenced and worth reading in this subject for future studies?
(3) What are the most significant institutions, organizations, and countries contributing to credit risk assessment and big data?

Methodology
The quantitative study of science aims to improve understanding, and bibliometric analysis plays an essential part in this field (Van Raan, 2004). The quantitative examination of technical terminology, production, development, cooperation, and usage of scientific publications is known as a bibliometric analysis. In recent years, bibliometric analysis has become popular for measuring and analyzing scientists' output, collaboration among authors and institutions, output comparison, highly cited outputs, and co-citation analysis. According to Ellegaard & Wallin, 2015), bibliometric analysis is an important aspect of research assessment techniques, especially in the scientific and practical domains. The use of bibliometric analysis for collaboration between industry and institutes was discussed by (Skute et al., 2019). Hence, bibliometrics analyzes the quantitative characteristics of research using statistical and mathematical approaches (Broadus, 1987). According to Cobo et al., 2011), Bibliometric mapping is a 3D depiction of the relationships between disciplines, fields, areas, and specific publications or authors.
Moreover, the bibliometric analysis identifies focus studies and objectively depicts the links among publications concerning a particular research subject by assessing how many occasions they have been co-cited by other published works (Apriliyanti & Alon, 2017;Fetscherin & Heinrich, 2015). The findings may be used to determine the popularity of important authors, their articles, and their effects. Thus, the bibliometric analysis makes it easier to evaluate meta-analytics, create and discover key research streams, and develop basic theoretical agendas (Apriliyanti & Alon, 2017;Fetscherin & Heinrich, 2015;Nobanee, 2020Nobanee, , 2021Nobanee et al., 2021) Academics publish their crucial discoveries and findings in research journals and typically base their study on studies/documents formerly published in similar publications according to bibliometric analysis (Van Raan, 2003). The analysis unit in any citation analysis is considered a citation (Kim & McMillan, 2008) that goes beyond the basic listing of research articles to incorporate centers of excellence (Fetscherin & Usunier, 2012) and map connectivity among research field publications.
Thus, it was found that the bibliometric analysis approach is the most suitable methodology for our research, as it will help identify critical studies on this topic. Moreover, it will help illustrate the connections among publications regarding this topic by evaluating how many instances have been co-cited by other published works. Thus, it will facilitate the evaluation of key research streams and develop basic theoretical agendas, unlike other methods, which are limited to delivering only a literature review on this topic.
Hence, the current study uses information from the Scopus database of Elsevier abstracts and citations. The data from Scopus are obtained using keywords, which can be summed up in Table 1, as a subject for 2012-2021. The most frequently used keywords and their occurrences are summarized in Table 2. Table 2 shows that the most frequent word in the search was "Big Data," with a total occurrence of 145. The keyword "Risk Assessment" was used 80 times. The details of other major keywords are shown in the table.
Very few empirical findings have tried to assess credit risk assessment and big data using bibliometric analysis. However, this study will help scholars better understand credit risk management and big data as a topic that is genuinely important to today's economic and financial systems. The current investigation will add to the academic literature by shedding light on credit risk assessment and big data by identifying and illustrating key patterns in this domain by utilizing Scopus and VosViewer. The

Conditions Number of documents
The search query of TITLE -ABS-KEY TITLE-ABS-KEY (("big data" AND "credit risk") OR ("big data" AND "credit expos*") OR ("big data" AND " loan risk") OR ("big data" AND "default risk") OR ("big data" AND "credit default") OR ("big data" AND "risk of default") OR ("big data" AND "insolvency risk") OR ("big data" AND "exposure at default") OR ("big data" AND "downgrade risk") OR ("big data" AND "credit rating") OR ("big data" AND "credit score") OR ("big data" AND "credit rate") OR ("big data" AND "credit assessment") OR ("big data" AND "credit evaluation") OR ("big data" AND "credit analysis") OR ("big data" AND "credit valuation") OR ("big data" AND "credit appraisal") OR ("big data" AND "credit grading") OR ("big data" AND " credit check") OR ("big data" AND "credit history"))

documents
Search query after refining TITLE-ABS-KEY (("big data" AND "credit risk") OR ("big data" AND "credit expos*") OR ("big data" AND " loan risk") OR ("big data" AND "default risk") OR ("big data" AND "credit default") OR ("big data" AND "risk of default") OR ("big data" AND "insolvency risk") OR ("big data" AND "exposure at default") OR ("big data" AND "downgrade risk") OR ("big data" AND "credit rating") OR ("big data" AND "credit score") OR ("big data" AND "credit rate") OR ("big data" AND "credit assessment") OR ("big data" AND "credit evaluation") OR ("big data" AND "credit analysis") OR ("big data" AND "credit valuation") OR ("big data" AND "credit appraisal") OR ("big data" AND "credit grading") OR ("big data" AND " credit check") OR ("big data" AND "credit history")) AND (LIMIT-TO (DOCTYPE, "cp") OR LIMIT-TO (DOCTYPE, "ar") OR LIMIT-TO (DOCTYPE, "cr") OR LIMIT-TO (DOCTYPE, "ch") OR LIMIT-TO (DOCTYPE, "ed") OR LIMIT-TO (DOCTYPE, "re")) AND (LIMIT-TO (LANGUAGE, "English")) access documents. All subject areas are included in the search, mostly limited to articles, reviews, conference proceedings, conference reviews, editorials, and book chapters. The search also includes all source types. Finally, the language of the searched articles is limited to English.

Results and discussions
This section presents the Scopus bibliometric findings for the keywords used for studies published between 2012 and 2021 (July).
Following the method system of Apriliyanti and Alon (Apriliyanti & Alon, 2017) and Fetscherin and Heinrich (Fetscherin & Heinrich, 2015), the study also employs a bibliometric software program to ease the process of finding the citation and co-citation links of publications. VOSviewer is used to verify the findings, which may also be used to "construct and visualize bibliometric maps" (Van Eck & Waltman, 2014). VOSviewer offers distance-based representations of bibliometric systems using the VOS (visualization of the similarity) mapping approach (Van Eck & Waltman, 2014).
As Figure 1 shows that during the past nine years (2012-2021 (July)), there has been an increase in the overall quantity of credit risk and big data research publications. In 2013, nearly two publications were published, which is considered insignificant. However, it increased to 10 new publications in 2015, indicating a positive sign for this research topic. Interestingly, 2018, only three years from 2015, has 40 new publications on this topic, which is four times the number of publications in 2015. Moreover, in 2020, 60 new publications were added to the wealth of  knowledge, indicating an upward trend expected to grow further as big data and credit risk assessment is a new emerging topic attracting scholars worldwide.
Hence, this study adds to the classification and synthesis of this huge body of literature, considering the increasing data and literature on credit risk and big data and recommending potential research fields for future study of the subject. Figure 2 presents the top sponsored institutes/organizations and the number of articles on credit risk assessment and big data published by these institutes. This study will aid potential researchers in discovering organizations interested in the selected topic. Hence, scholars can attempt to receive support from these institutions when they have a great idea but require a lot of professional and financial support. Likewise, postgraduate students can reach those entities if they wish to work and specialize in related fields.
Thus, the figure clearly shows that the greatest number of articles (34) on credit risk assessment and big data is published by the National Natural Science Foundation of China, with a huge difference compared to other entities. However, Chinese sponsors and organizations are the most dominant in credit risk assessment and related research.     Table 3 shows the most-cited journals in all search categories, as it will help future scholars target the correct journals that are specialized in this field and are most interested in the advancements made regarding this topic. Additionally, they can use it to refer to the most cited studies as it may have useful information. The results show that a total of eight journals ranked from the highest number of citations to the lowest. It was found that the journal "Electronic Commerce Research and Applications" has the largest number of citations in the required search category. Three relevant articles with 96 citations are found in the journal. The journal "Policy and Internet" has only one document in the search query, but it is cited 66 times. Similarly, the "Journal of Business Research" ranks lowest, with article citations of 27. Table 4 presents the journals with the highest number of published articles in the required search categories. The journals are ranked from top to bottom, with the highest number of studies at the top and the lowest number at the bottom. Interestingly, it is found that the greatest number of studies related to credit risk assessment and big data are published in the "Journal of Physics: Conference Series." This is an open-access Institute of Physics (IOP) Publishing peer-reviewed magazine that gives readers new advances in physics at worldwide conferences. This journal published a total of 15 documents with several citations of 18. Similarly, other journals with several published articles are listed in Table 2.
However, surprisingly, the journals with the highest number of published studies are not mentioned in the table of the top-cited journals, revealing that sometimes quality could outway quantity.
VOS viewer is used with a threshold of 50 documents from seven different journals to map the leading sources of the research articles. A network map of the leading sources is shown in Figure 3. The size of the circle, as in historiographic mapping, determines the journal's relevance; the larger the circle, the greater the journal's influence. Table 5 shows the leading authors in credit risk assessment and big data-related studies by the number of documents and citations. Identifying leading authors in big data and credit risk assessment will open the way for future researchers willing to publish in this regard to communicate and collaborate with influential authors in this area. These researchers could also use the  Table 5. Similarly, Figure 4 shows the   VOSviewer network mapping of the leading authors. It also visualizes leading authors by the number of articles and citations. Table 6 reveals the institutions or affiliations with the highest citations in credit risk assessment and big data research. The table ranks institutes from the highest to lowest number of citations. The most cited research article is published by the Institute of "Asia Australia Business College, Liaoning University." A research article published by the Asia Australia Business College, Liaoning University, is cited 76 times. This institution can be considered a "center of excellence for prior credit risk assessment and big data research." While Figure 5 shows the VOSviewer networking map with a threshold of 13 affiliations that show that the most influential institute/affiliation is Henan Key Laboratory of Finance. The majority of major institutions are located in China and the United States.   Table 7 presents the leading countries regarding several documents and citations in credit risk assessment and big data research. Highlighting the top countries publishing on big data and credit risk assessment will aid scholars in selecting countries that are currently interested and working. Moreover, it could open the way for new graduates who wish to continue their studies in this area to look for opportunities in these countries, as they already have sufficient knowledge in this area. Therefore, Table 7 shows that China leads in studies published and documents cited. This shows that China's research on credit risk assessment is quite extensive, and China is leading the world in credit risk and big data research, as can also be seen in Figure 6, VOSviewer networking map. The big circle indicates that China is highly influential in credit risk and big data research. Similarly, the United States, the United Kingdom, Vietnam, and India are some other countries alongside China that are influential in credit risk assessment and big data study ( Figure 6). Table 8 presents the most-cited articles related to credit risk assessment and big data research. This type of analysis will give potential researchers influential documents with the most relevant data regarding the topic to gain insights on big data and credit risk assessment and identify where to start. These are the most cited articles by other authors, which will guide other scholars in researching big data and credit risk. Shen et al. (Shen et al., 2018) is the most cited article with 76 citations, revealing that the study has significant and useful information regarding the topic and that 76 other authors can benefit from it until now. Figure 7 shows the VOSviewer network map of the most cited articles, and it also indicates that Liang et al. (Liang et al., 2018) article is the second with 66 citations. This indicates that this article has a high impact factor and other researchers find it relevant and useful for their research.

Limitations and future research
One of the weaknesses of this study is that the Scopus database does not include all relevant articles compared with other scientific databases (such as Google Scholar). It is also possible that many relevant publications could not be incorporated into our study. While the Scopus database is regarded as the most selective, it is intended to concentrate on certified publications that have proven brilliance in quality and impact. Although this technique is being utilized, there is some subjectivity in the assemblage of main research streams since the authors make certain judicial decisions. The current study is also restricted to articles published in English. It is recommended that future researchers include publications and articles written in languages other than English. Similarly, the current research only considered articles published between 2012 and 2021 (July); therefore, it is also recommended to include articles and studies before 2012 to obtain a more comprehensive bibliometric analysis of credit risk assessment and big data.

Conclusions and recommendations
This study uses bibliometric citation analysis to examine credit risk assessment and big data research during the previous nine years. This study used different search queries and reviewed 219 documents. Many interesting findings emerged from this study. First, the results suggest that the organization mostly contributing to financing credit risk and big data studies is the National Natural Science Foundation of China. The results also suggest that the greatest number of studies are published in the Journal of Physics: Conference Series, whereas the most cited studies associated with credit risk assessment and big data research, are published in the Electronic Commerce Research and Applications Journal.
The results also show that the greatest number of studies associated with credit risk and big data research was published by Zhang Y., and the author whose articles are most cited is Yu Y. The results show that the leading institute in credit risk and big data study is the Asia Australia Business College, Liaoning University in China. When comparing the number of research articles from different regions and countries, China is the leading country in researching credit risk assessment and big data. While developed countries have dominated credit risk and big data research in the last few decades, China has emerged as the leading credit risk and big data research country in the last nine years. Thus, it can be concluded that Chinese authors, institutes, and organizations have dominated credit risk assessment, big data, and related research in the last few years. It is also concluded from the current study that there are several possibilities to improve the knowledge of credit risk assessment and big data while also advancing theories and their influence on financial investments.
Hence, these results can be used in future research as a guide for potential scholars to identify significant documents, journals, countries, and organizations. Moreover, this study paves the way for future scholars to enhance the practicality of the research studies offered in the literature review section and top document area.