Why do banks fail? An investigation via text mining

Abstract This study aims to investigate the material loss review published by the Federal Deposit Insurance Corporation (FDIC) on 98 failed banks from 2008 to 2015. The text mining techniques via machine learning, i.e. bag of words, document clustering, and topic modeling, are employed for the investigation. The pre-processing step of text cleaning is first performed prior to the analysis. In comparison with traditional methods using financial ratios, our study generates actionable insights extracted from semi-structured textual data, i.e. the FDIC’s reports. Our text analytics suggests that to prevent from being a failure; banks should beware of loans, board management, supervisory process, the concentration of acquisition, development, and construction (ADC), and commercial real estate (CRE). In addition, the primary reasons that US banks went failure from 2008 to 2015 are explained by two primary topics, i.e. loan and management.


PUBLIC INTEREST STATEMENT
In this study, we used a method called textmining to analyze why 98 banks in the US failed between 2008 and 2015.Text-mining is a technique that helps us summarize information from a collection of reports.We discovered that the main reasons for these bank failures were related to problems with loans and management.By using text-mining techniques to review the material losses reported by the Federal Deposit Insurance Corporation (FDIC), we also identified several terms that banks should be cautious about to avoid the risk of failure.These terms include "loans," "board management," "supervisory process," "concentration of acquisition," "development and constructions," and "commercial real estate."

Introduction
Significant parallels exist between the post-Covid-19 consequence shock and the 2008 global financial crisis.They both have substantial adverse consequences worldwide, damaging many economies and potentially leading to a recession (Li et al., 2022).The 2008 crisis, stemming from Lehman Brothers' collapse, caused the worst global economic downturn since 1929, showcasing the crucial role of US banks in the financial system.Ashcraft (2005) raised the question of whether bank failures are essential and how they affect actual economic activities.The study provided evidence that failed banks have significant and lasting impacts on the real economy.
Numerous US banks fail yearly, especially during and after the financial crisis.When a bank fails, the Federal Deposit Insurance Corporation (FDIC) plays two key roles: (i) Paying insurance to depositors and (ii) Managing the failed bank's assets and debts.Furthermore, FDIC also provides reports that are conducted through investigations and publish material loss reviews for each failed bank, evaluating supervision and board oversight over a 10-year period before the failure announcement.These detailed assessments from FDIC are extremely valuable, shedding light on hard-to-measure criteria like board oversight and examination qualities.
Financial ratios are commonly used in the literature to assess bank performance and predict failures.However, they have limitations in determining management quality or strategy to fully describe the reasons for bankruptcy.Moreover, the financial indicators that combine with news/ textual information will improve the accuracy of prediction (Gupta et al., 2020).Therefore, text analysis should be considered a complementary tool to predict bankruptcy.Text mining (Hudaefi & Badeges, 2022;Hudaefi et al., 2022) is an artificial intelligence technique that quickly generalizes main ideas.Das (2014) defines it as a large-scale automated processing of digital plain text to extract useful quantitative or qualitative information.
Several studies have explored the application of text mining in various sectors, including financial industry (Hristova, 2022;Hudaefi et al., 2022;Kuilboer & Stull, 2021;Pejić Bach et al., 2019), management disciplines (Hudaefi et al., 2022;Kushwaha et al., 2021), supply chain sector (Chu et al., 2020).However, a limited number of articles utilize text mining for bank bankruptcy prediction.To fill the gap, the aim of this study is to use text mining to explore the US banking sector before, during, and after the global financial crisis.By analyzing bank failure reports provided by FDIC, the study aims to provide insights into the potential reasons for bank failures.Additionally, this research will contribute to the literature by (1) Explaining the reasons behind bank failure through text analysis and (2) Demonstrating the use of text analysis as a supplement to traditional financial ratio analysis.This article is organized as follows: Section 2 provides an overview of the literature concerning failure recognized and textual representation techniques.Section 3 introduces our data corpus and the methodology for extracting the key terms.The results are represented in section 4. The brief conclusion is in section 5.

Literature review
Text mining is applied popularly in the field of business management, such as opinion mining and sentiment analysis (Pak & Paroubek, 2010;Pang & Lee, 2008).This technique, however, has yet to be used widely in the finance and bank failure field.The most common use in this field is financial ratios to explain the reasons for bank failure.The financial ratios were initially conducted using numeric data from financial reports or relevant statements.
Previous studies focus on using ratios models to predict bankruptcy.Altman (1968), Ohlson (1980), and several studies have developed accounting ratio models of bankruptcy.Ratios such as Net worth to debt, Working capital to Total Assets, Earning before interest and taxes to total assets are widely used to describe the probability of bankruptcy.Most of these kinds of algorithms require statistical tests, hypotheses, or robustness checks to ensure that the method performs well.Some financial ratios are used widely, such as CAMELs rating, coverage ratios, and management quality (via ratios such as CEO duality, the percentage of independent directors, current ratio, ROE/ROA, etc.).Applying text mining to extract the most popular ratios, Kumar and Ravi (2007) reported that among 128 given papers, most papers mentioned current ratios, quick ratio, income ratio, EBIT/ total assets, ROA, or ROE.These ratios are also considered the "core ratios" that affect a bank's performance.In general, to measure the effect of governance, previous studies may use some variables that are considered to reflect the quality of governance, such as the gender of the CEO, and the number of meetings for the B.O.D. over the year.This approach brings the result with the equation measuring parameters.
In recent decades, the question of the value of non-numeric data has been addressed.Text mining was introduced in the 1960s by document classification and became popular in the 1990s.This method has found various applications in diverse domains (Kumar & Ravi, 2016).Especially in the decades of social media and big data nowadays, text mining has become a leading trend to analyze text context not only on Facebook, Twitter, blog, or other social networks but also through news and reports (He, 2013).This information is valuable to decision-makers, their partners, competitors, and stakeholders.Text in context is believed to bear more diverse information than numbers (Kloptchenko et al., 2004).However, in most previous research using text mining, researchers primarily analyzed text data based on word frequency calculated by morphologically analyzed text.The extracted word might lack important information that was included in the original text, such as word-to-word dependencies and the contexts around high-frequency words.Data is collected from headline news and financial reports.
In the field of finance, prolific work is reported in using text mining to solve problems such as predicting the FOREX rate, stock market, or customer relationship management (Kumar & Ravi, 2016).However, compared with the number of finance research based on financial ratios, the number of researchers based on text mining is the minority.Regarding FOREX rate prediction, the study suggested that based on the historical trend (Goodhart, 1990), news reports (Fung et al., 2002), macro news (Evans & Lyons, 2008), or even Twitter messages (Vu et al., 2012) might effect on FOREX rate and help investor predict the movement of foreign currency.Besides, there are more papers on stock market prediction that use news headlines, annual reports, or financial news from Bloomberg and Yahoo to foresee the trend of stock price (Back et al., 2001;Chan & Franklin, 2011;Koppel & Shtrimberg, 2006;Mellouli et al., 2010;Nassirtoussi et al., 2014;Wang et al., 2011).
Once the typical process of basic text mining is, firstly, to collect the data by acquiring articles, news or reports from the internet.Secondly, extract and retrieve the given data by reporting the frequency of the most common vocabularies as a baseline of the framework (Gajzler, 2018).Finally, calculate the correlation among words.However, this approach is more about statistics than giving the true meaning of the text.To maximize the efficiency of the use of text mining, document analysis is introduced.The two most popular tools are document clustering and topic modeling.These are two closely related tasks that can mutually benefit each other (Xie & Xing, 2013).
Topic modeling is one of the most powerful text-mining techniques, gaining researchers' attention (Chi et al., 2010).Blei and Lafferty (2009) proposed topic modeling by discovering patterns of word use and connecting documents that exhibit similar patterns.Topic models have emerged as powerful new techniques for finding valuable structures in an otherwise unstructured collection (Krstić et al., 2019) This technique is applied in various fields, such as customer analysis, political science, etc.A topic contains a cluster of words that frequently occurs ensemble and can connect words with similar meanings and distinguish between uses of words with multiple meanings (Moosad et al., 2015).There is various type of topic model algorithms, such as Latent Semantic Analysis (LSA), Probabilistic Latent Semantic Analysis (PLSA), Correlated Topic Model (CTM), and Latent Dirichlet Allocation (LDA); among them, LDA, an algorithm based on statistical (Bayesian) topic models is one of the most popular tools in this topic modeling and be considered as a standard tool.Various studies in the field of social networks (Cheng et al., 2013;Cohen & Ruths, 2013;Kim & Shim, 2014;McCallum et al., 2005;Wang et al., 2013;Yu et al., 2015), political science (Cohen & Ruths, 2013), linguistic science (Bauer et al., 2012) used LDA.The dataset is mainly used in finance for financial news and financial reports (Kumar & Ravi, 2016).Document clustering is a method for automatic cluster textual documents.This algorithm is widely applied in many fields, such as document organization, browsing, summarization, or classification (Cai et al., 2010).There are many algorithms for document clustering, like the K-means algorithm (Hartigan & Wong, 1979) and hierarchical clustering (Jain & Dubes, 1988).Further, it is evident that text mining is becoming more popular and drawing special attention from researchers.Using text mining saves time from reading thousands of documents and helps researchers have a general idea effectively.Text and data mining are considered complementary techniques for efficient business management.Text mining tools are becoming even more important.

Data
The corpus consists of 98 official bank failure reports that were collected from the FDIC website (https://www.fdicoig.gov/reports-bank-failures)spanning the years 2009 to 2015.Among these reports, 69 out of 98 banks failed due to the Global Financial Crisis of 2008.Table 1 provides a detailed overview of the corpus.The reports are announced by the Federal Deposit Insurance Corporation's Office of Inspector General (FDIC OIG), which is an independent office responsible for conducting audits, evaluations, investigations, and other reviews of the FDIC.The primary purpose of these reviews is to prevent, deter, and detect waste, fraud, abuse, and misconduct in FDIC programs and operations, while also promoting efficiency and effectiveness within the agency.
The audits conducted as part of the reports aim to determine the causes of the financial institution's failure and the resulting material loss to the Deposit Insurance Fund (DIF).Additionally, they evaluate the FDIC's supervision of the institution, including the implementation of the PCA (Prompt Corrective Action) provisions.Each bank's report provides both numeric and textual information.Interestingly, the textual part of the material loss review contains more comprehensive and detailed information compared to the financial ratios and other numeric data.
Each report in the corpus follows a structured format, comprising three main sections: Causes of failure, Material loss, and the FDIC's supervision.Each reason leading to the failure is analyzed in a separate and detailed paragraph.For instance, let's consider the report on the failure of The Bank of Union, El Reno, Oklahoma, in 2014.In this report, one of the reasons identified was related to the CEO's actions.The CEO would occasionally present information to the Board concerning specific borrowing relationships and the overall lending strategy of the bank.However, in certain instances, the subsequent actions taken by the bank's management deviated from the materials presented by the CEO.Each report in the corpus delves into similar in-depth analysis, providing comprehensive insights into the factors contributing to the failure of the respective banks.The structured approach allows for a thorough understanding of the circumstances surrounding each bank's collapse and helps identify patterns and trends within the dataset.
Overall, this corpus of bank failure reports offers valuable insights into the factors contributing to bank failures during the specified period and serves as a valuable resource for conducting in-depth analyses and research in this domain.

Method
Unlike the traditional financial ratio analysis, numbers are organized as a structured matrix.The primary challenge in applying text mining is investigating the unstructured data format.Text mining applies mainly resembles techniques as data mining; the difference is to deal with the corpus of textual data (Dörre et al., 1999).The corpus is converted into the document-term matrix after removing stop words, stemming, punctuation, number, and strip whitespace as proposed by Salton and Buckley (1988).

Pre-process and Bag of Words (BoW)
The Bag-of-Words (BoW) technique is exclusively based on raw documents.This method involves extracting words from the text by considering their frequency of occurrence, without taking into account the order or grammar of each word.The extracted words are then gathered together in the form of a "bag of words."Feinerer (2013) introduced the "tm" package in R, which provides a robust framework for textmining applications.This package offers various methods for tasks such as data import, corpus handling, pre-processing, data management, and the creation of term-document matrices.This pre-processing step is essential as it helps cleanse the words in the text before proceeding to subsequent stages.Certain elements like words, punctuation, and capitalized letters (e.g., "A", "in", "that", "there", "our") don't carry significant valuable meaning in the given context.Thus, such features are removed during the pre-processing and bag-of-words techniques to reduce noise and extract more meaningful information for further analysis.

Topic modeling via Latent Dirichlet Allocation
We propose that a bank failed due to certain reasons, and it is likely that other banks might encounter similar issues.To explore this further, we employ document classification tools to categorize reports into groups based on their content.
Topic modeling allows us to uncover the underlying themes or latent semantics present in the document corpus and helps identify document clusters, which is more insightful than solely relying on raw term features.One popular algorithm for topic modeling is Latent Dirichlet Allocation (LDA), introduced by Blei, Ng, and Jordan in 2003.This is one of the most popular algorithms for topic modeling.Without diving into the math equations behind the model, we can understand it as being conceptualized by two principles: LDA estimates the mixture of words that compose a topic and determine the topics that describe each document.Mathematically, LDA calculates based on a conditional distribution.A corpus D contains d document distributed into T topics including z t single latent topic.Each of z t topics composed by each word w t .LDA assumes the following generative process for each document: (1) Choose N ∼ Poisson(ξ).
(3) For each of the w words w t : (a) Choose a topic z t ∼ Multinomial(θ) and (b) Choose a word w t from p(w t |z t ,β), a multinomial probability conditioned on the topic z t.
For example, we provide a concise overview by selecting three sentences from separate reports (documents) related to the failure of different banks.
Sentence 1 from the "Report of Doral Bank, San Juan, Puerto Rico" 1 states: "The underlying cause of Doral's failure was attributed to poor asset quality." Sentence 2 from the "Report of Vantage Point Bank, Horsham, Pennsylvania" conveys: "The failure of Vantage Point Bank was a result of ineffective management by the board and management, specifically regarding the handling of risks associated with the bank's rapid expansion of its mortgage banking operation." Lastly, sentence three from the report on "Valley Bank, Moline, Illinois" reveals: "The primary reasons for the failure of Valley Bank were lax oversight by its board and the implementation of a risky business strategy by a dominant CEO." These sentences serve as succinct summaries of the main factors contributing to the downfall of each respective bank, and they form the basis for our research analysis.These sentences provide key insights into the reasons behind the respective banks' failures, serving as essential elements for our research analysis.
The main objective of LDA (Latent Dirichlet Allocation) is to automatically uncover the underlying topics present in a collection of documents or sentences.In this particular example, LDA has classified the documents into two topics: G (representing Governance) and R (representing Risk).
For each sentence, LDA has assigned a distribution of word counts across the identified topics.
• Sentence 1 is exclusively classified as 100% belonging to Topic R (Risk).
• Sentence 2 is classified as 42% related to Topic G (Governance) and 58% to Topic R (Risk).
• Sentence 3 is classified as 32% associated with Topic G (Governance) and 68% with Topic R (Risk).
Similarly, LDA performs a comparable process for the entire set of documents, automatically classifying each document into the given topics based on the distribution of word occurrences.
The critical aspect in this process is to determine the appropriate number of topics, which can have a significant impact on the results and insights derived from the LDA analysis.
Indeed, selecting the appropriate number of topics is a crucial and challenging aspect of using LDA for topic modeling.The optimal number of topics is not yet definitively established in the literature and remains an open question.
Following Arun et al. (2010), who experimented with different corpora and evaluated the number of optimal topics for each dataset.The results were presented in Table 2, showing that the ideal number of topics varied depending on the specific dataset.Despite their efforts, the authors noted that achieving consistent suggestions for the number of topics remains difficult due to the diverse nature of the data and topics in different domains.The absence of a standardized approach to determining the optimal number of topics is a challenge that persists in topic modeling research.
Researchers and practitioners often resort to heuristics, domain expertise, or validation techniques (such as perplexity or coherence measures) to assist in selecting the number of topics that provide meaningful and interpretable results.However, this aspect of LDA modeling continues to be an active area of research, with ongoing efforts to find more consistent and reliable methods for determining the appropriate number of topics for different types of data and contexts.
Although the literature proposes several methods for empirically determining the optimal number of topics, a more rigorous assessment of their effectiveness is still necessary.The following four algorithms have been suggested to estimate the optimized number of topics: Griffiths and Steyvers (2004), Cao et al. (2009), Arun et al. (2010), andDeveaud et al. (2014).Griffiths and Steyvers (2004) propose selecting the number of topics that maximizes the harmonic mean of the sampled log-likelihoods.Deveaud et al. (2014) opt for maximizing the average Jensen Shannon distance between all pairs of topic distributions.Cao et al. (2009) estimate the average cosine similarity between topic distributions and choose the value that minimizes this quantity.Meanwhile, Arun et al. (2010) propose minimizing the symmetric Kullback Liebler divergence between the singular values of the matrix representing word probabilities for each topic and the topic distribution within the corpus.Despite these proposed algorithms, there is a need for further refinement and comparison to establish more robust approaches for determining the optimal number of topics in topic modeling.

Document clustering
Document clustering is a powerful method used for discovering topics on a large scale from textual data (Larsen & Aone, 1999).While this technique has not been extensively employed in finance, it finds significant applications in fields like law and web page analysis (Ramage et al., 2009;Wong & Fu, 2002).
The primary objective of document clustering is to categorize documents into different topics.Ng et al. (2001), Xu et al. (2003), Lu et al. (2011), andAggarwal andZhai (2012) have all worked on classifying documents with similar characteristics into groups.Document clustering is crucial in organizing, browsing, summarizing, classifying, and retrieving documents.
Two commonly used algorithms for document clustering are the hierarchical-based algorithm and the K-means algorithm, along with its variants.These algorithms facilitate the grouping of documents efficiently, enabling effective management and analysis of large volumes of textual data.The first is hierarchical clustering, which includes single link, complete linkage, group average, and Ward's method.Although this algorithm allows documents to be clustered into a hierarchical structure suitable for browsing, it may suffer from efficiency problems.The second algorithm is based on variants of the K-means algorithm, which is more efficient and provides sufficient information for most purposes (Qin et al., 2017).In our experiment, we use both algorithms for document clustering.The K-means algorithm requires specifying the number of groups beforehand, while the hierarchical algorithm does not, allowing clusters to be chosen at any level of the tree.
In hierarchical clustering, each data point (document) is placed into its cluster, and then the closest two clusters are repeatedly combined into one cluster until all documents are merged into a single cluster.Hierarchical clustering is often visualized as a dendrogram.On the other hand, K-means clustering aims to find groups in the corpus based on the number of groups defined by the variable K.This approach requires defining the number of topics and iteratively redistributing the documents into topics until a termination condition is met.One disadvantage of K-means is that the accuracy and efficiency depend on the initial choice of clustering centers.

Feature selection
The first step after collecting documents is to transform documents into statements appropriate for text algorithms and mining tasks.The reports are in a pdf file format, which we converted into text form and cleaned before processing.The quality of the text mining method is highly dependent on the noisiness of the features.For instance, commonly used words such as "the", "for", and "of" may not improve the algorithm.Hence, selecting the feature effectively is critical to remove the corpus's noisy words.These are the steps that we used for feature selection as shown in Figure 1: (1) Remove number: In this research, we focus on investigating the text information; the number in each report will be removed.
(2) Remove stop words: A list of stop words is provided in the package of "stopwords" in R-software.The list included 175 words that are frequently occurring but transmit no significant meaning, such as I, our, his, was, is, are, will, etc.The recurrent appearance of these words may interfere with the analysis process; hence, remove words that belong to this list.Moreover, we also create and remove our own stop-words list, such as a bank, FDIC, also, because, the, etc.
(3) Stem words: some words have similar meanings but have different word forms, such as banks and banking, institution and institutions, managing and manager or management, etc.We convert different word forms into similar canonical forms.For example, failure or failing in to fail, examinations and exams or examine into an exam, etc.This process reduces the data redundancy and simplifies the later computation.
(4) Remove punctuation: All punctuations are removed from the text.This step aims to make the statements appropriate for text algorithms.
(5) Remove spare terms: We remove the spare terms that appear in only one report.

Model designed
In this step, we perform a word frequency analysis on each document in the corpus.Our hypothesis is that words mentioned more frequently hold greater significance in the reports.We calculate the number of times each word appears in each document and then aggregate these counts to obtain the total frequency for each word across the entire corpus.

Correlation analysis.
The correlation analysis examines the relationship between words in a binary form, where words either co-occur or do not appear together.The phi coefficient is a common measure used for binary correlation.Table 2 displays the matrix depicting the combinations of words X and Y along with their corresponding phi coefficients.This analysis allows us to understand the patterns of word co-occurrences and their associations within the corpus.
In which: N 11 : the number of document where both word X and word Y appear N 10 and N 01 : where one appears without the other N 00 : the number where neither appears In terms of this table, the phi coefficient is: The high value of ; suggests the high correlation between words X and Y.The literature suggested that counting the number of appearance times does not bring high value for analysing.Finding phrases via word correlation is a progression for text mining techniques.

Topic modeling.
We classify reports into topics.We hypothesize that among 98 failed banks, there are main topics that can be considered as the main reasons.Each topic is composed of weighted words.Grouping helps the information retrieval process bring higher value.LDA and document clustering techniques are applied to classify the reports into sub-groups.Document clustering with K-means and hierarchies: partition reports into groups.

Descriptive statistic
Table 3 presents correlation matrix of words X and Y combination and Table 4 presents the 30 most frequently occurring words in the corpus, displayed in their stem forms.These words primarily pertain to crucial bank activities, such as Loans, deposits, Credit, and Insurance.Notably, a significant focus is on Loan-related issues, including words like Loan, Loss, review, ADC (Acquisition, Development, and Construction), CRE (Commercial Real Estate), ALLL (Allowance for Loan Loss and Lease), Lend, and Estate.Additionally, an important governance aspect is evident, with words such as management, report, supervisory, board, and exam.Figure 2 further illustrates these findings through a chart, depicting the top 15 most frequent words.Of particular significance, the repetition of the words "Exam" and "Loan" is remarkably higher, occurring nearly 15,000 times across the 98 reports.These findings hold substantial financial analysis value, as most of these words are regarded as sensitive At first glance, the statistic presents' not-surprising words such as Loan, exam (or examination), management, risk, and report.These words are always considered "core reasons" for a bank's failure.In the history of research about the bank, these reasons can be found regularly (Alam et al., 2000;Bell, 1997;Haslem et al., 1992;Kolari et al., 2002;Martin, 1977).However, when going further, some words are significantly important and remarkable such as Concentr (or concentration), ADC (or Acquisition, Development and Construction), CRE (Commercial Real Estate), and board.
By looking at this list, the readers can have a global scenario of these banks during this period.Figure 2 shows the order of frequent words.Figure 3 presents in the word cloud all the words that appear more than 800 times in the corpus.Figure 4 also presents the word cloud of the top frequent words.For Figures 3 and 4, the bigger size of the word shows more frequency.This step suggested general ideas about the bank's failure.The second step analyses the correlation of words to bring more profound results.

The correlation matrix of words that appear more than 0.3
Counting the number of words, however, needs to reflect the picture of the context entirely.We apply R software to find out the correlation matrix among words.The correlation matrix suggested the connection between words.The 20 most correlated words among the 50 most frequent words are visualized in Figure 5.
Table 5 presents the correlation coefficient according to Cohen (1988).The correlation between 0.10 and 0.29 are "small", those greater than 0.30 and smaller than 0.49 are "medium", and those greater than 0.50 are "large" in terms of the magnitude of effect sizes.We hence, follow Cohen (1988) find the words that their correlation at the minimum as medium (Correlation must be greater than 0.3).
The linking is an intersection and complicated.Figure 5 shows the correlation matrix of words.The matrix is created by the important words, which are considered "core nodes" that most of the other words must "cross".These "core nodes" are significantly important as they are (i) in the most frequent words list and (ii) are considered dominant factors that connect and control others.The "core nodes" are: Exam, concentr, implement, asset, adque.Via' Core nodes", we can generate meaningful phrases, such as: "increase loan loss", "credit loss insurance", "implement credit exam", "growth concentr estate ADC", "implement control asset concentr growth", 'implement control ALLL", etc.Compared with the simple descriptive statistic, this step brings a more comprehensive picture of what happened to bank failure from 2008 to 2015.

Selecting the number of topics
Latent Dirichlet Allocation (LDA) is a generative model for documents in which each document is viewed as a mixture of topics, each containing a composition of words.The number of topics is crucial to the performance; however, finding the appropriate value for it is challenging (Cao et al., 2009).Finding a suitable number of latent topics in a given corpus has remained an open-ended question.
We assume that there will be at least 2 reports per topic.For 98 given reports, the range of the number of topics is from 1 to 50 topics.Figure 6 suggests the number of optimal topics based on the algorithm of Griffiths and Steyvers (2004), Cao et al. (2009), Arun et al. (2010) and Deveaud et al. (2014).As Deveaud et al. (2014) algorithm, we should categorize it into 18 topics.Griffiths and Steyvers (2004) and Cao et al. (2009) proposed 38 topics.The question of "How many topics for text classification" is still ongoing, and the answer typically depends on the characteristics of each corpus.Hence, we then experiment with 1 to 50 topics.Our result indicates that this corpus's optimal number of topics is 2. As the number of topics increases, the distinction among topics becomes unclear.Figure 7 is an example of the classification of 3 topics.The words are similar in all three topics; the only difference is the weight of each word.
Figure 8 shows 2 topics of the given corpus.Topic 1 focuses on loan-related issues, and the other focuses on management-related issues.These 2 topics included some common words: loan, exam, concentr, and risk.These words are also included in the "core nodes" of the correlation matrix.Sixty-five banks belong to Topic 1 and 33 to Topic 2.

Clustering
The K-means algorithm is applied to find out the optimal number of topics by document clustering.The calculation is based on Euclidean methods.With p and q are two random points, each has n features.The distance between p and q can be calculated as: Dist (p,q)= ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi . The number of optimal topics.The result of applying package "factoextra" from Kassambara and Mundt (2017) in R suggests that 98 documents should be divided into 2 groups to optimize the clustering.Figure 9 shows the optimal number of clusters.
One of the advantages of hierarchies clustering is to specify the number of topics at any level.The result can be seen in Figure 10, which suggests that we can cluster into 2 groups at the highest level of 3.This classification is consistent with the K-means algorithm and topic modeling with LDA.The dendrogram reports that 18 documents are placed in the first group, 80 in the second group.Table 6 presents the most common frequent vocabularies of 2 sub-groups.This list corresponds to the list in Table 4.
There are words in the top 20 common words that appear in the "bigger" group but do not appear in the second group ADC (Acquisition, Development and Construction), ALLL (Allowance for Loan and Lease Losses), Liquid, Policy, and Broker.These "sensitive terms" imply the factors that distinguish 2 groups.

Conclusion
As the important role of the banking system in economics, studying a bank's failure has become a topic of interest.By suggesting issues that banks must beware of, text analytics can be a complementary action for profound banks' financial analysis.It makes it possible that text analytics has captured a global tendency to foresee the features before they injure a bank's financial condition.It is noteworthy that ADC and CRE are mentioned significantly.Under Basel III, CRE (Commercial real estate loan) is a mortgage loan secured by a lien on commercial, rather than residential, property.This type of loan is typically made to business entities formed for the specific purpose of owning commercial real estate.ADC (Acquisition, Development and Construction) loan, considered the riskiest type of commercial real estate (CRE) lending, is a loan that allows the borrower to purchase real property (such as land), put in the necessary infrastructure, and then build stores or other buildings.This type of loan is often used by developers of large properties such as strip malls or shopping centers.To the best of our knowledge, rarely ADC and CRE are criticized as the reasons for the bank's failure.One of the reasons is the difficulty in obtaining the numeric data of ADC and CRE due to the complicated calculation.
We have demonstrated Bag-of-words techniques, a statistical inference algorithm for LDA, topic modeling, and document clustering for analyzing 98 banks' material lost reviews.Our research contributed by using text analytics on four major aspects.Those aspects are (i) core words, (ii) core nodes, (iii) the number of optimal topics in text mining, and (iv) consistent topic modeling with LDA, K-means, and hierarchies clustering.
For the core words, the results suggest that some core words, which are considered as some main reasons cause the bank's failure, appear in most of the reports.We classify them into four groups: loan, management, capital, and magnitude.The core words for the loan are loan, ADC, CRE, credit, rate, and ALLL.The core words for management are exam, management, report, supervise, review, board, and audit.The core words for capital are capital, deposit, asset, fund, and portfolio.The core words for magnitude are increase, significant, growth, and concentration.The given words are significantly sensitive to the banking system.Our results are comparable to financial ratios aspects.Moreover, it is noteworthy that some terms are hard to measure in numeric and not mentioned as a reason in the literature of bankruptcy, but they have a significant obviously influence on banks' survival.Those terms are management, supervision, and concentration on ADC or CRE.Further, the words found for the core nodes are exam, concentration, asset, implementation, and adequate.We suggest that the bank increase the supervisory process and seriously pay attention to the allocation of loans, especially on the ADC and CRE loan.
There is little agreement on the optimal number of topics in text mining; even the clustering has been assessed in many ways.Our experiment, once again, raises a question on this issue.In fact, the number of topics should depend on the features and components of each given corpus; there should not be a standard for every experiment.Our research suggests dividing the reasons that banks fail into two main sub-groups: loan and governance-related issues.We obtain a consistent suggestion on the number of clusters for the consistent topic modeling with LDA, K-means, and hierarchies clustering.Three algorithms suggested that the number of topics for this corpus should be divided in two.
The findings from Fatima (2013) suggest that banks with high loan to asset and high personal loan-to-assets ratios are more likely to survive, while banks with higher real estate and agricultural loans, and non-performing loans to assets are more prone to failure.In our research, we employed text-mining techniques to investigate the main reasons behind bank failures, and our results align closely with the aforementioned findings.However, our study goes beyond these established factors and presents additional reasons contributing to bank failures.Management and supervision emerged as crucial determinants of bank failure among the factors we identified.Inadequate management practices and oversight can significantly impact a bank's stability and ultimately lead to its downfall.Furthermore, we found that banks concentrating heavily on Acquisition, Development, and Construction (ADC) or Commercial Real Estate (CRE) loans were more susceptible to failure.These specific types of loans can expose banks to increased risks, and without effective management and supervision, these risks can escalate.
Notably, our research sheds light on the importance of supervisory quality, a factor that has not been extensively discussed in many previous studies.We find that the quality of supervision by regulatory bodies plays a pivotal role in preventing bank failures.Inadequate or ineffective supervision can lead to overlooked risks and improper risk management, making it crucial for regulatory authorities to maintain high-quality supervision to safeguard the stability of the banking sector.
There is scope for further research as our study has some limitations.We focus only on the loss material reviews; the numeric information is discarded and by looking at this text analysis, the movement of financial condition is not mentioned.In brief, this research has shown that utilizing text analytics bring some advantages than financial ratios analysis approach.Text analytics is relevant to data analytics for the main reasons that bank goes failure via core words such as loan, capital and deposit.Moreover, text analytics contributes to the literature of bank failure that the concentration on ADC and CRE loan, which is rarely considered in previous research.

Figure 1 .
Figure 1.The essential steps in mining documents.

Figure 7 .
Figure 7.The common frequent vocabularies of 2 sub-group.

Figure 8 .
Figure 8.An example of the classification of 3 topics.