Customer behaviour analysis based on buying-data sparsity for multi-category products in pork industry: A hybrid approach

Abstract Understanding customer behaviour is crucial for business success. For achieving this goal, the Recency–Frequency–Monetary (RFM) model has been commonly recognised as an effective approach to analyse customer behaviour. However, the traditional RFM approach is a coarse method for quantifying customer loyalty and contribution that can only provide a single lump-sum value of the recency (R), frequency (F), and monetary value (M); hence, it discards information regarding customers’ product preferences. Typically, different customers make different purchases. Subsequently, purchases are likely to be different across customers. This creates data sparsity, which affects the performance of conventional clustering methods. In this study, we integrated the group RFM analysis and probabilistic latent semantic analysis models to perform customer segmentation and customer analysis. The results indicated that the developed approach takes into account the product preference and provides insight into and captures a wide variety of the types of true ordering behaviour of the company’s customers. The information allows the manager to improve customer relationships and build a personalised purchasing management system for grouping customers with similar purchasing patterns.


PUBLIC INTEREST STATEMENT
Understanding customer behaviour is essential for the businesses decision-making process. The RFM model is widely used to analyse customer behaviour. However, the traditional RFM only provides a single lump-sum value of the recency, frequency, and monetary value; hence, information regarding customers' product preferences is discarded. Typically, different customers make different purchases. Subsequently, the purchases are likely to be different across customers. This creates data sparsity, which affects the performance of conventional clustering methods. In the present study, we integrated the group analysis and PLSA models to perform customer segmentation and customer analysis. The results indicated that the developed approach takes into account the product preferences and provides insight into and captures a wide variety of the types of true ordering behaviour of the company's customers. This information allows the manager to improve customer relationships and build a personalised purchasing management system for customers with similar purchasing patterns.

Introduction
Understanding customer behaviour is crucial in business endeavours, as it can help businesses satisfy consumer expectations. Customer behaviour can cause a company to succeed or fail. Thus, insight into customer behaviour can be useful in developing a company's marketing and operation strategy for satisfying customer requirements. Additionally, identifying customer behaviours plays a significant role in understanding factors that cause customers to buy a particular product. According to the wide varieties of products and customer needs, it is important to assess customer behaviours prior to releasing a product to the market. These problems are prevalent in the pork processing industry. Hundreds of customers order various products, which can be classified into different main meat cuts (i.e. ham, shoulder, collar, loin, tenderloin, belly, spare rib, and byproducts). Additionally, the products are available in various sizes. A customer order may span many product categories. Customer segmentation (as related to product characteristics and the production process), is important for a company's management, e.g., launching marketing strategies for each customer cluster and establishing customer priority to allocate products when the supply is insufficient to fulfil all the orders. In a procurement system, product orders are converted into the number of live pigs required for each pig size. Fattening farms with pigs of the required sizes are then selected for harvesting. Prior to the harvesting, the pig sizes on each farm are primarily estimated using the ages of the pigs and the feed consumption. For pork processors in Thailand, the supply of live pigs to the plant typically does not satisfy the demand, because of two important factors. First, the pig size is not uniform for each farm; i.e. each farm has a distribution of sizes, and a single farm may have various pig sizes. Thus, the plant may not obtain a sufficient number of pigs of the required size from harvesting a set of farms. Second, the plant manager may not slaughter sufficient pigs to fill all the orders. All the meat cuts are co-products. One pig provides two pieces of each of the main meat cuts. For example, slaughtering 100 large pigs produces 200 large pieces of ham, shoulder, collar, loin, tenderloin, belly, and spare rib. To fill the orders for 200 large-sized ham and 300 large-sized shoulder products, 150 large-sized pigs should be procured and slaughtered, which results in an inventory of 100 large-sized hams. This inventory adds to the plant costs, because the product is perishable and requires temperature-controlled storage. To avoid such an inventory, a decision maker may decide not to satisfy all the orders; i.e. some orders may be delayed or undergo product substitutions. To maximise customer satisfaction during supply shortages, a company first attempts to fill the important customers' orders. Currently, customer priority is based on the experience of the decision maker. Among the investigations to minimise the total inventory cost and to maximise the profit for multi-category products in supply chains, the following recent studies are highlighted. Sarkar and Giri (2020) developed a two-echelon supply-chain model with a single buyer and a single vendor under stochastic demand.  extended the three-level supply-chain model to an integrated multi-product four-level supply chain with a joint economic lot-sizing policy under the stochastic condition. Hoseini Shekarabi et al. (2019) modelled a multi-product, multi-wholesaler, multi-level, and integrated supply chain under conditions of a shortage and limited warehouse space. Kazemi et al. (2018) presented an inventory management policy for items that considers the imperfections (quality) and emissions. Gharaei et al. (2019a) developed an optimal replenishment policy for a multi-products inventory system by considering the quality control and green eco-friendly production policies. Gharaei et al. (2019b) developed a multi-product, multi-buyer supply-chain model under penalty, green, and quality control policies and a vendor-managed inventory with a consignment stock agreement. Additionally, with the goal of inventory system management for real-world conditions,  proposed a realistic Economic Output Quantity (EPQ) model. The overall inventory cost and the profit function in a multi-product EPQ model were defined and optimised by taking into account the faulty products. They also developed a bi-objective EPQ model with defective and good items to determine the number of shipments and the quantity of each product shipment. Giri and Masanta (2020) developed a closed-loop supply-chain model with price-and qualitydependent demand and learning in production in a stochastic environment. They considered a closed-loop supply chain with two suppliers, one manufacturer, and one retailer. The manufacturer produces a single product from fresh raw materials and used items collected by the retailer. Masoud et al. (2020) presented a multi-period, multi-product model for the location-allocation supply-chain problem under different factors, e.g., customers, vehicle fleet technologies, and environmental and social impacts.
The obtained group information is useful in the formulation of proper promotion strategies for pricing policies to increase the customer response rate and the business profit. To obtain insight into the customer segmentation and analysis, several relevant studies were systematically reviewed. For example, Fang et al. (2016) proposed a profitability model based on Big Data analytics considering the customer purchasing behaviour and foreseeable future cash flow in the insurance industry. Additionally, Tsai et al. (2017) developed a shopping behaviour prediction system based on moving patterns and product characteristics indicating suitable strategies for an individual customer to increase profit. Another relevant study was conducted by Holý et al. (2017), which was related to product categorisation based on customer behaviour using only market basket data. Finally, Wang and Tseng (2015) work, which proposed the Naïve Bayes classifier-based approach for matching the customer requirements to existing products, provided a good foundation for this study. The literature review indicated that the proposed method provides a better understanding of customer preferences and requirements than previously reported methods, which can lead to marketing opportunities for companies.
The problem of identifying high-response rate customers for product promotion can be solved via both statistical and non-statistical methods (e.g., data-mining technology), including Recency-Frequency-Monetary (RFM) analysis, k-means clustering, classification and regression trees, logistic regression, supervised and unsupervised neural networks, support vector machines, finite mixture models, the Bernoulli-Gaussian mixture model, and the Naïve Bayes classifier (Calvet et al., 2016;Chen et al., 2007;Durango-Cohen et al., 2013;Wang & Tseng, 2015). RFM analysis-a widely used customer segmentation method-employs three variables, i.e. consumption recency (R), frequency (F), and monetary value (M), to model customers' purchasing behaviour and evaluate their loyalty, contribution, and buying potential (Chang & Tsai, 2011). Because of its simplicity and reasonable performance, the RFM model is widely used to analyse customer behaviour (Chan et al., 2011;Coussement et al., 2014;Ha et al., 2002;McCarty & Hastak, 2007;Wu & Chou, 2011). However, the traditional RFM model is limited and has several significant disadvantages for application to customer or market segmentation (Coussement et al., 2014;Han et al., 2014;McCarty & Hastak, 2007). For example, Singh and Singh (2016) reported that the weights are arbitrarily assigned in the traditional RFM approach and that the approach does not account for the risk of customers being inactive. Hence, Singh and Singh (2016) suggested using data environmental analysis and the probability of being active, probability of reaching the minimum sales level required by the firm, and regularity of purchases to create an index called the risk-adjusted RFM. In our paper, we address the RFM approach for companies that offer a wide variety of products. For the multipleproduct case, the traditional RFM only provides lump-sum evaluation indices, which are inaccurate in quantifying customer loyalty and contribution. The proposed model analyses customer behaviour for a whole category rather than for an individual product. The traditional RFM model analysis, not including consideration in each category, is based on the total purchase volume. As shown in Table 1, elaboration traditional RFM analysis assumes that there are transaction records of five customers. Each transaction consists of purchased items and monetary expenses. Examining the traditional RFM values for each product and the total purchase expense reveals that customers differ significantly in purchasing these products. Hence, using a lump-sum evaluation index is likely to categorise customers into different groups rather than considering it productby-product. For considering the product category, Chang and Tsai (2011) developed a group RFM (GRFM) framework to identify potential customers according to their purchases by hierarchically transforming purchased data into categories, applying constrained clustering to categories of purchased products, and obtaining the RFM value from the categories. Although this approach can reveal the true buying behaviour by deeply analysing and utilising the RFM value of the customer according to their purchased items, considerable computation effort is required to transform purchased data into similar purchased patterns. Additionally, because the method of Chang and Tsai (2011) clusters customers according to their purchased items, a customer may be allocated to more than one cluster. Although allocating a customer to more than one cluster may be convenient when reviewing purchased items, in other circumstances, each customer belonging to one cluster may facilitate strategy execution. Chang and Tsai (2011) carefully constructed an algorithm to transform the purchased data into categories and then cluster data with similar purchased records into groups. One reason is to avoid data sparsity problems. As shown in Table 1, typically, a customer may purchase some of the products but not all. Thus, the product purchase differs across customers. In this case, data sparsity occurs when only a small fraction of the items in any given row are nonzero or non-null. Consequently, the performance of conventional clustering methods for customer segmentation and analysis is affected. Without data sparsity, we could have used RFM values for each product as attributes to directly cluster customers.
Data sparsity is a common problem in text processing for text mining. Text data are typically high-dimensional and sparse. To address data sparsity in text mining, the probabilistic latent semantic analysis (PLSA) model, which was originally developed by Hofmann (1999Hofmann ( , 2001) is commonly used. Recently, the PLSA model has been successfully applied to various fields, e.g., image learning and quality assessment (Fernandez-Beltran & Pla, 2018), facial recognition (Zhou et al., 2019), topic modelling (Li et al., 2018;X. Wang et al., 2019), a movie recommender system , unsupervised mining of long time series (J. Wang et al., 2013), and fuzzy coclustering and cluster splitting characteristics (Goshima et al., 2018). Therefore, the main objective of the present study was to investigate a customer segmentation and characteristic analysis approach for multi-category products in which the GRFM and PLSA methods are used for solving data sparsity problems.

Proposed methodology
In this study, a methodology is developed for segment-level customer behaviour analysis based on buying-data sparsity for multi-category products in the pork industry. For retaining the dynamic nature of the customer behaviour, this study is based on the GRFM model, and dataset segmentation principles are followed in accordance with the PLSA approach. The proposed methodology contains preprocessing, modelling, evaluation, and analysis phases. The preprocessing step focuses mainly on building a customer order corresponding to the RFM attributes for each multi-category product for each customer. The modelling and evaluation phase deals with two main decisions, including the selection of appropriate segments. In the analysis phase, the proposed methodology is applied to a case study. The detailed procedures of the methodology are described throughout this section.

Probabilistic latent semantic analysis (PLSA)
The PLSA model was originally used to find topics in a document: in this section, PLSA is adapted to the aforementioned problem of customer segmentation. More details on the PLSA model can be found in in Hofmann (1999Hofmann ( , 2001 Suppose the company contains N customers and M products, which are denoted by c i 2 fc 1 ; ; c N g and p j 2 fp 1 ; ; p M g, respectively. The customer order data are then summarized in an N � M co-occurrence matrix � N: Each matrix element, nðc i ; p j Þ, is represented by the RFM score of the product p j for customer c i . There are also K hidden (latent) classes or group variables z k 2 fz 1 ; ; z k g which are associated with each occurrence of a product p j for customer c i . The customer order information can be used to determine the probability, P(c i ), that customer i places an order, which is used to estimate the probability P(p j ) that a product j is ordered; however, the probability Pðz k jc i Þ that customer c i exhibits features z k and the probability Pðp j jz k Þ that a product feature z k corresponds to a product p j are unknown. Figure 2 and 3 shows the graph of the aspect model.
The joint probability of customer c i and product p j (i.e., P(p j , c i )) can be expressed by Equation (1). Equation (2) describes the probability that product p j is ordered if customer c i places an order.
Substituting Equation (2) into (1) yields As in Lu et al. (2010)'s study, the observation pairs Pðp j ; c i Þ are assumed to be generated independently; the log-likelihood function can then be obtained from Equations (4) and (5) given below: The parameters Pðz k jc i Þ and Pðp j jz k Þcan be estimated using expectation maximization (EM) to maximize the log-likelihood function. The EM algorithm consists of two steps, E and M.
In the E-step, the posterior probability is calculated for the latent variables using Bayes' formula, resulting in Equation (6): In the M-step, the parameters are updated based on the expected complete data log-likelihood, which depends on the posterior probability calculated in the E-step (Chen et al., 2008). This probability is given by Equations (7) and (8): The E and M steps are iterated to increase the likelihood function. The procedure is terminated when the specific conditions are met.
Thus, the EM algorithm estimates the model parameter values that maximize the likelihood of the observed data and returns appropriate probability distributions that can be used in the PLSA model.

Data description
The data used in this study were collected from a pork slaughtering house of a leading agriculture company in Thailand. There were 71,088 records of customer order data for 487 customers. The products are grouped into nine main product categories, including ham, tenderloin, loin, collar, shoulder, belly, spare rib, by-products and trimming meat. The nine product categories are used to analyze the purchasing/ordering behavior of the customers.

Novel approach
The GRFM and PLSA methods are used to develop a customer segmentation methodology for the pork processor case study. The proposed methodology can be broadly divided in 5 steps as shown in Figure 1. The procedure is given below.
Step 1: Preprocess data Preparing and preprocessing data are important processes for knowledge discovery in a database. In this step, the records of all of the orders are classified into nine product categories. Next, the recency (how long ago has the product been ordered for the last time), the average frequency (an average number of purchasing orders per month) and the average monetary values (an average of purchasing expenditure per order) are calculated for a customer in each product category.
Step 2: Assign R-F-M scores for 9 categories In this step, scores are assigned to the recency (R), frequency (F) and monetary (M) values of a customer in each product category. The score scale ranges from 1 to 5. Scores "1" and "5" correspond to the largest and smallest contributions to the company revenue, respectively. The values of the data set are scored from 1 to 5 by sorting the original RFM values for all of the categories in descending order. Next, the RFM values in each category are divided into 5 equal parts. The values in the top 20% of the data set are given scores of 5, and the values in the next highest 20% of the data set are assigned scores of 4. The other scores (i.e., 3, 2 and 1) are similarly assigned. Table 2 presents RFM criterion for all product categories. Table 3 provides examples of RFM scores for various customers.
Step 3: Obtain L and Pðz k jc i Þ The RFM scores from the previous step are used to determine the variables L and Pðz k jc i Þ using the PLSA model (Equations (1)-(8)), where k is the number of clusters.
Step 4: Determine the suitable number of clusters (K) Generally, the number of clusters for a given data set is not known a priori. The Akaike information criterion (AIC) equation is used to determine the suitable number of clusters (K): AIC ¼ À 2L þ 2Kðn þ mÞ (9) where L denotes the log-likelihood value, K denotes the number of clusters, and n and m are the number of parameters in the model. Normally, n and m are the numbers of rows and columns, respectively. In this study, the rows correspond to the customer numbers (i.e., n = 487), and m represents the number of features (i.e., m = 27 = 9 product categories x 3 R-F-M features). The suitable K value is that which provides the minimum AIC value.
Step 5: Allocate customers into clusters In this step, the value of Pðz k jc i Þ is used to allocate all of the customers into clusters. A customer c i is assigned to the cluster k with the highest Pðz k jc i Þ value.

Customer clusters
The optimum number of clusters was determined using the AIC model. Figure 3 shows the AIC values for different cluster numbers ranging from 2 to 15. The results indicate that the lowest AIC value is obtained for nine clusters (k = 9), which is the optimum number of clusters in the case study. Examples of Pðz k jc i Þ matrix elements obtained for nine customer clusters are presented in Table 3. The matrix elements indicate the probabilities of a customer belonging to each cluster. For example, the probabilities that customer N01 belongs to clusters 2, 3, 4, 8, and 9 are 0.10, 0.24, 0.13, 0.29, and 0.24, respectively. As the highest probability for customer N01 corresponds to cluster 8, customer N01 is considered to belong to that cluster.

Customer group description
After the customers are divided into nine clusters, the RFM analysis is performed to evaluate the customer value and define the group description in each segment.  In this process, the ↑ sign represents the status of a group with an RFM average that is higher than the average value for all the customers. The ↓ sign represents the status of a group with an average R-F-M value that is lower than the average value for all the customers (Ha, 2007;Ha & Park, 1998). The three R-F-M values correspond to segmenting the customers into eight (2 3 ) possible segments. For example, "R↑F↑M↓" signifies that the average M-value for this customer segment is lower than the total average M-value, whereas the R and F values are higher than the average R and F values for all the customers, respectively. In accordance with the works of Ha (2007) and Ha and Park (1998), these eight customer segments are denoted as follows: (1) Table 4 presents the results for the RFM pattern in each product category. These tables present the number of customers allocated to each customer cluster; the average actual values; the average R, F, and M scores; the RFM pattern; and the customer type. For example, for the ham product category, there are 58 customers in cluster 1. This cluster has average actual recency, frequency, and monetary values of 6.88 unit time, 21.95 times, and 4965.03 unit costs, respectively. The average R, F, and M scores for this cluster are 4.48, 2.88, and 2.5, respectively. This customer cluster is classified as the shoppers (SH) type because its average R and F scores are higher than the average R and F scores for all the customers (i.e. 4.48 > 4.17 and 2.88 > 2.01), but its average M score is lower than the total average (i.e. 2.5 < 2.78). Table 5 presents all the product categories. For example, the customers belonging to cluster 9 are valued customers for the ham, loin, shoulder, and belly categories and are the most valued customers considering all the product categories. Table 6 presents the results for the case where the individual product categories are not considered. The solutions for these two scenarios (presented in Tables 5 and 6) are clearly different. More information is provided by the solutions in Table 5 than for the solutions in Table 6. The information obtained using the developed method (Table 5) can be used to identify the characteristics of each group and to capture a wider variety of the types of true ordering behaviours of the customers (compared with the case where the product category is not considered). This information can be used by plant managers to measure the loyalty and contributions of a customer segment for each product category, allowing them to develop better production and marketing strategies.

Managerial implications
The main objective of this study was to provide decision support models for the environmentally responsible decision-making process in multi-category products under different customer demand patterns. The key objective in marketing is to increase the number of customer orders and persuade customers to spend more money on a company's products; thus, information such as that presented in Table 5 can help a company to formulate a marketing plan. The marketing plan can be designed for a specific customer cluster, a specific product, or a customer group for a specific product. For example, the customers in cluster 8 are considered to be churn customers in the shoulder product category (see Table 5). These customers have high frequency and high monetary values in the shoulder product category but have not placed orders for a long time. Something may have gone awry with these groups of customers. The company can develop strategies for motivating these customers to purchase more shoulder products by launching a promotion on shoulder products for this group only. In the collar product category, none of the clusters have any valued customers. The company may need to tune its marketing plan to increase sales in this product category. Thus, the customers in cluster 9 exhibit a high potential to become valued customers in many product categories. Appropriate marketing strategies can increase the sales to and loyalty of this group of customers.
The information can also be used in order allocation to ensure customer satisfaction when the supply of specific meat cuts is insufficient to fill all the customer orders. A company typically caters to its highest-priority customers first. The information obtained from the model can help a decision maker to prioritise the company's customers for specific product categories. For example, consider that there are 94, 350, 494, 362, and 230 live pigs of the extra small, small, medium, large, and extra-large sizes, respectively, which are supplied from contract farms to the processing plant. After the pigs are slaughtered and processed, the plant obtains 188, 700, 988, 724, and 460 pieces of each meat cut (i.e. ham, loin, and shoulder) in extra small, small, medium, large, and extra large sizes, respectively. We further assume that seven customers (N01-N07) have placed orders for extra large-sized ham, loin, and shoulder pieces, as shown in Table 7. There are 460 pieces of pork product available for each of the three meat cuts, resulting in a supply shortage. The decision maker must decide how to allocate the available products to these customers. The information in Table 5 can be used to make this decision. Let us consider the ham product category. The highestpriority customers in allocating the orders should be customers N05 and N06 from the valued  Apichottanakul et al., Cogent Engineering (2021)

Conclusion
This study presents an approach for customer segmentation and characteristic analysis based on the GRFM values and the PLSA model. Historical data from the company were transformed into RFM scores, which were used in the PLSA model to cluster customers into specific product categories. Because the analysis takes into account the probability of purchasing items, it provides better insight into the customers' preferences and captures a wider variety of the types of true ordering behaviours of the customers as a group than the traditional RFM analysis. This information can also be used in market planning and to determine the customer priority for each product category to increase company profits and customer satisfaction. Moreover, the company can develop effective strategies by measuring the loyalty and contributions for each segment to improve their customer relationship management, which can result in a competitive advantage. Additionally, the model framework can be applied to other businesses for the same purposes.