An Automatic Determining Food Security Status: Machine Learning based Analysis of Household Survey Data

Household food security is a major issue in developing countries like Pakistan. Despite significant breakthroughs in grain production within the country, the problem of food availability and utilization persists. Diet is one of the most potent determinants of nutritional condition. The dietary intake method has been utilized to determine the food security status of households, which depends on various factors. There are no automatic and user-friendly methods available to decide food security status, which is generally determined by manually calculating calorie intakes. Due to its high performance and precision, machine learning holds major significance. In this paper, the status of food security has been examined by applying machine learning algorithms, namely, support vector machine, naïve Bayes, k-nearest neighbors, random forest, logistic regression, and neural network, on survey data of households for best predicting the status. A food analysis (FA) app has been developed to automatically predict the FAO status of a household’s food security by implementing the random forest model that found higher precision among algorithms. Additionally, the proposed mobile app will also be helpful for collecting the households’ data. Furthermore, the objective of the study was to enhance food security awareness among individuals.


INTRODUCTION
The concept of food security has evolved as a measure for reliable access and availability of enough and quality food. [1 ] It is an unabating subject covering various determinants that include socioeconomic and environmental factors. The food security concept covers the food system activities and outcomes which contribute to social, economic, and environmental benefits for an active and healthy life. [2] Food security encompasses the segments of agriculture, environment, [3] employment and income, [4] marketing, health, nutrition, and public policy. [5] Pakistan is a developing country with an average per capita income of 1254 USD per annum. Its economy is highly reliant on agriculture, which adds nearly 19.3% to the GDP and employs 38.2% of the its workforce. [6] Pakistan has a Human Development Index (HDI) value of 0.560 as of 2012 and was ranked 152 out of 189 countries and territories [7] and placed in medium human developed nations. The majority of the population (more than 60%) of the country lives in rural areas that depend on the agriculture sector for its livelihood. [8] In Pakistan, a significant progress toward food sufficiency has been noticed since its independence. [9] Wheat crop production has played a major role in the country's food security. Total production of the country's wheat in 2010 was 24 million tons, compared to 11.6 million tons per year in early 1980s. Wheat has helped feed a population to grow to 174 million people in 2010 from 85 million in 1980. According to the National Nutrition Survey 2011, almost 60% of households in Punjab, 72% in Sindh, and 63.4% in Balochistan are experiencing food insecurity. The causes of food insecurity in the country include a hike in food prices, poverty, terrorism, energy crises, slow economic growth, and political instability. [7] Many approaches [10] are adopted for determining the status of food security. However, there is no universally accepted model to avoid complications. It is difficult to accurately find a person's calorie intake manually in all cases; therefore, in general, cultural and geographical perspective values have to be taken into consideration. Currently, there is no automatic method available for determining food security status. There is a need to develop an automatic system to predict the correct and precise status of food security of individual dietary intake. Inappropriate and nutritionally deficient dietary intakes are causing various infectious diseases, disorders, overweight, cancers, and other chronic illnesses.
Machine learning techniques can be adopted that takes information automatically using statistical or computational models and are helpful for the accurate finding of the factors and for improving performance. [11] Smartphone usage in different fields is increasing [12] progressively due to ease of access, good user interface, and reliability in every domain. In recent years, there have been a few studies [13] that were conducted for food security apprehension in Pakistan. However, slight interest has been shown to automate small farming households to enhance food security in the country using different machine learning models. Smart wireless technologies have swiftly turned out to be the most common means of transforming data, voice, and services in the modern world. Given this dramatic change, intelligent technologies have a great potential for advancement in numerous domains. Developments in smartphones offer a unique opportunity to bring the tools and technologies together in a much more informal manner. With intelligent technologies, we can transform the traditional ways of determining food security status into automated systems.
To bridge this gap, an Android-based application has been developed to find food security status of households through machine learning and smart technologies. In this study, different machine learning models have been implemented and trained on survey data of 756 farming households to find the best prediction model among all models. Random Forest comparatives performed best on these survey data in predicting food security status. The application has been developed and trained through the random forest algorithm. In the proposed app, users can input the last 7 days of consumed food values of their intake. After necessary preprocessing, the FAO status of their food would be predicted. In this study, a machine learning approach has been used to parse user data, learn from that data, and inform decisions based on model learning rather than complex manual calculation. The proposed study aims to familiarize survey researchers and social scientists with how machine learning algorithms can perform and highlight the applications in the survey datasets.

BACKGROUND AND LITERATURE REVIEW
Country's ability to pay for the import is a key determinant of food security. The food import capacity (FIC) and food import dependence are the two indicators to measure a country's food security, but the food import capacity is more reliable than the other. [14] In some countries, foreign exchange reserves are also a main concern because monetary restraints can limit imports role to overcome the gap between production and consumption in many countries (FAO. 2003). [15] According to Gittelsohn et al., [5] a household is to be food secure when it has ample access to healthy food for all of its family members. Also mentioned by Alinovi et al. [16] is that a household is considered food secure if it has the aptitude to obtain the nutrition required by its members to be food secured. The household food security indicators may contain household location, household density, dependence, income, health status, food production, or employment status. Garrett and Ruel [17] revealed that household access to food depends on whether the household has enough money to purchase food at prevailing prices or has enough land and other resources to grow its food. Food security measurement is still a contentious problem owing to the selection and order of some questions such as who should get, when, how, how much, and what kind of food. [18] These questions become the base for the selection of food security methods. Some other key queries give diversity in food security measurement tools like: What is the frequency of food insecurity? What are the variations in this frequency over time? What are the causes of food insecurity? What is the causal association between food security and these factors? What are the probable effects of food insecurity on human health and behavior? [19] However, the selection of food security definition leads to the choice of the best measurement method. [16] According to FAO 2011, the global level of chronic food insecurity has increased dramatically from 1990 to 2007 and even more in 2008-2009 due to financial and economic crises. Food production and insecurity at a global level is caused by the factors such as population growth, availability of arable land, water resources, climate change, food availability, food accessibility, and food loss. There are many factors that affect food security as shown in previous studies. In developing countries, domestic and international food prices instability causes food insecurity and hunger. Gorton et al. [20] revealed in their research that there are many physical, economic, social, political, and environmental factors that influence food security in high-income countries. Among these factors, the household financial resource is the major cause of food insecurity in developed countries.
They suggest that to overcome the food insecurity situation, interventions are needed to comply with the supplemental nutrition assistance program. Lack of assets, illiteracy, female-headed households, and a higher number of dependent members in the households are the real threat to household food security. A study carried out in Nepal shows that food insecurity was 74% and food insecurity gap was 0.33 [21] . The severity of food insecurity was 14%. The major socioeconomic factors contributing to food security were smaller household size, low dependency ratio, better irrigation facilities, large farm size, and livestock holding. Akter and Basher [22] studied the impact of income shock and food price on the households' food security and well-being in rural Bangladesh. They argue that increasing food prices and subsequent income shocks lead to high level of food insecurity situation. But the adverse impact of these factors faded over time with economic growth, market adjustments, and domestic policy responses. The impact of climate change and reducing risks to food security is one of the major challenges at present. [23] Machine learning has a huge impact on sentiment analysis and text classification in various languages. [24] The task of the opinion mining and sentiment analysis [25] is to analyze people's opinions, evaluations, sentiments, attitudes, and emotions from textual datasets. In the literature, plenty of methods are available for text classification, [26,27] opinion mining, [28] and evaluation of the sentiments. [29] However, the commonly used text classification techniques are Lexiconbased, Machine Learning-based, [30] and Rule-based Method. Recently, deep learning approaches have been used for feature selection and sentiments analysis as described by Onan. [31] Furthermore, in sentiment analysis, topic modeling has also been used to find the information contained in textual documents and present it in the form of themes. [32,33]

Study Area Description and Data Collection:
The study was mostly regulated in the rural areas of Punjab province, which is a heavily populated province of Pakistan. The main purpose of selecting Punjab was due to its national agriculture share in GDP which is 51% and national economy and crop production. [34] Punjab is located between 30°00ʹ N and 70°00ʹ E with a total area of 205,344 km 2. [35] Geographically, it has a rich agriculture section that is contributing well to the development of the province and economy, with its enormous irrigation system. [35] Average temperature in Punjab ranged from 16.3°C to 31.9°C during the years 1970-2001. Punjab province could be divided into five agro-climatic regions: wheat-cotton zone, wheat-rice zone, arid zone, mixed zone, and lowintensity zone. [36] The study was conducted using a multi-stage stratified sampling technique for selecting areas/ districts and 756 farm households from three agro-climatic zones Table 1. During the first phase, we distributed five strata according to zones. During the second phase, using stratified purposive sampling technique, 12 districts were selected from a total of 36 districts. Strata were not the same, but a proportional sample was drawn from each stratum by using a formula. One district was selected from arid zone; two from low-intensity zones; and three from wheat-cotton, mixed-zone, and wheatrice zones. For the selection criteria, homogeneity of different crops was considered from five main yields, i.e., cotton, sugarcane, rice, maize, and wheat. During the third stage, four villages were selected randomly and during the fourth stage, 12 houses from each village were also selected using random approach having a total of 756 households.

Analytical framework
There are several different ways for determining the status of food security and different studies [5,9,37] use different methods. Different machine learning models can be applied to find food security status of small farming. Machine learning involves using statistics, computational methods, and mathematics to find accurate and efficient algorithms for classification. The framework of the proposed methodology is shown in Figure 1. In the proposed study, the survey data of household diet intakes was used which was conducted by Ahmed et al. [38] The survey data have been divided into training and testing, and after necessary preprocessing, six different machine learning models have been applied: naïve Bayes, support vector machines (SVM), K-nearest neighbors (kNN), random forest, logistic regression, and neural network to find the best prediction model. A mobile application has been designed based on diet features and trained through random forest model due to its higher accuracy of prediction. Random forest algorithm provided the highest accuracy and precision because this algorithm uses a huge number of uncorrelated models operating as a single board that will beat any of the discrete component models. Accuracy comparison among these models is presented in Table 2. Enenkel et al. [39] described that once a mobile application is available, all assessments can easily be uploaded to the database for further processing and trend analysis. The evaluation and testing of Food Analysis (FA) App for automatically predicting the status has been accomplished through inputting the last 7 days consumed calories of the user. The proposed app predicted the FAO food security status based on training.

Best selected model
Random forest is a simple, flexible, and diverse machine learning algorithm developed by Leo Breiman [40] for regression, classification, and prediction. A random forest is made of numerous decision trees where data sets are divided into several trees with different features. Samples are drawn from each tree having high variance. Computationally, random forest depends on one or two tuning parameters that can be directly used for high dimensional data due to a built-in estimation of generalization inaccuracy. The algorithm is a meta learner with several separate trees where each vote or tree is given individual weights and the forest chooses the classification having most votes either weighted or non-weighted. Each attribute has an equal contribution for calculating and predicting the correct outcome. However, less contributed attributes can be filtered out during the model training process. In random forest, total trees can be increased without any generalization error. Total randomly selected predictor variables that are chosen at each node. The total trees in the forest tree size, that is calculated by the lowest node size or the largest number of end nodes. Random forest provides the highest precision in this case because it operates by creating multiple decision trees by looking at different possibilities of the problem during the training phase and return output by taking the mode of the result or average calculations of the decision trees. The accuracy and computational complexity of the algorithm can be influenced by the total number of variables in the problem. The training phase of random forests follows the general procedure of bootstrap aggregation or bagging by taking assumptions or subsets, repeatedly selecting a random sample and fitting trees to the models. Lodhran Hafizabad *The highlighted districts are the districts where the study was carried out. Source: Pinckney [36] and Ahmed et al. [38] f ¼ 1=B where X' = unseen samples, b = 1, 2, . . . , B and B = no of repeating the decision tree.
In addition, an estimation of the ambiguity of the calculation can be made as the standard deviation of the predictions from all the individual decision trees on X'. [41] σ ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Prediction of errors in a single tree is highly sensitive in the training phase, but many trees are not due to no correlation between them. Furthermore, SVM provides the second highest precision because it chooses a hyperplane to minimize errors during classification. [42] In the accuracy table, the AUC stands for "Area under ROC (receiver-operating characteristic) Curve" that measures the whole two-dimensional area. [43,44] It delivers a summative measure of performance across all possible classification inceptions. CA (Classification Accuracy score) measures the ratio between the total number of corrected prediction to the total number of input samples.

Accuracy ¼ Number of corrected predictions Total number of predictions
Precision is the ratio between True Positive/(True Positive + False Positive). [44] The ability of the classifier is to not label a negative sample as positive. In machine learning, recall is the ability of the classifier to find all the positive samples. The precision curve among implemented machine learning models based on survey data is demonstrated in Figure 2.  Precision ¼

True positives True Positives þ False Positives
Recall is the number of correct positive results divided by the number of all relevant samples. [45] Recall ¼ True Positives True Positives þ False Negatives In the model evaluation, test accuracy has been measured through F1 score. The F1 score is the harmonic mean between precision and recall. The range for F1 score is [0, 1]. The mathematical formula for the F1 score is expressed as [44] :

Mobile application development
With the rapid improvement of living standards, accurate, automatic determining of food attributes, keeping track of daily dietary calories, controlling nutrition intakes, managing their food habits to keep healthy has attracted more and more attention in everyday life. Various apps have been developed to keep records of daily meals, food names, and calories and calories estimation from images. [46,47] In the proposed study, FA App has been developed to predict the status of households' food security by inputting the calories consumed in the last 7 days. The FA App has been trained and tested through machine learning algorithm instead of manual calculation. In the food security app, users can enter food values such as beef per kg, potato, oil, and other required features. The app first performs preprocessing on user input calories like converting kilograms (kg) into grams (g). The preprocessed data have been passed to a training model from which specific parameters are defined to make random forest decisions. In the proposed study, daily calorie, per capita calorie, oil in-take, total household members, and total calorie intake in the last 7 days have been supposed to be the key features for the training model. The user calorie inputs are passed to train models for evaluating and predicting the FAO food security status. The graphical interface of the FA app is shown in Figure 3.
If the FA app status is returned one (1), then the food is secure; in case it is returned zero (0), the food is insecure. The results of the prediction status are illustrated in Figure 4. The calorie inputs are numeric while using the app. The link to the food security app is (https://play.google.com/store/apps/ details?id=com.foodo.analysis).

CONCLUSION
The present study examines the status of food security using several machine learning algorithms in different districts of rural Punjab by conducting a household survey of 756 farmers from various regions. A mobile app has been developed and trained through machine learning algorithms to determine food security status automatically. The study has uncovered the hidden heterogeneities in the household survey data. The proposed FA App is useful for supporting healthy eating, reducing malnutrition, and improving the country's population's overall health and nutritional status. Furthermore, machine learning methods have been introduced in food security and survey analysis. It has been proven to be an advanced technology with a huge number of successful applications in various domains. The FA App will also be helpful for data collection of dietary intakes which will improve the overall accuracy and efficiency of the system toward the goal.