Semantics Analysis of Agricultural Experts’ Opinions for Crop Productivity through Machine Learning

ABSTRACT Semantic analysis is a particular technique, which is an interesting area of research that associates with Natural Language Processing (NLP), artificial intelligence, opinion mining, text clustering, and classification. Numerous text processing techniques are being used to find out sentiments from the comments, such as social media tweets, hoax, fiction, nonfiction, novels, books, movies, health care, and stock exchange. Agrarian experts’ opinions play a vital role in the agriculture sector that yields good crop productivity. This paper presents a descriptive analysis of agriculture experts’ opinions through machine learning methods based on textual data collection. The data has been collected by surveying various academia, research institute, and industry of Punjab, Pakistan. The impact of various agricultural inputs such as seed quality, soil quality, soil-intensive tillage, climate changes, water shortage, synthetic fertilizer, and precision technologies on crop productivity have been collected through questionnaires. This research provides a descriptive analysis of collected agrarians experts opinions to increase the crop yield by providing awareness regarding current agriculture inputs to farmers by using machine learning. The current research provides a cohesive expert guideline for improving crop productivity, useful for agricultural policymaking, and conveys adequate farmers’ knowledge. Consequently, the proposed method is an innovative way of discovering recommendations of agrarians through sentiment analysis in survey data using machine learning methods. Furthermore, to the best of our knowledge, agrarians experts opinions on enhancing crop productivity have been considered for the first time in Pakistan.


Introduction
Semantics extraction with unique degrees of the analyzed texts, consisting of phrases, sentences, and documents. Recently, researchers have concentrated on semantics which is the interaction between human language (Chowdhury 2003). Currently, a big collection of documents has been sent and saved electronically; however, there is a need to preprocess text to extract meaningful information. Semantics can define how different words have altered meanings for other people (Wolf 1991). Semantics analysis applications accomplished on social media such as professional networking services (LinkedIn), social networking sites (Facebook), media sharing networks (Instagram, Pinterest, YouTube), social blogging networks (Tumblr, Medium), discussion networks (Reddit, Quora,), and review networks (Yelp, Glassdoor). These enables people from all over the world to post and share images, videos, audios, and professional profile information through LinkedIn (Bontcheva and Rout 2014). Specifically, scholars performed sentiments analysis on several social media datasets, particularly healthcare datasets, Facebook comments, movies datasets, and tweets (Saif et al. 2013). Over the last decade, with the explosion of work for exploring various aspects of sentiment analysis: detecting subjective and objective sentences; classifying sentences as positive, negative, or neutral; detecting the person expressing the sentiment and the target of the sentiment analysis; detecting emotions such as joy, fear, and anger in the text. Surveys by (Liu and Zhang 2012) give a summary of various of these approaches. In today's living world contexts, documents are stored electronically in every domain, as text data are increased highly in industry, business, technology, and the agriculture sector. Agriculture has become innovative by using IOT (Farooq et al. 2020), Cloud Computing (Mekala and Viswanathan 2017), Artificial Intelligence (Smith 2018), Machine Learning (Benos et al. 2021), Deep Learning, and Data Science (Angiani et al. 2016). Generally, agricultural productivity depends on some essential factors like fertilizer, seed, soil, water, and climate change (Ahmad and Heng 2012). In Pakistan, Punjab is the main agriculture zone of major and minor crops that contributes 18.9% of GDP and 42.3% of the labor force (Elahi et al. 2020). In the agriculture field only, we have had a large amount of text data through diverse platforms: Tweeter, Facebook, and LinkedIn groups (Martini et al. 2011). The scientists used semantics analysis on agriculture datasets to judge similarities, sentiments, emotions, feelings, and thoughts regarding crop productivity.
Various techniques have been used for selecting and extracting features from text such as Frequency Features Term Frequency Inverse Document Frequency (TF-IDF), Count Vector, N-grams (Uni-Gram, Bi-Gram, Tri-Gram), and Bag of Words (BOW) (Mirończuk and Protasiewicz 2018). The Term Frequency and Inverse Document Frequency (TFIDF) have been widely used method for features extraction (Abualigah et al. 2017). Machine Learning has been currently implemented on soil types for agriculture crops productivity and management system (Saikai, Patel, and Mitchell 2020;Dongare et al. 2020). The deep learning approach is also covering many problems related to agriculture (Hoang et al. 2013), bioinformatics, and computational biology of plants, (Muharam et al. 2021). Many of the scholars have presented different ICT applications in agriculture in remote sensing, ecosystem service, crop yield forecasting, land monitoring climate change, and online demand of agricultural products (Abd-Elmabod et al. 2020;Weiss, Jacob, and Duveiller 2020;Kantasa-ard et al. 2020). Independently, scientists are applying machine learning techniques in agricultural input for measuring their effects (Benos et al. 2021).
Semantics analysis has been used for the management of crop, soil, and water in the agriculture domain (Karthikeyan, et al., 2020). Many Agricultural applications like digital agriculture (Jayaraman et al. 2015) follow IOT infrastructure, which relates to the crop management system (Prathibha, Hongal, and Jyothi 2017). The focus of the proposed study was to apply a semantics analysis on agrarian opinion and providing their recommendations/guidelines for farmers that play a valuable role in crop growth and management. Therefore, an analysis of agricultural experts' opinions toward crop productivity is present in this study. Major Contributions of the proposed work are: • Collection of descriptive opinions of the agricultural experts through questionnaires • Analyze the descriptive opinions of the agricultural experts through machine learning techniques. • Determine the significant factors that affect agricultural productivity through opinion mining that are helpful to farmers and policymakers.
The rest of the paper is arranged as: section 2 describes the proposed methodology and related materials. Section 3 illustrates the experiment results and discussion. The conclusion is drawn in section 4.

Materials and Methods
In this research text opinions, regarding agriculture productivity were collected from agrarians experts. After preprocessing, feature extraction techniques such as N-gram, BOW, TF, and TF-IDF were applied on corpus for informative features. KNN and Naïve Bayes algorithms were selected for training the model. In the end, a comparison between agrarians' responses was carried out using the cosine similarity. Figure 1 shows the flow of the system for the proposed study.
being continued for better results from the trained model. Table 1 shows the list of questions that were prepared for data collection from agricultural experts.

Preprocessing
Preprocessing is the process of scrubbing and preparing the text for classification. The text consists of implicit noise that needs to be removed using data cleaning techniques. In the present research, preprocessing techniques such as tokenization of words, stop words removal, stemming, lemmatization, and a bag of words (Singh and Kumari 2016) were applied. Tokenization was used to convert text into chunks. It is necessary to remove such words from the corpus; therefore, stop words have been used that are 'a,' 'an, 'the,' 'have,' 'has,' 'from,' 'we,' 'will,' 'they,' 'them,' and much more. Similar stemming, also called lemmatization, has been used. Lemmatization removed the suffix of a word entirely and obtained the basic word form (lemma) (Kowsari et al. 2019). Count vector defined by several occurrences of features a basic way to represent the text data numerically called one-hot encoding (count vectorization) (Vaghela, Jadav, and Scholar 2016). Word cloud is also called text cloud/tags clouds, generated from the source of textual data in which words are depicted in different sizes. Word clouds are an alternative way of analyzing text from online surveys and documents, which is much faster than coding Essentially, word clouds generators work by breaking the text down into component words and counting how frequently they appear in the context-based documents, as shown in Figure 2.

Features Extraction
Text documents aim to select features from text to determine the most informative features that contain high dimensional informative (Abualigah et al. 2017). Researchers used divergent approaches to extract features from the corpus (Mirończuk and Protasiewicz 2018), Continuous Words (CBOW) with Skip-Grams, Term Frequency or Inverse Document Frequency (TF-IDF), Features Frequency (FF), N-grams, Word frequency and weight calculation intensity words, negation words, and overall sentence weight calculation to find poultry of words with negative or positive weights (Razzaq et al. 2019). Many methods exist that can be chosen according to the dataset requirements (Mirończuk and Protasiewicz 2018). In this current study, we have selected Bag of words (BOW), Term Frequency, N-Gram, TF-IDF for feature selection. BOW applies on the text document, and N-gram (Uni, Bi, and Tri-Grams) approach was used to find the impact of those words that repeat mostly with higher frequency from "Agrarian Experts." Term Frequency Inverse Document Frequency, (TF-IDF), is a standard weight scheme, the primary aim of this weighting scheme is to find the most informative feature where it represents the intrinsic content of the document (Abualigah et al. 2017). TF-IDF technique has been used in the study,for finding features with high frequency, which is the mostly used method to a small dataset with the specific content-based domain with BOW.

Classification Algorithm
We used Naïve Bays (NB) and K Nearest Neighbors (KNN) models for classification. KNN is instance-based learning, also called lazy algorithm, but it is a versatile algorithm used for text classification (Soucy and Mineau 2001) and regression. KNN is a feature-dependent algorithm. Lim (boundary) proposed methodology improves KNN performance based on text classification using well-estimated parameters (K-value). This study chose K = 3 for prediction and reduced the trained model's error.
where; x = Total Dataset; y = Total no of Labels e.g x denote the total dataset such as questions regarding agriculture range is x i = (x i1 x i2 x i3 . . .. . . . . .. . .. x in ) and y denote the total number of labels like agrarians experts views range is y i = (y i1 y i2 y i3 . . . . . . . . . y in ). Naïve Bayes is another simple classifier based on bayesian probability, assuming that strong independence exists between features probabilities (Hutto and Gilbert, 2014). One of the advantages of the NB classifier is that it requires a small amount of training data to calculate the parameters for prediction, that's why we have selected this approach for this dataset. (Tripathy, Agrawal, and Rath 2015). NB is a popular classifier for opinion mining, semantics, and sentiments studies. In the past, many scholars have used these two algorithms in their methodology due to their effectiveness and simplicity (Ikonomakis, Kotsiantis, and Tampakas 2005).
Where P Cjx ð Þ represent posterior probability and P X ð Þ predictor prior probability. We chosed Naive Bayes due to small dataset and KNN was selected because it performed better on semantics-based text classifications studies. (Vaghela, Jadav, and Scholar 2016).

Cosine Similarity
Cosine Similarity has been used for document comparisons based on counting the maximum number of common words in the document.

e2012055-992
In previous studies, the consine similarites has also been used in document comparsion on agriculture datasets (Prajapati and Kathiriya 2016). The cosine similarity method has been appleid for finding similarites between doucments using Equation (4) According to Equation (4) We find similarities between agrarian's feedback associates from different Institute/DAI, Academia, and Research Center in Punjab, Pakistan. Figure 3 illustrates a comparison of documents that show documents similarities based on agrarian views. Ten documents are Soil, Seed Type, Pest, Plant, Diseases, Irrigation, Climate, Precision, Fertilizer, Harvest, and Policy are considered. Each document is compared with all other documents, e.g., the first document, Soil (Figure 3a), and 9 documents. This shows that Soil document is similar to document No 2 that is Seed and document No 9 Policy (Figure 3a) with the same polarity ratio of 0.15. The second document is Seed (Figure 3b) is compared with all 9 documents. The seed document is similar to document No 3, Pest, and document No 5 Irrigation (Figure 3b) with the same polarity ratio of 0.15. Similarly, ratios are compared in documents number C, D, E, F, G, H, I, and J, respectively.

N-Grams
In this study, we have used the top twenty (20) features of words using N-Gram to see which word frequently appeared in the dataset from agrarians' opinion. In Uni-gram (also called unique word or single feature) agrarian experts primarily focused on "Crop, Soil, and Water" with frequencies 430, 235, and 215, respectively. Word Crop is more dominant because of high frequency than others: Soil, Water, Plant, Seed, Insects, Agriculture, Production, Productivity and many more in the graph (Figure 4). Similarly, 71 times "Crop Productivity, 64 Certified Seed, and 56 times Crop

Model Training and Evaluation
In literature, various machine learning algorithms have been used for semantics analysis such as K-Nearest Neighbors (Hmeidi, Hawashin, and El-Qawasmeh 2008), Support Vector Machine (Cortes and Vapnik 1995), Neural Networks, Decision Tree, and Naive Bayes (Xia and Wang 2004). We applied K-Nearest Neighbors and Naïve Bayes algorithms for the classification of the text. Both algorithms performed better and obtained reasonable accuracy of the K-Nearst Neighbors and Naïve Bayes 84% and 87%, respectively tabulated in Table 2. The machine was trained using agrarians' opinions and to test the classifier predicated results. Furthermore, finally, the model has been tested on different agriculture

Precision
The precision is also known as positive predicted values and it is the ratio of positive predicited value to the total predicted values and calcualted as (Haddi, Liu, and Shi 2013): Accuracy is a fraction of true prediction overall prediction formula is given below (Kowsari et al. 2019):

Recall
The Recall is a sensitivity and probability of detection i.e. (true positive rate). It is the ratio of correct positive prediction to the total positive (Haddi, Liu, and Shi 2013): The F1 score is a measure of model accuracy on a dataset. The F1 score for the proposed classifier is calculated using (Equation 8). (Abualigah et al. 2017):

Discussion
Over the last decade, scientists were focused on crop inputs for finding their role in crop productivity such as soil, soil types, soil humidity (Dongare 2020), fertilizer management ( (Saikai, Patel, and Mitchell 2020), crop management, seed, water temperature, climate changes, sustainability, chemical spry (Karthikeyan et al. Karthikeyan, et al., 2020). The present study has been conducted on "Agriculture Semantics Analysis" through machine learning model. The proposed study may also help in developing agricultural policies at the government level and a comparison is also made to find similarities between agrarians experts opinions. This research demonstrated that Naïve Bayes had better performed on experts opinions textual data. Supervised techniques like support vector machine, neural network, decision tree, random forest was not applied due to small dataset.The limitations of the proposed study are that the sample size needs to add more responses. There is a lack of previous research to compare and develop a real-time platform for disseminating findings to farmer communities.

Conclusion
In this research, we have presented a novel approach for collecting and analyzing the descriptive opinions of agricultural experts. The study has shown that Crop, Certified Seed, and Post-Harvest Loss are the significantly contribute to agricultural productivity. Similarities between agrarian's responses that belonged to different Institute/DAI, Academia have determined through cosine similarity and document comparison method.. This study was carried out using machine learning algorithm such as Naïve Byes and KNN algorithm and obtained 87% and 84% accuracy respectively. This study demonstrated that Naïve Byes has better performed better on text dataset of agriculture experts opinions.

Future Work
In future, agrarian opinions will be recorded in their voices for increasing good crop productivity. Their opinions and gestures will be analyzed and translated into multi-languages by using different deep leanring approaches. It will enhance the study and provide more valuable results due to enormous response level from agraian expets.