Aspect-based classification of product reviews using Hadoop framework

Abstract The advancement of e-commerce along with the quick development of product review discussion in the most recent decade, an enormous measure of sentiment data or reviews are produced which made it practically difficult for a customer to take an educated buy choice. A good number of customers share their conclusions about the product during these days. These reviews have an important role in customers’ purchase-decision process. For a popular product there may be hundreds or thousands of reviews. This is difficult for a potential customer to go to each of them to make an educated choice on whether to buy the product or not. Also, this is difficult for manufacturer of the product to follow-up and to manage client opinions. In this scenario, the aspects-based sentimental analysis helps in analyzing the reviews and categorizing them into appropriate aspects. Aspect or feature refers to attributes or qualities of a product. The proposed work begins with collecting reviews from online shopping websites, identifying aspects and classifying opinion orientation of aspects with different sentiment analysis techniques using Hadoop framework. This paper proposes a new pattern-based method for aspect extraction and sentiment analysis which gives an accuracy in the range of 72 ~ 75%. The proposed work is implemented on Hadoop MapReduce framework and the results show that Hadoop Multi-Node cluster set up performs aspect level sentiment analysis in a shorter time compared to traditional techniques.


PUBLIC INTEREST STATEMENT
With the growing availability and popularity of opinion rich sources such as online reviews, choosing the right product from huge product brands is difficult for the user. To improve the marketing and enhance the customer satisfaction, various retailer portals provide opportunity for the user to share their reviews on products and its aspects. It is difficult for the new customer and product manufacturer to understand interest of a customer about the product. In this situation, sentiment analysis helps the people to analyze the reviews and conclude whether it is good or bad. The aspect level sentiment classification helps new customers to know more interesting aspects of products. Aspect or feature refers to attributes of a product. This article presents new big data approach for aspect extraction and sentiment analysis of product reviews. It proposes a method which identifies aspects using patternbased approach and classifies opinion orientation of the aspect. Entire work is carried out on Hadoop framework and the proposed patternbased method gives better actuary. range of 72 ~ 75%. The proposed work is implemented on Hadoop MapReduce framework and the results show that Hadoop Multi-Node cluster set up performs aspect level sentiment analysis in a shorter time compared to traditional techniques.

Introduction
The web-based shopping has gained a lot of prevalence in recent years. This development is the consequence of different benefits on the web shopping which has over the traditional buying exercise. To mention few benefits-it saves customers' time by allowing them to purchase a product with few mouse clicks, provides the flexibility to compare the prices and make a buying decision, availability of online stores. As the popularity of online shopping has increased, the products purchase and sale on the web are also increased. The problem of web shopping is that, it might be hard to get assistance from the experts to purchase an item. With the product details, it is difficult for the customer to make a purchase decision. To overcome this problem, the suppliers have enabled the blogs, review sites, forums through which the customer can express their opinion about the purchased item. These opinions or reviews are more important than the descriptions provided by the manufacturer because customers express their views about the purchased item in detail, what the product is about, how it functions, etc. The popular or trendy items usually get many reviews. Besides, numerous reviews are long and have just a couple of sentences containing sentiments on the item. This makes difficult for other customers to go through all reviews and make a purchase decision. This has encouraged research on sentiment analysis. Since most of the extracted reviews are in unstructured or semi-structured format, and having larger size, traditional data processing techniques fails in analyzing them. As the data size increases, analysis of the data becomes challenging. Hence there is a need of big data processing technique to analyze this huge data in an effective and efficient way. In recent times, the effective tool that has turned out to be productive in handling large datasets is Hadoop (White, 2012), which is thought to be proficient for storage and processing of large sets of data. So, the main idea behind this work is to define a system which uses distributed processing technique to extract knowledge from big data sentiments. Sentiment or opinion is an idea, view, or state of mind, particularly one construct principally in light of feeling rather than reason. Sentiment analysis also called as opinion mining contemplates individuals' estimations towards specific elements. Sentiment analysis is one of the significant errands of NLP (Natural Language Processing). Sentiment analysis can be performed at three levels: Document level, Sentence level, and Aspect level (Medhat et al., 2014). At Document level, entire document is treated as a single element and sentiment analysis is performed to check whether the document conveys positive or negative view. At sentence level, opinion summarization is performed at the sentence level. First, the sentence is classified into subjective or objective sentence. Second, the polarity of subjective sentence is classified. An objective sentence expresses factual data and subjective sentence expresses subjective observations and views. People use e-commerce sites to buy products online and then they post reviews or opinions about the products. In these sites, customers express their opinions on products what they buy and share their opinions on the product aspects/features. Such reviews can also be studied to understand the best and worst features of the product. This is called aspect level sentiment analysis. Aspect level classification involves several subproblems, namely, detecting appropriate entities, mining their aspects/features, and identifying if the review specifies positive or negative opinion on each aspect. Complete and detailed information on entity and its aspects cannot be provided by document level or sentence level text classification procedure. This information is required for many business applications to make a better decision and to promote their business. So, the objective of this study is to design and develop an efficient algorithm to analyze real-time product reviews from retailer sites and perform aspect-based classification using Hadoop MapReduce framework. Figure 1 shows the taxonomy of the sentiment analysis techniques as discussed in article (Medhat et al., 2014). Sentiment Classification techniques can be grouped into two types: Lexiconbased approach and Machine Learning approach. The Lexicon-based approach is based on sentiment words that are normally utilized as a part of communicating positive or negative opinion. It depends on a sentiment vocabulary which consists of known and precompiled sentiment words. This sort of methodologies utilizes SentiWordNet, which is a tagged dictionary with precalculated scores assigned for each term.
Machine learning approach uses machine learning algorithms in order to perform sentiment analysis. Machine learning approach further categorized into two types: Supervised and Unsupervised learning techniques. Supervised learning method depends on the labeled dataset which is used to training classification model and the classification is performed using machine learning algorithms. Partial and hierarchical clustering are the most commonly used unsupervised algorithms. In machine learning approach, accuracy of the sentiment analysis mainly depends on training dataset, from which the classifier is built. As the review data set size increases, the accuracy decreases if we use same training dataset. Since the Lexicon method does not rely on training data set, it is best suited for sentiment analysis. We have proposed a new pattern-based method for aspect extraction and opinion phrase extraction. Sentiment analysis is carried out using Lexicon-based method. The various steps followed while performing aspect level sentiment analysis include-extracting the reviews, preprocessing, POS tagging, aspect selection, and classifying reviews into positive or negative.
Section 2 provides the brief literature survey on different techniques used to perform sentiment analysis and aspect level sentiment analysis. In section 3, the preliminaries of the work are discussed. Section 4 provides the details of proposed architecture. The experimental results are discussed in section 5. Section 6 discuses about conclusion and future work.

Sentiment analysis
Increasing popularity of e-commerce has given chance to sentiment analysis. Sentiment analysis identifies and categorizes the opinion of a text or a review. Walaa Medhat et al. (2014) have performed a study on different techniques of sentiment classification. According to the survey, the two sentiment classification techniques are Lexicon-based approach and Machine Learning Approach. The machine learning approach further divided into supervised and unsupervised learning approach. Fang & Zhan (2015) have performed sentiment analysis on product reviews using supervised machine learning algorithms at sentence level. The proposed method collected product reviews of different domains from amazon website. Out of three classification methods, i.e. Naïve Bayes (NB), Support Vector Machine (SVM), and Random forest, random forest methods provide the good result on sentiment analysis. Moraes et al. (2013) have performed sentiment analysis at document level and compared the performance of SVM and artificial neural network (ANN). The datasets considered are from different domains, i.e. movie reviews, book reviews, GPS data, and camera reviews. The result concluded that ANN outperformed SVM in sentiment classification, but the training time of ANN exceeded the training time of SVM. Zhang et al. (2014) performed opinion mining on reviews obtained from the mobile users. The authors compared two classification methods namely SVM liblinear and NB multinomial. From the result it hass been proved that NB multinomial performed better when compared to SVM liblinear. Hu & Liu (2004) and Potdar et al. (2016) have followed Lexicon-based approach to produce the summarization of product reviews collected from the amazon website. Deshmukh & Tripathy (2018) have proposed a semi-supervised approach for sentiment analysis using modified maximum entropy method. The work concentrated on collecting sentiment words from one domain which can be used to predict the sentiment words of another domain. The study also concluded that the proposed method is best when compared to SVM and NB. Yang et al. (2012) utilized SVM method to classify the text into positive or negative. Kumar et al. (2016) proposed a study on various classification methods like SentiWordNet, Logistic regression and NB approach. The study concluded that NB method outperforms the other two approaches. Chifu et al. (2015), proposed an unsupervised approach for aspect level sentiment analysis on product reviews. The proposed method used ant clustering algorithm to select the aspects of particular product. To classify the reviews, selforganizing map approach is proposed. Esparza et al. (2017) performed sentiment analysis on product reviews using various SVM kernels such as SVM linear, SVM radial, and SVM poly. Among the three, the linear kernel has achieved the greater accuracy. Kang & Park (2014) performed sentiment analysis to measure the customers' opinion on mobile service using dictionary approach. The sentiment dictionary used to identify the keyword vector in the review sentence and the summarization of customer opinion is performed using multicriteria decision-making approach. Khan et al. (2017) proposed a semisupervised method to perform opinion mining. The method combined Lexiconbased approach along with machine learning technique, i.e. support vector machine to achieve good results. Jinturkar & Gotmare presented a framework to perform sentiment analysis of review data using Hadoop framework. Assigning the values to the sentiment words are performed at mapper. Classification of polarity using Naïve Bayes is carried out at reducer. Maddikunta et al. (2020) have discussed the effect of dimensionality reduction techniques, namely, Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) on machine learning algorithms. Authors have concluded that classifier with PCA performs better compared to LDA. Gadekallu et al. (2019) proposed a location-based business recommendation system. Recommendation system's quality is measured in terms of customer business ratio and average service time on bench mark datasets. Authors have proved that, quality of proposed recommendation system in benchmark datasets gives better results compared to existing K-nearest neighbor algorithm. Hu & Liu (2004) have considered the words tagged with noun as aspects and applied compactness pruning and redundancy pruning technique to filter out the words which are not aspects. Bafna & Toshniwal (2013) have proposed a method for selecting domain specific aspects. The method involved probabilistic approach and association rule mining. An aspect selection method proposed by Maharani et al. (2015) follows a syntactic pattern approach. It follows various patterns to extract single word, double word and tri word aspects. The proposed work concentrated only on extracting explicit aspects. Devasia & Sheik (2016) identified aspects with the help of dependency between the words of review sentences. Stanford dependency parser is used to identify the relationship between the words. To filter the selected aspect, stemming is applied so that aspects in plural form can be converted into singular form. Ghosh & Sanyal (2017) performed aspect level sentiment analysis selection using NB and SVM. The proposed method considered frequency of aspect and static measure called Term Frequency Inverse Document Frequency. Parlar & Özel (2016) proposed a new aspect selection technique which depends on query expansion term weighting methods. The performance of proposed method is compared with chi-square method and document frequency difference techniques out of which the proposed method gained higher accuracy for Turkish reviews. Sharma & Dey (2012) carried out a comparative study on five different aspect selection methods. The result concluded that out of five methods, Gain Ratio is efficient as it gained highest accuracy. Chauhan & Singh (2017), proposed novel approach for aspect selection and used different filtering methods, namely, distance-based filtering, subset filtering, and superset filtering to filter redundant and basic features. Sentiment analysis is performed using Lexicon-based method to find the sentiment word associated with extracted aspects. Li et al. (2015) presented a model for Aspect-based sentiment analysis which follows clustering model to identify the aspects. The aspect group is formed by identifying their synonyms. The work concentrated on finding opinion of dynamic sentiment ambiguous adjectives with the help of dependency pattern rules. Htay & Lynn (2013), proposed novel approach for finding aspects and extraction opinion words of product reviews. Authors have proposed patterns for extracting aspects and opinion phrases and carried out experiments on five electronic product reviews extracted from amazon web site. Proposed pattern showed slightly better results compared to Hu & Liu (2004) approach. Chifu et al. (2015) proposed an unsupervised method for aspect level opinion mining on product reviews. The proposed method used ant clustering algorithm to select the aspects of particular product. To classify the reviews, self-organizing map approach is proposed. Gadekallu et al. [31] have discussed domain specific Aspect level sentiment analysis on Movie review dataset. Authors concluded that integrating syntactic information to the model will give better results in sentiment analysis.

Aspect level sentiment analysis
In the survey of opinion mining, Guo et al. (2014) suggested that, sentiment analysis algorithms can be improved further by applying big data processing techniques to analyze larger volume of data. Table 1 shows the literature survey summary of aspect level sentiment classification. The study and literature survey shows that the sentiment analysis is performed on smaller data sets at various levels using different classification algorithms. The datasets considered for the experimental purpose are static review sets and smaller in size. As the dataset size increases, the algorithms used for sentiment analysis may not scale up well. So, there is a need of effective framework to process large amount of data. Hadoop framework is the preferred computing framework for processing larger datasets since Hadoop clusters are highly available and operates in a fault tolerant manner. To gain better accuracy and to process data in a distributed manner, we have proposed pattern-based method for aspects and opinion phrase extraction.

Hadoop framework
Hadoop is openly accessible java-based software, initially implemented by Doug Cutting and Mike Cafarella in 2005. After the advancement, it has become an enlisted trademark of the Apache Software Foundation. The advantage of Hadoop is that, it provides an intense distributed platform for processing and managing Big Data. It executes applications on massive clusters of commodity hardware. Using Hadoop, it is possible to process many terabytes of data on large number of nodes. The significant favorable position of Hadoop system is that it gives reliability and high availability. Hadoop has two major components: Hadoop Distributed File System (HDFS) for storing purpose and MapReduce for processing purpose (White, 2012).
MapReduce, a programming model, is mainly utilized for analyzing and producing expansive datasets with the help of parallel and distributed algorithm on a cluster. For the most part it follows three steps: dividing, processing, and merging. MapReduce works in two stages. They are called map phase and reduce phase. Both the phases have key-value pair as input and output. The map phase takes the input, processes it and produces an intermediate key-value pair which will be given as input to the reduce phase. The job of the reduce phase is to merge every intermediate value related with the same key. The MapReduce structure comprises of two processes that are JobTracker and TaskTracker. It is the responsibility of JobTracker to handle users' request. The  Hu & Liu (2004) Lexicon-based method Proposed new approach for feature-based summarization of product reviews. This system performs tasks namely tagging, opinion word extraction, identifying opinion orientation of opinion words, and predicting the orientation of opinion sentences and summary generation. Algorithm lagged in finding out the strength of opinions, exploring opinions having nouns, verbs, and adverbs and Pronoun resolution.
Htay & Lynn (2013) Lexicon-based method Proposed approach for finding aspects and extracting opinion words of product reviews. Authors have proposed patterns for extracting aspects and opinion phrases. Proposed pattern showed slightly better results compared to Hu & Liu (2004) approach.
Extraction of implicit features is not addressed.

Bafna & Toshniwal (2013) Lexicon-based method
Proposed dynamic domain specific aspect-based summarization of customers' opinions on online products. Algorithm lagged in learning opinions expressed with nouns, verbs, and adverbs. The proposed approach for aspect extraction and opinion polarity detection needs to be improved. Maharani et al. (2015) Syntactic pattern-based method Proposed syntactic pattern for aspect extraction from product reviews and tested with different pattern combination.
Proposed approach did not succeed in the extraction of both implicit and explicit aspects. The experimental results showed that the syntactic pattern approach had several weaknesses that need to be improved.
Devasia & Sheik (2016) Semantic-based approach Proposed semantic-based approach for extracting product aspects. Recursive deep analyzer is used to find out opinion orientation of Product review sentences. Sentiment analysis is done by using recursive deep analyzer which needs to be further improved by applying various machine earning techniques. Proposed method fails in capturing implicit aspects and other contextual information used for classification. Ghosh & Sanyal (2017) Naïve Bayes and Support Vector Machine Authors have done comparative study on sentiment analysis of product reviews using Naïve Bayes and Support Vector Machine and proved that support vector machine classifier performs better than Naïve Bayes classifier for various mixture of features. Authors have done only a preliminary study on sentiment analysis.
More feature selection techniques need to be adopted to improve the classifier accuracy.
Parlar & Özel (2016) Naïve Bayes Multinomial classifier Proposed feature selector method which gave accuracy better than Document Frequency Difference and Chi-Square methods. Sentiment analysis Turkish product review is done using Naïve Bayes Multinomial classifier.
Authors have not considered reviews written in different languages.
(Continued) Chauhan & Singh (2017) Lexicon-based method Proposed frequent item set mining approach to extract candidate aspects and used different filtering methods namely, distance-based filtering, subset filtering and superset filtering to filter redundant and nonaspect phrases. More filtering techniques need to be incorporated to improve aspect extraction accuracy.
Li et al. (2015) Lexicon-based method Proposed Sentiment Summarization on Product Aspects (SSPA) approach which has incorporates different aspect sentiment analysis tasks namely, Product feature extraction, grouping features into meaningful aspects, Sentiment orientation, and finding out sentiment strengths. Experiments proves that each module of SSPA performs well. More types of features including verbs, adjectives and implicit features need to be addressed. Chifu et al. (2015) Unsupervised method Proposed ant clustering algorithm to select the aspects of particular product. To classify the reviews, self-organizing map approach is proposed.
As further work, we will experiment our method on other opinion mining data sets, for instance, the movie review data set. Algorithm failed in addressing opinion reviews with multiple opinions having opposite polarity. Other domain datasets need to be experimented. Rodrigues et al., Cogent Engineering (2020) JobTracker will coordinate the activity by assigning these tasks to execute on various datanodes. The TaskTracker which is residing on each DataNodes will manage the execution of each subtask. It also sends the progress report to the Job Tracker.
Hadoop Distributed File System was created utilizing distributed file system outline. The prominent features associated with HDFS are scalable, distributed and fault tolerant. HDFS maintains a master/slave architecture additionally it has the idea of a block. The master/slave pattern comprises of single NameNode and several DataNodes. NameNode deals with the filesystem and it contains greater part of its metadata in RAM. It has the information about all the DataNodes on which the blocks of a given file are stored and also handles the file system operations like creating, renaming, and closing directories and files. Apart from NameNode and DataNode, there is a third process called Secondary NameNode which works simultaneously with NameNode as a helper process. It periodically reads the information regarding file system and metadata from the NameNode's RAM and composes it into the hard disk or the file system. Figure 2 depicts the Hadoop framework.

Hadoop multinode cluster
A Hadoop cluster can be thought of as a gathering of freelance systems associated via a dedicated network to function as a centralized information processing asset. The computational computer cluster is the one which disseminates workload of data analysis process over different nodes of cluster which work altogether for processing data in distributed manner. In Hadoop cluster framework, aside from the network nothing is shared between the cluster nodes. This shared nothing paradigm of cluster decreases the latency of processing, hence at the point when there is a need to analyze questions on immense measure of data the processing time over cluster is totally reduced. Hadoop Multi-Node cluster follows master/slave architecture. Master node stores the data in HDFS and executes parallel calculation on data utilizing MapReduce. There are three nodes in Master node: Job Tracker, NameNode and Secondary NameNode (Subramaniyaswamy et al., 2015). It is the job of JobTracker to control the distributed processing of enormous data with the help of MapReduce. NameNode monitors all the information of the documents stored on HDFS, i.e. the metadata. It stores the information like accessing time of the document, distribution of documents over the Hadoop cluster and details about the user accessing a document on current time. The secondary NameNode contains the NameNode data backup. Slave node part of Hadoop cluster is in charge of storing and processing the data. Both DataNode and TaskTracker service run on each slave node in order to communicate to the master node. Figure

Proposed work
The main aim of proposed framework is to process the customer reviews from e-commerce web site. The customers will read the reviews before finalizing the purchase decision. But it is not really conceivable to go through all the reviews. Hence there is a demand for sentiment classification. In our study, the real-time reviews are extracted using Jsoup parser and proposed new tri-gram pattern for explicit aspect extraction and bi-gram pattern is used for finding out opinion phrases. Entire work is implemented using two MapReduce stages. Aspect extraction is performed in the first MapReduce stage, whereas identifying opinion phrases and sentiment classification is performed in the second MapReduce stage. Figure 4 depicts the proposed architecture for Aspect level Sentiment Analysis. The proposed approach is explained below:

Review extraction
In the proposed work the reviews of electronic gadgets are extracted from e-commerce web sites using Jsoup. Jsoup is a library also known as Java HTML parser for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jQuery-like methods. Jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; Jsoup will create a sensible parse tree. For Jsoup, one single seed URL of product web page is given as input which will then explore through all the web pages to obtain the customer reviews. The extracted reviews are stored in HDFS which will be given as input to the aspect level sentiment analysis process. Figure 5 shows the flow of review extraction.

Pre-processing of raw data
The preprocessing step involves appropriate fragmentation and cleaning of data which is performed using short/slang word replacement, multiple characters removal, and dealing with emoji's.
The review text may contain short word opinions such as nic, gud, g8, and many more are converted into full forms. For short words and slang word replacement, the words in the review text are compared with slang word corpus. If the word in the review text matches with the slang word in the corpus then it is replaced with appropriate, meaningful word. It also replaces the elongated words such as "supppeerrrb" with "super". The review text is also examined to eliminate the special characters like "+," "-," "=," "#," "$," "*," etc. An emoticon is a short form of emotion icon, which is the visual portrayal of facial expression, used to communicate the opinion, attitude, or emotion. The emoticons which express the feelings like happy, sad, angry, thumbs up, and thumbs down are considered. Such emoticons are identified and converted into text using Java emoji library.

Parts of speech (POS) tagging
POS tagging is utilized to mark each word in the review sentence with its legitimate part of speech, i.e. it labels the words as verb, adverb, conjunction, preposition, noun, adjective, and interjection. Aspects of a product are usually nouns and opinions are usually adjectives, adverbs or verbs. In this work Stanford POS tagger (Reddy et al., 2020) is used to label all the online review sentences. Table 2 shows the Stanford POS tags and corresponding description. For example, the sentence given to before POS tagging: "This phone has good battery life and speaker voice is good." After POS tagging: This/DT phone/NN has/VBZ good/JJ battery/NN life/NN and/CC Speaker_NN voice_NN is_VBZ good_JJ.

Aspect extraction
The main focus of the proposed work is on performing aspect extraction and sentiment analysis using pattern-based method. Pattern analysis is performed to extract explicit aspects syntactic pattern that corresponds to the aspects of product. For the experiment purpose, we have taken annotated review datasets from Hu and Liu research (Hu & Liu, 2004) which has reviews on electronic products such as digital camera, cellular phone, mp3 player, and DVD. These reviews are extracted from e-commerce web site, namely, amazon.com. Also, we included the  prepossessed real-time product reviews extracted from amazon.com. Some of the example datasets are as follows:  Following are the descriptions of symbols used in an annotated review dataset.
##: start of each sentence and each line is a sentence.
[u]: feature not appeared in the sentence that is explicit feature.
[p]: feature not appeared in the sentence that is implicit feature.
In a sentence "camera [+2] ##this camera is perfect for an enthusiastic amateur photographer", camera is explicit feature with positive opinion strength 2.
In e-commerce sites, customers express their reviews about products by highlighting some of the aspects of the products. For instance, in the review sentence "This phone has good battery life" customer has shared review on product aspect "battery life" and the opinion word is "good." In our work, we, first identify product feature from product review. Many of the product aspects are nouns and noun phrases. Author Htay & Lynn (2013) proposed tri-gram pattern for extracting opinionated phrases and aspects. Table 4 shows the pattern used by Htay for aspect-based sentiment analysis. Here DT represents determiner, JJ represents adjective, RB represents adverb, VB represents verb, NN represents noun. This pattern extracts both opinion words and aspects. First four patterns are for extracting aspects which comes with opinionated words and last four patterns are for opinion words. These trigram patterns extract only the noun aspects which comes with opinion words like adjectives, adverbs and verbs. But there are cases where aspects come other than opinion words. For instance, in the POS tagged sentence, "sometimes (RB) battery (NN) can (VB) get (VB) very (RB) hot (VB)" aspect "battery" comes with RB (adverb) and VB (verb). Htay pattern will fail to extract these kinds of aspects. So, to extract explicit aspects, we have proposed tri-gram syntactic pattern. Table 3 shows the proposed pattern for explicit aspect extraction. The aspects are derived from current POS tag with their (current-1) POS tag and (current+1) POS tag. Out of the extracted features, several nonfeatures phrases are also extracted. So, to eliminate these non-feature phrases, we have used Apriori algorithm to find frequent candidate features. In the Apiori algorithm, each and every extracted feature is taken as one transaction and sentences that contain those features are taken as item sets. Minimum support is calculated as a ratio of sentences contains all the features to total number of extracted features. If Sentence count of each feature is greater than minimum support, then such features are taken as candidate features. Algorithm 1.1 briefs on proposed aspect extraction. In the algorithm, Feature extraction as per the proposed pattern is done in Map phase and nonfeature filtering is done in Reduce phase.   After the aspect extraction, sentences are categorized as per the aspects. This result is given as input to next MapReduce Stage which does sentiment classification.

Sentiment classification
In the proposed work, Sentiment analysis is carried out using pattern-based Lexicon method. After the aspect extraction, we observed that sentiment pattern appears before and after the aspect patterns. These sentiment patterns are extracted, and Lexicon dictionary is used to classify the reviews. The Lexicon dictionary named SentiWordNet contains precalculated value for each word which can be used to predict the sentiment orientation of that word. The review sentences containing sentiment words are considered for sentiment classification. These sentences are classified into positive or negative sentences based on the sentiment score of sentiment words. The positive sentences usually contain the sentiment words such as good, awesome, super, and so on and negative sentences contain the sentiment words such as bad, worst, poor, and so on. The review sentences are initially tagged using Stanford POS tagger. Lexicon-based approach is applied for both uni-gram and bi-gram words.

Uni-gram method
The words tagged only as adjectives or adverbs are considered in ni-gram. These sentiment words are mapped against SentiWordNet to check the polarity. The sentiment score of the review sentence is calculated by finding the average sentiment score which is shown below.

Average Sentiment Score ¼ Sum of Scores of Sentiment Words Total number of sentiment words
(1:1) Based on the average sentiment score shown in Formula 1.1, the review sentences that contain sentiment word are classified as positive or negative.

Proposed bi-gram approach for sentiment analysis
Since unigram Lexicon method is built for uni-gram sentiment pattern, it gives misclassification for the bigram sentiment word. In the sentence, "This is awesome/JJ and Beautiful/NN," only "awesome" is taken into consideration. Also, in uni-gram Lexicon method, sentiment words preceded by Negation words and adverbs are not taken for analysis. For example, "This phone is not good," only "good" is taken for consideration and classifies sentence into positive; but "not" which is an adverb, gives negative effect on sentiment. Htay has proposed common pattern for aspectbased opinionated phrases. To get better results, in our proposed approach, we have written separate pattern for aspect extraction and opinion phrase extraction. The pattern used for opinion phrase extraction is shown in Table 5.
The proposed pattern extracts unigram opinion words which comes with noun phrases (first three patterns) and bi-gram opinion phrases (last four patterns).
Along with bi-gram pattern, Negation rule is applied to sentiment word preceded with negation word like "not." (a) If the sentiment word with positive score is preceded by "not," then it is treated as Negative word (e.g. not good).
(b) If the sentiment word with Negative score is preceded by "not," then it is treated as Positive word (e.g. not bad).
The proposed method for negation handling and sentiment classification is given in algorithm 1.2.  The results are discussed in section 5.

Results and discussions
The proposed sentiment analysis is performed on Hadoop MapReduce framework. Apache Hadoop is installed on Ubuntu 16.04 with 8GB RAM. The Multi-Node setup consists of 4 nodes with one node as master and remaining three nodes as slaves. The reviews of electronic gadgets are fetched from amazon.com with the help of a Java HTML parser Jsoup and are stored in HDFS. The electronic gadgets considered for experimental purpose include Nokia 6610, Apex DVD player, Canon g3, Nikon Coolpix, Creative labs jukebox, and Canon S100. MapReduce programming model is used for the further analysis of extracted reviews.

Datasets
The six sets of product reviews are generated out of the extracted reviews from amazon.com and review datasets from Hu and Liu research (Hu & Liu, 2004). Each review set has different number of reviews and aspects. The number of reviews considered for each product and their annotated aspects is shown in the Table 6.

Aspect extraction
We have proposed trigram pattern for explicit aspect extraction. Out of the extracted aspects, there are several nonaspect phrases listed in Table 7. To extract frequent features, we have used Apriori algorithm for filtering. Table 8 shows the most frequent features spoken by customer after filtering process.
Accuracy is calculated for the aspects before and after filtering process according to the formula given below:  T P is a case in which a word is an aspect and it is predicted as aspect, T N is a case in which a word is not an aspect and it is not predicted as aspect, F P is a case in which a word is not an aspect but it is predicted as aspect, and F N is a case in which a word is an aspect but not predicted as aspect. The Table 9 shows the aspect extraction results.
For our annotated dataset, the experimental result shows that the accuracy of aspect extraction using proposed trigram pattern is in the range of 29 ~ 42%, but trigram pattern with filtering process outperforms with the accuracy 57 ~ 63%.

Sentiment classification
Sentiment classification of particular aspect is performed using Lexicon-based approach with unigram and bi-gram pattern-based method. Since the analysis is performed at aspect level, some of the aspects extracted for mobile products are Volume, Battery Life, and Camera. The Figure 6 depicts the result of classification of reviews based on aspect for a product Nokia. The result is shown among nine most frequently spoken aspects of Nokia reviews.

Performance evaluation of sentiment classification methods
The performance of sentiment classification is evaluated using performance measures, namely, Precision, Recall and F1-measure. The experiment showed that, proposed bi-gram method shows the significant increase in the precision and recall compared to uni-gram method and Htay pattern. In dataset Canon S100, Nikon, Apex and Jukebox, Apex, Canon G3, and Nokia the Precision rate increased in the range of 21-33% compared to unigram and 3-5% when compared  with Htay. Similarly, there is increase in recall rate compared to other two approaches in the range of 5%. Table 10 and 11 shows the Precision and Recall of the experiments.
The experimental result showed that the Lexicon uni-gram method gives accuracy ranging 53 ~ 55%, Htay pattern results in accuracy in the range 66-70% but bi-gram lexicon method outperforms with the accuracy 72 ~ 75%. The value of the F1-measure has increased on average 4% compared to Htay pattern. Figure 7 shows the performance measures of sentiment classification methods. Figure 8 depicts difference between processing time in normal setup and in Hadoop, from which it can be concluded that the aspect extraction process is faster in Hadoop cluster compared to normal setup.

Conclusion
The popularity of online shopping has increased in much consideration in the zone of sentiment analysis. The proposed work concentrates on performing aspect level sentiment analysis of product reviews. This work focused on extracting reviews from amazon.com, preprocessing, POS tagging, comparing different aspect selection methods and sentiment classification. The experimental result proves that proposed pattern-based aspect extraction gives better results with filtering process. Sentiment classification result concludes that, proposed bi-gram method has gained high accuracy when compared to uni-gram method and Htay pattern. The entire experiment is conducted on Hadoop Single Node and Multi-Node cluster to compare the execution time of both the clusters. From the outcome it's been proved that Hadoop Multi-Node cluster performance is faster when compared to Single Node. The presented work focuses only on explicit aspects. Identification of implicit aspects can be considered in further process. In the future work, it is also possible to improve the proposed method by considering pronoun resolution, sarcastic review sentences with different sentiment classification techniques.