A Systematic Overview of Android Malware Detection

ABSTRACT Due to the completely open-source nature of Android, the exploitable vulnerability of malware attacks is increasing. To stay ahead of other similar review work attempting to deal with the serious security problem of the Android environment, this work not only summarizes the approaches in the malware classification phase but also lays emphasis on the Android feature selection algorithm and presents some areas neglected in previous works in the field of Android malware detection, like limitations and commonly applied datasets in machine learning-based models. In this paper, the Android OS environment, feature selection, classification models, and confronted challenges of machine learning detection are described in detail. Based on the brief introduction to Android background knowledge, feature selection methods are elaborated from key perspectives as feature extraction, raw data preprocessing, valid feature subsets selection, and machine learning-based selection models. For the algorithms of the malware classification, machine learning methods are categorized according to different standards to present an all-around view. Furthermore, this paper focuses on the study of deterioration problems and evasion attacks in machine learning detectors.


Introduction
Due to the rapid development of mobile intelligent terminals, Android becomes the most generally used computing platform on smartphones. As TrendForce (Huang 2020) recently issued, a total of 1.25 billion smartphones were produced in 2020 and Android captured 78.4% of the market shares. However, due to the wide distribution and the open-source nature, Android applications are accessible from potentially malicious third parties besides the official Android Market, which makes the platform a target for malware attacks. According to the 2019 Android Malware Special Report (360 Internet Security Center 2020) released by 360 Security on February 28, 2020, the platform intercepted about 1.809 million new malware samples on mobile terminals in 2019, and about 5,000 new mobile malware samples were intercepted on an average day.
Reviews for Android malware detection have been proposed in many previous studies. (Liu et al. 2020a) studied machine learning models of Android malware detection in recent years but neglected to see the limitations of machine learning models mentioned. (Bakour, Ünver, and Ghanem 2019) provided a wide taxonomy for all research directions. The overviews of static analysis ) and traditional detection (e.g., signature, heuristics based) (Sihwail, Omar and Ariffin 2018) are all included in this paper to draw the conclusion from differential perspectives.
However, to address the limitations of the outdated collected research, the ignorance of the importance of feature selection algorithms and the problems of machine learning models, this paper gives a more all-around overview of the work with a large volume of relative works from 2015 to 2021.
To our best knowledge, few reviews have systematically presented research focused on feature on four feature processing stages and Android malware classification models according to the taxonomy of detection tech and machine learning type. Compared with other researches, this work points out some areas neglected in previous works in the field of Android malware detection like limitations and challenges in machine learning models. Additionally, widely applied Android datasets and disassemble tools are summarized. The correlative research on Android malware collected in this paper can provide valuable reference and broaden the research direction for future researchers.
The main contributions of this paper are summarized as follows.
(1) Compared with other similar work, a comprehensive overview of Android feature selection is provided from more detailed aspects of the feature extraction, raw feature processing, and valid subsets stages. This paper concludes each data preprocessing method in four-stage from data cleaning to data transformation. How to pick features that stand the test of time without frequent retraining of machine learning models in valid feature subsets selection is also discussed.
(2) This survey completes existing previous reviews in offering a systematic overview of detection methods, including not only those based on machine learning but also traditional methods (e.g., signature, heuristics based). For the algorithms of the Android malware classifier, machine learning detectors are elaborated from learning task and learning type. With the introduction of the taxonomy of machine learning methods, the commonly used models in Android malware detection are distinguished as traditional machine learning, currently advanced detection models, and ensemble learning.
(3) This work bridges some research gaps in peer works by presenting the limitations of machine learning models in Android malware detection. Due to the evolution of the Android operating system, the deterioration issue of machine learning-based detector and the problem of frequent model retraining to select valid feature subset are highlighted. And the vulnerability to security attacks of machine learning detectors is also analyzed.
The paper is structured as follows: Section 2 describes the procedure used for collecting the papers. Section 3 gives a brief introduction to background knowledge of Android, the mechanism of Android operating system, and the key components of Android. In Section 4, the researches that focused on feature selection in recent years are summarized. Section 5 describes different algorithms of the classifier, especially focused on machine learning models. Section 6 takes a further look into the study on evasion attacks and deterioration issues to expose the vulnerability of many machine learning models. Section 7 introduces the widely used Android datasets and disassemble tools for Android applications. Finally, Section 8 makes a conclusion of this paper.
The following research questions have been brought out to help follow the process of systematic review conduction: RQ1 What are the significant feature extraction tech and data preprocessing methods in Android feature engineering?
RQ2 How are valid feature subset selection models categorized? What kind of machine learning models can be applied to feature subsets selection?
RQ3 How are the most advanced machine learning frameworks applied to the research of Android malware classification in recent years, and what are the limitations?
RQ4 What are the widely used Android datasets that can be applied to establish longitudinal comprehensive experiments?

Method of Literature Collection
It is common practice to clarify the paper selection criteria for a literature review, to establish representativeness and confidence regarding the source of information. The procedure used for selecting the papers is described as follows.
(1) The collection scope is based on the main content of this paper, so the search scope is composed of two main parts as Android malware classification and valid feature subsets selection, including the adversarial attack and degradation problems in each stage. To find other relevant papers that may not be collected through the keyword, an incomplete reverse snowball search is carried on the reference list of the article determined by early keyword search.
(2) In each section, the keywords are determined by the categories. For example, since malware analysis can be categorized into static/dynamic analysis according to the type of extracted features, "static/dynamic analysis" + "Android malware detection" is applied after fusing the keywords. For machine learning-based detectors and feature selection models, the keywords are determined by combining "machine learning" with "Android malware detection" or "feature selection." And the same searching approach is utilized for other signature and heuristics-based malware detection methods.
(3) The captured literature is mainly searched from authoritative journals or top conferences, indicating that the review has considered the most important relevant papers. Most of the reviewed works collected in this paper are from the following repositories: SpringerLink, IEEE Xplore Digital Library, Science Direct, ACM Digital Libraries, and Web of Science. Additionally, third-party online repositories such as ResearchGate, Baidu Scholar, and Google Scholar are also used by us.
(4) The papers are filtered according to the publication time to maintain the updated search of collections. Since Android security has attracted increasing attention in recent years, most of the collected papers are from the new journals from 2015 to 2021, but with a few old representative journals cited to explain some concepts. Furthermore, this work attaches great importance to research articles in recent three years.
The limitations exist since these articles are manually selected, so there are inevitably some omissions in the coverage. However, these selected papers are carefully screened to meet the requirements of good quality content and high relevance. Therefore, the existing article collection is capable to provide a sufficient basis for the review work of this paper.

Android OS Architecture
Android OS is an open-source operating system based on Linux for mobile platform, which is released by Google. Nowadays, Android is being updated rapidly though the key architecture of Android OS has remained unchanged. The architecture of Android can be divided into Modified Linux Kernel Layer, Libraries, System Runtime Library Layer, and Application Framework Layer as shown in Figure 1.
(1) Modified Linux Kernel. The core system services provided by Android are based on the Linux system, such as security, power management, and drivers. Acting as an abstraction layer between hardware and software, the modified Linux kernel hides details in the hardware layer and provides services to the upper layers to reduce coupling.
(2) Libraries. Linux-based process sandbox mechanism in Library Layer of Android is one of the cornerstones of the entire security design. Relied on modified Linux kernel layer for basic functions such as thread management and memory management, the Dalvik virtual machine is built with optimization to efficiently run multiple instances of virtual machines simultaneously in limited memory, and each Android application executes as a Linux process with a instance of the Dalvik virtual machine.
(3) System Runtime. From the perspective of the overall architecture, the developer only has active control over the System Runtime Layer and the structures above it, so the detection of Android malware should also focus on the same location. It can be classified into system library and Android runtime. The core library of Android runtime provides most of the APIs, such as Android OS, Android.net, and Android.media.
(4) Application Framework Layer. Android has an application framework layer that provides a variety of APIs for Android development. Developers are free to use these APIs to build their applications, subject to the security limitations of the framework's implementation.
(1) Activity. The Activity provides the users with a graphical window for actions such as buttons, text blocks, input blocks, etc. Users interact with the application by tapping these elements. Activity typically acts as an intermediate layer between users and app functions, responsible for conveying user intent.
(3) Service. Service is often used for the time-consuming logical processing in the background therefore many malicious behaviors are associated with Service for its invisibility to users. The Service does not run in a separate process but depends on the application process in which the Service was created.
(3) Broadcast Receiver. As a widely used mechanism for transmitting information between applications, Broadcast Receiver is usually used by malware developers to monitor various events related to sensitive information. Broadcast Receiver filters, receives and responds to outgoing broadcasts. Broadcast Receiver allows Android Apps the ability to respond to an external event, such as powering on the phone, receiving a text message or a phone call.
(4) Content Provider. It is possible that Content Provider can help Android malware implement malicious behavior for getting the permission to share data. Content Provider supports storage and reading of data in multiple applications, performing as a database to applications, so it allows accessibility to the exposed data such as contact books and messages for malware developers.
The required permissions and each of the four components used in an Android application need to be registered in AndroidManifest.xml. Therefore, analyzing AndroidManifest.xml can give an overview of the functionality and malicious behavior of the applications. AndroidManifest.xml file is commonly used as an auxiliary indicator to cooperate with other analysis methods for detection (Bai et al. 2020) (Chen et al. 2021).

Android Feature Selection
Feature selection improves malware detection efficiency by eliminating redundant and irrelevant features in Android malware detection. Figure 2 sketches the process of the feature selection phase, which is described in detail in the following subsections.
To demonstrate the importance of feature selection, (Babaagba and Adesanya 2019) tested the efficiency of feature selection in malware detection, by using supervised and unsupervised machine learning algorithms with or without feature selection. Taking the prediction accuracy of the algorithm as the performance indicator, the results showed that the best detection rate was supervised learning using a feature selection algorithm. Compared with the application without feature selection, the main accuracy jumped from 54.56% to 74.5%, which showed the influence of feature selection.

Feature Extraction
According to the approach of feature extraction using static features, dynamic features, or both, Android malware detection tech can be categorized into dynamic analysis, static analysis, and hybrid analysis as illustrated in Table 1.
(1) Dynamic analysis. Dynamic analysis is an approach which runs the program in a sandbox environment and tracks the behavior of the program's API call sequence, system call, network traffic, and CPU data to monitor the data flow during the program running, thus revealing the real behavior of the program processing closer to the actual situation. But it is not widely used on account of a large number of resources and the slow detection speed while running the program.   (Zhao et al. 2014) built a sandbox, monitored the system APIs, and used scripts to simulate various events to see the software behaviors. (Cai et al. 2018) extracted dynamic features based on API calls and inter-component communication (ICC) distilled from a behavioral characterization study, and trained a multi-class classifier using supervised machine learning. (Martin, Rodríguez-Fernández and Camacho 2018) utilized dynamic analysis with a Markov chain-based presentation to simulate the behavior of individual applications from different families of malware.
(2) Static analysis. Static analysis is performed by analyzing Android files and extracting information like requested permissions, opcode sequences, and API calls, etc. Static detection is widely used in the field of Android malware detection for many optional features that is easy to extract. (Kumar et al. 2019) proposed a static analysis method based on multiinformation features, using the integration of different features to promote detection accuracy. (Nicheporuk et al. 2020) took static analysis method to detect Android malware, using API method calls and permissions as features, applying convolutional neural network for training. (Suarez-Tangil et al. 2017) proposed an Android malware classifier that exploited features and artifacts introduced by obfuscation mechanisms used in malware.
(3) Hybrid analysis. The combination of dynamic and static analysis can make Android malware detection more accurate and efficient. (Onwuzurike et al. 2018) compared the detection performance between static and dynamic analysis on the same behavioral model relying on Markov chains built from the API sequences. The result showed that dynamic code loading has better performance for data in free conditions.
(Yung and Juang 2017) used Androguard to extract static features like permissions and Android four components, DroidBox to obtain dynamic features, and SVM to analyze the combination of the dynamic and static features. (Chen et al. 2016) introduced two newly defined features determined by the frequency of sensitive API calls and information requests. They adopted a streaminglized machine learning-based framework to support large-scale analysis, which observed app behaviors statically and dynamically.
Furthermore, API is commonly used to detect Android malware in both static feature analysis and dynamic feature analysis. After processing the API sequence or function call graph, and feeding it into RNN, the behavior path of the software, which is different from the benign software can be discovered. The problem of gigantic repetitive API sequences can be dealt through the size of the information entropy of the API (Lu et al. 2019). Some researchers (Gao 2019) (Fan et al. 2018) converted the frequency relation matrix obtained from the function call graph of sensitive into vectors and combined with other features to detect Android malware.

Raw Feature Data Preprocessing
Data preprocessing is essential after obtaining the raw feature from the feature extraction phase, which generally involves the stages as follows. Table 2 gives a brief introduction to each data preprocessing method from different researches.
(1) Data cleaning. Data cleaning removes the irrelevant features directly after obtaining the original data through feature extraction. For example, the irrelevant permissions possessed by both benign and malicious Android applications should be cleaned.
(2) Data integration. Data integration means integrating information of two or more features provided for comprehensive detection. It is necessary to use effective data fusion methods when multi-source features are combined to be the input of the classifier during the detection phase.
(3) Data reduction. Data reduction mostly refers to the process of dimensionality reduction of the data involving complex algorithms, with an attempt to address the problems of too large dimensionality of the Android feature vector.
(4) Data transformation. Data transformation is to transform data from one form to another. The most commonly used research method is converting the extracted features into images and feeding them into the deep neural network.
It can be seen from Table 2 that there is a range of technologies usually applied in each stage of data preprocessing.
In the data cleaning phase, some original data are cleaned up for meeting the requirements of the next data processing step. For example, the data unable to be analyzed was deleted for subsequent N-Gram processing. Also, some papers removed redundant original features to generate a vector map containing only features associated with malicious behavior.
In the data integration phase, static features and dynamic features can be combined to obtain better performance. The multi-source features are commonly the inputs of one branch of the neural network, and the outputs of all branches are combined to form the input of a fully connected layer.
In the data reduction phase, many algorithms such as feature weighting, evolutionary genetic algorithm, and machine learning models like natural language processing were adopted to reduce the dimensionality of feature vectors. Meanwhile, some studies used the image embedding method technique to represent the features and reduce the graph dimensionality.
In the data transformation phase, many researchers transformed the extracted features into audio files, directed graphs, grayscale images, color images, etc. to take the advantage of the neural networks to process these kinds of data.

Categories of Feature Selection
Android feature subsets selection refers to the procedure of choosing a valid subset from the existing features. According to whether the execution process of the feature selection algorithm is independent of the accuracy of the classifier or not, it can be divided into two categories. Table 3 depicts the traits of each category and Figure 3 illustrates the difference between them.
(1) Filter-based Approach. Independent of the result of the malware classifier, filter models select valid feature subsets by the general characteristics of the features. Due to the independence from classification results and high time efficiency, it has been widely used in feature subset selection.

Reference
Description Type (Bai et al. 2020) FAMD combined filter, wrapper, and embedded based feature selection method.
Data cleaning (Xu et al. 2018) The N-Gram technique was used to remove irrelevant API subsequences.
Data cleaning (Pang et al.2019) This paper removed redundant features based on feature weighting Data cleaning (Bai et al. 2020) FAMD used integration of permissions and Dalvik opcode sequences. Data integration (Qiu et al. 2019) A3CM combined API calls and network addresses. Data integration (Nicheporuk et al. 2020) API method calls and permissions were the input of a convolutional neural network.
Data integration (Xu et al. 2018) Static features and dynamic features were combined. Data integration (Yerima and Sezer 2019) Android permissions and API calls were used as multi-source data using static analysis.
Data integration (Naway and Li 2019) The integration of permission, intention filter, API call, and invalid certificate was used.
Data integration (Bai et al. 2020) The N-Gram technique and the FCBF algorithm were used to reduce dimensionality.
Data reduction (Alam, Alharbi and Yildirim 2020) This paper reduced data by DroidDomTree that mines the dominance tree of API calls.
Data reduction (Lu et al. 2019) Redundant N-Gram subsequences were removed using Information Gain.
Data reduction  Attribute Subset Selection and Principal Component Analysis were used to reduce dimensionality.
Data reduction  Three levels of permission feature pruning methods were presented Data reduction (Fatima et al.2019) Evolutionary genetic algorithm was applied for feature selection. Data reduction (Cai, Li and Xiong 2021) Feature weighting was employed to reduce data.
Data reduction (Kim et al. 2018) Image embedding method was used to reduce dimension of different graphs.
Data reduction ) Android malware clustering system was adopted through iterative mining of malicious payload.
Data reduction (Bakour and Ünver 2021) The source of APK was converted into grayscale images and processed by deep learning.
Data transformation  API sequence was converted to the enhanced function call graphs. Data transformation (Fan et al. 2018) Raw data of API call sequence and sensitive data were converted to frequent subgraphs.

Data transformation (Yen and Sun 2019)
This paper digitized the importance of the word and converted them to images.

Data transformation (Mercaldo and Santone 2021)
Features were presented in a form of audio files. Data transformation (Ünver and Bakour 2020) The AndroidManifest.xml in samples was used to constructed grayscale image datasets.
Data transformation (Vasan et al. 2020) Raw malware binaries were converted into color images. Data transformation (Salah, Shalabi and Khedr 2020) proposed a lightweight Android malware classifier with a novel feature selection method inspired by TF-IDF. (Yildiz and Doğru 2019) selected three Android permission feature subsets by the genetic algorithm and evaluated by SVM and NB. ) used automatic word-based sensitive Android feature engineering based on text classification.
The typical filter selection algorithm is ranking-based. Each feature is assigned to a score according to its importance calculated by the algorithm, and then select the top N features as input of the classification stage. W. (Wang et al. 2014) applied feature ranking methods to rank individual permissions based on the risk of single permission and the group of permissions. (Mahindru andSangal 2020, 2021) applied six different feature ranking approaches to select significant features, including Gain-ratio feature selection, Chi-Square, Information-gain, and logistic regression analysis.
(2) Wrapper-based Approach. Wrapper-based approach uses the accuracy of the malware classification to estimate the efficiency of generated feature subset. Evaluating the efficiency of generated feature subset by the accuracy of the malware classification, the wrapper method can obtain a better performance but is more complex and computationally costly compared with the filter-based method. And it can be combined with the filter-based algorithm to select Android features (Huda et al. 2016).  Recently, little research utilized the feedback from the accuracy of the classifier in Android malware detection. Though less computational overhead in the filtered-based algorithms, the valuable relevance information obtained from the classifier between different features is ignored, which can consequently choose a large number of redundant features while processing high dimension feature vectors. To make a breakthrough in the efficiency of the malware detection classifier, the problem of inexhaustible feature combination in selected valid subsets in the previous wrapper-based method should be tackled.

Machine Learning in Feature Subsets Selection
Due to the vulnerability of the syntax integrity of multi-source data during the process of manual feature selection, more research has focused on feature selection machine learning algorithm based as following summarized.
Traditional machine learning models can be optimized to select the valid feature subsets. (Priya and Visalakshi 2020) proposed KNN-based relief algorithm for feature selection, and the optimized SVM algorithm was applied for malware detection with the result showing that it was equivalent to the performance of the neural network. (Wang et al. 2020) presented a multiview neural network that can automatically generate multiple views of input and assign soft attention weights to different Android features. Multi-view preserved the rich semantic information of input without complex feature engineering.
Besides, unsupervised learning and reinforcement learning are also utilized in Android feature subsets selection. (Liu et al. 2021) proposed SRBM (Subspace-based Restricted Boltzmann Machines) by introducing the concept of subspace to optimize the model. Each RBM model in SRBM was used for unsupervised learning to learn the features of each particular subspace, and the lower dimension features are used to represent the original dataset. (Fang et al. 2019) used deep reinforcement learning to automatically select optimal feature subsets by encouraging the agent to maximize the expected accuracy from the malware classifiers in sequential interaction with the features space.
From the above discussion, the conclusion can be reached that the key to using machine learning models to select Android features is to use its prediction ability to calculate the weight of the feature or obtain the correlation between features based on an evaluation metric. Additionally, wrapper-based feature selection can also apply machine learning models to score the valid feature subset selected by the optimization algorithms.
For machine learning models that perform well on classification tasks, like SVM and DT, these models have a suitable separation capability in that they maintain the largest distance from the points in either class. For neural networks, the score derived from the sum of the softmax weights of the input features can be adopted as an evaluation indicator to select valid feature subsets.

Categories of Classification Technology
This section outlines the process of Android malware classification based on the features obtained from valid feature subsets selection. The Android malware detection methods can be categorized into signature, behavior, and machine learning based, as summarized in Table 4, among which the most mature method is signature-based detection. The following are introductions of several detection methods.
(1) Signature-based detection Based on pattern matching, signature-based detection maintains a malware signature library containing the unique signature for each known Android malware. Malware signature library includes different attributes like file names, content strings, or bytes, that are manually identified by experts or generated automatically. It detects an Android sample by testing whether there is a matching malware signature in the library.
This technology is the most convenient and universally used due to its fast detection speed and high accuracy. All the Android malware recorded in the malware library can be detected correctly. However, the disadvantage is that the maintenance of the malware signature library is time-consuming and is not applicable to detecting new malware.
(2) Heuristic-based detection Heuristic detection, also known as anomaly based and behavior-based detection, emphasizes the ability to identify unknown malicious software. This method compares the characteristics of unknown samples with known malware families, and each malware family is represented by a set of rules defined to mine the common experience and knowledge of the software. It is considered malware when the characteristics of the detected sample conform to the rules of one malware family. Known rule sets include attributes like software structure features, the API calls, operation code sequences, and multiple views integration rules, etc.
Heuristic-based detection techniques have the ability to self-discover of unknown malicious software and advocate the use of multiple methods to determine the difference between malicious and benign software. It makes up for the deficiency of traditional detection and can also identify unknown malicious software but with the disadvantage of a higher error rate for zero-day malware.
(3) Machine learning-based detection Machine learning trains a learner by adjusting the parameters to make the best predictions. Existing research demonstrated that machine learning is an effective and promising method to detect Android malware. In recent years, many malware detection works have attempted to harness machine learning to seek a breakthrough in unknown Android malware detection. The following subsections will introduce the detection technology based on machine learning in detail.

Machine Learning Based Android Malware Classification
In an attempt to deal with the lack of ability to identify unknown malware or zero-day malware for the traditional methods, machine learning is employed in Android malware detection universally in recent years' research. Machine learning can be roughly divided into five categories: symbolism, bayesism, connectionism, evolutionism, and behavioral analogism, according to the basic concept (Zhou 2016).
Firstly, according to learning type, machine learning used in Android malware detection can be divided into four categories depicted in Table 5.
(1) Supervised learning. The training data labeled with the category is the input into machine learning models in supervised learning. It is a classification task when supervised learning makes discrete predictions about various things, a regression task when supervised learning makes predictions about continuous values.
(2) Unsupervised learning. The prediction model is trained through unlabeled data sets, with the subject to explore and infer potential connections from unlabeled data in unsupervised learning. The typical tasks are clustering and dimensionality reduction. Heuristics A permission-based Android malware detection system using heuristic analysis 2020 (Priya and Visalakshi 2020)

Machine learning
The KNN-based Relief algorithm and the optimized SVM were adopted to detect Android malware. 2018  Machine learning Generate adversarial samples to evade the detection of current machine learning based detectors. 2020 (Pektaş and Acarman, 2020a)

Machine learning
Employ deep neural network as malware classification.

(Kumar et al. 2019) Machine learning
Combine dynamic analysis and static analysis machine learning.

(Pektaş and Acarman, 2020b)
Machine learning Deep learning was applied using features extracted from instruction call graphs 2018 (Hasegawa and Iyatomi 2018)

Machine learning
One-dimensional convolutional neural networks was applied for Android malware detection.

(Yen and Sun 2019) Machine learning
Utilize CNN to process images generated from the importance of words (3) Semi-supervised learning. Combined with supervised learning and unsupervised learning, only some parts of training data are labeled in semisupervised learning (Zhu and Goldberg 2009). It learns the internal structure of the data and then reasonably organizes the data for prediction with only a few marked data sets (Engelen and Hoos 2020).
(4) Reinforcement learning. Reinforcement learning can be applied to select Android features, using the classification result of the input data as feedback to the classification model, with the principle that the agent optimizes its next action to maximize the reward value.
Secondly, according to learning tasks, machine learning models can be categorized into classification model, regression model, clustering model and dimension reduction model. The training samples are to be classified into the given category in classification task, but without acknowledging of given categories in the clustering task. In the regression task, the input data need to fit a set of points using a function. As the machine learning models can be applied to solve different problems, it is difficult to categorize all machine learning algorithms from the same perspective. For example, decision tree can be utilized both in classification and regression tasks. There is no absolute boundary between different categories. Therefore, one model may belong to multiple categories.
With the basic knowledge of the taxonomy of machine learning methods, the commonly used models in Android malware detection were summarized as follows. Traditional machine learning and other current state-of-the-art detection models are distinguished, with a detailed summary as shown in Table 6. Three main types of models and algorithms used for Android malware detection are as follows: the first (1)-(6) is traditional machine learning models, the second are neural network and deep learning (7)-(8), and the third uses ensemble learning (9) which combines multiple classifiers to detect Android malware. (1)Linear Model. Simple and highly interpretable, linear functions using Android features as input are applied to give malware prediction. The typical linear model includes logistic regression and linear regression, with the difference that logistic regression is to solve the classification problem while linear regression deals with regression problems. ) provided indirect methods for diagnosing anomalies by building specialized linear models to locally approximate the anomaly scores generated by black-box models.
(2) Support Vector Machine. It shows significant improvement to effectively monitor the resources consumption of running Android malware with Support Vector Machine (SVM). SVM is to find a hyperplane (Boswell 2002) that perfectly divides n-dimensional data into two categories. (Faiz, Hussain and Marchang 2020) applied SVM using features extracted from Android permissions, broadcast receivers, and APIs to detect Android malware, with the highest classification accuracy of 98.55% achieved by personaCateg-SVM.
(3) Naive Bayes. Based on Bayes' theorem, Naive Bayes (NB) assumes that the effect of an attribute value on a given class is independent of the values of other attributes (Leung 2007). (Alqahtani, Zagrouba and Almuhaideb 2019) provided a review of machine learning detectors, summarizing NB, SVM, and DNN applied in Android malware detection in detail.
(4) Decision Tree. As one of the most typically applied supervised learning models used in inductive reasoning, Decision Tree (DT) builds a flowchartlike tree structure from training data. (Lashkari et al. 2018) applied RF, KNN, and DT as the Android malware detection classifier for comparison, with each machine learning algorithm trained, tested, and evaluated with the same selected features.
(5) K Nearest Neighbor. As a supervised learning model, K Nearest Neighbor (KNN) can obtain Android malware classification results through measuring Euclidean distance in geometric space between different eigenvalues (Ray 2019).
(6) K-means Clustering. K-means clustering algorithm is an unsupervised learning algorithm typically applied in Android malware family classification (Ilham, Abderrahim and Abdelhakim 2018). Given a set of N data points R d and an integer K in a real D-dimensional space, it is to find the center point in N data points, thus minimizing the mean square distance of each data point to its nearest center (Kanungo et al. 2002).
(7) Neural Network. Composed of a large number of connected artificial neurons, Neural Network (NN) uses neurons to reflect the received signal and the weight to present the strength of the signal (Gershenson 2003). The most typically used neural network algorithms are Perceptron Neural Network, Hopfield Neural Network, and Self-Organized Map.
(8) Deep learning. With multiple levels of data representation obtained by composing nonlinear modules that convert a level of representation to a higher and more abstract level of representation (LeCun, Bengio and Hinton 2015), deep learning originates from NN (Du et al. 2016) as illustrated in (Qiu 2020). It is used to detect Android malware usually when features are transformed into images. A hybrid malware classification using segmentationbased fractal texture analysis and deep convolution neural network features was proposed in (Vinayakumar et al. 2018), which binarized Android APK into grayscale images generated using bytecode information. (Vinayakumar et al. 2018) used a Back-Propagation Through Time (BPTT) to train an LSTM model to detect Android malware. (Vinayakumar et al. 2018) used two different network topologies with multiple network parameters, a standard LSTM network containing only one hidden layer, and a stacked LSTM network with three hidden layers, which exhibited high Android malware detection accuracy on both static and dynamic analysis.
(9) Ensemble learning. Multiple classifiers were combined to improve the Android malware detection accuracy in ensemble learning (Zhao et al. 2018) (Rana and Sung 2020). More specifically, ensemble learning describes a way of combining learners. A new classifier fusion method based on the multi-level structure was proposed by (Yerima and Sezer 2019), training basic Android classifiers at a lower level to generate models, using a set of sorting algorithms to select the final classifier and assigning the weight of the prediction results of the chosen classifier according to the prediction accuracy of the basic classifiers at a higher level. However, it is computationally costly to apply ensemble learning, for the reason that each APK file should be analyzed by multiple detectors. To tackle the problem, (Birman et al. 2019) applied deep reinforcement learning to automatically start-up and stop the base classifiers, using DNN to dynamically determine if there is adequate information to classify a given APK file.

Limitations and Challenges in ML Based Detection
There are notable challenges confronting mainstream technologies especially machine learning in Android malware detection that is necessary to be considered in future work. As described in this section, these challenges can be divided into two aspects. Firstly, machine learning is vulnerable to adversarial sample attacks. Moreover, there are more serious problems caused by the upgrade of the Android ecosystem and the emergence of new malware. The machine learning-based detector suffers from degradation problems, and the feature selection algorithms are not strongly adaptable to the evolution.

Vulnerability to Security Attacks
Although the enhanced performance of Android malware detection was observed in machine learning-based classifiers, a variety of countermeasures have been proposed by attackers to evade the detection. For example, they may add adversarial examples to interfere with machine learning detectors, which makes it easier to evade detection while retaining the malicious function. (Papernot et al.2016;Amodei et al. 2016) reviewed the existing work on security risk and summarized the security problems of machine learning. As described in Table 7, machine learning model security problems can be divided into three categories: training integrity threat, test integrity threat, and lack of robustness of the model. Among these different attacks, the most common situation is test integrity threat, for the little opportunities to manipulate the training dataset of the detection classifier.
Recent works have highlighted the vulnerability of many machine learning models of Android malware detection to adversarial examples, which can be used to evaluate the security and robustness of the model before it is deployed.
The existing approach to generate adversarial samples is modifying the feature vector of the Android malware, intending to be misclassified by machine learning detectors, at the same time guaranteeing the malicious functionality. (Grosse et al. 2017) used the augmented adversarial crafting algorithm to mislead this classifier while adding individual features to AndroidManifest.xml to preserve semantics. (Rosenberg et al. 2017) applied a query-efficient black-box attack that generated adversarial examples by modifying the malware's API call sequences and non-sequential features.
However, there are some defects in the above attack models, which should be emphasized when building a robust detector against the adversarial samples. For example, (Grosse et al. 2017) modified AndroidManifest.xml to fool the classification model, but it fails when hybrid analysis or features contained in AndroidManifest.xml are not extracted as input to the classifier. So multisource features extracted from different Android dissembled files, which provide more all-round analysis with comprehensive information, can be combined in future research to defense against the attack based on single feature modification.
Besides the lack of feasibility in such an adversarial attack, the impact of the mutation may also lead the Android malware to crash. In other words, the malicious behaviors could be lost or sometimes the codes cannot be compiled appropriately due to the modification of the feature vector. Therefore, to enhance the feasibility of the feature-space attacks, ) combined malware evolution attack and malware confusion attack to preserve the critical structure of malware. Phylogenetic analysis for the Android malware family was conducted to interpret evolving malware patterns in evolution attacks, and then it was complemented by mutating permission and API features less differentiable from Android malware. Furthermore, instead of focusing on feature-space attacks, other researchers built attack models on problem-space. (Pierazzi et al.2020) applied a problem-space attack focused on test-time evasion in the Android malware detection, through modifying real Poison attack The attacked model is unable to work appropriately in the test phase with poisoned data mixed into the training dataset. (Biggio, Nelson and Laskov 2013) Backdoor attack The attacked model classifies the data of the backdoor trigger into the target category due to the poisoned training set. (Gu, Dolan-Gavitt, and  input-space objects that correspond to an adversarial feature vector. The result of the experiment on a dataset of 170 K apps demonstrated the feasibility for an attacker to evade DREBIN (Arp et al. 2014) and its hardened version, Sec-SVM (Demontis et al. 2017).
Although the escape rate of these attacks to machine learning detectors is generally high, defenders can still build an effective classifier against the adversarial samples through some methods as follows: (i) Adversarial Training. Training a new detection model with adversarial samples. (ii) Variant Detector . Developing a detector in addition to the original malware detector to detect whether an app is a variant derived from existing malware. (iii) Feature Integration. Integrating more features as possible to fully extract the various information of the sample.
From the discussion, the future research focused on the Android malware classification is suggested to adopt the adversarial models to evaluate the robustness of the model, and consider the above-mentioned approach to help address the vulnerability problems of machine learning-based detectors.

Deterioration Issues
The upgrade of the Android ecosystem proposed difficulties in feature subsets selection and malware classification stage. Despite numerous malware family classification approaches being available, there remains a valuable topic since it has not been well solved. One of these challenges is how to pick features and build robust detectors that stand the test of time without frequent retraining, since a key issue is the problem caused by the evolution of the Android ecosystem.
For the Android feature selection stage, it is vitally essential to select valid Android features to build anti-malware tools that are resilient to the evolution. (Suarez-Tangil and Stringhini 2018) tracked massive amounts of malware from 2010 to 2017 and explored how the repackaging malware evolved by using differential analysis. They discussed some areas that should be specially paid attention to when extracting Android features to detect malware. Building an infrastructure able to mine a mobile software ecosystem, (Cai 2020a) depicted how the behavior of Android software has changed over time by focusing on three ecosystem elements' ecological interaction and behavioral evolution patterns. These changes in Android software tracked by the above researchers proposed challenges in the future work focused on the Android feature engineering.
For the malware classification stage, machine learning-based Android detectors have been noted that they suffer from sustainability issues. Machine learning-based detectors deteriorate due to the constant evolution of the Android ecosystem and the new malware. The aging problem in Android malware classifier was emphasized in (Fu and Cai 2019) and identified by the framework proposed in (Jordaney et al. 2017).
Some researchers (Kantchelian et al. 2013) (Maggi et al. 2009) attempted to address the problem by frequent retraining of the malware classification model, but consequently the performance of the classifier tends to be untrustworthy with a loose retraining frequency and it results in high cost for manually labeling all Android samples in retraining process. Therefore, a thorough solution to the sustainability issues focuses on the promotion of the rapid-aging classifier. From the reviewed literature, the approach to slowdown the aging of classification models is depicted as follows.
(1) Present Features in Abstraction. The key to solving the problem is developing detectors resilient to changes and achieving scalability, so the concept of abstraction is utilized to make the machine learning model more adaptable for its insensitivity to the detailed changes of the Android framework. For example, (Onwuzurike et al. 2017) used the family, package, or class information to generate abstracted API calls rather than relying on the raw API calls, and they tested the model on the dataset containing samples captured over six years to display its consistency. Similarly, (Zhang et al. 2020) also dealt with the problem by exploring the semantic similarity despite the different implementations.
(2) Track the Evolutionary Patterns. Another strategy (Cai, 2020b) is to understand the evolutionary patterns of extracted features in benign samples and malware and then leverage the findings to build a sustainable malware detector. ) studied the five-year evolution trajectory of a new behavior profile described by run-time behaviors and proposed a detection system based on observations of consistent differences between benign and malicious software over years. It showed better sustainability performance than MamaDroid (Onwuzurike et al. 2016) for the ability to maintain high accuracy.
(3) Build Self-evolving Detector. For this method, the detection model will be updated if identified as aging in the detection stage. (Xu et al. 2019) proposed a self-evolving Android malware detection system that maintains a different set of detection models and automatically self-updates through online learning techniques to improve sustainability and reduce deterioration.
In a conclusion, as the research trends paid more attention to the deterioration problem of machine learning classification models in Android malware detection, it should not be ignored in future experiments. The researchers can demonstrate the resilience of the proposed classification model by testing it on the datasets over years without frequent retraining.

Datasets
As a literature survey, it is essential to propose a separate survey dimension of the datasets and disassemble tools generally used in Android malware detection. Taking experiments on well-known and updated datasets that are sufficient for a longitudinal comparative experiment is of concern in the research. Recent typically used public Android datasets are discussed as follows.
(1) Drebin (Arp et al. 2014). As the most-used android malware dataset in previous studies serving as a benchmark, Drebin contains 5,560 files from 179 different malware families. The applications were captured from August 2010 to October 2012 but have been never updated since then.
(2) RmvDroid . RmvDroid is a malware dataset containing 9,133 samples collected from 2014 to 2018 that belong to 56 malware families based on Google Play's app maintenance results over several years and analyzed by VirusTotal.
(3) Androzoo (Allix et al. 2016). AndroZoo is a well-known collection of Android Applications mainly captured from Google Play, AnZhi, and AppChina, with samples analyzed by tens of different AntiVirus products. It is still being updated and contains 16,941,455 different benign and malware samples at present.
(4) AndroZooOpen (Liu et al. 2020b). As a supplement dataset for AndroZoo that is made up of close-sourced android apps, AndroZooOpen presents a growing collection of open-source Android apps collected from several sources including Github and Google Play, having over 45,000 app repositories currently.
(5) AndroCT (Li, Fu and Cai 2021). AndroCT is a large-scale dataset on the run-time traces of function calls in 35,974 benign and malicious Android apps from 2010 to 2019. Each app was exercised both on an emulator and a real device, and the traces were separately curated by running each sample app against automatically generated test inputs.
(6) AMD (Wei et al. 2017). AMD contains 24,553 samples, categorized in 135 varieties among 71 malware families ranging from 2010 to 2016. The dataset includes detailed descriptions of each malware variety's behaviors generated based on the manual analysis result.
It is notable from the above datasets that some data have not been maintained in the last three years, such as Drebin (Arp et al. 2014) and Android Malware Genome (Zhou and Jiang 2012), so using the latest Android reverse engineering tools to disassemble the outdated samples can be problematic. More significantly, these datasets no longer represent the present Android malware landscape. It is recommended to carry out experiments on well labeled, advised studied datasets released or updated after 2019 (e.g., Androzoo, AndroZooOpen). Most of the existing datasets are static, including no information about the runtime behavior of apps. For dynamic analysis, AndroCT is recommended for its extracted run-time trace features.

Dissemble Tools
Disassembly is the reverse process of compilation, turning executable Android machine source code into higher-level code. The disassembly of the object code can be divided into static disassembly and dynamic disassembly. Static disassembly is to get the assembly code directly by parsing the binary instructions of the object code without executing the program. Dynamic disassembly, on the other hand, tracks instructions as the program executes, so dynamic disassembly can only handle instructions that the object code executes.
At present, a lot of professional disassembly tools have been produced in domestic and foreign researches in recent years. The mainstream disassembly software in the current professional field are introduced as follows.
(1) Smali & Baksmali (Gruver 2021). It is a powerful APK file editing tool for the Dalvik Virtual Machine to decompile and back-compile classes.dex. The syntax is a loose Jasmin in smali and-dedexer syntax in Baksmali, and it implements all the features of the.dex format.
(2) Androguard (Halder et al. 2020). As the reverse engineering of Android applications, it functions include: support for multiple platforms (such as Linux, Windows, OSX, etc.); mainly used for static analysis; written primarily in Python; implement visualization.
(3) APKTool (Wiśniewski and Tumbleson 2021). The main functions of APKTool include dissembling resource files to the original format (including Resources.arsc, classes.dex, png, XML, etc.), rebuilding decoded resources back to binary APK/JAR, and processing APKs that depend on framework resources.
(4) AndroPyTool (Melbshark 2019). Capable to extract static and dynamic features from the Android APK, it combines various well-known Android app analysis tools such as Droidbox, FlowDroid, Strace, Androguard, and Virustotal. A source directory is needed for AndroPyTool to implement analysis and generate json and CSV format properties files.
(5) FlowDroid (Arzt et al. 2014). FlowDroid is an Android static taint analysis tool oriented on context, flow, fields, object sensitivity, and lifecycle awareness, which has higher accuracy and recall rates than other static analysis approaches. Based on IFSP framework, it can analyze all possible paths of information flow to generate CFG (Control Flow Graph) and label the taint leak path of sensitive information flow from source to sink.

Conclusion
Relying on reviewing the captured work, this paper provides a systematic overview of the Android OS environment, feature selection, and classification technology. Also, the limitations of machine learning and the commonly applied datasets and disassemble tools are included. The main objective of this paper is to depict a full portrait in the field of Android malware detection, especially machine learning based.
Compared with other reviews, this paper not only gives a brief introduction to Android system mechanism and malware classification algorithms but also summarizes the data preprocessing approach and valid feature subsets selection models systematically, including the limitations and challenges in Android malware detection, which gives a more comprehensive look in the techs of feature selection and malware detection in recent years. This article can provide readers with a fundamental overview of Android malware detection and inspire them to pursue new research avenues.
However, after a comprehensive research of Android malware detection, there are still some challenges in future research, for example, the vulnerability of Android detectors to adversarial sample attacks, the aging classification models due to the emergence of new malware, the difficulty to build Android feature selection models resilient to the evolution of the Android system, etc. In addition, while there are a variety of machine learning methods used to classify Android malware, little research on these methods has focused on feature selection that has a fundamental impact on detection efficiency. Although the ensemble learning models that combine multiple learners are widely utilized, many models used traditional machine learning as base classifiers, instead of presenting the combination of state-of-the-art machine learning algorithms such as deep learning models. Moreover, the way to automatically select and remove the base classifiers can be exploited to solve the problem of computational expense and to find the optimal combinations of base classifiers. Therefore, future research can make full use of reinforcement learning in Android detection for ensemble learning.
Finally, little research utilizes the feedback from the accuracy of the classifier in Android malware detection, consequently the valuable relevance information obtained from the classifier between different features is ignored. However, to make a breakthrough in the efficiency of the wrapper-based feature selection procedure, the problem of inexhaustible feature combination in selected valid subsets in the previous wrapper-based method should be tackled.

Disclosure statement
No potential conflict of interest was reported by the author(s).