Intelligent mining of large-scale bio-data: Bioinformatics applications

ABSTRACT Today, there is a collection of a tremendous amount of bio-data because of the computerized applications worldwide. Therefore, scholars have been encouraged to develop effective methods to extract the hidden knowledge in these data. Consequently, a challenging and valuable area for research in artificial intelligence has been created. Bioinformatics creates heuristic approaches and complex algorithms using artificial intelligence and information technology in order to solve biological problems. Intelligent implication of the data can accelerate biological knowledge discovery. Data mining, as biology intelligence, attempts to find reliable, new, useful and meaningful patterns in huge amounts of data. Hence, there is a high potential to raise the interaction between artificial intelligence and bio-data mining. The present paper argues how artificial intelligence can assist bio-data analysis and gives an up-to-date review of different applications of bio-data mining. It also highlights some future perspectives of data mining in bioinformatics that can inspire further developments of data mining instruments. Important and new techniques are critically discussed for intelligent knowledge discovery of different types of row datasets with applicable examples in human, plant and animal sciences. Finally, a broad perception of this hot topic in data science is given.


Introduction
A recent paper in the Science Policy Forum on increasing scientific exploration with Artificial Intelligence (AI) discusses that the human bottleneck in scientific discoveries could be overcome through 'systems that use encoded knowledge of scientific domains and processes in order to assist analysts with tasks that previously required human knowledge and reasoning' [1]. The Hanalyzer (high-throughput analyser) was a pioneer in supporting this knowledge-based genome-scale interpretation technique [2]. Techniques developed by computer scientists have provided the opportunity for researchers to sequence approximately 3 billion base pairs (bp) of the human genome. Currently, achievements generated from the application of next-generation DNA sequencing (NGS) technologies have inaugurated genomics science, and facilitated critical progress in various areas such as epidemiology, biotechnology, forensics, biomedical sciences and evolutionary biology [3].
Bioinformatics as an interdisciplinary area explores new biological insights from biological data [4]. Biological databases are the heart of bioinformatics [5,6], and represent an organized set of a huge variety of biological data from past research conducted in laboratories (including in vivo and in vitro), from bioinformatics (in silico) analysis and scientific articles. Databases related to 'omics' (e.g. genomics, transcriptomics, proteomics and metabolomics) collect experimental data and can be browsed with designed software [7]. Recently, it has been revealed that analysis of large volumes of biological data through traditional database systems is very troublesome and challenging [8], whereas biological knowledge discovery can be accelerated by intelligent use of the data. Such action is called data mining (DM) and can include simple, complex and/or combinational queries. Consequently, numerous techniques of genomic DM have been created for experimental and computational biologists [9]. DM methods can be used in bioinformatics studies because bioinformatics is data-rich, while no comprehensive theory of life organization can be detected at the molecular level [8].
The question is how to converge the two domains, AI and DM, for successful mining of bio-data. The present paper argues how AI can assist bio-data analysis. Then, an up-to-date review of different applications of biodata mining is presented. It also highlights some future perspectives of DM in bioinformatics that can inspire further developments of DM instruments.

Intelligent knowledge discovery in bioinformatics
A challenging and hot research area for AI was generated when the Human Genome Project and other largescale biological studies collected a huge quantity of data [10]. Hunter's sentinel article [10] entitled 'Artificial Intelligence and Molecular Biology' appeared in AI Magazine 25 years ago. Today, bioinformatics is involved in 'big data' and encounters such challenges as sequence, expression, structure and pathway analyses [11]. For the present and future developments of bioinformatics, AI and heuristic approaches are highly essential. Today, it is widely agreed that these two potential domains are converging [12].
Bioinformatics is a highly new interdisciplinary and strategic area of study integrating and interpreting the complexity of any biological data through information technology and computer science. This area of science attempts to develop novel algorithms and software, data storage methods and new computer architectures in order to fulfil the computational requirements [13]. Algorithm architecture is a step-by-step process (a list of welldefined instructions) for calculation, data processing and automated reasoning. In fact, an algorithm is applied to calculate a function. For instance, Hilbert et al. [14] introduced a partial formalization of the concept in order to figure out the Entscheidungsproblem. Bioinformatics basically copes with four aspects of analysis, including DNA sequence analysis, protein structure prediction, functional genomics and proteomics, and systems biology, through the development and application of innovative algorithmic methods [3].
Finding solutions to the biological issues is in the area of bioinformatics where the DM approaches could be used efficiently. Both DM and bioinformatics are fast developing fields of research [8]. The growth of information storage technology has generated a vast volume of raw data considering two aspects: algorithm development and rise of modern storage equipment. These raw data include important information. In the 1990s, researchers used knowledge discovery from data (KDD) in order to extract knowledge from databases. As Piatetsky-Shapiro and Frawley [15] argue, 'Knowledge discovery is the nontrivial extraction of implicit, previously unknown, and potentially useful information from data.' Of course, reasonable time complexity, accuracy, comprehensibility and useful results are necessary features that should be considered for the extraction of new knowledge. Furthermore, according to Fayyad et al. [16], DM is synonymous with KDD. DM can be applied in bioinformatics for areas such as gene finding, function motif detection, protein function domain detection, protein function inference, protein and gene interaction network reconstruction, protein sub-cellular location prediction, disease diagnosis, disease treatment optimization, disease prognosis and data cleansing [17]. For instance, a novel learning algorithm (KODAMA package) can be used for knowledge discovery and DM [18].
The process of DM has three levels, including (i) data pre-processing, (ii) data modelling and (iii) data postprocessing ( Figure 1). In the first phase, raw data are prepared for mining. Because of the widely distributed, uncontrolled generation and utilization of numerous bio-data, data cleaning, data pre-processing and the semantic integration of such heterogeneous and highly distributed databases have become significant in systematic and coordinated analyses of bio-databases [19]. As indicated in Figure 2, the second phase discovers relationships between different data for extraction of significant new patterns [20]. In this regard, prediction and description are the primary goals of DM [17]. The predictive models (such as classification, regression, Time series analysis, prediction, etc.) can predict unknown data values using the known values. On the other hand, the descriptive models (such as clustering, sequence discovery, association rule and summarization) can detect the patterns in data and discover the properties of the data assessed [21]. In the final phase, postprocessing, the extracted data and patterns are evaluated and then verified as knowledge. Background knowledge can also be used to verify the extracted knowledge [22]. DM systems are classified based on criteria such as: (i) the type of data source mined (e.g. text, image, audio, video, etc), (ii) the data model (e.g. Object Model, Relational data model, Object Oriented data Model, Hierarchical data Model/W data model), (iii) mining techniques (e.g. machine learning, genetic algorithms (GA), statistics, neural networks, visualization, database oriented or data warehouse-oriented, etc.), and (iv) the kind of knowledge discovered (such as classification, clustering, association, characterization, discrimination, etc.). The classification can also consider the degree of user interaction engaged Figure 1. Basic concepts of data mining. The DM process includes three levels: (1) data pre-processing (raw data is prepared for mining), (2) data modelling (discovers relationships between different data for extraction of significant new patterns), and (3) data postprocessing (extracted data and pattern are evaluated and then verified as knowledge).
in DM. A comprehensive system can provide different DM approaches be appropriate in various conditions and options, and represent various levels of user interaction [21].
DM approaches and techniques can be categorized into three key groups: (i) supervised learning techniques, (ii) unsupervised learning techniques, and (iii) other. The first group involves classification and prediction tasks. Clustering and association rules mining are in the second category. On the other hand, some tasks are not classified either as supervised, or as unsupervised learning techniques. Hence, they are assigned into the third category. Yet, there is not a comprehensive list of DM tasks. Nevertheless, according to Piatetsky-Shapiro [23], the most common DM approaches are (a) sequence mining, (b) clustering, (c) decision trees and decision rules (classification), (d) support vector machine (SVM), (e) neural networks (classification), (f) Bayes classification, Figure 2. Schematic overview of possible inputs for DM process and subsequently possible predictions and outputs from DM algorithms leveraging many genome-scale datasets. The upper side of the circle shows different selected inputs/datasets including single nucleotide polymorphisms, structures of biological molecules, chromosomal mapping, phylogenetic data, gene expression profiles, DNA/RNA/protein sequence data and biochemical pathways. In the heart of the circle, the most popular DM algorithms and techniques are presented. On the lower side of the circle, different types of possible outputs extracted from DM approaches are displayed. These outputs include protein characterization, dataset characterization, pathway characterization, DNA and RNA sequence characterization, and interaction characterization.
(g) regression, (h) link analysis, (i) descriptive statistics and (j) visualization. DM tasks include the selection of suitable algorithms. Both the selection of DM approach and algorithm, and parameterization of the optimal algorithm depend on the goals of the analysis and features of the available data [24]. A couple of DM activities such as manipulation, mining of sequence data, string searching algorithms, machine learning and database theory have been considered seriously. The developed methods for such tasks have led to the extensive progress in computer science [8].

Sequence mining
DM can be used in such fields as text mining, sequential pattern mining, image mining and web mining [8]. Among these areas, sequence data mining (SDM) is the most primitive operation in computational biology [17], and helps to discover the sequential relationships and knowledge hidden in the ocean of sequence data [8]. For example, by mining of DNA sequences alone, the BiRen algorithm predicts enhancers using a deep-learning-based model [25]. Lim et al. [26] also presented an automated information extraction system (@Minter) based on Support Vector Machines for text-mining of microbial interactions. SDM has a broad range of applications such as web access patterns, the analysis of customer purchase patterns, business, security, weather observations, medical data, DNA/RNA/protein sequencing, and so on [8]. In bio-data analysis, the most critical search problems are similarity search and comparison among bio-sequences and structures [19]. In fact, the sequence analysis refers to subjecting a DNA, RNA or peptide sequence to sequence alignment, sequence databases, repeated sequence search, or other bioinformatics approaches on a computer [17].
With the reducing costs, rapid advancements in NGS and related bioinformatics computing sources, and the generation of complete genome sequences of various organisms, bioinformatics provides both conceptual bases and practical approaches for discovering systemic functional behaviours of cells and organisms [27]. In the area of DNA, RNA and protein sequence analysis, SDM approaches are utilized for sequence alignment, sequence searching and sequence classification. Protein sequence classification is the favourite area of many researchers [8].
Sequence alignment is essential in solving such issues as prediction of the secondary and tertiary structures of proteins, prediction of the ancestral sequence or tracing the common genes in two organisms [28], prediction of gene function, sequence divergence, sequence assembly, database searching and so on [29]. However, sequence alignment is a highly complicated task because of the high number of possible combinations and searches. This complexity rises exponentially along with the size of the sequence. Therefore, sequence alignment is considered a highly computationally intensive problem [28]. Thus, both software and hardware advancements have the potential to improve the accuracy and speed. Consequently, new algorithms have emerged. These algorithms are classified as optimal and heuristic. Although optimal algorithms are efficient in alignment sensitivity, they are computationally expensive. In modern computational biology, the computational cost of all dynamic programming algorithms aforementioned is prohibitive especially for large-scale applications such as database searching. As a result, scientists have shifted their attention to heuristic algorithms. Heuristic approaches are faster algorithms that do not guarantee delivery of the optimal solutions [28].
Furthermore, pairwise sequence alignment is categorized into local and global. Local sequence alignments discover the best approximate sub-sequence match within two given sequences. Local sequence alignments find extremely similar areas within the two sequences. Some popular local sequence alignment algorithms include Smith-Waterman [30], FASTA [31], BLAST (Basic Local Alignment Search Tool) [32], Gapped BLAST [33], BLAT (BLAST Like Alignment Tool) [34], BLASTZ [35] and PatternHunter [36]. BLAST is the most popular bioinformatics algorithm worldwide that has been developed at the National Center for Biotechnology Information (NCBI) for fast sequence alignment [32]. The strategy utilized in BLAST for raising the speed is basically fulfilled by two shortcuts: do not bother finding the optimal alignment, and do not search all of the sequence space. Efficiently, BLAST tends to rapidly find the areas with high similarity, without checking every acceptable local alignment [29]. On the other hand, global sequence alignments detect the best alignment of both sequences in their entirety. Therefore, they look for global mapping between entire sequences. Some popular global sequence alignment algorithms include Needleman-Wunch [37], MUMmer (Maximal Unique Match-mer) [38], GLASS [39], AVID [40] and LAGAN [41] (Table 1). All pairwise algorithms are different in terms of indexing step, identifying seeds/anchors and the final step. Some algorithms seem to be more suitable to homologous sequences, whereas others target divergent sequences [28].
Besides pairwise alignments, Multiple Sequence Alignments (MSAs) have been used to align closely related sequences, distantly related sequences or both [42]. MSA algorithms are an interesting field of study since the 1980s. Traditionally, the most common method is the progressive alignment procedure, exploiting the idea that homologous sequences are evolutionarily related. Later, various alignment programs including global and local methods have been developed [43]. CLUSTA, an extremely common and effective heuristic algorithm for multiple alignments, was developed by Higgins and Sharp [44]. Then, it was extended into the current version, CLUSTALW, by Higgins et al. [45]. Additionally, evolutionary-based inference systems are highly crucial in such fields, as epidemiology and virulence [46], elucidation of the life tree [47], biodiversity [48], drug designs [49], human genetics [50] and cancer [51]. MSA and its subsequent analysis are the requirements for such evolutionary-based research [52][53][54]. Also, MSAs are very important in determining particular traits, known as 'specificity determining positions', modulating protein's function in a particular context, for instance, interaction areas, targeting signals in different cell machineries, pathways or compartments, or post-translational modification regions (cleavage, phosphorylation, etc.) [55][56][57].
Numerous genetic diseases are due to mutation variants of a gene or cluster of genes, or the overlapping features of various genetic diseases mapped to near or distant loci [3]. Consequently, mutation analysis has become highly significant because of its association with different diseases [42,58]. Hence, various computational approaches are being developed to forecast the function of missense mutations and to detect residues having an important impact on maintaining wild-type function. These approaches are, sequence-based algorithms [59], structure-based algorithms [60,61] and a combination of both [62]. MSAs highlight two main trends that are particular to disease-associated mutations [42]. In addition to forecasting the function of mutant gene products, low throughput sequencing of known target genes facilitates the discovery of new mutations, thus helping scientists understand the evolving characteristics of some genetic diseases. Bioinformatics is able to predict such substitution impacts [3]. A three-phase analysis of 1514 missense substitutions in the DNA-binding domain (DBD) of TP53 (the most frequently mutated gene in human cancers) confirmed the utility of the Align-GVGD approach (http://agvgd.iarc.fr) for functional classification of missense mutant variants for any genes with Local FASTA Disadvantages: if the sequences possess more than one area of homology (two optimal diagonals), just the area around init1 a could be found, while the area contributing to initn b will be discarded. Advantage: speed over optimal algorithm. [31] BLAST Disadvantages: it cannot find seeds c smaller than the minimum length 'l regarded for the precise match seed (DNA alignment) and reports just local alignments. Also it can find too many seeds per sequence; therefore, decreasing speed (protein alignment) and allows no gaps in sequence. [32]

BLAST2
It was developed to overcome the disadvantages of BLAST.
[33] BLAT Same as BLAST and FASTA. BLAT is different from BLAST in that which sequence it indexes. BLAT is confined as it does not find small homologous areas due to the small seed length. [34] PatternHunter It introduces spaced seed to increase the sensitivity. Also, its performance is higher than that of the above-mentioned algorithms regarding sensitivity. The speed is not higher than BLAST, as it is performed in Java and induces memory problems for very long sequences. [36] BLASTZ It is the fastest algorithm in the BLAST series. To speed up the algorithm, all repeats should be removed in the sequences. [35] MASAA (Multiple anchor staged alignment algorithm) MASAA employs the searching methods (suffix tree) utilized in global sequence alignment algorithms to identify long common substrings in both sequences. The simulations show that this algorithm outperforms BLASTZ when the sequences are divergent and sometimes generates an alignment when BLASTZ does not return any alignment. On homologous sequences, the performance is comparable. Overall, MASAA finds the alignment faster than BLASTZ. [28] Global MUMmer It is one of the first global alignment algorithms that align two long genomes.
[38] GLASS It aligns long genomic sequences. It aims to remove the limitations of standard dynamic programming (SDP) approaches which had running time problems and to increase the sensitivity when aligning the sequences in their entirety. [39] AVID It balances sensitivity and speed when aligning very long sequences.
[40] LAGAN More sensitive than previous algorithms. An effective pairwise aligner which can be appropriate for genomic comparison of distantly related organisms. It is not faster than MUMmer and BLASTZ. It is not also sensitive in detecting transpositions. [41] a FASTA refers to a diagonal, scoring the highest value, 'init1.' b In FASTA algorithm, the maximum weighted graph is chosen and the best alignment identified is marked as 'initn.' c A pair of highly similar areas is known as 'seed.' adequate available sequences [42]. Additionally, the discovery of single nucleotide polymorphisms (SNP) in numerous model and non-model plant species is the result of bioinformatics progress [13]. In a recent study, Huang et al. [63] offered a framework that is able to discover long, single point mutations across multiple sequences. However, this framework could not detect co-mutations involving multiple positions. Other researchers have attempted to use the translation probability matrix to evaluate the future amino-acid composition [64,65]. However, they have only considered the mutation in one position and are unable to analyse the geographical dissemination of mutations over time.
Later, a different algorithm was proposed to mine comutations across multiple sequences [66]. However, the framework did not consider the three-dimensional (3D) structure of proteins. Recently, Wei [67] suggested an effective algorithm based on 3D-structure for discovering non-contiguous mutations in biological sequences. Furthermore, high-throughput aligners can help in mapping the sequence reads to the reference sequences. Sequence alignments have numerous functions. However, there is pressing need for highly efficient algorithms due to the large volume of the short sequence reads produced by NGS [68]. The Maq algorithm utilizes hashing methods [69]. In order to align reads, techniques based on the Burrows-Wheeler transformation can also be applied. Such techniques include BWA [70], Bowtie [71] and Soap [72]. Although these algorithms are faster than Maq [72], they are limited to split reads in order to achieve gapped alignments. Moreover, a Smith and Waterman algorithm [30] is employed in the Mosaik aligner [73] for aligning the short reads [68].

Clustering
By applying heuristic approaches, the clustering algorithm can classify objects into a default number of clusters based on the data similarity. Distance metrics which are usually utilized as a scale for similarity evaluation of the objects include Euclidean, Jacquard, Manhattan, etc. The similarity measure can be chosen based on the features of the objects [24]. Based on a machine learning view, clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system presents a data concept [74]. However, cluster analysis attempts to determine the number of clusters in a dataset. This is an open issue in cluster analysis. For example, highly utilized iterative methods, such as the kmeans algorithm, ask the user to determine the number of clusters in the data before running the algorithm. Algorithms which can discover the number of clusters are categorized in unsupervised clustering algorithms [75]. Hierarchical and partitional clusterings are the most popular clustering approaches (Table 2). Practically, clustering is highly important in DM applications such as information retrieval, text mining, scientific data exploration, spatial database applications, web analysis, marketing, customer relationship management (CRM), computational biology and medical diagnostics [74].
Exploring the hidden patterns in the gene expression microarray data is challenging for functional proteomics and genomics. DM methods can be used for addressing this task [75]. In gene expression data, clustering is a significant approach for deriving underlying information [20] such as biologically relevant grouping of genes and samples, gene regulation, gene function and gene expression differentiation in different circumstances [75]. For instance, Engreitz et al. [77] mined significant information from transcriptional modules in microarray data for acute myelogenous leukemia. Tasoulis et al. [75] also examined the application of the proposed k-windows clustering algorithm on gene expression microarray data. Besides determining the clusters present in a dataset, this algorithm can also define their number. Furthermore, the DBSCAN (density-based spatial clustering of applications with noise) clustering algorithm was used to screen colon cancer data [78]. On the other hand, a supervised fuzzy clustering approach discovered potential protein biomarkers to recognize individuals at high risk of bladder cancer [79].
Additionally, Frey and Dueck [80] proposed the Affinity Propagation (AP) algorithm, which is a state-of-theart clustering approach. It has been used in wide fields of computer studies and bioinformatics since it has higher performance than traditional approaches such as k-means. In order to achieve high quality sets of clusters, real-valued messages are passed between all pairs of data points until convergence by the original AP algorithm. Like agglomerative clustering, AP is able to measure similarities between data samples.
The AP clustering algorithm is not dependent on a vector space structure, in contrast to other prototypebased techniques, and the clusters are selected from the detected data samples and not calculated as hypothetical averages of cluster samples [81]. As outlined by Bodenhofer et al. [81], AP is especially appropriate for bioinformatics purposes because: (i) numerous similarity scales applied in bioinformatics are not associated with explicit vectorial features; and (ii) detecting a small set of clusters can offer the opportunity for exploration in biological datasets. So far, AP algorithm has been demonstrated to be effective for the purpose of for microarray data analysis [80][81][82][83][84][85], Network analysis [86][87][88] structural biology studies [89][90][91], and sequence analysis [92]. For review, see [81]. Although AP has many applications, one of its most significant research problems is its speed, particularly for large-scale datasets, since it needs quadratic CPU time in the number of data points to calculate the messages [93]. In order to solve this issue, the FSAP (fast sparse affinity propagation) algorithm was suggested for AP [94]. However, the efficiency of this fast algorithm is at the expense of the clustering result accuracy. In fact, its clustering outputs are different from the outputs of the original AP algorithm. Thus, Fujiwara et al. [93] suggested an effective AP algorithm pruning unnecessary message exchanges in the iterations and calculating the convergence values of pruned messages after the iterations to identify clusters. While it can guarantee exactness of the clustering outputs, it is quite faster than other algorithms. Furthermore, unlike FSAP, any innerparameters are not required to be set by users. In Table 2. Most popular data-mining algorithms along with their most prominent characteristics (Modified from Li et al. [76] addition, for clustering extremely large sequencing data, Jiang et al. [95] reported a Dirichlet Process Means (DP-means) algorithm. This algorithm (DACE) follows a random projection partition approach for parallel clustering.

Association rules mining
For the first time, Piatetsky-Shapiro and Frawley [15] proposed the association rules mining technique (a market basket analysis approach), which is another area of DM. This method can detect non-trivial patterns in the data, and define the relationships among the binary variables utilized to characterize a set of objects [96] ( Table 2). The most common a-priori algorithm offers two input parameters: rule support and confidence. The proportion of dataset providing the rule condition is association rule support, and the proportion of the dataset to which this rule can be applied is association rule confidence [24]. In spite of the solid nature of association analysis and its potential applications, such approach is not as popular as clustering and classification, particularly in the area of bioinformatics. However, some researchers have employed association rules techniques in their work [97][98][99][100]. For instance, Mohanty et al. [101] created a prediction model by association rules in order to discover breast cancer masses in mammograms.

Regression
The regression tree is a machine-learning approach for creating prediction models from data by recursively subsetting the data space and fitting a prediction model within each subset. Accordingly, a decision tree can be created graphically from the subsetting [102]. In fact, regression analysis is a statistical method estimating and predicting relationships between variables [20]. Regression trees are for dependent variables taking continuous or ordered discrete values, with a prediction error [102]. Regression algorithms are simple linear, multiple linear, logistic and fuzzy. In DM, regression algorithms predict hidden data based on continuous training data. In this method, the behaviour of the dependent variable (y) is estimated by independent variables (x) [20]. For example, relationships between vaccination and risk of preterm birth can be revealed by a regression algorithm [103].

Classification
Classification, as a supervised learning technique, is a very popular task in DM. It predicts the class of a userspecified goal feature based on the class of other features, known as the predictive features [104]. Therefore, it assigns objects to the predetermined classes. The classification process has two steps, including training and testing. The training phase involves the algorithm that analyses the data meant for learning and generates a classification model ( Table 2). The testing phase checks the accuracy of the model through another data set. Although Naive-Bayes Classifier, SVM, K-Nearest Neighbour (KNN) and Genetic algorithm (GA) are popular methods of classification for gene expression and protein data, decision trees, Bayes classifications and artificial neural networks (ANNs) are the most common classification approaches [24]. Supervised machine learning can be utilized for classification. For example, a group of machine learning methods is SVMs which are based on the linear separation between groups. The features determining SVMs include (i) the principal assigning the optimal linear classifier based on separation margin maximization, (ii) detection of the support vectors, and (iii) utilizing kernels to change the initial variables into a greater-order non-linear space in which the linear separation takes places. One of the most common SVM algorithms is Sequential Minimal Optimization (SMO) [105]. Furthermore, decision trees are machine-learning models structuring the knowledge utilized to differentiate between instances in a tree-like structure. Novel examples are categorized by pursuing the tree alongside the related branches, based on the features of the sample. Approaches (e.g. C4.5) begin with an empty tree and repetitively divide the data, generating branches of the tree, until they define exemplars of a branch to a leaf of the tree [106]. The Random Forest approach is based on decision trees, whereas multiple trees are based on the training data. Each tree has only access to a randomly sampled subset of the traits of the problem. Subsequently, by the class prediction of the test samples, each tree can predict a class and the majority class predicted is utilized [107]. Furthermore, Bayesian classifiers are statistical approaches based on Bayes theorem [108]. Naive Bayes [109] is the simplest one calculating the probability that each sample input belongs to each of the classes. Naive Bayes is a highly competent machine learning approach across various application domains and has perfect scalability. As reviewed in Swan et al. [105], ANNs are inspired from the function of the brain. They include a set of neurons (computational elements) interlinked via a vast diversity of interconnectivity patterns. Depending on the received signal, the connections of a neuron define its activity. Each individual neuron is a variant of a linear classifier. However, the presence of various layers and neurons can lead to the creation of elaborate nonlinear classifiers allocating their function to complicated issues [110]. Furthermore, rule-based learners involve BioHEL [111] as well as JRip [112]. They aim to automatically produce collections of meaningful principles that determine the allocation of a particular cluster to a givenclassof a problem [113]. Rule learning encompasses a variety of approaches. Their distinctions are based on (i) the kind of rule sets they create and (ii) how to establish the rules and the rule sets [105].
Sequence data analysis is very important in bioinformatics. This task can be dealt with using prediction and classification methods. For example, the research goal may be to assign a protein of interest to a family in order to elucidate the evolution of this protein and to reveal its biological function [8]. Additionally, the investigation of proteins is highly beneficial in biological and medical domains. In biology, for instance, putative amino-acid sequences are often analysed for discovery of enzyme active sites, or nucleotide sequences, in order to identify coding or non-coding regions of DNA or to identify the function of particular nucleotide sequences [8,114]. Thus, it is essential to develop an intelligent system for bio-data classification and behaviour prediction (For review see [8]). To briefly outline some of the more notable techniques, the Rough Set Classifier technique [115] has been suggested as a novel model for classification of large volumes of protein data based on protein functional and structural characteristics. This model is considered an effective classification tool due to its accuracy and fast speed. Another, three-phase model for the classification of unknown proteins into known families has been reported [116], in which the noisy sequences are first omitted in order to improve the accuracy through minimizing the computational time; second, the important features are acquired and a feature ranking algorithm is used to classify the sequences; and third, neighbourhood analysis is used to classify the sequence of interest into a particular class or family. This rule can mine significant relations between a protein sequence and protein classes, subclasses and families. This kind of classification, in addition to data analysis, generates knowledge-based information [8]. Another method for classification of protein sequences is the feature hashing technique [117], which has the advantage of reducing the dimensionality on protein sequence classification tasks. Alternatively, a hybrid GA/SVM algorithm for classification of protein sequences has been proposed [118], in which the protein features that carry precise and sufficient discriminative information are selected for classifying and training the SVM classifier simultaneously. Based on experimental outputs, the hybrid GA/SVM system has been demonstrated to outperform the BLAST and HMMer (Hidden Markov Model-based sequence search) methods [8,118]. Furthermore, Leung et al. [104] used a DM framework for predicting hepatitis B virus (HBV) positive patients and analysing key mutation sites in the HBV DNA sequences. In this approach, two new algorithms were developed based on Rule Learning (RL) and Nonlinear Integral (NI). The NI algorithm performs well using the fuzzy measure and the nonlinear integral because the non-additivity of the fuzzy measure shows the significance of the individual features and their inherent interactions. The authors also used GA for optimization providing multimodal solutions involving sets of best solutions. Moreover, a regularization approach was applied to achieve a solution with the fewest nonzero fuzzy measure values [104].
Besides, bioinformatics opens a new window for understanding cancer biology through intelligent systems. For instance, Banwait and Bastola [119] employed supervised and unsupervised techniques for precise classification of cancer types and sub-types. The supervised classifier models based on ANN, random forest and SVM have addressed the cancer sub-type classification issues [120,121]. Combining the cancer biology knowledge with influential computational and statistical tools has the potential to discover miRNAs as new biomarkers to detect cancer and cancer sub-types. Also, combining gene and miRNA expression data with computational analysis techniques could help to determine the role of miRNAs in cancer development and metastasis and their capacity in acting as therapeutic agents in cancer treatment. Additionally, a challenge in classification of cancer tissue samples based on gene expression data is to create an influential approach selecting a parsimonious set of informative genes [122]. In this regard, Wang et al. [123] introduced a novel algorithm (Chisquare-statisticbased Top Scoring Genes (Chi-TSG) classifier) for binary and multi-class cancer classification and informative genes selection based on numerical molecular data. On the other hand, classification of gene expression data is highly important in prediction of disease related genes. Thus, an influential statistical feature selection method for classification of gene expression data set was enhanced based on statistically defined efficient range of traits for every class termed as ERGS (Effective Range based Gene Selection) using naive Bayes (NB) and SVM Classifiers [120]. Furthermore, classification of RNA structure change by 'gazing' at experimental data was proposed by Woods and Laederach [124].

Neural networks
The term neural network originally refers to a circuit of biological neurons. However, its contemporary use is in the context of ANNs, which comprise programming solutions resembling the function of artificial neurons, or nodes. Electrical signalling and other types of signalling emerge from neural transmitter diffusion. Hence, neural networks are highly complicated [125], and have become one of the vital techniques in the bioinformatics field since the development of various biological databases storing DNA/RNA sequences, protein structures and sequences, and other macromolecular structures. Prediction is the most commonly discovered ability of neural networks in bioinformatics, especially in cases of a limited volume of available raw data that can be utilized to extract the prediction model [126]. Table 3 lists a number of applications for neural networks in bioinformatics.
Machine-learning methods can be used in different areas of bioinformatics: support vector machine for protein fold recognition, hidden Markov model (HMM) for sequence and profile alignment, Bayesian networks for gene regulatory networks [138] and ANNs for protein secondary structure prediction [138], disease classification and biomarkers identification [139] (Figure 3). Due to gene collaboration in functional molecular networks [140][141][142], network-based analyses have been highly used in cancer research to provide a molecular stratification of cancer patients [141], to predict disease outcome [143,144], to understand tumourigenesis [145] and the mechanism of action of tumour-inducing viruses [146], to predict the carcinogenicity of chemical compounds [147] and to prioritize the damaging effects of cancer mutations [148]. Thus, Horn et al. [149] harness the fundamental wiring of genes into functional networks to develop a powerful statistical framework complementing gene-based tests to produce new hypotheses about driver-gene candidates. Several new methods using degree-of-interest (DoI) functions   [136] Radial basis function networks [137] have been reported [150]. They use DoI-based filtering, graph layout and a network comparison method. Furthermore, the RenoDoI framework has been developed as an application to untangle huge and dense networks through DoI function, and has been integrated in the network visualization framework Cytoscape [150]. Topological network analysis of gene-disease associations can reveal significant properties of the nature of Mendelian diseases [151]. Hence, four different bipartite networks including OMIM, CURATED, LHGDN and ALL have been employed to examine human diseases at a global scale [152]. For further exploration of the diseases and disease-related genes, gene and disease centric views of the data are produced through projecting the bipartite gene-disease networks to monoparite networks [152]. Godinez et al. [153] also reported a multi-scale convolutional neural network for phenotyping high-content cellular images. A syntax convolutional neural network (SCNN) based DDI extraction approach has been proposed for extraction of drug-drug interaction information from biomedical literature [154].
On the other hand, knowledge about protein secondary structure can help to understand human diseases and to develop therapeutic enzymes and drugs. Hence, various AI techniques are applied for prediction of protein secondary structure. Standard statistical approaches such as discriminant analysis and generalized linear models have limitations when there are highly nonlinear and complicated interactions. Currently, machine learning makes computer programming enable to increase performance with biological data sets [138]. Because of the high capability of ANN to reveal complicated patterns, categorize big data and make precise predictions in huge complex amino acid/protein data sets, ANNs have become a key technique in computational molecular biology issues such as DNA and RNA nucleotide sequence analysis, sequence correlations, sequence encoding and result interpretation, and protein structure prediction. Of course, it has its own strengths and weaknesses (Table 4). Current developments in accuracy using statistical context-based scores (SCORPION) [155] and incorporating tertiary structure information with the ROSETTA de novo tertiary structure prediction approach [156] have shown continual improvements in the ANN method for protein structure prediction. Table 5 shows a comparison of ANN with other machine-learning approaches in protein structure prediction. Additionally, Uziela et al. [167] proposed a model for assessment of protein quality using deep learning neural network approach. Moreover, forecasting the errors of predicted local backbone angles and non-local solvent-accessibilities of proteins using deep neural networks are valuable for prediction, evaluation, and refinement of protein structures [168]. Zeng et al. [169] also reported a systematic exploration of CNN architectures predicting DNAprotein binding.

Performance evaluation and visualization
Because of numerous descriptive and predictive algorithms for knowledge mining, various performance assessment approaches are required ( Figure 1 and Table 2). Performance assessment techniques generally include single scalar and graphical approaches [170]. Specificity, sensitivity and accuracy are in the first group. Simplicity in implementation but lower efficiency in assessment is the major feature of this group. The second group considers Receiver Operating Characteristic (ROC) Curve, Cost-Line and Lift. This group has a complicated implementation but it makes good sense. A system was suggested for fast extraction of important knowledge about cancer by summarization and visualization [171]. The model employs clinical trial registries and analyses data related to cancer vaccine trials. The system output is used as key information regarding cancer vaccine trials and can be utilized for future vaccine development [171].
After information evaluation, scientific data representation plays an important role. Different techniques of data representation can sometimes influence the explanation of the results or even change the conclusion of some experiments [172]. However, along with technological developments, data visualization is becoming a bottleneck, as in the postgenomic era, data visualization tools are necessary [173]. Consequently, Information Visualization (IV) is highly vital in presenting experimental results in the bioinformatics area [172]. Furthermore, visualization, as an advantage for an algorithm, is very important in DM [20]. IV methods are accepted as computerized techniques such as data selection, data transformation and data representation in a visual form facilitating human interaction for discovering and understanding the data (reviewed in [174]). IV approaches are based on two main functions of the human visual system: first, a human visual system with a broad bandwidth that can process a huge amount of information at one time; second, a human visual system with the ability to distinguish trends and patterns within visual areas, such as shape, location, size, and colour of objects. Thus, IV techniques have two major objectives: first, they consider a huge amount of information at a time which would not be readily perceivable by humans otherwise; second, they retrieve useful knowledge from a huge amount of information by recognizing patterns and trends [174]. There is a wide variety of IV methods. Thus, various classifications have been developed from different angles. For instance, Shneiderman's taxonomy [175], which is based on data types and tasks, includes seven data types, namely, temporal data, tree data, multidimensional data, network data, 1-D linear data, 2-D planar or map and 3-D data, and also seven tasks, namely, zoom, history, details-on-demand, filter, overview, extract and relate [174]. On the other hand, IV approaches are categorized into six groups based on data visualization methods including pixel-oriented, geometric, hierarchical, hybrid, icon-based and graph-based techniques. Besides these dimensions of IV techniques, other aspects can also be used in IV taxonomy such as distortion, data preprocessing and dynamic/interaction techniques [176]. Another taxonomy has been proposed based on a 'data state reference model', describing four steps of data state in IV and three transformation operators between every two adjacent steps [177]. A unified taxonomic framework in the perspective of IV system designers has also been proposed [178], including further perspectives such as display dimensions, data relationships, user's skill level and context factors [174].
H erisson and Gherbi [179] suggested a method for the three-dimensional visualization of the DNA molecule. Their method is based on a biological 3D model predicting the complex spatial trajectory of big naked DNA. This method could help to achieve a general view of the sequence instead of the textual presentation. Thus, a novel vision and an original method emerge. This method is appropriate for conducting original bioinformatics research and for analysing the spatial architecture of the genome [172]. Moreover, a new visual method and software for analysing residue mutations has been developed. This approach can combine various biological visualizations such as one-dimensional sequence views, three-dimensional protein structure views and two-dimensional views of residue interaction networks and aggregated views [180]. A method for analysing the huge and complicated datasets is to generate integrated data-knowledge networks allowing biomedical researchers to analyse the results of an experiment in the context of existing knowledge. Hence, Vehlow et al. [181] proposed a visual analytics method integrating interactive filtering of dense networks according to degree-of-interest functions with attribute-based layouts of the resulting sub-networks. Comparing multiple sub-networks with different analysis facets was provided through an interactive supernetwork that could integrate brushing-and-linking The first machine-learning method in protein structure prediction was partly based on Bayesian statistics [157]; BN performs well over huge databases.
Less opaque [158] Hidden Markov models (HMM) HMM (a probabilistic model) can provide relevant information about the sequence-structure relation [158]; its accuracy is less than that of the other machine-learning methods.
ANN is more successful [159] Support vector machines (SVM) A supervised learning model; associated with learning algorithms and classification and regression analysis in its construction of a hyperplane; can handle high-dimensional data; flexibility in modelling diverse types of data; high accuracy.
SVM is superior in predicting the location of turns [160]; in ubiquitin protein structure prediction, SVM is superior to both ANN and HMM [161]; SVM requires a relatively small training set to avoid overfitting of the data [162]; ANN have much better accuracy and take much less training and computation time [163]; SVM require much larger memory and powerful processor [163]; SVM outperformed ANN with an overall accuracy of 89.3% in identification of lipid-binding proteins (LPBs) from non-LBPs [164] Other -Nearest-neighbour method had an overall three-state accuracy of 72%, higher than neural network [165]; nonlinear dimensional reduction in protein secondary structure prediction yielded similar results compared to ANN [166] methods for highlighting components across networks [181]. Additionally, for multivariate data visualization, Kuntal and Mande [182] offered a web-based platform (Web-Igloo) which is useful for visual DM.

Future perspectives
In spite of great advances in the area of bioinformatics, various issues still remain to be addressed. Highthroughput sequencing, with its increasing tools and decreasing expenses, has been widely used. Scientists have been able to sequence entire genomes, analyse DNA sequence variation, quantify transcript abundance and understand mechanisms such as alternative splicing and epigenetic regulation using the first (Sanger) and the second (next) generation sequencing technologies [183]. However, yet, NGS has important challenges, such as data processing and storage. Genome interpretation is also another major challenge, which involves not only the analysis of genomes for functional elements, but the understanding of the importance of variants in individual genomes on phenotypes and disease. On the other hand, the next generation of modern and effective sequencing technologies can determine a huge deal of elusive knowledge regarding the repetitive and noncoding elements. Developments in TGS (Third Generation Sequencing) promise synergies with NGS technologies to raise our understanding of human/animal/plant genomics and genetics. NGS made a revolution in genomicsrelated research, and it is believed that the NGS discoveries will be continuing in near future. Constant developments in Pool-seq (whole-genome sequencing of pools of individuals) will raise its implications in the future. First, the availability of novel software will accelerate the analysis of Pool-seq data. Then, analyses of low-frequency variants will become typical through the use of new tools. The third development considers the haplotype phasing of Pool-seq data [184]. Although existing methods are based on sequence information of founder haplotypes, an extension relaxing this requirement to only a subset of the haplotypes in the pool will make this method more general and lead to more precise estimates. Ultimately, the availability of longer sequencing reads will accelerate the reconstruction of haplotype information from Pool-seq data. This can be achieved through technological developments (such as Nanopore and PacBio sequencing), and through new library preparation protocols (such as Illumina's Synthetic Long-Read technology), allowing haplotype sequencing for DNA fragments of up to 10 kb with the current sequencing technology. Such technological advances, along with the wide variety of biological research questions requiring huge sample sizes, mean that Pool-seq will continue to complement the sequencing of individual genomes in future [185].
Single-cell sequencing technologies have two main weaknesses: low genome coverage and high amplification bias. Despite the existence of some bioinformatics tools, new algorithms and software should be developed in order to analyse single-cell genomics data. Particularly, tools are required to assess the function of different single-cell sequencing technologies. Additionally, technical standards are needed for evaluation of the genome coverage and amplification biases. In spite of the limitations, we expect the nucleic acid sequence analysis of singlecell genomic DNAs and RNAs will be resolved in future via novel advancements in microfluidics and NGS technologies.
Various plant genomes have been sequenced at different levels of completion and many plant genome projects are underway [186][187][188]. Consequently, SNP discovery has become possible even in complex genomes. However, at present, there are limited SNPs from crops. Hence, there is a wide scope for production of reference genome sequences and discovery of such SNPs using NGS technologies for further understanding of plant genetics and genomics. Moreover, other issues that should be addressed are the ascertainment bias of popular bi-parental populations and the low validation rate of some array-based genotyping platforms. On the other hand, the area of epigenetic regulation of many genome components can be understood comprehensively by achieving deeper and more accurate sequencing [13].
What is more, various studies on protein classification algorithms show that no method has been developed for the classification of the proteins based on their amino-acid sequence. Therefore, novel methods could be created for the classification of the proteins based on their sequences, rather than their functional and structural features. Moreover, new ANN-inspired approaches and strategies can be used to offer predictions for higher levels of protein structures (tertiary and quaternary). Thus, protein function can be revealed and drug/enzyme therapy could be considered in the future.
Assessing the efficiency of bioinformatics methods is very important in the future improvement of the present applications and tools. For example, a comprehensive assessment is essential for obtaining insight into the effect of mutations, how they should be best mapped onto the sequence, structure, and network presentations, and how they should be combined into the visual layout [180]. Furthermore, the aggregation of network areas is another issue that can reduce the visual complexity. In fact, identifying areas of particular interest for evaluation of the potential influence of mutations could make mutation patterns with specific functional consequences more apparent, especially, in the analysis of multiple proteins [180]. Additionally, it is thought that improving the software integration of various applications in an automated way would involve better synchronization over linked views and automated retrieval of external data [180]. Lastly, based on the present evidence, it is our belief that the discoveries in the wide range of bioinformatics domains will continue in the next decade.

Conclusions
The developments of omics technologies have led to flourishing of high throughput genome-wide scanning data. Consequently, both bioinformatics and DM is a very fast ongoing research area. They need various skills for the gathering and storing, managing and analysing, interpreting and spreading of biological information. Furthermore, high performance computers (HPC) and innovative software are required to handle and organize tremendous quantities of genomic and proteomic data. Besides low cost and high speed, another motivating reason for wideranging computational screens of genomic data is the fact that the complexity and extent of biological systems might best be discovered by simultaneous consideration of a broad range of genome-scale data. Hence, it is essential to explore the hot research issues in bioinformatics and enhance innovative and intelligent data-mining techniques for effective and scalable bio-data analysis.