Genetic algorithm-based method for forest type classification using multi-temporal NDVI from Landsat TM imagery

ABSTRACT Remote-sensing technology has been a useful tool for mapping and characterizing forest cover types and species composition, providing valuable information for effective forest management. This study investigates the application of a genetic algorithm (GA)-based approach on Normalized Difference Vegetation Index (NDVI) to separate local forest communities at Huntington Wildlife Forest (HWF), located in New York State of the United States, into deciduous, mixed/coniferous and nonforests using Landsat TM imagery. Overall accuracy, producer’s accuracy, user’s accuracy and kappa coefficient of agreement are employed to assess the performance of the proposed method. Its overall effectiveness is supported by the accuracy of 80.41% and kappa coefficient of 0.56, and its capability of separating the forest cover types is endorsed by the class-wise accuracy measures. This method shows advantages in its limited demands for input features, that only multi-temporal NDVI indices are required; and in its simple and efficient mechanism, which refers to threshold optimization and feature selection.


Introduction
Forest type composition provides valuable information for forest inventory and management, thus it is increasingly desirable to obtain the spatially explicit forest type distribution for a range of applications such as forest monitoring (Lehmann et al. 2015;Liu, Im, and Quackenbush 2015;Sankey et al. 2017), natural resources (Corona 2016;Li et al. 2014), biodiversity and ecosystems (Barlow et al. 2016;Laurin et al. 2014; Van der Plas et al. 2018) and climate change (Charney et al. 2016;Asner et al. 2017). Traditional forest mapping methods such as fieldbased investigation are usually limited by the great demands for time and workload, as well as the difficulty to conduct over large geographic areas and precipitous topography. Advanced remote sensing technology offers a unique capability of large-scale forest mapping in a costeffective manner (Gudex-Cross, Pontius, and Adams 2017; Khatami, Mountrakis, and Stehman 2016;Zhen, Quackenbush, and Zhang 2016).
The usefulness of multi-temporal Landsat imagery for forest species/type classification has been extensively explored, and favorable performances have been achieved at varied geographical scales (Gudex-Cross, Pontius, and Adams 2017;Li, Im, and Beier 2013;Zhu and Woodcock 2014). In the Landsat family, the Landsat Thematic Mapper (TM) data have been proved ideal for mapping forest type composition at local and regional scales, considering its moderate spatial resolution (30 m) and temporal frequency (16 days) (NASA 2018), as well as the low cost and wide availability (Mancino et al. 2014;Foody and Hill 1996;Li et al. 2018).
Existing methods for forest types mapping using Landsat TM data range from statistical-based classifiers such as clustering (Lo and Choi 2004) and maximum likelihood (Hagner and Reese 2007), to machine learning algorithms such as neural network (Kempeneers et al. 2011), decision tree (Sesnie et al. 2008), random forest (Zhu and Woodcock 2014) and support vector machine (Li, Im, and Beier 2013). Comprehensive comparisons of algorithm performances have been made, but so far no agreement has been reached regarding the most appropriate classifiers for any given task (Phiri and Morgenroth 2017).
Nowadays there is growing interest in machine learning approaches fusion, aiming at improving classification accuracy. Progress has been made in various studies; for example Gong, Im, and Mountrakis (2011) coupled genetic algorithm and artificial immune networks to improve land cover classification, and Chatterjee et al. (2016) employed genetic algorithm to optimize the input weight vector of a neural network classifier. However, enhancements of these hybrid machine learning approaches are somehow limited, and at the cost of algorithmic complexity (Chatterjee et al. 2016). In addition, an evident limitation of the above-reviewed classifiers lies in the high dimension of input features, which could result in decreasing classification accuracy due to the Hughes effect (Zhu and Liu 2014). A common solution is to employ feature selection that keeps the most informative features (Zhu and Liu 2014;Pal and Foody 2010).
Genetic algorithms (GAs) are a family of computational models motivated by the process of natural selection in evolutionary biological systems (Mitchell 1998). As one of the most popular heuristic algorithms for optimization, GA shows its advantage of avoiding the convergence to a locally optimal solution, which is easily incurred in other algorithms such as hill climbing (Yuret and Maza 1993). Its ability to handle the search space in many directions simultaneously has greatly increased the probability of finding a global optimum (Gen and Cheng 2000). Thus it has been reported as a robust and efficient optimization technique in a wide range of applications, and has been commonly utilized in parameter optimization, optimum threshold identification and feature selection in classification tasks (Chatterjee et al. 2016;Gong, Im, and Mountrakis 2011;Van Coillie, Verbeke, and De Wulf 2007). In this study, we are interested in examining its utility in detecting optimum thresholds as well as useful features for forest type classification. We expect a simple and low-demanding method that minimizes the data inputs and algorithmic complexity, while still produces satisfactory results.
Normalized Difference Vegetation Index (NDVI) was found highly correlated with vegetative characteristics such as vegetation cover (Huang et al. 2017), leaf area index (Xavier and Vettorazzi 2004), biomass and carbon stock (Liu, Su et al., 2015) and crown density (Zaitunah, Samsuri, and Safitri 2018). Its weak sensitivity to atmospheric and topographic effects enabled consistent performance in multi-temporal analysis and scene-to-scene comparisons (Krishnaswamy, Kiran, and Ganeshaiah 2004).
In the past few decades, much emphasis has been placed on using multi-temporal NDVI as the major indicator for vegetation mapping because of its seasonal distinction (Wang and Tenhunen 2004). Krishnaswamy, Kiran, and Ganeshaiah (2004) accomplished vegetation cover mapping using multi-season NDVI data in diverse tropical deciduous ecosystems. Mancino et al. (2014) successfully detected the natural forest expansion area via Landsat TM imagery and NDVI differencing techniques. Smoothed NDVI data were evaluated regarding its potential of distinguishing various land cover types in a vegetated area in Shao et al. (2016). Therefore considering the strong predicting ability in vegetation mapping of NDVI, multi-temporal NDVI bands are chosen as the predicting variables for forest type classification in this study.
The objectives of this research include (1) separating local forest communities at the Huntington Wildlife Forest (HWF) into deciduous, mixed/coniferous and nonforests types using multi-temporal Landsat-derived NDVI, (2) assessing the effectiveness of the developed GA-based method for forest type classification when using NDVI as the sole predictor and (3) providing interpretations of forest cover conditions over HWF for long-term continuous inventory and management.

Study area and data
Located in the central Adirondack Park in New York State of the United States, the Huntington Wildlife Forest is a 6,000ha research forest with a Continuous Forest Inventory (CFI) system, which has been updated with new measurements every 10 years since 1950. According to the manual of 1991 Continuous Forest Inventory Measurement Procedures for Huntington Wildlife Forest, there are 285 CFI plots in this study site, and each plot has a radius of about 17 m, suggesting the area of a plot (908 m 2 ) approximates to that of one pixel in Landsat TM images (900 m 2 ). Therefore, it is reasonable to represent a plot with a pixel in the Landsat imagery. HWF is an ideal site for remote-sensing studies because of the availability of forest inventory and environmental monitoring data on-site. Figure 1 demonstrates the location and extent of the study area along with the 285 CFI plots.
Two types of datasets are collected for this study: Landsat TM images (Path 14, Row 29) covering the HWF area and CFI data of each study plot located in this site. Six Landsat images of leaf-on and leaf-off seasons (i.e. images acquired on 20th May, 8th August and 9th September for leaf-on season, and on 28th January, 17th March and 30th December for leaf-off season) in 1991 are downloaded from the website of United States Geological Survey (USGS, https://glovis.usgs.gov/app). And the red and nearinfrared bands of these images are of particular interest. The CFI data on HWF include a manual instructing the CFI measurement procedures, a GIS shapefile containing the geographical location of the study plots, and a spreadsheet recording the in situ measurements obtained in 1991.
In the in situ measurement records, tree species of each plot is determined by the surveyors through visual inspection of its dominant and codominant canopy, and is used as the reference data. Summarizing the tree species, three major forest types are identified: northern hardwood, hardwood/conifer and conifer. Major hardwood include sugar maple, red maple and yellow birch; common conifers are red spruce, balsam fir, eastern hemlock and white pine (Demers and Breitmeyer 2008). For practical concern, which is to balance the sample size of the forest cover types, in this study tree species are categorized into deciduous and mixed/coniferous, and the nonforested plots are assigned nonforests (Table 1).

Data preprocessing
Preliminary processing steps were performed upon the acquired data. First using the extent of the GIS shapefile of CFI plots, the original Landsat TM images were masked for the exact study area. Then the spectral reflectance values of visible red (R) and near-infrared (NIR) bands were converted from the digital numbers of the original TM images. NDVI, the normalized difference of near-infrared and visible red reflectance values were then calculated as follows: The above equation indicates that NDVI for a given pixel always ranges from minus one (−1) to plus one (+1); and nonvegetated pixel produces a value close to zero. Negative values are indicative of clouds, snow, water and other nonvegetated, nonreflective surfaces, while positive values imply vegetated or reflective surfaces (Burgan and Hartford 1993). Layers containing the NDVI value for every single pixel were generated for the 6 months chosen, and then by referring to the locations of the study plots, pixel values on these NDVI layers were extracted. These NDVI values at each plot were treated as the predicting variables for forest type classification in this study.
The proportions of deciduous, mixed/coniferous and nonforests are 63.16%, 35.09% and 1.75%, respectively. Considering such uneven distribution among different classes, equal percentage allocation on these forest types were applied for dividing subdatasets for calibration and validation. In this way, the original data were split into two datasets using stratified random sampling, keeping the ratio of sample counts for calibration and validation 2:1. The allocation of the two subdatasets is summarized in Table 2.

Feature selection
Feature selection aims at choosing a small subset of features that 'ideally is necessary and sufficient to describe the concept, to speed up learning, and to improve concept quality' (Kira and Rendell 1992). Thus, for optimized performance, best combination of NDVI values from the 6 months as well as the sequence of these variables when importing to the model were sought by exploiting statistical approaches. Boxplots, Analysis of Variance (ANOVA) and Canonical Discriminant Analysis (CDA) were utilized in this process. Boxplots visually depict the variables through their quartiles; ANOVA analyzes the differences among variable means and CDA augments the understanding of the contributions from each variable to the canonical components.

Conceptual model description
Normally a GA model consists of four stages: Initialization, Selection, Crossover and Mutation. The core of GA is to search for an enhanced fitness of a population of attempted solutions toward a convergence at the global optimum (Tseng et al. 2008).
In the Initialization stage, a population of attempted solutions, called chromosomes, which are randomly generated in the solution space, is selected as the starting point of the search. In this study, one chromosome is represented by the array of NDVI values of the month(s) that are used for searching. After Initialization, each chromosome is evaluated by a user-defined fitness function. The role of the fitness function is to assess the performance of the chromosome as a solution for the targeted problem (McCall 2005). In the Selection phase, a common rule of selecting parents for reproduction is that only the best members will be retained and will pass their characteristics from generation to generation. This selective process encourages the bestperforming chromosomes to contribute the most to the next generation over time. Then Crossover and Mutation steps are applied to create new generations. The Crossover operation produces a new offspring from two randomly selected 'good parents', while Mutation randomly chooses the chromosomes in the population, and arbitrarily changes a subset of the genes on the chromosomes.
Briefly, in a GA, a population of potential solutions evolves over successive generations using a set of genetic operators. Figure 2 demonstrates the structure of a basic GA model.

Genetic algorithms model
In this study, A GA model is trained to classify the vegetated area into deciduous and mixed/coniferous. In this model, various combinations of NDVI values in different months are tested to find the best fitness. Figure 3 displays a flow diagram of the key phases in the proposed GA model. In this model, genes on the chromosomes refer to the NDVI variables. The fitness function for determining the ability of a particular chromosome to survive is represented by the classification accuracy in the calibration process. The number of generations that the chromosomes undergo, and number of population of the chromosomes can be arbitrarily predefined or based on a threshold of fitness. For this study, these two parameters are set as static numbers of 50 as maximum, for the sake of simplicity and efficiency.
In Selection phase, fitness for the formed chromosomes is firstly calculated. The chromosomes with the highest five fitness values are excluded from the processes of regeneration and directly proceed to the next generation, while others will be regenerated aiming for better replacements. Crossover and Mutation are controlled by random processes that depend on crossover and mutation rate. Crossover simulates the sexual reproduction while Mutation imitates asexual reproduction. Considering their practical meanings, we chose a crossover rate of 0.8 and mutation rate of 0.2, consistent with tggghose in Im, Lu, and Jensen (2011). When a chromosome is selected for Crossover, it switches certain genes with others on chosen chromosomes by predefined rules, which depends on the season of NDVI. If a chromosome is picked for Mutation, certain genes are replaced by either random values or a percentage change of their original values. In the Mutation stage chromosome with the highest fitness is exempt from this operation, ensuring that the best set of genes is kept until it is surpassed. This design enables the GA model to produce thresholds of NDVI bands corresponding to the highest classification accuracy.

Results and discussion
Results of this study consist of four components. First, thresholds for separating the forested area and nonforests are found by statistical analysis. Second, thresholds for separating deciduous and mixed/coniferous are generated from the developed GA model. Third, overall performance is evaluated by combining the results of the above two-step separations, which is referred to as the GA-based classification method. Further, the entire study area is classified into deciduous, mixed/coniferous and nonforests using the best fitted thresholds, which produces a classification map showing the geographical distribution of forest types of HWF in 1991.

Statistical analysis
For uncovering the distributions of the NDVI values of each month, boxplots are drawn using the 120 deciduous plots, 65 mixed/coniferous plots and three plots of nonforests in the calibration dataset. These boxplots are illustrated in Figure 4, from which general interpretations arise on the range of NDVI values across the 6 months, and on the distinction of NDVI values among the three types within each month. These understandings help to find the thresholds for extracting the nonforests category, and to decide the combinations of the six variables as the input data in the following searching step in the GA model.
In the graphs of May and September, the occurrence of clear distinctions between nonforests and the other two types (deciduous and mixed/coniferous) suggests the usefulness of NDVI in these 2 months for extracting the nonforests type from the whole dataset. However, a few outliers exist in the deciduous category that might influence the accuracy. The thresholds derived from NDVI bands of May and September are shown in Table 3. Another important observation from the graphs is that the NDVI of deciduous are lower than that of mixed/coniferous in leaf-off seasons, while higher in leaf-on seasons. Therefore, in the GA model, only two kinds of crossover scenarios are considered since the season a month belongs to is known, which greatly simplifies the model (see the Crossover stage in Figure 3).
Results from ANOVA suggest a rejection of the hypothesis that the mean NDVI of the 6 months are equal. This rejection means that the tested data are of little correlation with each other, and further supports that all the six NDVI sets are suitable for being the predicting variables in the subsequent classification. In this study, CDA is used for discovering the importance of the variables. For this purpose, canonical coefficients for the first and second canonical components are inspected: the larger the absolute value of the coefficient, the more contributions it makes to the corresponding canonical component; in other words, the better the variable is to distinguish different classes. Thus, the canonical coefficients lead to a ranking list of the months showing their importance from high to low: March, December, August, September, January and May. This list assists to determine the combinations of months as inputs to the GA model.

Genetic algorithms model
Without using any prior knowledge, first for each month, its NDVI is imported to the GA model individually, and the threshold along with its corresponding accuracy is gained from the model. Since this model executes heuristic search, five trials are performed for each month, and the mean value of the thresholds corresponding to the highest accuracy is considered as the threshold for that month. Results of using single months as inputs are shown in Table 4.
The rank of the months offering accuracy from high to low is: March, December, August, January, September and May. This list is almost the same as that from CDA, except for the switched places of January and September. Thus in the next step, March is used as the first input, and then the rest of the months are added to the model one by one according to the ranking lists given by CDA and GA modeling with single months, to search for the highest accuracy with a combination of the variables. In this process, we also run the model for five times, and then find the best threshold of each month in the combinations that produce the highest accuracy. Table 5 shows the results of this process. It suggests that the highest accuracy (89.73%) occurs when the combination of NDVI values in March and August are used, and the thresholds are 0.1451 and 0.7349, respectively. Therefore, for classifying the forested area into deciduous and mixed/ coniferous, NDVI values in March and August are chosen by the GA model.
Summarizing the results from statistical analysis and GA modeling, thresholds of NDVI for classifying the three forest types are found, providing an optimized solution for the classification (Table 6). For separating forested area (i.e. deciduous and mixed/coniferous) and nonforests, NDVI values of May and September are used. Specifically, plots with NDVI values of May larger than 0.45 and NDVI of September larger than 0.35 are   It is suggested that for identifying nonforests from the forested area, NDVI in leaf-on season is more useful than that of leaf-off season. Because in leaf-on season, all trees grow leaves that causes the NDVI values generally high in the vegetated area, while the NDVI values in the nonforested area stay low. This enhanced discrepancy of NDVI promotes the capability to separate the forested and nonforested area by simple thresholding. As for discriminating deciduous and mixed/coniferous, combined NDVI in both leaf-on and leaf-off seasons are necessary. This is supported by the NDVI distributions in Figure 4, where in leaf-off season, NDVI of deciduous is generally lower than that of mixed/coniferous, and in leaf-on season the situation is exactly the opposite.

GA-based method evaluation
The proposed GA-based method contains two steps, one for extracting nonforests type from the whole data using statistical approaches, and one for separating deciduous and mixed/coniferous using a GA model. This classification method is evaluated using the prepared validation dataset. Applying the thresholds of NDVI in March, August, May, and September in Table 6 to the validation dataset, the 97 CFI plots are classified into deciduous, mixed/coniferous and nonforests.
Comparing the results with reference data, a matrix showing classification results and accuracy statistics is created (Table 7), providing producer's accuracy and user's accuracy of each category, as well as overall accuracy and kappa coefficient. The overall accuracy indicates that 80.41% plots are correctly classified. The kappa coefficient (0.56) suggests a moderate agreement between classification results and reference information. The accuracy of the deciduous type is the highest: 58 out of 60 deciduous plots are correctly classified. As for mixed/coniferous, only 54.29% of the plots are classified correctly, and 16 plots are misclassified into the deciduous class. For the two validation plots of nonforests, the 50% accuracy means one plot is classified correctly and the other one is placed to another class (deciduous). The low accuracy of nonforests is greatly attributed to its small sample size. To offset this deficiency, we may manually add additional plots that locate in the open field or lakes, where visual interpretation of the Landsat images is sufficient for this action.

Forest type classification
A classified map containing deciduous, mixed/coniferous and nonforests types over the study area is produced by utilizing the proposed GA-based classification method, which is shown in Figure 5. In the figure, detailed distribution of the three types throughout the entire area is attained. Forested area covers more than 90% of the map, and deciduous and mixed/coniferous spread through the entire site, mostly aggregated by type. Large area  belonging to nonforests are lakes and open fields. By checking the Digital Elevation Model (DEM) covering this area, we observed that mixed/coniferous mostly grew in higher elevations, and low-lying area is generally dominated by deciduous type. Area and proportion of each forest type are calculated and exhibited in Table 8. These results demonstrate that the mixed/coniferous is the dominant category over the study area with a proportion of 64.72%, and there are 28.09% area covered by the deciduous. In addition, 7.19% of the area belong to the nonforests type. These interpretations have provided valuable information of the forest cover conditions over HWF in the year 1991, and will contribute to long-term continuous inventory and forest management in this area.

Conclusion
In this study, a genetic algorithm-based method was proposed for classifying the Huntington Wildlife Forest into deciduous, mixed/coniferous, and nonforests types, using multi-temporal NDVI derived from the Landsat TM imagery. Statistical approaches such as ANOVA and canonical discriminant analysis were utilized for priori knowledge of the NDVI variables, facilitating subsequent decisions on selecting different groups of predicting variables for model calibration. Results show that the proposed method successfully achieved a favorable overall classification accuracy of 80.41%, and the kappa coefficient of 0.56 indicates moderate agreement between the classification results and reference. As the sole type of input variable, multi-temporal NDVI proved its capability of separating the forest types as well as extracting nonforested area effectively. In addition, a map displaying the geographical distributions of the three forest types of the entire study area was produced, to promote our understanding of the studied forest and to assist its future monitoring and management. Future work include applying the developed method to the same study area at 10 years' interval, taking advantage of the decennial forest inventory system of HWF. Also, there are potential improvements for the GAbased method in its flexibility and robustness, for example testing other forms of variables. Figure 5. A map of the study area composed of deciduous, mixed/coniferous, and nonforests types.