Hyperspectral wavelength selection for estimating chlorophyll content of muskmelon leaves

ABSTRACT Quantifying chlorophyll content, an effective indicator of disease as well as nutritional and environmental stresses on plants, may enable optimal fertilization while managing crops. Hyperspectral remote-sensing is commonly used to estimate chlorophyll content. In this context, the process of variable selection is crucial since it is necessary to identify variables relevant to chlorophyll and eliminate redundant variables. In this study, 14 wavelength selection methods based on partial least squares (PLS; namely, backward variable elimination, backward and forward interval-PLS, competitive adaptive reweighted sampling, genetic algorithm, iterative predictive weighting, loading-weights, PLS with Martens’ uncertainty test, regression coefficient, regularized elimination procedure, sparse-PLS, sub-window permutation analysis, uninformative variable elimination and variable importance in projection) were combined with one of five machine learning algorithms (Cubist, deep belief nets, random forests, stochastic gradient boosting and support vector machine) and then evaluated. According to the ratio of performance to deviation (RPD), the best combination of variable selection method and machine learning algorithm was regularized elimination procedure and Cubist achieving an RPD of 1.76 and an RMSE of 2.42 μg cm−2.


Introduction
In Japan, muskmelons (Cucumis melo L.) are cultivated in glass greenhouses, which offer suitable light for their growth and facilitate controlling nutrient conditions. The quality of muskmelons may also be improved by inducing plant stress (Sugiyama et al., 2008). However, production depends on the experience of skilled farmers because the stress status of muskmelon plants is not monitored regularly and because nitrogen deficiency reduces total biomass markedly, leading to early mortalities.
Chlorophyll content has been used to evaluate plant physiological activity and is an indicator of muskmelon yield and potential quality (N.L. Chen et al., 2010). Furthermore, chlorophyll is closely associated with nitrogen, an essential plant nutrient, and changes in chlorophyll content have therefore been used to evaluate nutritional and environmental stresses on plants (Datt, 1999).
Hyperspectral reflectance has been used for estimating chlorophyll content in various plant species and hyperspectral remote-sensing can be an effective tool for measuring chlorophyll content in the field (Golhani et al., 2019;Sonobe et al., 2021). Some previous studies were based on the datasets composed of measurements taken under relatively low light-stress conditions (Feret et al., 2008); however, some stresses have been used to improve the qualities of crops and it can change the chlorophyll a/b ratio (Terashima & Hikosaka, 1995). In this study, the potentials of hyperspectral remote-sensing for monitoring crops under nitrogen stress were evaluated.
The basic principle of the variable selection methods is to select a small number of representative variables and then they identify more concise and effective spectral data and play important roles in the multivariate analysis since the removal of redundant variables is effective for producing better prediction results (Balabin and Smirnov, 2011). Therefore, the combinations of pre-processing original reflectance data and wavelength selection methods could be more powerful tool for improving the usability of original reflectance data. Partial least squares regression (PLSR) is an effective technique for identifying a subset of important variables and has been used for dimensional reduction to remove useless or irrelevant information, such as noise and background reflectance, which make the predictive ability of a model poor. Wavelength selection methods based on PLS can be divided into three groups: filter, wrapper and embedded methods (Mehmood et al., 2012;Pierna et al., 2009).
Studies have shown that specific combinations of reflectance data and machine learning are effective for assessing vegetation properties (Samat et al., 2019;Xie et al., 2021). Random forests (RF; (Biau & Scornet, 2016), support vector machine (SVM) with a Gaussian kernel function (Burges, 1998) and Cubist and stochastic gradient boosting (SGB) have performed well in identifying vegetation properties (Breunig et al., 2020). Machine learning algorithms based on an artificial neural network (ANN) have also been applied to analyse remote-sensing data to estimate chlorophyll content (Lu et al., 2020). Deep belief nets (DBN) are among the most powerful tools for regression modelling. They have superior interpretability, convergence, computation effort and accuracy, although convolutional neural networks (CNN) supersede DBN in image processing applications (Romeo et al., 2020;Uddin et al., 2020). While earlier studies demonstrated that DBN based models perform well, different strategies are needed to attain high accuracies. Cubist, RF and SGB place special importance on a several particular variables; however, the importance of all wavelengths was less than 20% for SVM and DBN (Sonobe et al., 2018(Sonobe et al., , 2020. Thus, SVM or DBN may be more useful when the green peak or red-edge inflection point (REIP) shifts are large; however, Cubist, RF and SGB may be powerful tools when shifts are small and their changes are effective for evaluating specific biochemical properties (Sonobe et al., 2018(Sonobe et al., , 2020.
We therefore aimed to identify a combination of an effective PLS-based variable selection method and machine learning algorithm for estimating chlorophyll content from muskmelon leaves using hyperspectral reflectance.

Measurements and datasets
Experiments were conducted using muskmelon plants in a greenhouse at Shizuoka University in Shizuoka, Japan. Between 19 August and 16 September 2020, six nitrogen treatments were applied: Enshi (horticultural experimental station) formula solution (treatment A), two-thirds of the Enshi formula solution (treatment B) (this has been used as a standard nutrient solution for muskmelon cultivation), one-third of the Enshi formula solution (treatment C), a sixth of the Enshi formula solution (treatment D), a twelfth of the Enshi formula solution (treatment E) and a treatment without nitrogen (treatment F). Reflectance and chlorophyll content were measured from 103 leaves. The numbers of leaves for each treatment are shown in Table 1. On 26 August and 02 September, the leaf discs were sampled from the 8th leaves and the discs from the 20th leaves were added from 09 September, however, two measurements from 20th leaves under treatments E and three measurements from 20th leaves under treatments F were failed to measure reflectance due to due to overheating of the spectrometer on 09 September.
Hyperspectral reflectance in 1 nm steps across the entire wavelength domain from 400 to 2500 nm was obtained from a leaf clipping using a FieldSpec4 spectroradiometer (Malvern Panalytical, Almelo, Netherlands). A splice correction function as implemented in ViewSpec Pro (Analytical Spectral Devices Inc., USA) was applied to minimize the inconsistency caused by the three detectors: the visible and nearinfrared (VNIR) portions of the electromagnetic spectrum, short wave infrared SWIR1 and short-wave infrared SWIR2. The obtained reflectance was denoised by applying de-trending (DT), which is a simple baseline correction method: the baseline is assumed to be a second-degree polynomial function of wavelength and is subtracted from the spectrum (Barnes et al., 1989;Candolfi et al., 1999).
Leaf samples were collected by punching three disks per leaf, which were then stored in dimethylformamide. A dual-beam scanning ultraviolet-visible spectrophotometer (Ultrospec 3300 Pro, Biosciences) and Wellburn's method (Wellburn, 1994) were used to quantify chlorophyll content. The chlorophyll unit was then converted to μg cm -2 using the area of the leaf discs.

Regression models based on machine learning algorithms
The performances of five machine learning algorithms were evaluated based on their ability to estimate chlorophyll content in the muskmelon leaves. These were Cubist, deep belief nets (DBN), random forests (RF), stochastic gradient boosting (SGB) and support vector machine (SVM).
Cubist is a rule-based model tree approach and generalizes regression models to add boosting when the number of committee models is greater than 2. Its leaves are expressed as multivariate linear regression models. A regression model based on Cubist is generated in two steps: (1) establishing a set of rules that divides the training data into smaller subsets and (2) fitting a regression model to these smaller subsets and applying a nearest neighbour algorithm to the leaf node, using an ensemble approach combination (Quinlan, 1992). The number of committee models and neighbours used for correcting the model predictions were optimized, using parameter spaces of 1-100 for committee models and 1-5 for neighbours. DBN consists of multi-layer unsupervised restricted Boltzmann machines (RBMs), which are two-layer neural networks (Hinton et al., 2006). The optimized hyperparameters include hidden layers, layer unit sizes, batch size, number of epochs, learning rate, dropout rate and weight decay. The parameter spaces were 2-6 for hidden layers, 10-75 for layer unit sizes, 5-20 for batch size, 0.001-0.1 for number of epochs, 10-200 for learning rate and 0-0.02 for weight decay. Also, dropout -a technique for addressing overfitting -was utilized during the training phase to facilitate high-quality predictions; the associated parameter space was 0-1 (Srivastava et al., 2014).
RF is an ensemble learning technique that builds multiple decision trees based on random bootstrapped samples of the training data (Breiman, 2001). There are two hyperparameters: the number of trees and the number of variables used to split the nodes. Even if the number of trees is too large, the generalization error always converges and over-training is not a problem. However, reducing the number of predictor variables results in each individual tree of the model being weaker. The parameter spaces were 20-1000 for the number of trees and 1 to the number of variables for the number of predictor variables.
SGB also builds an ensemble of trees using a random sub-sample of the training data to improve computation speed and prediction accuracy for each iteration and to avoid overfitting (Friedman, 2002). Parameter spaces for optimized variables were 3-1000 for total number of trees to fit, 1-20 for the maximum depth of each tree, 0.001-l for the learning rate and 0.75-1 for the minimum number of observations in the terminal nodes of the trees.
SVM is based on fitting a logistic distribution to the output values of the decision functions of classifiers and using quadratic optimization to obtain class probabilities (C.C. Chang et al., 2011). At present, the RBF kernel is the most commonly used due to its useful features (W.J. Wang et al., 2003). There are two parameters that control the flexibility of the classifier: the regularization parameter C and the spread parameter σ. Overly, high C values lead to a high penalty for lack of separable points and storage of many support vectors, while exceedingly low values lead to underfitting. σ is closely associated with the generalization performance of SVM. The hyperparameter spaces were discretized along 2 x , where x = −50-50 for both parameters.
Before generating regression models, all measurements were divided into three data sets -training (50%), validation (25%) and test (25%) -based on a stratified random-sampling approach (Hastie et al., 2009), which was repeated one hundred times for more robust results.

Wavelength selection method based on partial least square
When running filter methods, a regression model based on PLS is generated and the output is then evaluated to identify a subset of important variables. Loading weights (LW), regression coefficient (RC) and variable importance in projection (VIP) are in this category. While running the LW method, the high and low variables are defined by calculating the maximum absolute loading weights from the principal factors (Y.G. Wang et al., 2016), while the load weights from each component were accumulated in the VIP method (Chong & Jun, 2005). When running the RC method, the sensitive wavelengths are generally selected according to the regression coefficient of PLS models (Mehmood et al., 2012).
The wrapper methods directly estimate generalization ability using a learning algorithm (Pierna et al., 2009). Included in this category are backward variable elimination (BVE), competitive adaptive reweighted sampling (CARS), genetic algorithm (GA), iterative predictive weighting (IPW), PLS with Martens' uncertainty test (MUT), regularized elimination procedure (REP), sub-window permutation analysis (SwPA) and uninformative variable elimination (UVE). BVE is a backward iterative step-by-step PLS-oriented method for the selection of spectral variables and its objective is to build a correct model with few variables (Pierna et al., 2009). In CARS, Monte-Carlo sampling with the PLS regression coefficient is applied and the variables with a larger weight of regression coefficient are applied as a new subset to establish a PLS model (Fan et al., 2016). GA, an adaptive heuristic search algorithm centred on the evolutionary ideas of natural selection and genetics, is superior to MUT, backward interval-PLS (BiPLS) and forward interval-PLS (FiPLS; (Villar et al., 2014). The cyclic repetition of PLS regression is conducted in IPW, which calculates the predictor importance based on the absolute value of the regression coefficient and then computes the standard deviation of the predictor, and the predictors are multiplied by their importance in the next cycle (Forina et al., 1999). In MUT, the principle of jack-knifing is applied to estimate standard errors of the regression coefficients which are then divided by their estimated standard errors to yield t-test statistics (Villar et al., 2014). REP also adopts a stepwise elimination and a stability based variable selection procedure, where the samples have been split randomly into a predefined number of training and test data sets (Mehmood et al., 2012). The influence of each variable without considering the influence of the other variables is evaluated in SwPA (Li et al., 2010). In UVE, artificial noise variables are added to the reflectance data and all original variables which are less important than those with artificial noise are removed (Pan et al., 2016).
In the embedded methods, variable selections are conducted at the component level. Backward and forward interval-PLS (BiPLS and FiPLS) and sparse PLS (SPLS) are examples of this method. In iPLS, the data are divided into non-overlapping sections and a separate PLS model is built in each section to identify the most useful wavelength (Lindgren et al., 1994). The sub-interval with the smallest cross-validated prediction error is selected in FiPLS, while the subintervals having the largest error are removed in BiPLS (Mehmood et al., 2012). SPLS combines variable selection and modelling in a one-step procedure (Le Cao et al., 2008). Details of each method are summarized in (Mehmood et al., 2012).

Performance assessment
The ratio of performance to deviation (RPD; Equation (1)) was calculated (Williams & Norris, 1987). Each method was classified into three categories based on RPD: "A" (RPD > 2.0), "B" (1.4 ≤ RPD ≤ 2.0) or "C" (RPD < 1.4). The models categorized as "A" and "B" are referred to as excellent and fair models, respectively, while those categorized as "C" are nonreliable models (C.W. Chang et al., 2001). Thus, models classified as "A" or "B" were assumed to have the potential to estimate chlorophyll content.
RMSE ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 1 n where SD is the standard deviation of chlorophyll content in the test data, RMSE is root-mean-square error, n is number of samples, y i is measured chlorophyll content and b y i is estimated chlorophyll content. Then, the relationships between the selected wavelength and their importance were evaluated based on a black box data-based sensitivity analysis (DSA), which assumes that the fitted models are pure black boxes (Cortez & Embrechts, 2013). DSA uses several training samples instead of a baseline vector, while only one input is changed at a time and the others are kept at their average values in a computationally efficient one-dimensional sensitivity analysis (Kewley et al., 2000).

Chlorophyll content after each treatment
Chlorophyll content per leaf area (cm 2 ) ranged from 10.47 to 34.1 μg for the different nitrogen treatments (Table 2). There were no clear changes in chlorophyll content with growth, except for treatment F, in which chlorophyll content both decreased with growth and differed significantly from other treatments (p < 0.05, Tukey-Kramer test).

Spectral reflectance of different treatments
The correlation coefficients between chlorophyll content and reflectance after de-trending show two troughs (Figure 1), indicating high negative correlations near the green peak and the red edge inflection point (REIP). While these were most distinctive for treatment F, the lowest absolute values of correlation coefficients were recorded for treatment B. Although the significant positive correlations over the 400-500 nm range and wavelengths greater than 750 nm were identified for treatment F, these tendencies were not clear for other treatments. Furthermore, the positive correlations near 680 nm were identified for treatments A, B and F, however, the positions of their peaks differed. Positive correlations were not identified for treatments C and D.
After one hundred repetitions, selected wavelengths selected by 14 PLS-based methods were shown in Figure 2. The average numbers of    wavelengths selected by CARS, LW and UVE were less than 50, while those of BiPLS, RC, REP, SPLS, SwPA and VIP were more than 1000.

Accuracy validation
RPD and RMSE values were calculated using regression models based on machine learning algorithms using 100 iterations (Figure 3). The best combination of variable selection method and machine learning algorithm was REP and Cubist, achieving an RPD of 1.76 and an RMSE of 2.42 μg cm −2 . The importance at 20 nm interval as assessed by DSA and Cubist is shown in Figure 4. Generally, the high importance values were observed over REIP (680 to 720 nm) for all variable selection methods. Although some variable selection methods ignored the importance over the green peak, the highest importance was observed over it for CARS.
Further, Cubist showed the best RPD values for all variable selection methods, while SVM was deemed unsuitable since its RPD values below 1.4 for all the selection methods investigated during this study. The RPD values of UVE, LW and FiPLS were also generally below 1.4, thus these methods are considered unsuitable for estimating chlorophyll content of muskmelon leaves using hyperspectral reflectance. In contrast, MUT, RC, REP, SPLS and SwPA showed potential with RPD values above 1.6.

Discussion
On 26 August, when the treatment was started, there were no significant differences among the six treatments. Within one to two weeks, the mean value of chlorophyll content became greater along with more nitrogen strength. A significant difference in chlorophyll content was confirmed between F and A, B, C and D on 2 September and between C and A, B, D and F and E and F on 9 September (p < 0.05, based on the Tukey-Kramer test). Chlorophyll content became higher along with higher nitrogen strength, and that led to a lower reflectance, due to the strong absorption of chlorophyll-a and b under blue light (410-470 nm) and red light (644.8-670 nm), respectively (X. Chen et al., 2020;Navarro-Cerrillo et al., 2014).
Of the best combinations of variable selection methods and machine learning algorithms, both of REP and VIP, the most prominent variable selection methods, were selected 13 times after 100 repetitions each (Table 3). The high performances of VIP was also reported for estimating leaf chlorophyll content in winter wheat (He et al., 2015). SwPA (11 times), GA (9 times) and SPLS (8 times) were also effective as variable selection methods.
These top five methods were selected 54 times in total. UVE, LW and FiPLS generally had RPD values below 1.4 when the measured values merged after 100 iterations (Figure 2). Previous studies reported a reduced performance of FiPLS, LW and UVE because these methods removed useful information (Santos-Rufo et al., 2020;Xia et al., 2017). The reflectance at 550 nm, which is frequently noted as the green peak, was applied for estimating chlorophyll content in some studies (Carter & Knapp, 2001;Datt, 1998). Indeed, the green peak becomes tiny with higher chlorophyll content. However, some studies have reported that chlorophyll contributes to reflectance at 550 nm especially for low anthocyanin content against a high background of chlorophyll (Merzlyak et al., 2003). Besides the green peak, the red edge has also been used for chlorophyll content estimation and is shifted to longer wavelengths with higher chlorophyll content (Gitelson & Solovchenko, 2017;Miller et al., 1990). The performance of FiPLS, LW and UVE relied on the red edge, and sometimes the wavelengths over the green peaks were removed at 100 repetitions. In contrast, the other methods utilised both the green peak and red edge. Ram et al. (2011) reported that anthocyanin induction was strongly influenced by low nitrogen concentration. The effects of anthocyanin differ in reflectance at the green peak and red edge: at the red edge chlorophyll does absorb but anthocyanin does not, whereas the absorption of anthocyanin is at maximum at the green peak  (Gitelson et al., 2006). Using the green peak might therefore introduce inaccuracies related to senescence or stress caused by lack of nitrogen.
The RPD values of the regression models based on UVE and LW were less than 1.4, which meant that they were not suitable valuable selection methods. The average numbers of wavelengths selected by UVE and LW were 10, which was the smallest value. Some previous studies (Santos-Rufo et al., 2020;Xia et al., 2017) reported they greatly reduce variables but also remove some useful information and then this characteristic was confirmed in this study.
The machine learning algorithms Cubist (38 times) and DBN (32 times) were generally selected. Potential methods for using with these algorithms were SwPA (6 and 3 times for Cubist and DBN, respectively), GA (3 and 5 times), REP (3 and 4 times), SPLS (5 and 2 times) and VIP (2 and 7 times). For kernel-based algorithms, an inappropriate selection of hyperparameters relates to kernel function (Horvath, 2003), and so different ranges of kernel-related parameters for SVM have been suggested: from 0.005 (Foody & Mathur, 2004) to 2 8 (Sonobe et al., 2014), although greater values may lead to a decrease in accuracy by more than 20% (Trisasongko, 2017). In the present study, σ ranged from 2 −21 to 2 50 with a mode of 2° (7 times per 100 iterations), but no generally preferable value could be determined. Kp ranged from 2 −7 to 2° with a mode of 2 −5 (24 times per 100 iterations). Although Cubist, RF and SGB are all stochastic modelling techniques involving ensemble regression trees or rule-based models, the performances of RF and SGB were obviously lower than that of Cubist. Cubist's proficiency has been demonstrated by comparing 77 popular regression methods (Fernandez-Delgado et al., 2019). SGB fails if the training data set is small, and since only a fraction of the training data was sampled in this algorithm (n = 100 in this study), overfitting was likely. When running RF ("randomForest" package; (Breiman et al., 2018)), one third of the training data is separated as out-ofbag (OOB) samples. These data are not considered in the training of the tree and can be used to evaluate performance. However, this strategy might have reduced sample size too much to generate regression models, as RF did not perform as well as in previous studies (Biau & Scornet, 2016). A lot of earlier studies have reported the best machine learning algorithms for estimating leaf chlorophyll contents from hyperspectral reflectance or vegetation indices calculated from reflectance Zhu et al., 2020), however, the combinations of machine learning algorithms and wavelength selection methods were not conducted. The results indicated wavelengths selection is a critical step for chlorophyll content estimation and suitable selection methods made estimation accuracies higher.
In order to evaluate whether the proposed method was effective to the other species, ANGERS (Feret et al., 2008), which includes the measurements from 41 different species and is the dataset measured in 2003 at INRA in Angers (France), were used for validations. Figure 5 shows the relationships between measured and estimated values. Although the estimated values were constant for the samples whose chlorophyll contents were bigger than 60 μg cm −2 or smaller than 10 μg cm −2 (these values were not included the measurements from the muskmelon leaves), the proposed method still had the high performance with a root mean square error of 11.30 μg cm −2 with RPD values of 1.92.
In this study, six nitrogen treatments were applied to produce various causes related to the chlorophyll contents and then reflectance of muskmelon leaves in Figure 5. Relationship between measured and estimated chlorophyll contents. a wide range of chlorophyll content was investigated in this study. However, ANGERS is larger dataset and includes leaves with very low pigment contents, and sometimes, with almost no carotenoids or no chlorophylls (Feret et al., 2008). Especially, the lower and higher chlorophyll contents were confirmed in the measurements from broad leaf trees. Broad leaf trees generally have two distinctive leaf types including shaded and sunlit leaves. Sunlit leaves were grown under high irradiances and are much less susceptible to photoinhibitory damage than shaded leaves (Powles, 1984), while shaded leaves are commonly larger and thinner than sunlit leaves (Terashima et al., 2001). The difference between the two types of leaves in broad-leaved trees should be linked to the chlorophyll content estimation models to improve the accuracy for the measurements from broad leaf trees.
Leaf scale spectroscopy is effective to confirm direct links of reflected information with chlorophyll content, since surrounding effects would have been under controlled. Some variable selection algorithms have been proposed to assess chlorophyll contents from canopy spectra (H.J. Liu et al., 2019;N. Liu et al., 2020;Yang et al., 2021) and satellite-or air-borne remote-sensing data are more of professional applications concerning large scale assessment and then their potentials should be assessed in the future works.

Conclusions
Dimension reduction strategy is important for eliminating redundant and irrelevant features and improving accuracy of estimations of vegetation properties using hyperspectral reflectance. To this end, the use of PLS-based feature selection methods has been proposed. However, the influences of the feature selection methods and machine learning algorithms were unknown. Therefore, combinations of 14 PLS-based variable selection methods and 5 common machine learning algorithms were evaluated for their potential for estimating chlorophyll content of muskmelon leaves using hyperspectral reflectance.
Generally, Cubist and DBN performed well for this purpose. Algorithms were more important for evaluating chlorophyll content, however, improvements were identified for combinations with one of five PLS-based variable selection methods: SwPA, GA, REP, SPLS and VIP. These methods can be considered effective for enabling precise agricultural analyses in this context.
Although de-trending was used to pre-process reflectance data in this study, vegetation indices are also effective for removing noise and reducing the data saturation problem. Therefore, it would be beneficial to test the efficacy of vegetation indices for further improving chlorophyll content monitoring.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
This study was supported financially by JSPS KAKENHI [grant number 19K06313].