Estimation of paddy rice maturity using digital imaging

ABSTRACT Harvest time is an important factor affecting grain yield and postharvest quality. The estimation of crop maturity greatly supports farmers in the optimization of the harvest time. This study proposed a method for predicting paddy rice maturity (J816 and 5Y4 varieties) using color features and the random forest (RF) regression algorithm. Paddy panicle images were obtained using a flatbed scanner during the maturation period. To estimate paddy rice maturity, 22 color features representing the greenness of crop leaves were extracted from the paddy panicle images. Stepwise regression was used to select superior features as inputs to the RF regression model. The coefficient of determination (R2) and root mean square error (RMSE) values of the model were 0.93 and 1.18% for J816 and 0.94 and 1.60% for 5Y4, respectively. The results indicate that the proposed method in this study is a promising technique for the estimation of paddy rice maturity.


Introduction
Crop maturity is an important factor determining the optimum harvest time and affecting grain yield and postharvest quality. A crop produces the maximum dry matter yield at maturity, and then the grains do not continue to grow and only lose moisture . [1] After maturity, the risk of lodging, shattering, and weather changes for crops increases, resulting in reductions in grain quality and yield at harvest . [2] In California, the head rice yield was shown to decrease from 63.8 to 45.8% when the harvest was delayed by 10 days . [3] In Canada, the yield loss of wheat was shown to be as high as 17% due to shattering after delayed harvest . [4] In addition, a delayed harvest leads to an increasing crack rate in rice due to meteorological changes such as rain and dew, affecting the subsequent mechanical drying quality and the taste of milled rice.
Rice is one of the most important cereal crops and supplies nearly 50% of the world's population as a staple food . [5] The fast increase in the world's population and the gradual reduction in cropland areas have been raising the demand for foods, especially staple foods such as rice . [6] In general, farmers determine the optimum harvest time in paddy rice mainly through visual observation in the field . [7] Although widely used, this visual method based on personal experience is subjective and highly inconsistent and shows poor reproducibility, resulting in improper harvest times . [8] Kernel dry matter accumulation after anthesis could provide a precise estimation of the optimum harvest time in crops. Crop maturity can be estimated in the maturation period by weighing oven-dried samples and establishing regression models . [1] In addition, grain moisture content as a useful indicator for determining the optimum harvest time has been successfully applied in many grain crops, such as wheat, [9] barley, [10] maize, [11] and rapeseed . [12] However, these methods, which are time-consuming and destructive, in addition to their high costs and labor consumption, cannot be used for real-time applications. Thus, a simple, efficient, and objective method is necessary to determine rice maturity.
In recent years, nondestructive techniques based on spectral and hyperspectral reflectance have been investigated for estimating the chlorophyll content of plants . [13] These methods were developed to achieve special purposes such as real-time and accurate nutrient status reporting . [14,15] Because chlorophyll content affects the visual features of leaves, using digital cameras or red, green, and blue (RGB) imaging as low-cost instruments in the visible range has also been used in nitrogen status estimation . [16][17][18] On the other hand, mature grains or panicles of paddy rice change color from green to golden brown. Thus, the estimation of paddy rice maturity by its color using digital imaging seems feasible.
Computer vision, as a rapid, low-cost, and high-accuracy technique, has been widely applied in agricultural production . [19][20][21] As chlorophyll is the main pigment in leaves and is responsible for leaf greenness, various color indices have been designed to assess plant chlorophyll status at the leaf or canopy scale of the crop. Wang et al. used the green channel minus red channel (GMR or G-R), green channel divided by the red channel (G/R), normalized green index (NGI), normalized red index (NRI), and hue features to estimate biomass, N content, and leaf area index (LAI) . [17] They mentioned that the GMR and G/R features had a better correlation to the biomass, N content, and LAI than the other features. Farshad Vesali et al. used 19 color features extracted from corn leaf images to estimate the chlorophyll content of corn leaves . [22] They found that of the other features, the hue feature had the strongest linear relationship with chlorophyll content. In addition to leaf greenness, grain color is an important indicator of crop maturity. Therefore, the color features used to assess plant chlorophyll status would also be useful for estimating paddy maturity.
However, predicting paddy rice maturity by its color using digital imaging is still challenging. There have been few applications of computer vision to determining maturity in paddy rice. The main applications of computer vision in rice have been carried out for classification, [23][24][25] the detection of external damage, [26][27][28] or the determination of quality parameters . [29][30][31] Choe Lip Haw et al. proposed a method to determine paddy maturity by color vision techniques . [32] They used RGB cameras to collect images of panicles at ambient illumination in a laboratory, and then the pictures of the panicles were equally divided into 3 parts: the terminal part, the middle part, and the basal part. Three points were randomly selected in each portion to obtain the mean hue value to determine the paddy maturity. As ripeness gradually progresses from terminal to basal, the distribution of colors at each part of the panicle is uneven. This method of color extraction, in addition to the effects of ambient conditions, results in random and inconsistent maturity prediction. This study aimed to evaluate the effect of various color features from digital images on the estimation of paddy rice maturity in the maturation period.

Paddy rice samples
Two high-yield rice varieties were used in this study: Jijing816 (J816) and 5You4 (5Y4). J816 is shortgrain rice with a 112.8 cm plant height, and 5Y4 is long-grain rice with a 122.8 cm plant height. These two varieties were chosen because they have a similar maturation period of approximately 48 to 55 days after heading (DAH). The experiment was carried out in the experimental fields of the Rice Research Institute of Jilin Province, China during 2019 and 2020. The field size for each variety was 1000 × 300 cm, with a plant spacing of 30 cm. The seed development stages are shown in Table 1. The heading date was determined at the time when approximately 50% of the spike had emerged. Paddy panicles were manually harvested every day from the sampling date to the harvest date. At each harvest, enough paddy panicles were randomly selected from healthy plants without any pathogenic symptoms. Then, the paddy panicles were transferred to the laboratory to determine paddy rice maturity and to conduct image acquisition. Image acquisition and the determination of the maturity of the paddy rice were conducted simultaneously to prevent the influence of moisture loss on the paddy color.

Determination of paddy rice maturity
To determine paddy rice maturity, paddy samples were threshed by hand from panicles and then dehulled using a laboratory huller (JLGJ-45, China). After this, 20 g sound whole rice samples were collected to determine paddy rice maturity. These rice samples were visually classified into two stages: mature (completely translucent kernel with light brown color) and immature (completely translucent kernel with green color) . [33] Finally, rice maturity was determined using Eq. (1).
where M is the paddy rice maturity (%), W m is the weight of mature kernels (g), and W im is the weight of immature kernels (g).

Image acquisition
To reduce the effects of ambient conditions on imaging, a flatbed scanner (Epson V39, Japan) was used to acquire paddy panicle images. A single panicle was placed on the glass bed of the scanner and then imaged after closing the scanner cover. The scanner interfaced with the computer through a universal serial bus (USB) port. All the images were acquired with a size of 2550 × 3509 pixels at a resolution of 300 dpi and stored in an uncompressed format to avoid introducing any additional artifacts. Color calibration was conducted using a gray card (Kodak R-27, Eastman Kodak Company) before image acquisition on each day. For each rice variety, 300 images (30 images × 10 days) obtained in 2019 were used for model development. These images were randomly split into two groups: 80% (240 images) were selected as the training set to train the model, and the remaining 20% (60 images) were used as the test set to evaluate the performance of the developed model. To validate the robustness of the developed model applied to paddy rice harvested in different seasons, 36 images (9 images × 4 days) of each variety acquired in the same fields in 2020 were used in this study. Thus, a total of 600 images were used for model development, and 72 images were used for model validation.

Image processing
Image processing aims to isolate the paddy panicle from the original color image. The image processing algorithm was developed using LabView 2016 (National Instruments, USA). The image processing operations of the paddy panicle are shown in Figure 1. After the original panicle image was acquired (Figure 1a), the first step of the image processing operations was thresholding. As the color of the paddy panicle was more saturated than the background, the hue, saturation, and intensity (HSI) color space was suitable for thresholding. From the saturation (S) color space, the threshold of S = 25 was defined in the S-histogram ( Figure 2) to extract paddy panicles from the background.
After thresholding, there were still some small particles representing the background in the binary image ( Figure 1b). Therefore, a sequence of morphological operations (including opening, particle removal, and border rejection) was applied to filter the noise in the binary image. The opening operation was a combination of erosion and dilation. Erosion was used to discard the isolated pixels with a higher brightness than the background, followed by dilation to fill the holes or connect cracks inside the objects in the binary image . [34] The particle removal operation was used to filter out particles larger or smaller than a certain size in binary images. Then, the border rejection operation was applied to remove the particles or objects truncated by the boundary in the eroded binary image to enhance the accuracy of object recognition. Finally, the filtered image ( Figure 1c) was used as the mask to obtain the isolated color paddy panicle (Figure 1d).

Feature extraction
To obtain the color features of the paddy panicles, the RGB and HSI color spaces were used in this study. For each paddy panicle image, the mean and standard deviation of the red (R), green (G), blue (B), hue (H), S, and (I) were calculated in the histogram of each color space using Eqs. (2) -(3) . [35] Color mean ¼ Color std ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi P x x¼1 P y y¼1 p xy À Color mean  where Color mean and Color std are the mean and standard deviation of the color values, n is the total number of pixels in the panicle image (Figure 1d), and p xy is the pixel value in the (x, y) position. In addition to these main color features, other combination color features that reflect the greenness of plants proposed in previous studies were calculated . [17,18,22,36,37] Therefore, a total of 22 color features were extracted from each paddy panicle image (Table 2).

Feature selection
To improve the performance of the prediction model and to reduce the processing requirements of the computer, it is necessary to select the optimal features that carry the most useful information. In addition, the risk of overfitting problems can be reduced by using limited input variables. Many feature selection methods have been proposed to select optimal features, such as the least absolute shrinkage and selection operator (lasso), decision trees, stepwise regression, and correlation-based feature sections . [18,38,39] In this study, stepwise regression was used to select the most effective features to develop models for predicting paddy rice maturity. Because stepwise regression can effectively solve collinearity in multiple variables, compared with other methods, it has higher efficiency in selecting features. The basic principle of stepwise regression is to add or remove variables one by one in a regression model based on their statistical significance level until all the independent variables in the model are significant and the other independent variables are insignificant. Stepwise regression was performed using SPSS 19.0 (IBM SPSS Statistics, NY, US).

Model development
Random forest (RF) is an ensemble algorithm based on decision trees proposed by Breiman . [40] A decision tree is a nonparametric supervised learning algorithm used for classification and regression. The RF model creates multiple bootstrapped samples from the training set and then builds a number of nonpruned decision trees from each bootstrapped sample set . [41] For regression, the RF model Difference between G and R GMR GMR = G mean -R mean 2 G divided by R GDR GDR = G mean/ R mean 3 Vegetation index VI VI = (G mean -R mean )/(G mean + R mean ) 4 Normalized R NR NR = R mean/ (R mean + G mean + B mean ) 5 Normalized G NG NG = G mean/ (R mean + G mean + B mean ) 6 Normalized B NB NB = B mean/ (R mean + G mean + B mean ) 7 Coefficient of variation of R CVR CVR = R mean/ R std 8 Coefficient of variation of G CVG CVG = G mean/ G std 9 Coefficient of variation of B CVB CVB = B mean/ B std 10 Extreme G EXG EXG = 2G mean -R mean -B mean builds a number of regression trees. Each regression tree is built independently and predicts the outcome individually. Finally, the RF model averages the results from all regression trees as the final prediction. An RF is a powerful machine learning method that is relatively robust to outliers and can overcome the "black-box" limitations of artificial neural networks. Additionally, the parameterization of RF is very simple and computationally lighter than other machine learning methods (neural networks or support vector machines) In this study, an RF model was trained with 10-fold cross-validation in Python using the Scikit-Learn package. To obtain the most accurate and robust model, the RandomizedSearchCV method was used for parameter optimization (20 iterations). The optimized parameters and the corresponding searched values were the n_estimators (10 to 500), the max_depth (1 to 30), and the min_samples_split (2 to 10). The flowchart of model development is shown in Figure 3. Two RF models, the full-features model and the optimal-features model, were developed for each paddy rice variety. The fullfeatures model used all color features extracted from paddy images as the inputs. The optimalfeatures model used optimal features selected by using stepwise regression as the inputs. The performance of these two models for each paddy rice variety was compared to select a suitable model for predicting maturity.

Model evaluation
The evaluation of the RF model for maturity was performed by using the coefficient of determination (R 2 ) and root mean square error (RMSE). Cross-validation was applied to train the developed models. Generally, the larger the R 2 (close to 1) and the smaller the RMSE (close to 0) are, the better the performance of the developed models. However, what we truly expect is a model that performs well on new samples. For this purpose, the model should learn a 'universal law' that is as suitable as possible for all potential samples from the training samples to make correct predictions when performing on new samples.
In this study, a total of 300 images for each paddy rice variety were separated into a training set (240 images) and a test set (60 images). In general, the model gave nearly the same performance on the training set and test set, which means that the model can perform well on new samples. The R 2 and RMSE can be calculated as [42] : ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi 1 n X n i¼1 y i Àŷ i À � r (5) where y i is the i-th actual maturity, ŷ i is the i-th predicted maturity, � y is the average of actual maturity, and n is the number of samples. Figure 4 shows the maturity evolution of paddy rice during the maturation period. In this study, the maturity of the two rice varieties presented a gradually increasing trend during the maturation period. The maturity of J816 was approximately 77% at the harvest date, whereas that of 5Y4 was approximately 57%. Note that a large area of lodging occurred in the field due to the influence of typhoons, resulting in slow growth in the rice, especially for 5Y4, which had a higher plant height than J816.

Correlation of the color features with maturity
In the two rice varieties, nine color features, CVR, CVG, CVB, R std , G std , B std , I std , NG, and EXG, had rather high R 2 values (> 0.7) ( Table 3). Of these features, CVB and B std had the best correlation with maturity for J816 (R 2 = 0.89) and 5Y4 (R 2 = 0.86) rice, respectively. As previously reported, [22] most of these greenness features derived from digital camera images were shown to be highly positively correlated with the chlorophyll content of corn leaves. However, in this study, these features were highly negatively correlated with rice maturity. This was expected because when a rice plant grows toward maturity, the grains gradually change from green to yellow due to the loss of nitrogen absorbed by the mature panicle . [43] Therefore, our results indicated that these nine color features extracted from each panicle image could be used as predictors for estimating paddy maturity. Tewari et al. [37] reported that the mean value of the various image features was found to be best correlated with the greenness of the crop canopy. However, in this study, the mean value of the different color features was poorly fitted to rice maturity because these features were extracted from each segmented image, which contained only the panicle. As ripeness gradually progresses from terminal to basal, the color distribution of the panicle in each part is uneven. In addition, the computation of features in this study used the mean value of all the pixels in the panicle images. Therefore, the correlation between the mean value of colors and maturity was poor, while the standard deviation and coefficient of variation of colors had the strongest relationship with maturity. The GMR (G-R), which others have reported to be valid in the representation of N, was not superior to the other color features in correlating with paddy maturity . [17] This was likely due to its limited generality in rice crops, which have unique coexisting impacts from water and soil background and canopy structure.

Estimation of maturity using all features
Based on all 22 color features extracted from each paddy panicle image, the prediction model was established using the RF algorithm. The performance of the RF model based on all the features for each rice variety is shown in Table 4. As shown in Table 4 and Figure 5, the R 2 and RMSE of the full-features model were 0.94 and 1.08% for J816 and 0.95 and 1.43% for 5Y4, respectively. The results indicated that the RF model had a superior performance for the two varieties in the prediction of rice maturity. However, overfitting was present in these models due to the differences in the performance metrics (R 2 and RMSE) between the training set and test set, which indicated that the model with all the features may be less stable on new samples.

Estimation of maturity using optimal features
Although the RF model based on all the features achieved good performance, many features made the model inefficient and unstable. Therefore, selecting the optimal features to develop the model is necessary. In this study, stepwise regression with a significance level of 0.05 for entry and 0.1 for exclusion was applied to select optimal features as input variables for the RF model. As shown in Table  5, the selected optimal features were CVB, CVG, and H std for J816 and B std , H std, and S mean for 5Y4. The R 2 and RMSE of the optimal-features model were 0.93 and 1.18% for J816 and 0.94 and 1.60% for 5Y4, respectively ( Figure 6). The RF model with optimal features gave similar performance on the training set and test set (Table 5) and performance equivalent to that of the full-features model ( Figure 5). This implied that using the selected optimal features could reduce the overfitting problem without reducing the performance of the model.

Validation of the developed models
The best prediction model should have a high R 2 value, low RMSE, and the minimum variations in these performance metrics between training and testing. Therefore, it is possible to replace the fullfeatures RF model with the optimal-features RF model to predict rice maturity in this study. For this, the optimal-features RF models of the two varieties were validated using paddy panicle images acquired in 2020. As shown in Figure 7, the performance of the model (the R 2 values were 0.92 and 0.96 and the RMSE values were 1.16% and 1.44% for J816 and 5Y4, respectively) showed that the developed models yielded almost the same performance on the test set (collected in 2019) and on the validation set (collected in 2020). The results imply that the proposed method in this study can better predict the rice maturity of J816 and 5Y4 harvested in different seasons. Farshad Vesali et al. [22] developed an android app for smartphones that estimated the chlorophyll content of a corn leaf using linear (regression) and neural network models based on various color features extracted from digital image. The app estimation compared well with the corresponding soil plant analysis development (SPAD) meter values (R 2 = 0.88 and 0.72 and RMSE = 4.03 and 5.96 for the neural network and linear model, respectively). Haw et al. [32] proposed a method using the hue color of florets extracted from digital images to predict paddy maturity. They used a linear regression model with only the hue feature as input and reported that the R 2 values for grain hue at the terminal, middle and basal parts were 0.77, 0.91, and 0.96, respectively. As ripeness gradually progresses from terminal to basal, the distribution of colors at each part of the panicle is uneven. This method of color extraction, in addition to the effects of ambient conditions, results in random and inconsistent maturity prediction. Compared to these, our results indicated that the proposed method in this study can be considered a rapid and low-cost alternative for estimating paddy maturity in the field, especially when there is a demand for high availability.
Nondestructive determination of crop maturity is important for acquiring timely and inexpensive information regarding the optimum harvest time and crop management. In this study, we used image processing techniques to predict maturity in two high-yield rice varieties (J816 and 5Y4) to determine the best harvest time. We collected the panicle images with a scanner and extracted various color features to improve the accuracy and stability of the prediction results. Compared to image acquisition with a camera lens, there is no need for any additional attachment (i.e., illumination system and frames) to acquire high-quality panicle images for further analysis. An additional benefit of scanner imaging is that the panicle image has a relatively consistent background, which is convenient for segmenting the background using image processing techniques. However, the proposed method for predicting maturity in this study was validated only for J816 and 5Y4 paddy rice, and the model developed in this study was not suitable for other paddy varieties. Further research is needed to collect more samples from different growing regions and crop years to develop a general model that could be broadly suitable for paddy rice varieties.

conclusion
In this study, we proposed a method that uses digital imaging and an RF regression model to predict paddy rice maturity during the maturation period. Color images of paddy panicles of two rice varieties (J816 and 5Y4) were acquired by a flatbed scanner during the maturation period. A total of 22 color features were extracted from the paddy panicle images. Stepwise regression was used to select optimal features as the inputs for the RF regression model. The R 2 and RMSE of the RF model with optimal features were 0.93 and 1.18% for J816 and 0.94 and 1.60% for 5Y4, respectively. Compared with the full-features model, the optimal-features RF model yielded equivalent performance without overfitting the data. The validation of the optimal-features model confirmed that it is robust for J816 and 5Y4 rice varieties harvested in different seasons. The digital imaging estimation compared well with the corresponding observation maturity results (the R 2 and RMSE were 0.92 and 1.16% for J816 and 0.96 and 1.44% for 5Y4).
The proposed method in this study can be a practical low-cost alternative for measuring paddy maturity. In addition to the lower cost, this method has several other advantages, such as rapid and robust prediction without sample preprocessing. In addition, in comparison with conventional computer vision systems, which usually include an illumination system and image acquisition box, the proposed method in this study requires only a flatbed scanner, which provides technical support for the harvesting-storage management of high-quality rice in field conditions. However, further research on the method is required to validate the performance and robustness of various rice varieties from different growing regions and crop years.