Modelling populations of Lygus hesperus on cotton fields in the San Joaquin Valley of California: the importance of statistical and mathematical model choice

ABSTRACT Understanding the population dynamics of herbivorous insects is critical to developing and implementing effective pest control protocols. In the context of inverse problems, we explore the dynamic effects of pesticide treatments on Lygus hesperus, a common pest of cotton in the western United States. Fitting models to field data, we explore the topic of model selection for an appropriate mathematical model and corresponding statistical models, and use techniques including ANOVA-based model comparison tests and residual plot analysis to make the best selections. In addition we explore the topic of data information content: in this example, we are testing the question of whether data, as it is currently collected, can support time-dependent parameter estimation. Furthermore, we investigate the statistical assumptions often haphazardly made in the process of parameter estimation and consider the implications of unfounded assumptions.


Introduction
It has long been understood that solving problems in applied ecology often relies on an accurate understanding of population dynamics [19,24]. When addressing questions in fields ranging from conservation science to agricultural production, ecologists often collect time-series data in order to better understand how populations behave when subjected to abiotic or biotic disturbance [11,12,29]. Furthermore, in many cases, the development and analysis of mathematical models can help make sense of time-series data as well as predict future population responses to ecological drivers. Fitting models to data, which requires a broad understanding of both statistics and mathematics, is thus an important component of understanding pattern and process in population studies. In agricultural ecology, pesticide disturbance may disrupt predator-prey interactions [35,36] as well as impose both acute and chronic effects on arthropod populations [17,26,39]. In the past two decades, the focus of many studies of pesticide effects on pests and their natural enemies has shifted away from static measures such as the LC 50 , instead emphasizing population metrics/outcomes [20-22, 34, 37]. In the meantime, we now have decades of field studies that have generated time-series data aimed at assessing the effects of pesticides on arthropods at the population level [10,23,27,30,33,35,41]. When working with real data, one must consider the strength or information content [8] of the data set. This can be described as an understanding of how strongly one can carry out model validation given a particular data set. There are many ways to quantify the content of a data set, including use of sensitivities, the Fisher Information Matrix and Akaike Information Criteria methods [4,8,9,13,38]. Simple mathematical models, parameterized with field data, are often used to then predict the consequences of increasing or decreasing pesticide exposure in the field. Accuracy in parameter estimation and fitting data to models, which has received increasing attention in ecological circles [25,28], depends critically on the appropriate model selection. In all cases, this includes optimal selection of both statistical and mathematical models fit to data -something that is not always fully explicitly addressed in the ecological literature. We address this gap using data from pest population counts of Lygus hesperus Knight (Hemiptera: Meridae) feeding on pesticide-treated cotton fields in the San Joaquin Valley of California [31]. In particular, we fit L. hesperus counts to a simple mathematical model and consider two statistical models: one assuming absolute error and one assuming relative error. We carry out model selection between two nested mathematical models, and compare the outcomes when assuming absolute error versus relative error.

Data
Our database consists of approximately 1500 replicates of L. hesperus density counts, using sweep counts, in over 500 Pima or Acala cotton fields in 2004-2008. In each replicate, data was collected by one of four pest control advisors (PCAs) between June 1 and August 30. Although there was variability in data collection schedule between PCAs and between replicates, fields were sampled roughly 1-3 times per week, at irregular times during these summer months. In addition, some PCAs chose to count both nymph and adult L. hesperus, while others simply counted the total number of L. hesperus caught in the nets. In addition, the replicates varied in the presence or absence of chemical treatments, as well as in frequency, schedule, and variety of pesticide applications, ranging from zero to six applications in one season. However, the netting effort was standardized across all PCAs and all replicates.
Within the entire database, we consider a particular replicate, which we will denote as replicate 1, that was treated with pesticides three times intermittently between 1 June and 30 August 2007. The pesticides (and the associated targets) used on this replicate are summarized in Table 1. Note that in addition to the target types listed, additional chemicals may have been applied, such as surfactants, plant nutrients, and adjuvants. Although it is likely that these chemicals have little effect on L. hesperus, effects are still possible; however, they will be ignored in our analysis.

Statistical models
We now describe the statistical model, in terms of the mathematical model and the collected data. Let x denote the N−dimensional vector of state variables (x for the scalar case), with θ θ θ = [q, x 0 ] (if initial conditions x 0 are unknown) denoting parameters to be estimated in the ordinary differential equation (ODE) system (mathematical model) In many cases, not all variables are observed in data collection, so we define the mdimensional observation process f(t; θ θ θ) = Cx(t; θ θ θ), with m ≤ N. Assume we have n data points y j at discrete time points {t j } n j=1 . It follows that f(t j ; θ θ θ) = Cx(t j ; θ θ θ) for j = 1, . . . , n. We now define the statistical model: a quantified relationship between the observed model output and the raw data; similarly, one could think of this as a description of the measurement/observation error. First consider an absolute error model with realization where θ θ θ 0 is the p × 1 'true' parameter vector which we assume exists. Observe that f(t j , θ θ θ 0 ) is completely deterministic, so the randomness of the m × 1 vector Y j is due to the m × 1-dimensional random error E j (for j = 1, . . . , n). We assume that E j , j = 1, . . . , n has zero mean and covariance matrix given by are constants). Error of this kind is often described as i.i.dindependent and identically distributed. LetȲ be the m × n matrix whose n columns are the m × 1 random vectors {Y j } n j=1 ; that is,Ȳ = [Y 1 , Y 2 , . . . , Y n ]. One seeks to estimate unknown parameters by minimizing the least squares discrepancy between model output and data. In other words, one must minimize the ordinary least squares (OLS) cost functional Because {Y j } n j=1 are random vectors, one must define the estimator θ θ θ n OLS = arg min θ θ θ∈Q J n (Ȳ, θ θ θ) and corresponding estimatê θ θ θ n :=θ θ θ n OLS = arg min θ θ θ∈Q J n (ȳ, θ θ θ).
That is,θ θ θ n is the best OLS estimate for θ θ θ 0 . This absolute error model is often haphazardly and incorrectly assumed in both maximum likelihood and least squares optimization methods. (In choosing an optimization method, one should recall that maximum likelihood models also require an assumption of the underlying error probability model.) The modeler can carefully choose this error model to allow for some type of relative error [2,4] if such a model can be correctly identified. Thus the modeler should also (in the absence of specific information on the error process) consider a relative error model of the form with realization where γ is some constant that depends on the given data set. Because f and E j are both m × 1-dimensional operators, '•' denotes component-wise multiplication in Equations (6) and (7). When γ = 0, Equation (6) is equivalent to Equation (2). In the case of population models, Equation (6) is often correct, and represents error with non-constant variance (which may depend on the output function f(t, θ θ θ)). One can determine if a given statistical model is appropriate by examining residual plots (residuals versus time, residuals versus observed output) and visually determining if they violate the assumption that the residuals are i.i.d. [2]. Typically, this visual investigation is also how one determines the best value of γ . We will consider both statistical models and determine the best choice. For ease of notation, we will present the remaining methodology in the context of the scalar problem (N = 1), although all methodology can be easily extended to vector systems. Therefore, when referring to the random observation variable, we will now use the notation where Y j is the one-dimensional version of the previously defined Y j .

Residual plots and generalized least squares
Residual plots are powerful in their ability to illuminate incorrect assumptions regarding observation error. One can examine the residual plots versus time and model output, respectively, and determine whether the scatter of the error seems to violate the statistical assumptions [2]. This simple process can often prevent a crucial mistake -an incorrect statistical model assuming absolute error and other modelling mistakes. If one finds that the relative error statistical model (6) is most suitable, for some γ , one must replace the cost functional (4) with a generalized least squares (GLS) formulation [2,4,14,16,32] with weights ω j = ω j (θ θ θ) = f −2γ (t j , θ θ θ), j = 1, . . . , n. Consequently, one must redefine the estimator θ θ θ n GLS = arg min θ θ θ∈QJn (Y, θ θ θ) with corresponding estimatê θ θ θ n :=θ θ θ n GLS = arg min θ θ θ∈QJ n (y, θ θ θ).
We assume an error model in the form of (8), but a more generalized form of error model for GLS methods can be found in [3,4,14,16,32]. GLS estimatesθ θ θ n and estimated weights {ω j (θ θ θ)} n j=1 are found using a standard iterative method [2,4,14,16,32] as given below . For the sake of notation, we will suppress the sample size superscript n (i.e.θ θ θ GLS :=θ θ θ n GLS ) in describing the iterative method.
(4) Set k: = k+1 and return to step 2. Terminate when the two successive estimates for θ θ θ GLS are sufficiently close.

Mathematical models
In previous modelling attempts of L. hesperus population in untreated fields, an exponential model adequately described the total population growth [9]. However, manipulations of the system, such as pesticide applications, time-varying presence of predators or prey, or resource changes, can be mathematically represented with time-varying parameters. We now consider a modified model incorporating time-dependent growth due to chemical applications. Let j * = the number of pesticide applications in a data set. Let model A be as follows: where t 1 is the time of the first data point, and k(t) is a time dependent growth rate where p(t) is described below, and P j = [t p j , t p j + 1/4], j = 1, . . . j * with t p j as the time point of the jth pesticide application. Note that |P j | = 1/4 which is approximately the length of time of one week when t is measured in months. This reflects the general assumption that pesticides and herbicides are most active during the 7 days immediately following treatment. Clearly, η is the constant growth rate of the total population in the absence of pesticides. In addition, t = 0 refers to June 1 (as no data are present before June 1 in our database).
We use linear splines to approximate p(t) as follows. Consider m linear splines: where where the evenly spaced spline functions are centred over nodes {t i } given by t p j : |P j |/(s + 1) : t p j + 1/4, and the step size h is given by h =t i −t i−1 . For our analysis with this model, to impose continuity of k(t), we do not include the splines centred over the 'end' nodes, that is, t = t p j , and t = t p j + 1/4. Linear spline approximations are simple, yet flexible in that they allow the modeler to avoid assuming a certain shape to the curve being approximated. Incorporating a time-dependent parameter such as k(t) is useful when modelling a system with discontinuous perturbations (such as the removal of a predator, or the application of an insecticide). The addition of more splines (s > 3) provides a finer approximation, but demands more terms in the parameter estimates. We conjecture that it is likely that s = 3 (excluding splines centred at interval endpoints) is sufficient. An example of three linear splines over the interval [0, 0.25 ] is given in Figure 1(a). A sample function p(t) for some chosen values of {λ i } 3 i=1 is pictured in Figure 1(b). Now consider model B: dx dt = ηx, where x 1 , η are defined as above. Note that model B is equivalent to model A when applying the constraint p(t) ≡ 0, that is, λ i = 0 for i = 1,2,3. Therefore, this is the case of comparing nested models, and an ANOVA model comparison test is applicable to determine whether a time-dependent growth parameter is not only appropriate in describing these population dynamics, but can also be estimated given the information content of the data. In our analysis of models A and B, we first estimated the initial condition x 1 using model B (as this data point precedes any pesticide applications and provides a better estimate for x 1 ), and then fixed this parameter in all subsequent parameter estimates. Therefore, the parameters to be estimated are θ θ θ = q = [η, λ 1 , λ 2 , λ 3 ] T .

Model comparison test
We now present the residual sum of squares ANOVA-type model comparison test based on confidence levels [14,16,32] as developed in [1], described in detail in [2,4] and extended in [5] to GLS problems. This test is used to determine which of several nested models is the best fit to the data; therefore, this test can be applied to the comparison of models A and B. While Akaike Information Criteria methods are commonly known to incorporate both model fit and model complexity into quantifying a model selection score [38], this ANOVA-based model comparison test for nested models indirectly incorporates the number of model parameters into the resulting test statistic, as will be seen below. Again, for ease of notation and because we are applying this test to the comparison of one-dimensional models, we will present everything in the scalar case (although it can be applied to a general vector system).
Assume we have math model f (t, θ θ θ) and n observations Y = {Y j } n j=1 with realizations y = {y j } n j=1 and assume the statistical model assumes either absolute or relative error and takes the form of (2) or (6), respectively. Here we will describe the methodology in the context of absolute error (OLS optimization), but this test can be readily extended in the case of relative error and GLS optimization [4,5].
Let Q denote the admissible parameter set, where Q is a compact subset of R p , with θ θ θ 0 ∈ int(Q). Recall that we define the OLS estimator θ θ θ n OLS = arg min It is useful to test if θ θ θ ∈ Q H , where Q H is some particular subset of Q of the form where H is an r × p matrix of full rank, and c is an r × 1 constant vector, whose entries are determined by the problem at hand. We can formulate the null and alternative hypotheses: we notice that J n (y,θ θ θ n H ) ≥ J n (y,θ θ θ n ). Next, we define the non-negative test statistic and realization by T n (Y) = n(J n (Y, θ θ θ n H ) − J n (Y, θ θ θ n )),T n (y) = n(J n ( y,θ θ θ n H ) − J n (y,θ θ θ n )).
We then define the test statistic and realization Under certain assumptions (see [1,4] for details), we have the following conclusions: (1) θ θ θ n → θ θ θ 0 with probability one as n → ∞; (2) If H 0 is true, U n → U in distribution as n → ∞, where U ∼ χ 2 (r), a χ 2 distribution with r degrees of freedom with r being the number of constraints on the parameter space Q H , as defined in Equation (13). Now consider two parameters of interest: a threshold τ and significance level α, where for a given τ , α = Prob(U > τ). This statistic relates to our null hypothesis in the following way: If the test statisticÛ n > τ, we reject H 0 as false with confidence level (1 − α) × 100%. Otherwise, we do not reject H 0 .
One can use a standard χ 2 distribution table [18,40] to determine the value of τ given a choice of α which is appropriate given the data (highly controlled lab data will usually call for a higher confidence level than field data). We note that as explained in [4, p. 149], an equivalent formulation for hypothesis testing uses the concept of p-values to reject or not reject H 0 . The minimum value α * of α at which H 0 can be rejected is called the p-value . Thus, the smaller the p-value , the stronger the evidence in the data in support of rejecting the null hypothesis.
We draw the reader's attention to the second conclusion above, where we see how this model comparison test addresses model complexity. In this example, we compare models A and B, with p = 4 and p = 1, respectively, which implies that we have r = 3 number of constraints. If r were to increase (i.e. if we were to consider a more complicated model with more parameters to be compared with either model A or B), U N would converge to a χ 2 distribution with greater degrees of freedom, in turn increasing our threshold τ . In this way, this model comparison test incorporates model sophistication (i.e. number of additional parameters to be estimated) into model selection.

Statistical models and residual plots
Using residual plots, we determine whether an absolute or relative error statistical model best suits our data. We first consider absolute error and present the residual plots versus time and model output in Figure 2(a) and 2(c), respectively, where we see a clear right opening horn shape. This violates the assumption that the error terms E j , j = 1, . . . , n are i.i.d with constant variance.
Next, we choose a relative error model and display the residual plots versus time and model output, respectively, for model A using γ = 0.85 in Equation (6) (see Figure 2(b) and 2(d)). We choose γ by searching for the value that consistently produced i.i.d. residual plots given a range of parameter estimates. In our computations, γ = 0.85 produces these results for both models (A and B). Although we only present here the residual plots for model A, the plots for model B exhibit very similar traits (see [7]). The reader may compare Figure 2(a) and 2(c) with Figure 2(b) and 2(d), respectively, to visualize the clear difference between the absolute and relative error statistical models. From these results, we assume a statistical model (6) with γ = 0.85 for all subsequent analysis. When implementing this method, we choose the terminating tolerance in the GLS algorithm in the following manner: where max | · | is the max-norm for vectors.

Mathematical model and parameter estimation
In order to estimate the true parameter vector θ θ θ 0 = [η, λ 1 , λ 2 , λ 3 ] T , we find the parameter values which minimize the GLS cost functional (8). Because of the stiff nature of the linear spline approximations, we utilize MATLAB's ODE solver ode15s. We consider both unconstrained and constrained optimization techniques for parameter estimation. In the case of unconstrained optimization, Q = R p , whereas for constrained optimization, Q is some strategic subset of R p , given the nature of the parameters. It is reasonable to presume that λ i < 0 for i = 1,2,3, given the assumption that pesticide applications slow population growth. While non-negative values for λ i are not impossible (as in the case of hormesis [15]), they are generally unexpected. Therefore, to carry out constrained optimization, we let Q = {R × [−∞, ] 3 }. We use an upper bound of (some small positive value) rather than 0, to comply with an assumption of the model comparison test [1] and to permit mild hormetic effects. In practice, it is reasonable to further constrain where K is some positive finite number. Based on the estimates we found with a variety of initial guesses, we let K = 20 in our constrained optimization search. In order to determine the best method, we also estimate parameters and calculate confidence intervals for each parameter using bootstrapping, which allows ones to look at the underlying distribution of each parameter (see [6] for details concerning both the theory and algorithm). For three out of the four parameters, the parameter estimate confidence intervals are significantly narrower when using constrained optimization in contrast to the confidence intervals found using bootstrapping with unconstrained optimization. Here, we estimate parameters and perform the model comparison test using MATLAB's constrained optimizing function, fmincon.

Model comparison results
We perform the model comparison test with r = 3 degrees of freedom and p = 4 parameters, using constrained optimization. We report the results using the relative error statistical model (6), and subsequently the incorrect statistical model (2) for the sake of comparison.
In the computation of both optimization searches, we use ode15s to solve model A and ode45 to solve model B (due to the stiff nature of model A and non-stiff nature of model B). Consider Q as defined above (with optimization constraints) and define where H = . In other words, Q H is the subset of Q such that λ i = 0 for i = 1,2,3.
In comparison to highly controlled experimental data, ecological field data are more greatly affected by environmental factors, including weather changes and cross-field migration effects, none of which are incorporated into models A and B. Therefore one may argue that a larger significance value (e.g. α = 0.1) is a reasonable choice of significance. Using a χ 2 (3) distribution table, we identify the corresponding threshold τ ; given, r = 3, α = 0.1, then τ = 6.251. The results are summarized in Table 2, where 'Conf' represents the confidence to reject the null hypothesis. As we have already noted above, one can equivalently carry out this test using p−values (see [4] for details). One can see thatÛ n = 7.59 > τ, therefore we can reject the null hypothesis with at least 90% confidence. In addition, we report the parameter estimatesθ θ θ n andθ θ θ n H in Table 3, as well as model fits to data for models A and B in Figure 3(a) and 3(b), respectively.
We also present in Table 4 the results when using the incorrect statistical model. When assuming absolute error, we findÛ n = 1.46 < τ, therefore we fail to reject the null hypothesis. As discussed below, this leads to a faulty conclusion regarding our primary goal in  Table 3. Parameter estimates over admissible parameter spaces Q and Q H , using (best) relative error model and GLS.  this study: the statistical significance of pesticide effects on L. hesperus populations, and reflecting this in the best choices of both statistical and mathematical models.

Discussion
It is difficult to overemphasize the importance of carefully choosing the best statistical model. In ecology and many other disciplines, OLS (with an absolute error model) is a common method of parameter estimation, but GLS with a relative error statistical model is a better choice in many situations. A relative error model is not always best, and often an assumption of absolute error is sufficient. However, the statistical model will not only be critical in determining the resulting method to use in obtaining the parameter values, but will also affect every subsequent area of analysis, including model comparison results. Therefore, correct identification of the statistical model is imperative. In addition to the methods using residual plots seen above, there are other methods being explored to address this topic, including a technique that does not require prior parameter estimation [3]. In this case study, when incorrectly assuming an absolute error statistical model, the model comparison test results reported only 31% confidence to reject the null hypothesis, far below our desired threshold. Therefore, we fail to reject the null hypothesis. Based on these results, we could make one of the following incorrect conclusions: (a) the data does not support the estimation of time-dependent parameters or (b) model B best fits the data because pesticides did not have a significant effect on the population growth of L. hesperus. However, it is clear that a relative error statistical model is a more accurate choice in this example, and the model comparison test results drive us toward the opposite conclusions: the data can support the estimation of time-dependent parameters, and a model incorporating the effects of pesticides does provide a statistically significantly better fit to the data than a simple exponential model. These conclusions are supported by the model fits presented in Figure 3. Although the exponential model does pick up the general increasing exponential nature of the data (see Figure 3(b)), the effects of pesticides are clearly non-trivial and should be represented in the mathematical model (see Figure 3(a)) and should not be overlooked, as in model B, in this case study.
In short, the statistical model defines the cost functional based on an assumed error structure, which defines the parameter estimates . All subsequent analyses, including math model selection, sensitivity analysis and other common studies, are heavily influenced by the assumptions of the statistical model describing the underlying error. Computationally, using GLS and a relative error model requires minimally more effort. Therefore, we stress that it is not only imperative, but easy to incorporate this important step into the modelling process.
In addition to statistical model selection, we also draw attention to the consideration of time-dependent parameters. Using flexible methods like linear splines to approximate unknown effects caused by manipulations to the system can provide better model fits to data. Use of time-dependent parameters can be considered for many ecological problems, especially those with mid-season changes, including, but not limited to, known migration patterns, addition or removal of some species, chemical changes, non-constant breeding, and distinct weather changes.
We may consider several different approaches toward expanding on the current results. First, it is recommended that we repeat the preceding analysis on other replicates, including those with 1,2, or 4 pesticide applications to produce a more robust study of the information content of our larger database. In other words, it is a beneficial exercise to determine if there is a correlation between parameter estimation accuracy/data information content (see also [8] for examples) and frequency of pesticide application. In addition, we would like to consider a 2-D compartmental model, as in previous research [9] and use the model comparison test to determine if the data can support time-dependent parameters for each age class. This could not only shed light on the class-specific effects of pesticides (on both nymphs and adults), but could also provide further insight into whether distinguishing between nymph and adult age classes during data collection is beneficial to understanding population dynamics, in the case of pesticide-treated fields. Moreover, we are exploring other methods of verifying correct statistical method in statistical model misspecification studies that do not rely on previous computation of the inverse problem. Lastly, we would like to use sensitivity analysis to determine sensitivity of the model output (population projection) to perturbations in parameter values. This could provide insight into the subtle effects of pesticide efficacy on L. hesperus population growth.