A Statistical Technique for Modelling Dissolved Oxygen in Salt Lakes

Abstract An integrated linear regression technique was developed and validated to model dissolved oxygen in salt lakes by using R software and based on data from Sawa Lake, Iraq. The technique helps understand and evaluate salty aquatic ecosystems in the presence of water quality data gaps. The technique involved selecting the important water quality parameters that have significant statistical relationship with dissolved oxygen. In order to make the regression development simpler, the validation approach was incorporated with the model. Linearity, homogeneity, normality, outliers, and influential data points were verified. The simulation approach was also capable of displaying the interaction between the selected water quality parameters and the other insignificant parameters. The statistical analysis results indicated that dissolved oxygen in salt lakes is a function of total dissolved solids. The developed model represented dissolved oxygen with R-squared of 90.73% and p-value of 1.08E-06. Furthermore, the model results showed that the influence of salty ions on dissolved oxygen/total dissolved solids model is the same. It was found that temperature has a significant impact on the developed dissolved oxygen model. In addition, the model simulation revealed that salt melting surrounding the lake due to temperature variation during the year cycle increased total dissolved solids in the lake water. Consequently, the lake dissolved oxygen levels were impacted.


PUBLIC INTEREST STATEMENT
In future, salt lakes will be considered one of the main water resources in the underdeveloped and developed countries due to water issues. These lakes are not protected well, and need to be preserved for future human essentials. In this work, the implemented statistical modelling technique highlights a visual and quantitative way to be used for salt lakes protection. Since dissolved oxygen is a measure of lake health, the proposed model is capable of simulating dissolved oxygen in salt lakes by using total dissolved solids only. Integrated water quality data are not available for all lakes, especially in poor countries. Using the current statistical technique helps researchers monitor their public water resources in case of data gap or insufficient data. Hence, the present paper revealed insight of the issue and developed the solution for the public necessities.

Introduction
Salt or saline lakes are interesting inland features and need to be taken into account in water quality modelling since many water quality parameters (environmental variables) can be changed due to environmental conditions such as dissolved oxygen, water temperature, and even salt compositions. Similarly, climate change and global warming can affect salt lakes ecosystems (Hussain & Abed, 2019;Tweed et al., 2011). Salinity impacts the water cycle by affecting the evaporation process in lakes (Obianyo, 2019). Unfortunately, very few water quality studies have been conducted on salt lakes compared to freshwater lakes even though inland salt water has approximately the same volume as freshwater rivers and lakes (Last, 2002).
It is not surprising that salt lakes in underdeveloped countries have even less attention than in developed countries, especially those countries that have problems with water resources and less data to work on. One of these salt lakes is Sawa Lake, located in Iraq about 26.74 km west of Samawah City (N 31°18´ and E 45°00´). It has almost no attention and no monitoring system, and it is not protected. Concerns that future changes of climate or management may affect the lake water quality have promoted the need for modeling the lake for future predictions. Shope and Angeroth (2015) highlighted the need for more monitoring of dissolved oxygen in salt lakes statistically. Salt lakes represent about 45% of the total inland lake volume (Shiklomanov, 1990), and they are used as a source for water supply especially in regions of dry/hot weather and low groundwater level. Sawa lake is a good example for this. The lake significance is that its located is within a desert area where people have used wells for water supply. This concern makes salt lakes valuable for humans. Sawa Lake is located within a gypsum region, and it does not have any inflows or outflows. The only ways to gain or lose water are by rainfall, evaporation, or groundwater. Some algae species, phytoplankton, and a few kinds of fish exist (Awadh & Muslim, 2014;Hassan et al., 2006). Hence, fish species need to be preserved in the lake to meet the human essentials (Cordier et al., 2020;Moayedi et al., 2019). The most important element in aquatic ecosystems is oxygen because it is necessary for life, controls many chemical reactions through oxidation, and gives a general indication of aquatic systems health (Cole & Wells, 2006). However, the Sawa Lake water quality monitoring has ignored the role of dissolved oxygen in the lake ecosystem. Questions such as how dissolved oxygen in Sawa Lake relies on other water quality variables statistically have not been answered. In other words, exploring how statistically significant the relationship between dissolved oxygen and the other variables is important. By answering this question, water quality in the lake can be statistically related to easily measured water quality variables to predict the impact on dissolved oxygen. Zhang (2019) illustrated the global trends and the most common approach used in water quality assessment, the Water Quality Index (WQI). For instance, Mohammed and Abdulrazzaq (2018) conducted a case study to assess the water quality using the WQI approach. However, the modelling chosen for this study is a simple statistical model based on field data since the lake has not properly been inspected based on the latest water quality modelling techniques, especially by the statistical methods. Thus, the need to develop a statistical dissolved oxygen model for Sawa Lake arises based on the following issues. First, Sawa Lake location is in Iraq, which is a developed country. Water resources in Iraq are not protected and managed, and need extensive studies in order to re-preserve the aquatic systems. In addition, integrated data that are necessary for developing a full-scale numerical model capable of simulating all of the environmental processes are not available completely. Data such as bathymetry maps, continuous time series for the water quality parameters (historical data), inflows, outflows, and meteorological data are required during the same time period in order to develop a water quality model. Second, there is need to provide a statistically proven and robust technique capable of selecting the most environmental variables that impact dissolved oxygen in salt lakes. Finally, salt lake ecosystems have high salt concentrations with different kinds of ions. Each ion concentration can be explored efficiently by taking the advantages of R software packages (R Core Team, 2015), making the dissolved oxygen depletion in the lake clear by knowing the contribution of each salt ion in the total dissolved solids. Hence, this research outlines an efficient statistical technique for modelling dissolved oxygen by using R software packages. The adopted visual approach is based on statistical hypotheses tests, and results in a linear regression model that governs the relationship between the dominant environmental predictor variables and dissolved oxygen (the dependent variable).
The contribution of this work is to develop and validate a statistical technique capable of determining the relationship between dissolved oxygen and the other water quality parameters for salt lakes by using R software. Therefore, the main objectives of this research are to build a linear regression for modelling dissolved oxygen based on the available water quality parameters from Sawa Lake, valid the model statistically, and visualize the model by showing the interaction between the model and the unimportant water quality parameters.

Study area, data description, and model specification
The Sawa Lake area is approximately 5.1 km 2 , and the bottom elevation is about 17 m above sea level with water depth of about 3.0-5.5 m. The lake water level elevation is higher than the surrounding area level by approximately 2-5 m (Ziyadi et al., 2015). Monthly samples of water were taken from Sawa Lake in 2017 by the Iraqi Ministry of Health and Environment for water quality monitoring purposes. The monitoring system included taking the average of 25 samples every mid-month between 9 and 11 am for evaluating the water quality in the lake. Figure 1 shows the study area and sampling locations used in this study.
Since the aim of this research was to investigate whether a statistically significant linear relationship exists between the response variable (DO) and the other fifteen environmental predictors, DO was considered the dependent variable and the other predictors were the independent variables. The model goal was to describe DO concentration based on the other water quality parameters so that DO in Sawa Lake can be predicted by other parameters. Accordingly, the target linear regression model expresses DO as a function of other water quality predictors that have a significant relationship with DO.

The conceptual framework of the study
To systematically accomplish the data analyses, several regression models needed to be built successively by removing predictor variables that are not important. These analyses included testing the null hypothesis that DO is not affected by the other model variables (DO varies independently of the other model variables) against the alternative hypothesis that DO dependents on the other variables significantly. All statistical analyses were performed in R 3.1.3 (R Core Team, 2015) and based on the statistical analysis in Gotelli and Ellison (2013). The general framework of the analyses was summarized in Figure 2, which is the conceptual model of the study. By running this statistical conceptual model, the relationship between DO and the other water quality predictors was determined whether it is statistically significant or not by exploring and answering the following hypotheses statistically: -Null hypothesis: There is no significant relationship between DO and the other water quality predictors (the linear regression line slope = 0).
-Alternative hypothesis: There is a significant relationship (the linear regression line slope ≠ 0). Initially, the null hypothesis was assumed true to be checked through the model by using statistical tests in order to make the final decision (true or false hypothesis).

The predictor variables selection approach
First of all, and in order to make a decision and decide which predictor variables need to be tested for the linear regression model, the data relationships were explored before building the model. This initial look based on the understanding of the system is an important step prior to developing the linear regression model (Burnham and Anderson, 2002;Zuur et al., 2010).
Figure 3 (a) shows a graphical representation (scatterplot matrix) summarizing the relationships between each pair of variables in the raw data. There are many interesting relationships that can be recognized based on the trend between DO (the response variable) and the other water quality parameters (the predictor variables), but no decision can be made until a statistical evidence exists. In addition, the data were graphically displayed as shown in Figure 3 (b and c), which are correlation matrix plots for the available water quality parameters in two different ways. In both plots, the lower/left panels display the pairwise Pearson correlation coefficients (0-1) between variables, while the upper/right panels use colors to display the coefficients in which the color intensity and the circle size are proportional to the correlation coefficient. Displaying the water quality parameters in such a way helps detect how the parameters are strongly correlated among themselves. The strong correlation between any two predictors means there is collinearity (also called multicollinearity), leading to a non-significant relationship in the target multiple regression model (Gotelli & Ellison, 2013;Zuur et al., 2010Zuur et al., , 2009. Because this research was interested in finding a significant relationship between DO as a response variable and other predictors. Temperature, TDS, and turbidity were assumed to be the main predictors to explain the variation in the response variable DO and to start with developing the regression model because the other predictors are salts and already are accounted for in TDS and Turbidity. This assumption will be proven statistically significant while building the model. Therefore, the correlation between these three predictors was explored. For example, it can be clearly noticed that all predictors have a strong correlation with Temp except Turb with a correlation coefficient of 0.56 (the lowest coefficient). Thus, we expect to get a confusing statistical analysis if the linear regression model was built based on Temp and all other predictors due to these strong correlation coefficients (≥0.74) and a good regression if the linear regression model was built based on Temp and Turb only. Same idea could be applied to TDS and Temp with cautious.
A graphical summary (boxplots) of the three variables is provided in Figure 4 (a), and the corresponding correlation matrix was plotted as shown in Figure 4 (b). Even though Turb data statistically reveal outliers presence, the outliers will be kept because they did not come from errors or mistakes during sample collection such as errors due to parameter measurements or calculations, or any other source that may change the real values. It was found that based on old observations the turbidity in the lake has minimum values in March. Furthermore, to improve the response variable normality due to data spread in addition to reduce the outliers effect on the predictors, the log transformation of the variables DO, Temp, Turb, and TDS was considered in the regression model building. Based on the correlation between the response variable and the predictors as shown in Figure 4 (b), DO has a strong correlation with TDS (a correlation coefficient of 0.96) and has a good correlation with Temp and Turb (a correlation coefficient of 0.8 and 0.7, respectively). For the predictors, TDS has a good correlation with Temp and Turb (a correlation coefficient of 0.74 and 0.72, respectively). Therefore, the presence of TDS with Temp and Turb in the same model could lead to collinearity or non-significant relationship. On the other hand, Temp and Turb could be together in the same model due to the low correlation between them (a correlation coefficient of 0.56).
Therefore, it was necessary to try building the linear regression model based on (trial 1: DO, Temp, and Turb) and (trial 2: DO and TDS) to avoid the multicollinearity issue and to make sure that the predictor variables behave independently. Then, a step-wise forward selection was carried out to include other predictors as long as they can contribute in explaining DO statistically.

The linear regression model development and validation
A full regression model was fitted between the response variable (logDO) and the other water quality predictors (logTemp, logTurb, and logTDS). It was found that keeping TDS leads to a non-significant relationship (the null hypothesis is true). Removing TDS, therefore, adjusted the regression and made the null hypothesis rejected. The statistical summary of the final model is shown in Table 2, after excluding TDS. The regression outputs show the estimated intercept and slopes in addition to the standard errors, t-values and p-values. There is also information on the multiple R-squared, the adjusted R-squared, and F-statistic. The adjusted R-squared was 0.6242, referring to the relationship strength between the response variable and the predictors in which 62.42% of the variance in DO can be explained by Temp and Turb. Setting the typical choice 5% confidence interval (Zuur et al., 2009), the p-value of the intercept and the slopes were less than the 5% significance level, confirming that the null hypothesis (slope = 0) is not true and the relationship between the response variable and the predictors is significant. In addition, the regression as a whole model is performing well since the F-statistic indicated that the p-value is less than the 5% significance level. In other words, the null hypothesis was rejected and the alternative hypothesis was true. As a result, the estimated linear regression model gave the following multiple linear regression model, Eq.
where log is the natural logarithms.
The linear regression of Eq. (1) was based on the trial 1 (the full model includes DO, Temp, and Turb). Doing trial 2, the linear regression analysis gave another robust model (Eq. (2)), see Table 3 for the statistical summary of (Eq. (2)):  (2) Both of the above linear regression models need to be verified to determine whether or not the model that has been built meets the linear regression underlying assumptions. Non-linearity, heteroscedasticity (violation of homogeneity), non-normality, outliers, and influential data points must be tested (Gotelli & Ellison, 2013;Zuur et al., 2016Zuur et al., , 2009).
Hence, missing the important assumption such as the predictor variables interdependency and the residual patterns absence produces biased parameter estimates and type I errors (Quinn and Keough, 2002). In addition, autocorrelation in data should be taken into account by assessing the residuals independence using either or both of the autocorrelation functions and/or variograms (Schabenberger & Pierce, 2001). Therefore, a diagnostic test was conducted to test both models. (1)) on the Sawa Lake data. These graphs are typical linear regression outputs, produced in R, and used to validate the linear regression (Zuur et al., 2009). The graphs were arranged in four panels: The upper left panel shows the residuals (fitted values minus observed values) against fitted values. It is used to assess the homogeneity assumption. If there is a pattern or different scatter in the spread of the residuals around the horizontal line (zero residuals), the homogeneity assumption will be violated due to the non-linearity, and therefore the residuals will scatter around the center line differently. Based on this panel outputs, the residuals bounce around the horizontal line (the mean of the residuals is zero) with no systematic patterns. Thus, the homogeneity assumption is valid. However, the presence of minor violation in homogeneity is not a major problem to be considered (Sokal & Rohlf, 1995).
The upper right panel shows the standardized residuals, which is obtained by taking the absolute value and weighted by the leverage, against the theoretical quantiles to check the normality assumption graphically. The points follow the 1:1 line except at the tails. Therefore, Shapiro-Wilk normality test was used to test the residuals statistically to prove the normality assumption clearly (Gotelli & Ellison, 2013). Based on the Shapiro-Wilk test results, the residuals distribution is normal (the null hypothesis) because of the large value of w-value and p-value >0.05 significance level.
The lower right panel shows the square-root transformed standardized residuals against the fitted values to check the equal variance assumption graphically. No pattern exists; however, the F-test based on the residuals was used to check the equal variance assumption too. All residuals were divided into two groups based on the fitted values median, and then the two groups were tested for equal variances (σ1,σ2) by assuming true equal variances (the null hypothesis: σ1/σ2 = 1, and the alternative hypothesis: σ1/σ2 ≠ 1). The F-test resulted in a p-value >0.05 significance level in addition to F-value (σ1/σ2 ≠ 1). Therefore, the null hypothesis is not true and the two variances are different. Because of the good p-value (greater than 0.05 significance level), the linear regression model meets the assumption of equal variance.
Finally, the lower right panel was used to check the influential outliers based on Cook's distance. The third data point has a Cook's distance greater than 1. Therefore, this data point is an influential point (Fox, 2015;Hair et al., 1998;Zuur et al., 2009). It was found that removing this data point from the raw data and repeating the analysis lead to a non-significant relationship between DO and the other predictors. Thus, the model of Eq. (1) was rejected due to this violation, and the model of Eq. (2) should be considered to be examined.
Basically, Eq. (2) is a simple linear regression model. Figure 5 (b) shows the related model validation graphs. It is clear that all graphs were improved as we moved from Eq. (1) to Eq. (2), meeting all of the assumptions of the linear regression, see Table 4 for the statistics. However, time independency needs to be checked.

The auto-correlation effect
Because the dataset that was used to model Sawa Lake was time series, auto-correlation (time dependency) could be statistically significant. The presence of auto-correlation will influence the final model functionality and could lead to wrong predictions if ignored. It was noticed that the regression p-value may be overestimated by 400% in the presence of ignored auto-correlation (Ostrom, 1990). In order to assess the temporal aspects, auto-correlation function (ACF) can be used to evaluate independence of residuals in case of constant time resolution (Schabenberger & Pierce, 2001). Zuur et al. (2009) considered ACF the more formal visualization tool to detect patterns. An ACF value represents the Pearson correlation between a time series and the same time series shifted by one time space, called time lag (the time resolution in this case). Figure 6 shows the autocorrelation plots for the model of Eq. (1) and Eq. (2), respectively, indicating clearly that there is no significant auto-correlation because ACFs are within 95% confidence levels (the blue dash lines).
Thus, Eq. (2) model satisfies the linear regression assumptions graphically and statistically and can be used to describe DO in Sawa Lake where 90.73% of the variance in DO can be explained by TDS. Hence, the target null hypothesis that there is no significant relationship between DO and the other environmental predictors was rejected, and the alternative hypothesis that there is significant relationship was accepted.

The linear regression model simulation
In order to make future predictions for specific environmental management purposes, it is necessary to explore the model comparability with the original data (Gelman et al., 2013;Zuur et al., 2013). First of all, the final model in Eq. (2) was plotted using the Sawa Lake raw data as shown in Figure 7 (a). There is good agreement between the model predictions and data, reflecting the strong relationship between DO and TDS in Sawa Lake since 90.73% in the DO variance can be expressed in terms of TDS with a low p-value.
TDS caused oxygen depletion in Sawa Lake, mainly during summer conditions. Shope and Angeroth (2015) reported same depletion situation in Great Salt Lake, US. The DO started dropping on June and recovered on November, keeping around the same value during the rest of the year approximately. Finally and by taking the advantages of the model in Eq. (1), the DO had a strong correlation with Temperature too. Therefore, the plots in Figure 7 (b) were used to reveal the interaction among DO, TDS, and Temp based on the original data and the valid model in Eq. (2). The circle size is proportional to Temp. Hence, the model DO predictions showed same pattern as the data, in which the highest TDS value occurred on August when the water temperature was the highest, resulting in the low DO level. Thus, the model in Eq. (2) was proven statistically and graphically in addition to the comparability with the data. In addition, the salty ions that contribute to TDS and affect DO were explored. Plotting DO against TDS and taking into account these ions effect fulfilled this purpose. The graphs (a to h) in Figure 8 highlight the ions effect on the relationship between DO and TDS. Interestingly, all of the salty ions have almost the same influence on DO and TDS relationship as shown by the circles sizes in the graphs (a to h). The circles sizes are bigger during summer season, however. This underlies the impact of temperature on the undertaken melting processes surrounding the lake region. Although this supports the present analysis result that the most influential environmental variable on DO after TDS is the Temp as shown in the rejected model of Eq.
(1), it shows a statistically insignificant relationship. Hence, comparing Figure 7 with Figure 8 helps identify that the DO drop down during summer season was due to a TDS increase in the lake as a result of the high temperature. Ions move faster when the water is warm. The high temperature increased the amount of TDS in salt lakes due to melting process and led to more DO depletion in the same time. Al-Zubaidi and Wells (2020) and Al-Zubaidi (2018) stated the role of temperature in modelling dissolved oxygen in freshwater analytically and numerically. However, TDS can be used to predict dissolved oxygen levels in salt lakes statistically.

Conclusions
A new statistical technique was developed in order to model dissolved oxygen in a small salt lake called Sawa Lake, located in Iraq. The findings of this research indicated that: (1) The statistical approach can model the salt lake aquatic ecosystem or similar in underdeveloped and developed countries such as Iraq where complete data are not available for the numerical models to be used. It was capable of showing the interaction between the water quality variables visually and statistically, revealing the correlation strength between the environmental predictors.
(2) The technique was also capable of building a linear regression based on statistical diagnostic tests. Therefore, the model selects just the environmental predictors that have a statistically significant relationship with the dependent variable rather than building a full model based on all the environmental predictors. Furthermore, a visual way was proposed to show the contribution of the environmental predictors that were not selected in this relationship by showing the impact of those predictors on the model.
(3) It was found that the dissolved oxygen in the lake is a function of the total dissolved solids only with R-squared of 90.73% and p-value of 1.08E-06. Therefore, 90.73% of the variance in dissolved oxygen can be expressed by total dissolved solids.
(4) The statistical model developed in this study showed that all salty ions have a similar interaction with the dissolved oxygen/total dissolved solids relationship. In contrast to the salty ions, the results showed that temperature variation with time was the most water quality variable that plays a major role in the dissolved oxygen/total dissolved solids relationship. Thus, during summer season, lake temperature goes up, reducing lake dissolved oxygen due to the higher total dissolved solids coming into the lake through salt melting processes in the lake region.