Spatiotemporal characterization and forecasting of coastal water quality in the semi-enclosed Tolo Harbour based on machine learning and EKC analysis

ABSTRACT Characterizing and forecasting coastal water quality and spatiotemporal evolution should be significant to coastal ecosystem management. However, high-quality modeling coastal water quality and their spatiotemporal evolutions is rather challenging due to complex dynamic mechanisms especially in a spatially and temporally heterogenous semi-enclosed bay. To this end, this study develops a framework incorporating machine learning (ML) algorithms and the Environmental Kuznets Curves (EKC) analysis to model, analyze and forecast the spatiotemporal variations of water quality indicators for different subzones and seasons in the semi-enclosed Tolo Harbour of Hong Kong. The application results indicate that the developed ML-based framework with an accuracy range of 0.672 ∼ 0.998 is well-suited in forecasting and understanding the coastal water evolution in a semi-enclosed harbour compared to conventional approach. Furthermore, the spatiotemporal characteristics of coastal water quality evolution in this semi-enclosed bay are analyzed and discussed for coastal hydro-environmental management. Moreover, the EKC analysis is also performed for determining the evolutions of essential water quality variables under 95% confidence interval of Hong Kong PCGDP projection and then implemented in the developed ML-based model for future prediction.


Introduction
In recent decades, many coastal ecosystems throughout the world have been facing the critical issue of hydroenvironment degradation with different degrees owing to the rapid development of coastal cities and urbanization (Chau, 2007;Peng et al., 2012;Xu et al., 2010). The anthropogenic impacts such as coastal discharges of industrial waste and municipal sewage have significantly increased the nutritive contaminants and aggravated the water quality decline (Qiao et al., 2017). These environmental issues, along with other damaging impacts such as water temperature, acid-base, and meteorological conditions, usually stimulate the harmful algal bloom (HAB) events (also known as red tides), which is a typical eutrophic symptom frequently observed worldwide (Lee et al., 2003). This explosive growth and accumulation of algal species to an undesirable level in HAB events often result in a variety of adverse effects on coastal ecosystems, including water discoloration, transparency decrease, oxygen depletion, aquaculture fish kills, and beach closure that causes severe ecological and economic loss, especially in the semi-enclosed water bay areas (Deng et al., 2021;Guo et al., 2020).
CONTACT Huan-Feng Duan hf.duan@polyu.edu.hk Supplemental data for this article can be accessed here. https://doi.org/10. 1080/19942060.2022.2035257 The responses to nutrient enrichment of different ecosystems may vary considerably (Dai et al., 2016;Xu et al., 2010), and their leading causes are also very sitespecific (Zhang & You, 2017). Particularly, semi-enclosed water bay areas with spatially and temporally heterogeneous conditions in different subzones usually respond differently to eutrophication susceptibility, which makes the simultaneously spatial and temporal analysis more complicated . Specifically, the pollutant matters easily get trapped in the water near the inner part due to slight tide flushing, longer water retention time, and weaker water exchange, making the areas more prone to eutrophication. On the contrary, the water near the outer open sea is often more affected by oceanic dynamics where frequent hydrodynamic exchanges with open seas impede the accumulation of nutrient substances Peng et al., 2012;Qiao et al., 2017). Therefore, in those semi-enclosed bay areas, spatial and temporal variations of hydrodynamic and hydro-environmental conditions can be very significant, even in a small-scale zone (Peng et al., 2012). Besides, various observations and practices have also demonstrated that it is essential to separately deduce the different causalities of eutrophica-tion for different subzones and seasons to identify appropriate management strategies of water quality regulations for other portions over a heterogeneous ecosystem (Xu et al., 2004. Recently, novel models incorporating machine learning (ML) algorithms demonstrate better performance in predicting eutrophic evolutions with higher accuracy than conventional mechanistic and regression models (Alizadeh et al., 2018;Li et al., 2014;Zeng et al., 2017). Among ML models, artificial neural network (ANN) is the most widely adopted algorithm in hydraulic and ecological modeling due to its superior self-adaptability, self-organization, and robustness (Deng et al., 2021;Hadjisolomou et al., 2021;Zeng et al., 2017). Nevertheless, a successful application of ANN model depends on a tremendous amount of training data and expertise in tuning various structural hyperparameters Xie et al., 2012). Random Forest (RF) is another robust machine learning algorithm that resorts to the concept of ensemble learning. Unlike other models trained individually, the ensemble learning model aggregates a group of simple models as a more powerful learner to potentially produce a vastly superior prediction performance than single models (Shin et al., 2020). In this regard, RF has also received broad interest in modeling and realizing environmental systems (Sattari et al., 2020;Zeng et al., 2017).
On the other hand, forecasting the future trend of water quality is still a main challenge of ML paradigm because the historical data used lacks the future information. Nevertheless, in ecological economics, some environmental quality indices are discovered to have significant associations with the economic development level of adjacent areas (He et al., 2014;Sarkodie & Strezov, 2018). This kind of association can be commonly concluded by the Environmental Kuznets Curve (EKC), which hypotheses a statistical relationship between environmental and socioeconomical indices (Diao et al., 2009). For example, Sarkodie and Strezov (2018) confirmed the validity of the EKC hypothesis in Australia, China, Ghana, and the USA; Chen et al. (2018) adopted EKC models to extrapolate the coastal environment evolutions with the economic developments in southeast China. In this case, the EKC model is expected to provide a possible solution to forecast the future eutrophic problems upon incorporating the ML models. However, the shapes of EKC curves are usually found to be very site-specific (Diao et al., 2009), and few studies disclose the EKC relationships along the Hong Kong coasts.
Among previous studies on ML-based eutrophic modeling especially in semi-enclosed bays, two major research gaps should be summarized as: (1) most researchers have focused on revealing the temporal variation of water quality of a single point, or as a whole of water bay, while the spatial differences are usually neglected Xie et al., 2012;Yu et al., 2021); (2) although a mass of ML-based analyses have demonstrated the validity of ML models in ecological modeling, most of them are only devoted to calibration and validation of subsistent data in essence (Hadjisolomou et al., 2021;Mamun et al., 2020;Shin et al., 2020) and seldom research aim to predict the future trend of eutrophic evolution owing to the uncertainty and complexity in environmental extrapolation.
As an extension to the previous study by authors (Deng et al., 2021), the main objective of this paper is to develop an ML-based framework for coastal water quality analysis in a semi-enclosed water bay in Hong Kong (i.e. Tolo Harbour) to (1) precisely characterize the spatiotemporal evolutions of objective parameters, (2) provide a practicable procedure for extrapolating the future trend of water quality indicators by combining a socioeconomic model. Unlike previous studies taking a semienclosed bay as a whole, we split the study area and period into different subzones and seasons for a more specific spatiotemporal analysis. To this end, we firstly investigate the effectiveness of the ANN and RF schemes compared with the MLR in eutrophication modeling for a spatially and temporally heterogeneous semi-enclosed ecosystem. Then, a seven-step framework for such modeling in a semi-enclosed water bay is proposed. For demonstrating the applicability of the developed framework, the eutrophication quantified by two typical response indicators, i.e. Chlorophyll-a concentration and transparency extent, is assessed, and the key contributors are interpreted over different seasons and subzones. After validating the effectiveness of the ML models for the water quality states, they are then extrapolated by innovatively incorporating EKC models using the projection of the Gross Domestic Products per capita (PCGDP) over the next decade in Hong Kong. On this basis, the well-trained ML models can predict the evolution of Chlorophyll and transparency in Tolo Harbour.

Study area
The Tolo Harbour, located in the north-eastern part of Hong Kong (see Figure 1), is an almost land-enclosed seawater body with only a narrow channel way out to Mirs Bay (Chau, 2007;Deng et al., 2021). The water area of Tolo Harbour is about 52 km 2, with an average length and width of 16 and 3 km, respectively (Sin & Chau, 1992). Figure 1 shows the change of geographical pattern and land-use pattern of Tolo Harbour over the past two decades. According to different geographic characters and hydrodynamic conditions, the whole west-east water body is generally divided into three subzones (i.e. Harbour, Buffer and Channel Zone) by dotted lines in Figure 1 (Muttil & Chau, 2007;Sin & Chau, 1992;Xu et al., 2004). Generally, the tidal current velocity of the inner part is only 0.01 ∼ 0.02 m/s, while that is about 0.2 ∼ 0.3 m/s at the outer channel part . Hence, the contaminant and nutriments trapped in the Harbour subzone are usually flushed out in a more extended period than the other two subzones (Sin & Chau, 1992).
The land-use type along the Tolo Harbour also shows significant heterogeneity. To be specific, the inner Harbour zone is almost surrounded by artificial surface including two new developed satellite towns, 'Tai Po' and 'Shatin' where the municipal and industrial waste from the local sewage discharge became the primary pollution source of the harbour zone (Deng et al., 2021;Muttil & Chau, 2007). There is also cultivated land producing agricultural waste along the south coast of the buffer zone, although artificial surfaces are gradually replacing it. In contrast, the land along the outer channel zone remains undeveloped state encircled with forest, shrubland, and grassland where the direct rainfall and runoff become the input of potential pollutants. Furthermore, the completion of reclamation projects at Tai Po (over 300 hectares), and Pak Shek Kok (about 74 hectares) in the 1990s ∼ 2000s resulted in the further decrease of the Tolo water area. After 2010, the new developments of Ma On Shan and new openings of Hong Kong Science Park (at Pak Shek Kok) increased industrial and residential land with a growing sewage discharge.

Modeling data and statistics
Monthly data of water quality and meteorology from 1999 to 2019 monitored by the Environmental Protection Department (EPD) (https://cd.epic.epd.gov.hk/) of Hong Kong and Hong Kong Observatory (HKO) (https://www.hko.gov.hk/) are used for modeling in this study. Three monitoring stations are selected to represent three subzones: TM3 (22°26 51 N, 114°12 10 E) for harbour zone, TM6 (22°26 38 N, 114°14 30 E) for buffer zone, and TM8 (22°28 24 N, 114°18 00 E) for channel zone. Based on the classification of trophic states in Table 1 suggested by NALMS (1990) and Sin and Chau (1992), Chlorophyll-a concentration (Chl-a, μg/L) and Secchi Disc Depth (SDD, m) are chosen as the response variables of eutrophication. Therein, Chl-a implies the algal biomass and SDD determines the water transparency. In addition, nine potential driven factors commonly used in previous eutrophication analyses are selected as modeling inputs (Mamun et al., 2020;Muttil & Chau, 2007;Xu et al., 2010), including Total Phosphorus (TP, mg/L), Total Nitrogen (TN, mg/L), Water Temperature (WT,°C), Suspended Solids (SS, mg/L), pH value, Dissolved Oxygen (DO, mg/L), 5-day Biochemical Oxygen Demand (BOD 5 , mg/L), Rainfall (mm), Wind Speed (WS, m/s). Figure 2 shows the variation pattern of the original annually average data span from 1999 to 2019 (i.e. 20 years' field data). Considering the climate feature of Hong Kong, all data for each subzone is classified into four seasons in this study: Spring (Mar -May), Summer (June -August), Autumn (September -November), and Winter (December -February). The spatial and temporal characteristics of environmental parameters and response variables are provided in Table S1 in the supplementary materials for the studied sea bay area.
According to the land-use heterogeneity and seasonal climate features, here we specify the spatial and temporal characteristics by separate modeling on different subzones and seasons.

Machine learning and regression methods
In this study, two different machine learning algorithms (i.e. artificial neural network and random forest) are applied for comparative investigation for Chl-a and transparency (SDD) prediction compared with a conventional statistical regressor (i.e. multiple linear regression). Figure 3 shows the structure of the artificial neural network (ANN) adopted in this study. The typical architecture of ANN models consists of multiple functional layers, namely an input layer, one/multiple hidden layer(s), and an output layer, with interconnected artificial neurons. In practice, each neuron node accepts inputs from the precedent layer and generates output to the next layer (Cho et al., 2011;Deng et al., 2021;Lee, 2004) by conducting a simple nonlinear operation as:

Artificial neural network
where a denotes the nodal data; w represents the weights along with interconnected nodes; b is the bias; f is the transfer function; the subscript i and j represent the node number of two adjacent layers. This ANN model learns the implicit relationship by error back propagation algorithm that iteratively adjusts connecting weights w to minimize the global error (Gupta, 2013;Whittington & Bogacz, 2019). For a welltrained network, the relative importance of each input  can be calculated based on the 'weight' method as follows: where w ih is the weight connecting input and hidden layer; ni and nh are the numbers of input and hidden nodes. After comparing the effectiveness of different network structure, a single hidden-layered network is built with nine input nodes, six hidden nodes, and one output node. The Sigmoidal function is selected as the transfer function for each neuron.

Random forest
Differently from single machine learning algorithms, Random Forest (RF) is an ensemble learning method in which decision trees are incorporated as the basic units. Each decision tree is constructed on a subset randomly subsampled from the original dataset by the bagging approach (Kehoe et al., 2015;Sattari et al., 2020). The final output of an RF model is generated by aggregating (i.e. voting for classification and averaging for regression) all separate predictions of each tree in the forest (Zeng et al., 2017). Since the ensemble method combines many tree models, RF is believed to provide a better generalization performance and less possibility of overfitting (Zeng et al., 2017). For clarity, the schematic of the RF model is given in Figure 4.
Essentially, the generating process of decision trees depends on the selections and comparisons of different features. In this regard, RF evaluates which features (inputs) are comparatively important by the mean changes of out-of-bag error (errOOB) (Eq. (4) ∼ (5)), which refers to the forecast error on unselected samples, with no additional manual extraction and judgment needed (Kehoe et al., 2015).
where errOOB i 1 and errOOB i 2 represent the out-of-bag error before and after random permutations of i th feature; t is the maximum tree number in the forest; I i denotes the importance of i th feature.
A forest of 200 trees is built for an ensemble with the bagging approach in the present study. To assure the independent and identical distribution of sampling set, the data used for generating each tree is put back to the total dataset each time for possible reselection in the following steps (Kehoe et al., 2015). The whole generating process of different individual tree models is implemented in a parallel manner.

Multiple linear regression
Multiple linear regression (MLR) is a multivariate statistical technique that predicts the dependent variable using various independent variables. The purpose of applying MLR algorithm is to find the optimum relationship that fits the multiple observed data by determining the best combination of coefficients (Aiken et al., 2012;Cho et al., 2011). The general representation of MLR formula can be expressed as: where x and y are independent variables (input) and dependent variable (output) respectively; β 0 is the yinterception and β i is its regression coefficients; is the residual term;n and m denote the number of input variables and sampling size, respectively. Theoretically, the individual contribution of each variable can be estimated by the absolute magnitude of the coefficient β i . The relative importance values in MLR can be calculated by:

EKC model and PCGDP projection
The Environmental Kuznets Curve (EKC) describes the potential relationship between ecological indices and Gross Domestic Products per capita (PCGDP) by building a multinomial curve model Diao et al., 2009), which is generally written as the following equation: where y is the environmental estimator; x is the socioeconomic indicator represented by PCGDP in this study; z denotes the other factor contributing to the environmental degradation; α 0 , β 1 , β 2 , β 3 and β 4 are the intercept and coefficients of the corresponding term to be determined; t denotes the time parameter. Note that all variables in the EKC analysis are annually averaged in this study.
Once the EKC relationships are determined, the future variation of the corresponding water quality indices can be forecasted based on the future projection of PCGDP.
Since PCGDP is a nonstationary series, the Auto-Regressive Integrated Moving Average (ARIMA) method on first-order difference is used to project PCGDP in the next ten years with a 95% confidence level.

Model performance evaluation
In order to evaluate the modeling performance, three indicators are adopted to quantify the modeling accuracy, namely mean absolute error (MAE), root mean squared error (RMSE) and correlation coefficient (CC). To be specific, MAE can be mathematically calculated as Eq. (9), which gives the arithmetic average of global deviation between prediction and observation (Mamun et al., 2020;Xie et al., 2012).
Theoretically, the model with better performance has a lower MAE value. The estimator RMSE is presented in Eq. (10), which calculates the squared root of the average residual between predicted and actual values (Shin et al., 2020;Zhang & You, 2017).
The model with smaller RMSE can be considered to have better accuracy. The correlativity between predicted data and actual measured data can be statistically evaluated by the Correlation Coefficient (CC). As a rule, a model with a CC closer to 1 indicates a better goodness-of-fit (Deng et al., 2021). The following equation can calculate CC values: where m represents the number of the sample size used for modeling; y i andŷ i denote the real observation and its estimated value, respectively; parameters capped with a bar refer to the arithmetic averaging.

Workflow for coastal eutrophication analysis
This study aims to use different machine learning/ regression models to simulate and characterize the spatial and temporal evolutions of Chl-a concentration and transparency (SDD). It enables the interpretation of the essential variables over different seasons (time domain) and different subzones (space domain) and estimates the future change of eutrophication in the Tolo Harbour. To this end, seven key steps are outlined in Figure 5 and elaborated as follows. • Step 1: Raw data preparation -Collecting raw data of variables potentially related to the eutrophication issues over the study period. • Step 2: Data preprocessing -The raw data measured at different depths is depth-averaged firstly; then the monthly data is interpolated into daily data to increase the number of samples; lastly, the information is scaled to the range of −1 ∼ 1 to ensure the different variables are at the same order of magnitude. • Step 3: Dataset Splitting -In this study, a 10-fold splitting method is adopted to produce 10 pairs of training/testing sets prepared for 10-run modeling. • Step 4: Model training and testing -Three different models are trained separately using a set of training data while the corresponding testing dataset is used to test the accuracy of the trained model. Considering the 10-fold crosstraining adopted in this study, all these performance indicators used in this study are averaged with 10 runs. • Step 6: Importance Variable Analysis -Relative Importance (RI) values are calculated and evaluated for different seasonal and spatial scenarios for comparison.

•
Step 7: EKC model development -To build desired models for each critical variable and estimate future evolutions based on the next 10-year PCGDP projection. Accordingly, the future variation trends of Chl-a and SDD can be estimated by well-trained ML models.
All these algorithms and models (ANN, RF, and MLR) are implemented through in-house coding in the MAT-LAB program platform to achieve better application flexibility. Figure 6 presents the 20-year Chl-a evolution at the three inspected locations (as shown in Figure 1) compared with the predictions based on MLR, ANN, and RF models. These time series plots show that the highest Chl-a concentration level with a maximum of 42.40 μg/ml is found in the Harbour zone, fluctuating more violently than the buffer and channel sections. For inspection, the trophic status of the water body in Tolo Harbour classified in Table 1 is used herein for illustration. During the early periods from 1999 to 2003, the trophic status at TM3 was almost mesotrophic and eutrophic with even three hypereutrophic events occurred, in line with the relatively high annual levels of TP, TN DO, and BOD 5 in those years ( Figure 2). Afterwards, the average concentration has been maintained between the eutrophic and mesotrophic state, though oligotrophic conditions have been frequently observed, especially after 2011. In the buffer zone, the trophic condition was almost oligotrophic to mesotrophic, but increasing cases of eutrophication have been observed recently. Channel zone is the most deficient productivity area with a low Chl-a level and maintains almost oligotrophic over the whole study period.

Spatial characteristics of Chl-a prediction in different zones
In Table 2, three descriptors, namely MAE, RMSE and CC, are adopted to quantify the performances of different models. Overall, the modeling performances of training set are slightly better than the results of the testing set without overfitting problems for all methods. Specifically, RF model provides a better agreement (CC over 0.99) than ANN and MLR models in simulating Chl-a growth. Meanwhile, the MLR presents higher training errors at TM3 station (RMSE = 3.526, MAE = 2.611) than ANN and RF. Regarding spatial comparison, the ANN algorithm obtains the best testing performance in terms of correlation at TM3 (CC = 0.824) in comparison to TM6 and TM8. As a result, the relatively low MAE and RMSE but high CC values demonstrate the favorable predictive availability of ANN and RF, whereas the MLR performs worst with the lowest goodness-of-fit. Figure 8(a) exhibits the relative importance (RI) values of different external influence variables in terms of Chla prediction in various spatial domains of Tolo Harbour. In general, the better the model performs, the more reliable its interpretation of essential variables is. Based on the RI values, the BOD 5 , which represents organic matter content, is recognized as the foremost important variable in all three inspected regions of Tolo Harbour. Moreover, the Chl-a concentration in the Harbour subzone is also largely influenced by nutrients (TN and TP) and SS, while the impacts of them may decrease adjacent to the buffer region.

Temporal characteristics of Chl-a prediction in different seasons
The variation patterns of Chl-a results over the study period are exhibited in Figure 7, indicating seasonal changes from spring to winter. In both summer and autumn periods, relatively high concentration levels in Chl-a have been observed in Tolo Harbour. Remarkably, the hypereutrophic cases have appeared in the summer and autumn periods of 2000, 2001, and 2003. On the contrary, the trophic status of Tolo Harbour remains weakest in wintertime, during which the oligotrophic condition is more likely to occur. However, it also shows the frequent winter eutrophications since 2016 but recovered by 2019. The performances of different models are elaborated in Table 3 in supplementary material, revealing that the lowest modeling errors (MAE and RMSE) are found in winter while the highest ones are found in spring. Furthermore, the CC values by RF are found to be much higher than ANN and MLR during all seasons in both training and testing stages.
The RI indexes for Chl-a prediction over the four seasons are presented in Figure 8(b). During all seasons, BOD 5 is ranked as the foremost decisive contributor to algal growth. In addition to BOD 5 , SS is also suggested as another salient factor during spring and winter periods as the concentration varies largely. However, the relative significances of other attributes are nearly equivalent in the summer and autumn periods. The modeling results also disclose that TN is the dominant nutrient limiting algal growth in different seasons in Tolo Harbour.

Spatial characteristics of transparency prediction in different zones
As shown in Figure S1 (in supplementary), the transparency of Tolo Harbour shows a spatial gradient since the average secchi disc depth decreases from the outer channel part towards the inner harbour zone. Notably, from 1999 to 2003, the harbour section has suffered from immense turbidity with SDD of almost less than 2 m, which was then maintained as the mesotrophic state (2 ∼ 4 m) after a significant rise during the period from 2003 to 2006. By contrast, the buffer zone water is more transparent whose trophic status is almost mesotrophic Figure 6. Comparison of 20-year spatial variation of observed and predicted Chl-a concentration results based on ANN, RF and MLR models at TM3, TM6 and TM8.  with very few eutrophic events, and no hypereutrophic issue occurs. The channel part follows a similar pattern but with an even deeper average depth and more frequent oligotrophic cases. The modeling performances are listed in Table S2. The training CC values do not display any difference by RF over different regions (CC = 0.997), while the best testing CC (0.992) can be observed in harbour part. Moreover, ANN performs better at TM3 with the lowest MAE and RMSE among all these three stations. Moreover, the relatively low CC values (less than 0.5) indicate that MLR performs unsatisfactorily, especially in the channel subzone with the CC even less than 0.3.
According to modeling results of the two ML methods (ANN and RF) ( Figure S3(a)), SS and TN can be identified as the key factors to water turbidity (i.e. low transparency) for harbour subzone. However, it is also noted that both the nutrients (i.e. TN and TP) and organic pollutants present even higher contributions than SS from the buffer zone to the channel zone. Moreover, the precipitation also contributes important influences on transparency in the channel zone compared to the other two subzones.

Temporal characteristics of transparency prediction in different seasons
The time series plots of both measured and predicted transparency results for Tolo Harbour over different seasons are presented in Figure S2. It is noticeable from the results that the secchi disc depth fluctuates much violently with higher standard deviations (Table S1 in the supplementary material) in the winter and spring periods, indicating a larger variation in transparency over these two seasons. In fact, more oligotrophic states are observed in winter while hypereutrophic conditions more likely occurred in summer in this bay area, especially before 2007. After a pronounced rise of transparency from 2007 to 2009, Tolo Harbour hardly shows hypereutrophic states and mostly maintains mesotrophic conditions during summer periods. Thereafter, three severe eutrophic events occurred in the spring of 2013, followed by frequent oligotrophic conditions after 2016 as the improved trophic states. Similarly, the performances of different models are elaborated in Table S3 in the supplementary material. The results indicate that the most significant modeling errors and lowest CC value imply that ANN and RF models have the worst modeling performance in winter among all seasons. Moreover, the low CC values (approximately 0.5) of MLR indicates its invalidity in transparency prediction.
From the relative importance analysis ( Figure S3(b)), nutrient contents (TN and TP) and organic pollutants (BOD5) are suggested as the dominating contributors to transparency during the summer and autumn periods. In contrast, the influence of SS on water turbidity dominates over the TN, TP, and BOD5, especially in spring and winter. Moreover, the physical factors including WT, WS, and Rainfall have not shown an evident seasonal variation of relative importance on transparency.

Prediction and analysis of water quality evolution
The EKC model is a commonly used statistical method that builds multinomial curves to estimate the change of environmental variables with PCGDP Sarkodie & Strezov, 2018). According to the previous analysis, EKC curves are plotted in Figure 9 for four relatively essential variables (i.e. TN, TP, BOD 5 , and SS) at three different stations in Tolo Harbour. These curves show that all four selected variables almost have a U-shaped quadratic relationship with economic growth except for the monotonically linear increases of TN at buffer zone and BOD5 at channel zone.
Once the EKC relationships are built, the future variations of water quality can be estimated using the projection of PCGDP data. 2020 Gross Domestic Product published by the Census and Statistics Department (CSD) of the Hong Kong government (https://www.censtatd.gov.hk/en/) reviews PCGDP data of Hong Kong over the past two decades. Based on these data, the future growth of Hong Kong PCGDP in the next decade is projected by the ARIMA method under the assumption of current strategies and policies in operation (Figure 10). With a 95% confidence level, the baseline, upper, and lower confidence limit line represent three different economic developing patterns over the following ten years. According to these three patterns, the corresponding water quality evolutions have been predicted under low, basic, and high economic growth ( Figure 9). Hereafter, they are input into the welltrained ML models to expect the future variation of Chl-a and SDD. Figure 11 shows the trends as well as varying ranges of future Chl-a and SDD, which were predicted based on three models in the next decade. Comparing to Chla, the extrapolating forecast of SDD performs smaller ranges, indicating a more stable evolution in oncoming years. Nevertheless, Chl-a has wider variation ranges and remarkable oscillated developments, which demonstrated a more complex pattern of blossom mechanism, especially at the buffer zone where has more comprehensive effects of industrial, agricultural, and residential pollutants.
Specifically, the future predictions of growth trend and range by MLR method almost increase linearly with time, simply neglecting possible nonlinear oscillation, which is reasonable to assume that MLR is too simplistic in extrapolation predicting. The ANN model performs more conservatively with a larger potential range of variation at the regions away from the inner coast (i.e. buffer and channel section) due to more sensitive responses on different PCGDP growth cases. This high predictive sensitivity can be attributed to the suspicion of overfitting when extrapolating predictions using networks trained with past data. In contrast, RF performs specifically with a more reasonable variation, providing a reliable extended forecast.

Discussion and implications
The modeling performances shows that different models have the different effectiveness in Chl-a and transparency prediction. Specifically, ANN and RF perform better than MLR due to their excellent abilities to handle nonlinear problems (Mamun et al., 2020;Xie et al., 2012). Moreover, ensemble learning model (RF) even provides a more accurate performance than ANN by aggregating sub-models (Shin et al., 2020;Zeng et al., 2017).
The water quality in Tolo Harbour has been observed to show a consistent decrease from less developed channel segment to the densely populated inner region (Deng et al., 2021;Wong et al., 2018;Xu et al., 2010). The ML models have also correctly displayed this variation trend. Specifically, the most severe region with frequent eutrophic events and lower transparency is found to be the inner harbour zone with a direct inflow of coastal wastes. In addition, the study indicates a relatively weak variation in seasonal patterns in the water quality of Tolo Harbour , except the Chlorophyll-a concentration showed slightly higher in summer and autumn periods than others.
Although nitrogen and phosphorous in the water body are typically considered as the principal factors of Chla growth (Deng et al., 2021;Mamun et al., 2020), the higher RI values of BOD 5 indicate that organic pollutant contributes more substantially in the Tolo Harbour. This finding is consistent with previous studies for the Tolo Harbour (Li et al., 2004). Furthermore, a strong relationship between biochemical oxygen demand and eutrophication has also been confirmed by Xu and Xu (2015). This result is reasonably related to sewage diversion plans launched by the Hong Kong government in the late 1990s, which reduced 90% discharge loading and controlled TN and TP within an acceptable range . After that, the critical limiting factor to algal growth could not be attributed to nutrient availability Nurohman et al., 2021). In addition, BOD 5 and SS are also found as another significant stimulus to algal growth in Tolo Harbour, especially in lower velocity regions (i.e. inner harbour and buffer zone) and high variation seasons (i.e. spring and winter). This finding also concurs with the conclusions of relevant research (Gorokhova et al., 2020;Nurohman et al., 2021).
Water transparency is an integrated indicator related to concentrations of inorganic suspended solids, phytoplankton, and dissolved organic matter in the water, so that could reflect the turbidity, watercolor and abundance of algae (Ao et al., 2018;Mamun et al., 2020). In Tolo Harbour, three variables are found as key factors of water transparency: SS, BOD 5, and TN. Wherein, the SS is found to be the leading driven factor near the   inner harbour zone. The sudden increase of SS in recent years ( Figure 2) should be worthy of the attention, which could be a forewarning of eutrophic issues. This could be attributed to the large-scale constructions due to rapid exploitation along Tolo Harbour in recent decade, including the town expansions of Shatin, Taipo, and Ma On Shan as well as the new developments of Hong Kong Science Park. On the other hand, TN and BOD 5 also have critical contributions to transparency. Nevertheless, the similar impacts of the two on algal growth and water turbidity evidence that these two symptoms of eutrophication highly correlate in the Tolo Harbour. This finding has also been confirmed in recent studies (Hadjisolomou et al., 2021;Mamun et al., 2020). Furthermore, the lower importance of weather conditions, such as WT, WS, and Rainfall, also corroborates that the deterioration should be blamed on anthropogenic activities rather than natural evolution.
From a long-term view of future prediction, the water quality will remain more stable near the channel part (TM8) under current policies and strategies. In contrast, the water deterioration (Chl-a increases and SDD decreases) at TM3 is likely to recur as the economy develops within the next decade but not as severe as 20 years ago. Regarding the buffer region, more attention should be given to regulating and controlling policies owing to the complicated dependency between socioeconomic development and water quality evolution.
In this study, the developed methodology and application results should be the preliminary work of coastal hydro-environment management and is of great significance for water environment protection. Specifically, the more specific spatiotemporal analysis than previous studies characterize different key factors to eutrophication over different subzones and seasons, which are expected to provide more tailored management measures to particular subzones and seasons on coastal eutrophication problems. Furthermore, the well-trained framework incorporating EKC model is able to forecast the future eutrophic trend in the following years, which provides practitioners with a possible way to take the precautionary measures in advance.

Summary and conclusion
In conclusion, this study proposes a machine learning framework (using ANN and RF models) to characterize the spatial and temporal variations of eutrophication indicators (i.e. Chl-a concentration and transparency) in a semi-enclosed water bay (Tolo Harbour) in Hong Kong. Furthermore, the developed framework provides a practicable way to forecast future eutrophicate evolutions in the coming years once incorporating the EKC models.
The modeling results indicate that the developed framework had an excellent performance in coastal water quality modeling. The spatiotemporal patterns of key contributors to eutrophication are also rationally interpreted based on this framework. Specifically, the relative importance analysis reveals that the water quality deterioration in the Tolo Harbour is mainly attributed to nutrients, organic pollutants, and suspended solids. Accordingly, their treatment deserves high priority when launching strategies for alleviating eutrophication in this sea bay area in the future. Especially, the sudden increase of nutrients and suspended solids monitored in Tolo Harbour should be worthy of the attention.
According to the future analysis, the eutrophicate condition will remain almost at a constant level in the channel zone. In the inner Harbour zone, although the water deterioration may rebound in oncoming years, the situation will not be as strict as two decades ago. The results also imply that this method still has uncertainties in detailed predicting, especially at the TM6 station, due to more comprehensive anthropogenic impacts along the coasts, which need more inspections and modifications when implementing regulations in the future.
From the application results and analysis of this study, the proposed ML-based framework is proven to be wellsuited and effective to model and understand the spatiotemporal evaluation of coastal water quality in a semienclosed seawater bay under various heterogeneous conditions. Furthermore, it can provide helpful tools and methods for coastal hydro-environmental forecasting. However, some limitations of this work should also be noted: (1) the proposed ML-based framework is highly data-depended, whose performances directly rely on the reliability of the data used; (2) the ML-models are very site-specific, which means a new training process must be conducted before each application; (3) the projected socioeconomic statistics and extrapolated water quality data based on different assumptions may be very different with reality. To overcome these limitations, this proposed method could be improved in the future studies by higher resolution data and still needs modifications by successive water quality data monitoring.