Air quality analysis and PM2.5 modelling using machine learning techniques: A study of Hyderabad city in India

Abstract The rapid urbanization and industrialization in many parts of the world have made air pollution a global public health problem. A study conducted by the Swiss organization IQAir indicated that 22 of the top 30 most polluted cities in the world are in India. This creates the problem of air pollution, which is very relevant to India as well. Exposure to air pollutants has both acute (short-term) and chronic (long-term) impacts on health. Among the major air pollutants, particulate matter 2.5 (PM2.5) is the most harmful, and its long-term exposure can impair lung functions. Pollutant concentrations vary temporally and are dependent on the local meteorology and emissions at a given geographic location. PM2.5 forecasting models have the potential to develop strategies for evaluating and alerting the public regarding expected hazardous levels of air pollution. Accurate measurement and forecasting of pollutant concentrations are critical for assessing air quality and making informed strategic decisions. Recently, data-driven machine learning algorithms for PM2.5 forecasting have received a lot of attention. In this work, a spatio-temporal analysis of air quality was first performed for Hyderabad, indicating that average PM2.5 concentrations during the winter were 68% higher than those during the summer. Following that, PM2.5 modelling was done using three different techniques: multilinear regression, K-nearest neighbours (KNN), and histogram-based gradient boost (HGBoost). Among these, the HGBoost regression model, which used both pollution and meteorological data as inputs, outperformed the other two techniques. During testing, the model acquired an amazing R2 value of 0.859, suggesting a significant connection with the actual data. Additionally, the model exhibited a minimum Mean Absolute Error (MAE) of 5.717 μg/m3 and a Root Mean Square Error (RMSE) of 7.647 μg/m3, further confirming its accuracy in predicting PM2.5 concentrations. In our investigation, we discovered that the HGBoost3 model beat other PM2.5 modelling models by having the lowest error and the highest R2 value. This study made a substantial addition by incorporating the spatiotemporal relationship between air pollutants and meteorological variables in predicting air quality. This method has the potential to improve the creation of more precise air pollution forecast models.


Introduction
Air, being a vital element for sustaining life on Earth, is facing challenges due to both human-made and natural factors.Industrialization, volcanic eruptions, forest fires, agricultural burning, and urbanization, among others, have collectively contributed to a decline in air quality worldwide (Raju et al., 2022).Shockingly, approximately eighty percent of global cities and ninety-eight percent of cities in middle-income nations surpass recommended air quality standards.This escalation in air pollution leads to detrimental consequences, accelerated climate change resulting in extreme weather events, including economic losses, reduced visibility, and millions of premature deaths annually (Jung et al., 2015).The main air pollutants include particulate matter, carbon dioxide, carbon monoxide, nitrogen oxides, sulfur oxides, and volatile organic compounds.Among these, anthropogenic fine particulate matter (PM 2.5 ) poses a significant threat to air quality, representing particles with an aerodynamic diameter less than 2.5 micrometres (Gregorio et al., 2022).Accurate forecasting of air pollutants and identifying pollution trends can aid scientists in devising effective emission control strategies.India has been grappling with air pollution for over a century, and the situation has worsened in recent decades due to rapid population growth, unplanned urbanization, and industrialization (Kapoor, 2017).Notably, 22 of the world's 30 most polluted cities are in India (IQAir, 2020), and from 2008 to 2013, India ranked among the most polluted countries globally, according to the World Health Organization's database (World Health Organization [WHO], 2014).Ambient air pollution in India is a significant contributor to approximately 17.8% of all fatalities, as reported by the Global Burden of Illness study (Lancet, 2020).Among the major factors responsible for these deaths are ambient particulate matter (PM) and residential air pollution.
Over the past two decades, India has witnessed some of the most severe and widespread air quality degradation, making air pollution a critical concern for regulatory authorities.The postmonsoon season, particularly in the Indo Gangetic Plains (IGP), becomes highly susceptible to severe pollution episodes.An alarming example occurred during a week in early November 2017, when PM 2.5 particle levels in Delhi surpassed WHO regulations by 25 times (11 times higher than Indian standards), triggering an environmental health emergency (De Vito et al., 2018).Recognizing the gravity of the situation, the WHO updated its air quality guidelines, establishing revised annual and 24-hour requirements for six major pollutants (WHO, 2021).For instance, the yearly average standards for PM 10 and PM 2.5 were reduced from 20 g/m3 to 15 g/m3 and from 10 g/m3 to 5 g/m3, respectively.Despite technological advancements aimed at improving air quality, significant transformative changes are yet to be fully realized (Gulia et al., 2018).The issue of air pollution continues to demand urgent attention and comprehensive efforts to safeguard public health and the environment.
The introduction of real-time monitoring stations in India heralded a significant technological advancement in the field of air quality monitoring.In 2006, these stations were initially set up in Delhi as a pilot project and were later expanded to numerous other cities after 2016 (Roychowdhury & Somvanshi, 2020).Over time, the number of Continuous Ambient Air Quality Monitoring (CAAQM) stations across the country has increased to 278, serving 147 cities.Additionally, India established the System of Air Quality and Weather Forecasting and Research (SAFAR) network, which combines manual and real-time monitoring with air quality forecasting capabilities.Initially implemented by the Indian Institute of Tropical Meteorology (IITM) in Pune for Delhi, SAFAR has now expanded its forecasting services to three other cities: Pune, Mumbai, and Ahmedabad (http://safar.tropmet.res.in/).These forecasting models have proven valuable in guiding policymakers to make informed strategic decisions.While a significant portion of the country's air quality monitoring spectrum is covered by manual and real-time monitoring stations, researchers are actively exploring the use of low-cost sensors to measure air quality on a smaller spatiotemporal scale (A.Kumar & Gurjar, 2019).This approach holds promise for enhancing the granularity and coverage of air quality data, allowing for more detailed insights into local air pollution patterns.
The rising number of premature deaths associated with air pollution has drawn significant attention to the health impacts of PM 2.5 (Particulate Matter with a diameter of 2.5 micrometers or less) (Cohen et al., 2017;Silva et al., 2013).These tiny particles possess the ability to deeply penetrate the respiratory tract and reach the lungs, giving rise to respiratory and cardiovascular ailments (Gronlund et al., 2015).Exposure to PM 2.5 can cause irritation in the eyes, nose, throat, and lungs, leading to symptoms such as coughing, sneezing, runny nose, and shortness of breath.Furthermore, PM 2.5 exposure can severely impact respiratory function and worsen medical conditions such as asthma and heart disease.Studies have revealed that increased daily exposure to PM 2.5 is associated with a higher number of respiratory and cardiovascular hospital admissions, emergency department visits, and mortality rates (Baccarelli et al., 2014).Long-term exposure to fine particulate matter has also been linked to an elevated risk of chronic bronchitis, decreased lung function, and mortality from lung cancer and heart disease.Particularly vulnerable to the harmful effects of PM 2.5 are individuals with pre-existing lung and cardiac issues, as well as children and the elderly.
Numerous studies have investigated the spatial and temporal variations of pollutants (Analitis et al., 2020;Athanasiadis et al., 2003;Chaloulakou et al., 2003).Varaprasad et al. (2021) conducted a study focusing on PM 2.5 , Carbon Monoxide (CO), NOx, and sulphur dioxide (SO2), and observed distinct seasonal fluctuations for each pollutant.In the study area, PM 2.5 and PM 10 concentrations had a notable impact on air quality, with PM 2.5 mass concentrations being higher during the post-monsoon season (Das et al., 2023;Li et al., 2019).The study also found that PM 2.5 concentrations varied significantly during the day.Additionally, regional disparities were identified during the investigation.Furthermore, Li et al. (2019) explored the correlation between meteorological factors and PM 2.5 concentration, noting that precipitation, relative humidity, air temperature, and wind speed showed a negative relationship with PM 2.5 concentration.These findings shed light on the complex interactions between air pollutants and meteorological conditions, contributing to a better understanding of air quality variations.
In light of the study conducted by Singh et al. (2013), which established a significant correlation between daily death rates and air pollution statistics, the researchers found that PM 2.5 (particulate matter with a diameter of 2.5 micrometers or smaller) was particularly dangerous.This was attributed to its ability to penetrate the lung walls, causing severe health issues and potentially leading to increased mortality rates in regions with higher levels of this pollutant.To address the complexities of predicting future pollution levels, S. Kumar et al. (2020) took a different approach.They employed general statistical methodologies, including multiple linear regression, to develop models capable of forecasting pollution levels over time.These models were then applied in another study, where researchers aimed to identify correlations between various characteristics and pollution patterns.The primary finding of this subsequent study was the successful identification of correlations between certain characteristics and air pollution patterns.However, it became evident that predicting various elements of time series data, such as trends, seasonality, and outliers, presented considerable challenges.The studies relied predominantly on simple statistical models, which proved inadequate in capturing the intricacies and nuances of pollution trends.The main limitations that hindered the accurate forecast of air pollution patterns were twofold.First, these simple statistical models lacked the ability to handle complex interdependencies and correlations between different variables that influence pollution levels.Air pollution is influenced by a multitude of factors, including industrial activities, traffic, weather conditions, and geographical features, making it a complex and multifaceted problem to model accurately.Second, the models faced difficulties in capturing seasonal variations and long-term trends.Air pollution patterns often exhibit seasonal fluctuations due to variations in weather conditions, human activities, and natural phenomena.These seasonal variations are crucial to understanding pollution levels and predicting future trends accurately.Furthermore, capturing long-term trends and potential outliers in pollution data is essential for making informed policy decisions and implementing effective mitigation strategies.
The importance of accurately modeling and anticipating air quality cannot be overstated, as it enables the public to be aware of potential health risks and empowers them to take precautionary measures.In recent years, machine learning approaches have gained popularity for forecasting temporal sequences of pollutants, and their application to air quality forecasting has been on the rise (Le et al., 2020).Forecasting models play a crucial role in developing effective strategies for assessing and informing the public about potential spikes in the air quality index (Zhang et al., 2021).These models generally fall into two main categories: simulation-based and data-driven approaches, each utilizing different methods to predict air pollution concentrations.The simulation-based approach integrates physical and chemical models to simulate the emission, transport, and chemical transformation of air pollutants.This method takes into account various factors such as emissions from different sources, meteorological conditions, and background characteristics to generate forecasts (Grell et al., 2005).While this approach can provide valuable insights, it does face certain challenges.One of the primary challenges is the presence of uncertainties in numerical models, which can impact the accuracy of predictions.Additionally, a lack of sufficient data on certain parameters can limit the precision of simulation-based models.On the other hand, data-driven approaches leverage statistical and machine learning techniques to identify patterns (Li et al., 2016).This approach proves to be effective, especially when dealing with high-dimensional data, as machine learning algorithms can efficiently discover relevant exposures that are related to desired health outcomes.Data-driven approaches are particularly useful when dealing with complex air pollution patterns influenced by numerous factors, as they can adapt and learn from the available data to make predictions (Ma et al., 2022).
Data-driven machine learning technologies have revolutionized the way researchers investigate the influence of various air contaminants on health outcomes (Caselli et al., 2009;Goudarzi et al., 2021a;Liu et al., 2021;Tsai et al., 2018;Ni et al., 2017;Niska et al., 2004;Siew et al., 2008).These advanced methodologies enable them to analyze and interpret complex data sets, considering multiple air pollutants simultaneously and their potential impact on human health.One critical area of research has been focused on early-life exposure to ambient air pollution and its effects on children's neurodevelopment.Studies, such as the one conducted by E. Kim et al. (2014), have provided mounting evidence that exposure to air pollution during early developmental stages may have adverse effects on cognitive development and neurobehavioral outcomes in children.This research highlights the importance of understanding the long-term consequences of air pollution exposure during critical periods of brain development and emphasizes the need for implementing measures to protect vulnerable populations, particularly children, from harmful air pollutants.Additionally, the implications of air pollution extend beyond neurological effects.A noteworthy study by Huang et al. (2020) identified air pollution as a risk factor for obesity, particularly among individuals with a higher body mass index (BMI).This finding suggests that air pollution may have a broader impact on metabolic health, raising concerns about its contribution to the obesity epidemic.Further exploration into the relationship between air pollution and various health conditions, including obesity, is essential for formulating effective public health policies and interventions.Air pollution's far-reaching consequences extend beyond the realm of public health and also encompass detrimental effects on several industries, most notably agriculture.In China, extensive studies have revealed the significant impact of industrial air pollution on agricultural productivity.As a result, the agricultural sector experiences reduced marginal products, while various critical parameters like labor-capital dynamics undergo alterations (Wang et al., 2020).This intersection between air pollution and agricultural productivity inevitably ripples into broader economic implications for a country.The negative influence on agricultural output can hamper food production and supply, potentially leading to food shortages and increased prices.Moreover, decreased agricultural productivity may lead to reduced exports and hinder the overall growth of the economy.Additionally, air pollution's effect on agriculture can disrupt rural livelihoods, forcing communities to cope with environmental challenges that affect their socio-economic well-being.For instance, farmers may face financial burdens due to decreased crop yields, exacerbating poverty and inequality.The economic impact of air pollution is not limited to agriculture alone; it also extends to other sectors.For instance, manufacturing and industrial activities might suffer from decreased productivity and increased costs due to air quality regulations and health-related absences among workers.Moreover, the healthcare sector experiences a surge in the demand for medical services, placing a strain on the healthcare system and draining resources that could be allocated elsewhere for development.In sum, the detrimental effects of air pollution on agriculture and various industries contribute to a vicious cycle that hinders a country's overall economic development.Addressing air pollution becomes a crucial priority for sustainable economic growth, improved public health, and the well-being of communities and industries alike.
Due to exponential growth in both urbanization and industrialization, India has become highly vulnerable to atmospheric pollution in recent years, particularly in urban areas.The increasing level of pollutants in the atmosphere worsened the ambient air quality.The Air Quality Index has been increasing at an alarming rate in major Indian cities.This made us focus on the air quality of major Indian cities.Therefore, necessary steps need to be taken to overcome this critical issue.Pollutant forecasting and the discovery of various patterns in air pollution will improve the scientific knowledge required for the development of an optimal emission control strategy.To bring long term solutions for the problem of air pollution, the right strategic decisions must be taken, and this is possible only if there is an accurate air quality measurement and forecasting system in place.Machine learning-based PM 2.5 forecasting models offer the potential to develop methods for evaluating and warning the public about potentially harmful levels of air pollution (X.Feng et al., 2015;Goudarzi et al., 2021b).
The current study has multiple goals: (a) conduct spatiotemporal air quality analysis and explore seasonal changes in PM 2.5 levels.(b) Conduct a correlation analysis.(c) Examine the relationship between input and output variables.Furthermore, the project intends to utilise several machine learning techniques for PM 2.5 modelling and to establish the best PM 2.5 forecasting model.

Study area
The city of Hyderabad will be the focus of the spatio-temporal analysis and PM 2.5 modeling in this study.As a fast-growing global city, Hyderabad's air quality has deteriorated significantly over the last decade, owing mostly to increased traffic and the presence of numerous industries in its northern and eastern sectors.Despite being an important urban center, the literature review revealed a scarcity of studies on the air quality of Hyderabad.Furthermore, the observed PM 2.5 levels in the city have consistently exceeded the prescribed limits set by the Central Pollution Control Board (CPCB).Considering these factors, Hyderabad was chosen as the ideal study area for this project.

Hyderabad
The city experiences a hot semi-arid climate, characterized by distinct weather patterns.During the months of June to October, Hyderabad receives substantial rainfall due to the influence of the southwest monsoon, which contributes significantly to its overall precipitation.The mean annual temperature hovers around 26.6 °C, with monthly average temperatures ranging from 21-33 °C.May stands out as the hottest month, with temperatures soaring as high as 36-39°C, while the coolest period occurs from December to January, with temperatures ranging from 14.5-28 °C.Hyderabad boasts a diverse economy, fueled by key industries such as information, pharmaceuticals, drugs and technology, manufacturing, food, and hospitality.The city's prominence as a major IT hub has earned it the moniker "Cyberabad," with numerous IT parks and multinational corporations operating within its boundaries.The pharmaceutical and biotechnology sectors have also flourished in Hyderabad, housing headquarters and manufacturing units of several prominent companies.Additionally, the manufacturing, hospitality, and food industries contribute significantly to the city's economic growth.
However, alongside the positive aspects of its economy, Hyderabad faces the challenge of vehicular emissions as a significant contributor to pollution.The increasing number of vehicles, coupled with traffic congestion, contributes to air pollution in the city.Efforts are being made to address this issue, including the promotion of public transportation, the encouragement of electric vehicles, and the implementation of stricter emission norms for industries.These measures aim to improve air quality and reduce pollution levels, ensuring a sustainable and healthier environment for the residents of Hyderabad.Figure 1 represents the geographical location of Hyderabad district and six monitoring stations.

Data and methodology
Data collection and data pre-processing will be the initial step.Spatio-temporal air quality analysis of the study area needs to be considered before modeling to understand the trend of air pollution in the study area.PM 2.5 modelling will be conducted out using several machine learning models, and performance will be evaluated.Figure 2 depicts the general methods used for PM 2.5 modelling.
The research begins with meticulous data collection.Subsequently, the collected data undergoes a crucial data pre-processing step to prepare it for modeling.This pre-processing process comprises four stages, which include data integration, data cleaning, organization, checking for missing data and outliers, and finally, preparing distinct training, testing, and validation datasets.After completing the data pre-processing step, a correlation analysis is conducted to reveal the relationships between the input and output variables.This analysis helps identify the key factors influencing PM 2.5 levels in the study area.Next, a comprehensive spatio-temporal air quality analysis is undertaken to explore the variation of pollutant levels during the study period.The spatio-temporal analysis of air quality involves studying and evaluating air pollution levels across geographical locations and time.This analysis is crucial for comprehending the distribution, trends, and patterns of air pollutants in different areas and their fluctuations over time periods.By identifying pollution hotspots, seasonal variations, and long-term trends, this research becomes vital in formulating effective air quality management strategies and policies.In this study, we utilized geographic information systems (GIS), statistical methods, and machine learning techniques to process and analyze the extensive datasets involved.This multidisciplinary approach allows us to gain a comprehensive understanding of air quality dynamics and enables the development of efficient measures to enhance overall air quality.This analysis offers valuable insights into spatial patterns and seasonal trends of air pollution, enhancing our understanding of the overall air quality dynamics.For PM 2.5 modeling, three powerful machine learning algorithms are utilized: Multi linear regression, KNN regression, and HGBoost regression.These algorithms are chosen for their capability to capture complex relationships between variables and provide accurate predictions.To assess the performance of each model, various evaluation metrics are employed, such as MAE, RMSE, and R 2 .These metrics help gauge how well the models are able to predict PM 2.5 levels.Finally, a rigorous model comparison is conducted to determine the best PM 2.5 forecasting model for the specific study area.The selected model will serve as a crucial tool for assessing and managing air quality in the city of Hyderabad, assisting authorities in implementing effective strategies and measures to safeguard public health from air pollution.

Data collection
Air pollutant and meteorological data were gathered from the all India CAAQMS portal, which is administered and operated by the Central Pollution Control Board of India (CPCB).The data collection was conducted at six Continuous Ambient Air Quality Monitoring Stations (CAAQMS) situated in Hyderabad city.The 24-hr meteorological and pollutant data were obtained for a period spanning from January 2018 to December 2019 from six CAAQMS managed by the Central Pollution Control Board (CPCB) of India in Hyderabad (CPCB).

Data pre-processing
Prior to initiating any modeling process, data pre-processing is a critical step to prepare the input data for machine learning models.This step involves removing inconsistent data, handling null or missing values, and addressing outliers that may disrupt the modeling process.Null or missing values in the data are identified and removed to ensure the quality and accuracy of the dataset.Outliers, which can arise from faulty readings or exceptional events, such as forest fires or religious gatherings, are also addressed, as they can significantly impact pollutant levels.The MLR, KNN, and the HGBoost regression models are used in this work.A variety of metrics are used to evaluate and compare their performance.R 2 , MAE, RMSE, and MSE are employed as evaluation metrics to assess the models' accuracy and effectiveness.Through comprehensive evaluation and comparison of these models using appropriate metrics, the study aims to identify the most suitable PM 2.5 forecasting model for the given study area.This will aid in making informed decisions to manage and mitigate air pollution effectively.

General
During the span of two years, from January 2018 to December 2019, a comprehensive spatio and temporal air quality analysis was performed.This analysis aimed to understand the variations in air pollutant levels across different locations and time periods.Additionally, a correlation analysis was conducted using the collected data to unveil the relationships between various variables.For PM 2.5 modeling, three different regression techniques were employed: multilinear regression, K-nearest neighbor regression, and HGBoost regression.These models were utilized to predict PM 2.5 levels based on the available data.Following the modeling process, the results obtained from the three models were meticulously compared.

Spatio-temporal air quality analysis
Figure 3 depicts a box plot depicting the change of various pollutants for each of the Monitoring Stations (MS) over the time period from January 2018 to December 2019.PM 2.5 , NO 2 , SO 2 , CO, and ozone were analyzed for all 6 stations.The box plot is plotted using the daily average data for all the pollutants.The box plot for CO was plotted separately due to the difference in units.Table 1 provides the air quality standards set by the CPCB for each of the pollutants to achieve satisfactory conditions.
The air quality standard for PM 2.5 set by the CPCB for satisfactory conditions is 60 μg/m 3 .The average daily concentration of PM 2.5 was found to be exceeding the CPCB limits in all the stations.Most of this higher concentration of PM 2.5 has been observed during the winter season.During the two-year study period, 246 days had an average PM 2.5 concentration that exceeded the CPCB threshold of 60 μg/m 3 .On 6 November 2019, the highest PM 2.5 concentration for MS6 was 130.54 μg/m 3 .This was the highest PM 2.5 concentration obtained during the period of study.The median value of 52.04 μg/m 3 obtained for MS5 indicates that the PM 2.5 pollution level in MS5 is relatively on the higher side when compared with other stations (Table 2).
Upon conducting a more detailed analysis of Figure 3, it becomes evident that Monitoring Station 1 (MS1) and MS5 exhibit significantly higher NO 2 concentrations compared to the other monitoring sites.At MS1, the mean average daily concentration of NO 2 is recorded as 48.62 g/m3, with the maximum value observed during the monitoring period reaching 103.42 μg/m 3 .This maximum concentration surpasses the CPCB limit of 80 μg/m 3 by a substantial 29%.Meanwhile, at MS5, the daily average concentration of NO 2 is 48.11 μg/m 3 , with a striking peak of 130.59 μg/m 3 recorded on 9 February 2019.This maximum concentration represents a noteworthy 62% increase over the CPCB limit for NO 2 .It's worth noting that this particular value stands as the highest NO 2 concentration reported during the entire monitoring duration.
Figure 4 is a box plot depicting the fluctuation of CO levels across stations.CO concentrations were found to be greater at MS5 station, as was the case with NO 2 .The average CO content observed at MS5 was 0.75 mg/m 3 .Although this figure is lower than the CPCB-regulated standard of 2 mg/m 3 , it is important to note that the CO concentration is highly variable, as seen by  a standard deviation of 0.37 mg/m 3 .Figure 1(b) displays the locations of air quality monitoring stations (MS) 1-6.Notably, MS5 (Sanathnagar MS) is located at a heavily used crossroads, which explains for the increased CO and NO 2 values at this station.These pollutants are mostly released by traffic-related fuel burning, which explains the higher amounts detected at MS5.The concentrations of SO 2 and ozone at all sites remain within the CPCB's acceptable limits of 80 μg/m 3 and 100 μg/m 3 , respectively.Figure 5 depicts the temporal variation of PM 2.5 over a 2-year period using a box plot for the summer and winter seasons.This illustration depicts the seasonal changes in PM 2.5 levels over the chosen time period.
The seasonal pattern of PM 2.5 levels, as depicted in Figure 7, clearly illustrates a significant rise in pollutant concentration during the winter season.This sharp increase can be attributed to the inversion effect in the atmosphere, which is prevalent during the winter months.Across all monitoring stations, an average increase of 68% in PM 2.5 concentration has been observed during the winter season.The maximum increase of 82% was recorded at MS2, while the minimum change of 52% was observed at MS3. MS6 had the highest average winter PM 2.5 concentration of 69.12 μg/m 3 , while MS4 had the lowest average concentration of 30.69 μg/m 3 during the summer season.Table 3 provides a detailed summary of the variation in average PM 2.5 concentrations across all sites during the winter and summer seasons.

Correlation analysis
A correlation analysis was done on the acquired data to better understand the link between the input as well as output characteristics.This investigation was carried out independently for meteorological factors and contaminants.Table 4 contains specific information about the climatic parameters used in the study.A correlation analysis was conducted to examine the relationship between meteorological parameters and PM 2.5 levels.The outcomes of this analysis are presented in the form of a correlation heatmap, as depicted in Figure 6.
From the results, it has been inferred that all the meteorological parameters used as input have a negative correlation with PM 2.5 .With a Pearson correlation coefficient (r) of −0.62, wind direction (WR) has the largest negative correlation.Additionally, wind speed (WS) displays a negative correlation with an r value of −0.24.The negative correlation of PM 2.5 concentration with wind speed and direction is based on the fact that wind is capable of transporting the light PM 2.5 particles in the air.A higher wind speed or a change in the wind direction away from the monitoring station would transport the PM 2.5 particles away, thereby reducing their concentration (Raju et al., 2022).
The r value for temperature (Temp) was obtained as −0.22.The negative correlation of temperature with PM 2.5 concentration is due to the strong air convection during higher temperatures.As the atmospheric temperature increases, the land heats up more quickly than the air.This creates a disparity in temperature between the air near the land surface and the air above it.The warmer air near the surface becomes less dense and starts to rise through convection.The lighter PM 2.5 particles at the surface are transported upward with the ascending air during this intense convective upward movement.As a result, their concentration is reduced near the ground, contributing to a decrease in the overall level of PM 2.5 pollutants in the lower atmosphere.
The r values for RH and SR are −0.02 and −0.075, respectively, showing that both of these parameters have a modest connection with PM 2.5 .
Table 5 shows the details about the pollutant parameters being used in the study.
To investigate the relationship between pollutant parameters and PM 2.5 levels, a correlation analysis was done, and the results were visualised in the form of a correlation heatmap (Figure 7).All pollutant indicators were shown to have positive relationships with PM 2.5 .With a r value of 0.54, NO 2 had the strongest positive connection with PM 2.5 of any pollutant measure.This suggests that NO 2 is a key precursor of PM 2.5 .With a r value of 0.32, ozone exhibited the second largest positive connection with PM 2.5 .SO 2 has a mildly positive connection with PM 2.5 , as demonstrated by its r value of 0.22.CO, on the other hand, has the lowest r value, indicating that it has no effect on PM 2.5 concentrations.The correlation analysis revealed important information about how different input parameters are related to the dependent variable, PM 2.5 .Despite the fact that some parameters had little effect on the objective variable, all parameters were utilised as inputs in the modelling process to assure full consideration without adding computing complexity.

PM 2.5 modeling
The PM 2.5 modeling has been done using MLP, KNN and HGBoost.The data was split (80% for training and 20% for testing) and fed into three different models for each algorithm, each with a different set of input variables.The first model used only the meteorological parameters as independent variables, while the second model used pollutant variables alone.As input for PM 2.5 modelling, the third model used a combination of meteorological and pollutant data.

Multi Linear Regression (MLR)
Table 6 displays the results of the multilinear regression model.The results show that the MLR1 model with only pollutant data as input had the least impressive results, with an R 2 of 0.345 during testing.This model's error was likewise higher than the other two models, with an RMSE of 18.06 μg/m 3 and MAE of 14.552 μg/m 3 .During testing, the MLR2 model utilising meteorological characteristics as input performed better than the MLR1 model, with a lower error and an R 2 value of 0.467.
However, the MLR3 model with both pollutant and meteorological data as input outperformed the other two models examined.It had the smallest error of the three models, with a MAE of 11.297 μg/m 3 and an RMSE of 14.453 μg/m 3 .During training and testing, R 2 values were 0.577 and 0.581, respectively.The R 2 for testing in the MLR3 model increased by 68.4% and 24.4%, respectively, as compared to the MLR1 and MLR2 models.
The outcomes of the MLR3 model are depicted in Figure 8.To compare the anticipated and test values, a combined plot was created.The R 2 score for testing was determined to be 0.581, indicating that the model was well-fitting.Furthermore, r between the predicted and test values was calculated to be 0.76, indicating a strong positive association.The distribution plot with a kernel density estimator (KDE) in Figure 9 depicts the distribution of errors or residuals in the modelling results.The distribution plot's base width is broader, indicating that the mistakes are more variable.The observed peak for zero error is quite near to 0.03, showing a minor bias in the model's predictions (Figure 9).The equation of the line of best fit for MLR3 is given by:

K-nearest neighbour regression model (KNN model)
Table 7 displays the results of the KNN regression model.The K value used for the modelling is (K) = 2.The results obtained indicate that the KNN1 model with pollutant data alone as the input gave the most underwhelming result with a R 2 of 0.471 during testing.The error obtained for this model was also higher than the other two models, with an RMSE of 16.229 μg/m 3 as well as a MAE of  11.717 μg/m 3 .During testing, the KNN2 model with meteorological parameters as input outperformed KNN1, with a lower error and an R 2 value of 0.498.
Among the three models evaluated, the KNN3 model, which utilized both pollutant and meteorological data as input, demonstrated superior performance.With a MAE (Mean Absolute Error) of 9.137 μg/m 3 and an RMSE (Root Mean Square Error) of 12.594 μg/m 3 , it exhibited the lowest error compared to the other models.The KNN3 model achieved R 2 values of 0.897 and 0.682 during the training and testing periods, respectively.Notably, the R 2 value for testing in the KNN3 model exhibited a significant improvement, increasing by 44.8% compared to the KNN1 model and 36.9% compared to the KNN2 model.These results underscore the KNN3 model's superior ability to accurately predict PM 2.5 levels and outperform the other models in the evaluation.Figure 10 shows the results obtained for the KNN3 model.A joint plot was plotted for the predicted value versus the test value.The R 2 score for testing was 0.682, and the correlation coefficient (r) between predicted and test values was 0.83.
In Figure 11, the distribution plot with a KDE function is presented, representing the distribution of errors or residuals in the modeling results.The base width of the distribution plot appears wider, suggesting a significant variance in the error values.However, despite the wider base width, the peak for zero error was observed at 0.05, which was higher than that of KNN3.This higher peak indicates that the model performed better than the KNN2 model (Figure 11).Overall, these findings suggest that the model's predictions were more accurate and closer to the actual values compared to the KNN2 model.

Histogram based gradient boost (HGBoost) model
Histograms conveniently illustrates the distribution of data, more precisely the number of occurrences of a data point, in case the data is repetitive.When the data fed to a model is arranged or discretised into bins as in a histogram, the flexibility of model increases.Combining histogrambased algorithm with gradient boosting constructs high performance machine learning ensembles (Nhat-Duc & Van Duc, 2023).HGBoost makes the algorithm to catch hold on integer-based data structures (histograms) instead of relying on sorted continuous values when building the trees.HGBoost is well-suited for capturing complex nonlinear relationships in the data.It combines  Table 8 displays the findings generated from the HGBoost model.The results show that the HGBoost2 model using only pollutant data as input had the least impressive results, with an R 2 value of 0.667 during testing.This model's error was likewise higher than the other two models, with an RMSE of 11.748 μg/m 3 and a MAE of 8.807 μg/m 3 .During testing, the HGBoost1 model utilising meteorological parameters as input outperformed HGBoost2 with a lower error and an R 2 of 0.728.
However, the HGBoost3 model with both pollution and meteorological data as input performed the best of the three models examined.It had the lowest error of the three models, with a MAE of 5.717 μg/m 3 and an RMSE of 7.647 μg/m 3 .During training and testing, the R 2 values were 0.981 and 0.859, respectively.In comparison to the HGBoost1 and HGBoost1 models, the R 2 for testing in the HGBoost3 model increased by 20% and 28.78%, respectively.
Figure 12 illustrates the outcomes of the HGBoost3 model.A joint plot was generated to visualize the predicted values versus the test values.The model achieved an impressive R 2 value of 0.859 during testing, demonstrating its strong predictive performance.Furthermore, r between predicted and test values was estimated to be 0.93, showing a strong positive association. Figure 13 depicts the distribution of mistakes or residuals in the modelling results using a distribution plot with a KDE.The base width of the distribution plot is broader, indicating that the mistakes are variable.Notably, the peak detected for zero error is centred at 0.04, indicating that the model is accurate and has a low bias in its predictions (Figure 13).

Comparison of model performance
From Figures 14 and 15, we can infer that the KNN3 model performed better than the MLR3 and KNN2 models.The MAE and RMSE of 9.137 μg/m 3 and 12.594 μg/m 3 were found to be less than the corresponding MAE and RMSE values of the MLR3 and KNN2 models.But the same does not hold when the comparison is made with respect to the HGBoost3 model.In the overall picture, we can conclude that the HGBoost3 model performed better than all the other models, including KNN3, with the lowest MAE and RMSE of 5.717 μg/m 3 and 7.647 μg/m 3 , respectively.The MSE value 58.487 (μg/m 3 ) 2 of HGBoost3 was also found to be the least among all the other models.Hence, we can conclude that HGBoost3 model gave the least error for PM 2.5 modelling.
The comparison of R 2 values during training as well as testing, as depicted in Figure 16, provides a clear assessment of the models' performance.From the figure, it is evident that the KNN3 model outperformed both the MLR3 and KNN2 models.The R 2 for the KNN3 model during testing was 0.682, which surpassed the R 2 values of MLR3 (0.581) and KNN2 (0.498).However, when compared to all the models, the HGBoost3 model demonstrated superior performance in both training as well as testing stages, achieving R 2 of 0.981 and 0.859, respectively.The HGBoost3 model was found to be the most successful for PM 2.5 modelling based on minimal error and maximum R 2 values.The spatiotemporal air quality evaluations were carried out over a two-year period, from January 2018 to December 2019, and full results were presented.The correlation analysis clarified the connections between the input parameters and the output variable.PM 2.5 modelling was performed using multilinear regression, K-nearest neighbour regression, and HGBoost, and the results were compared based on error and R 2 values.Finally, the best model was chosen based on its high R 2 value and low error.HGBoost typically has faster training times compared to MLR and KNN, especially when dealing with large datasets.MLP requires iterative optimization techniques like backpropagation, which can be computationally intensive and timeconsuming, while HGBoost's histogram-based approach allows for efficient parallelization.

Summary and conclusions
For the city of Hyderabad, a spatiotemporal air quality investigation was performed from January 2018 to December 2019.Multilinear regression, KNN, and HGBoost regression models were used to model PM 2.5 .The findings show that for PM 2.5 modelling, HGBoost regression outperforms.
• Spatiotemporal air quality study revealed higher levels of PM 2.5 and NO 2 at several monitoring stations.
• The highest measured PM 2.5 concentration was 130.54 μg/m 3 , which exceeded the CPCB air quality regulation of 60 μg/m 3 .
• Similarly, the highest NO 2 level was 130.59 μg/m 3 , which exceeded the CPCB guideline of 80 μg/m 3 .• There were seasonal changes in PM 2.5 concentrations, with average winter levels 68% greater than summer levels.
• Positive correlations were found between PM 2.5 and various pollutants, with NO 2 displaying the highest correlation (Pearson coefficient of 0.54).• Conversely, when meteorological parameters were considered, negative correlations were observed.Speed and wind direction showed the highest negative correlations (−0.62 and −0.24, respectively).
• Meteorological factors had a greater influence on PM 2.5 modelling than pollutant data.
• Using both meteorological and pollution data, the HGBoost3 model performed remarkably well, with R 2 values of 0.981 and 0.859 during training and testing, respectively.
• The HGBoost3 model has the lowest errors, with MAE and RMSE of 5.717 and 7.647 μg/m 3 , respectively.
• As a result, the HGBoost3 model was chosen as the study's best PM 2.5 forecasting model.
The research emphasizes the urgency of addressing air pollution in Hyderabad, given its adverse impacts on public health, the environment, and overall quality of life.By employing machine learning techniques and adopting a multi-faceted approach, there is hope for positive change, leading to cleaner air and a healthier future for the residents of Hyderabad and other polluted cities worldwide.The findings indicate that air quality in the city is influenced by a combination of local emissions, meteorological conditions, and regional factors.Machine learning models have proven to be effective tools in predicting PM 2.5 levels, allowing for better understanding and forecasting of air pollution episodes.The significance of this study lies in its potential to assist policymakers and environmental authorities in implementing targeted measures to tackle air pollution effectively.By identifying key contributors to PM 2.5 concentrations, authorities can design more focused and impactful policies, such as stricter emission controls, improved urban planning, and public awareness campaigns.However, it is essential to acknowledge that the effectiveness of these measures depends on collaborations between different stakeholders, including government bodies, industries, and the general public.Continuous monitoring, data collection, and model refinement will be crucial to maintain the accuracy and reliability of the predictions.

Figure 1 .
Figure 1.(a) Map of Hyderabad (b) the locations of Hyderabad's six Monitoring Stations (MS).

Figure
Figure 3. Variation of pollutants for different stations over the period from January 2018 to December 2019.

Figure 4 .
Figure 4. CO variations at various stations.

Figure
Figure 5. Seasonal variation of PM 2.5 .

Figure
Figure 6.Correlation heatmap of meteorological parameters with PM 2.5 .

Figure
Figure 12.Distribution plot for HGBoost3 model.

Figure
Figure 13.Joint plot for HGBoost3 model.

Figure 14 .
Figure 14.MAE and RMSE for different models.

Figure
Figure 16.R 2 for different models.