Future Challenges of Particulate Matters (PMs) Monitoring by Computing Associations Among Extracted Multimodal Features Applying Bayesian Network Approach

ABSTRACT The particulate matter (PM) is emitted from diverse sources and affects the human health very badly. In the past, researchers applied different automated computational tools in the predication of PM. Accurate prediction of PM requires more relevant features and feature importance. In this research, we first extracted the multimodal features from time domain standard deviation average (SDAPM), standard deviation of standard deviation (SDSD), standard deviation of particulate matter (SDPM), root-mean square of standard deviation (RMSSD), and nonlinear dynamical measure wavelet entropy (WE) – Shannon, norm, threshold, multiscale entropy based on KD tree (MSEKD), and multiscale approximate entropy (MAEnt). We then applied the intelligent-based Bayesian inference approach to compute the strength of relationship among multimodal features. We also computed total incoming and outgoing forces between the features (nodes). The results reveal that there was a very highly significant correlation (p-value <0.05) between the selected nodes. The highest total force was yielded by WE-norm followed by SDAPM and SDPM. The association will further help to investigate that which extracted features are more positively or negatively correlated and associated with each other. The results revealed that the proposed methodology can further provide deeper insights into computing the association among the features.


Introduction
The human health is severely affected due to the pollutant particulate matters (PMs) time series (Ostro, Broadwin, and Lipsett 2000;Weng, Chang, and Lee 2008). The size of the PMs varies in diamete from few nanometers (nm) to tens of micrometers (µm). The types of PMs are PM 1:0 ; PM 2:5 ; PM 10 . The human health is hazardously affected by these PMs due to the variation in distribution, size, and composition (Ostro, Broadwin, and Lipsett 2000). The ultrafine particles (PM 1:0 ; PM 2:5 ) have a more impact on the human health than coarse particles (PM 10 ) (Laden et al. 2014;Mar et al. 2006). According to the survey in 1993, it was observed by the World Bank that about 50% of the diseases are spreading due to the poor household environment (Albalak et al. 1999;L. P. Naeher et al. 2001). The indoor PMs are inhaled from cigarette smoking, cooking, wood burning stoves, and malfunctioning combustion appliances (C. K. Lee and Lin 2008;Martínez et al. 2016). The other sources affect the human health via respiratory systems (Zhiqiang et al. 2000) such as cough, asthma, bronchitis, cancer, fever, bronchial constriction, and obstructive disease (Repace and Lowrey 1980). Moreover, 90% of the people who are affected by the PM are spending most of their time at home rather than outside (Kado et al. 1994). Moreover, the people living in the rural areas are using domestic wood combustion heaters during cold or moderate weather, which severely affect their health GLASIUS et al. 2006;Grange et al. 2013;Molnár and Sallsten 2013;Trompetter et al. 2013). The respiratory symptoms and exacerbations especially in children and young adults are associated with elevated concentration of ambient PM in wood burning communities (Lipsett, Hurley, and Ostro 1997;McGowan et al. 2002;Luke P. Naeher et al. 2007;Town 2001). The studies indicate that wood smoke affects the human health in a similar way as that of diesel and gasoline. There is a dire need to study the health impacts of ambient PM emitted from wood combustion heaters and other sources. The research also reveals that wood smoke contains the polycyclic aromatic hydrocarbon compounds having carcinogenic properties, and ultimately, the indoor exposure to wood smoke increases the cancer risk (Hosgood et al. 2010).
The optimal and modern combustion engines produce the diesel exhaust particles that are primarily PM 2:5 , a substantial component of which are PM 1:0 . These particles are highly complex comprising the core of elemental carbon and absorbed organic compounds and small amount of nitrate, sulfate, metal, and many trace elements (Wichmann 2007).
The diesel exhaust study is very problematic as the PM of diesel exhaust varies in chemical composition and size with respect to the engine type (light duty, heavy duty, or method of fuel injection), operating condition of the engines (accelerating, decelerating, or idle), and fuel formulation (low/high, petroleum-based diesel, sulfur fuel, or biodiesel). These differences are yet unclear that how much they change the toxicity (Matti Maricq 2007; Y. Wang et al. 2015). The lifetime of the atmospheric PM ranges from minutes to several days. The impact of these particles on health can be of greater extent as they age in the atmosphere (Duncan et al. 2008).
Likewise, the crustal dust produces the health risks including respiratory mortality(>75 years of age) (Zauli Sajani et al. 2011), cardiovascular mortality (Chan and Ng 2011;de Longueville et al. 2013;Elliott, Henderson, and Wan 2013;H. Lee et al. 2013H. Lee et al. , 2014, asthma exacerbation (Kanatani et al. 2010;J.-W. Lee and Lee 2014;PARK et al. 2005;C.-H. Wang, Chen, and Lin 2014;Yoo et al. 2008), respiratory and COPD morbidity TAM et al. 2012), pneumonia (Cheng et al. 2008;Griffin 2007), reduced lung function in children (Hong et al. 2010), lung inflammation (Ghio et al. 2014;Lei et al. 2004), and infectious diseases (Goudie 2014;Sprigg et al. 2014;Yang 2006). The PMs spreading from diverse sources have a very severe impact on human health. With the passage of time, the environment due to the polluted PMs, weather environmental changes, huge constructions, increase in motor vehicles, and other sources that are used as means of facilities severely impact our health due to spreading the concentration of PMs on diverse means. Thus, if no precautionary measures and tools are developed and concerned health-care professionals and environmental professionals are provided with scientific solutions, there might be severe future impacts and challenges on human health.
In the past, researchers utilized traditional machine-learning algorithms. However, we need a more comprehensive analysis to determine the associations and other Bayesian measures to further strengthen our analysis to unfold the hidden dynamics for further improving the PM concentration prediction. The parametric information from the data in the recent studies has been investigated using a probabilistic propagation algorithm (Bayes Rule) by applying Bayesian networks (BNs). The degree of uncertainty and associations of variables varies from different sources such as numerical data, empirical data, and expert opinion to capture the conditional dependencies of a variable upon others (Kaikkonen et al. 2021). BNs have successfully been utilized in many studies by different researchers such as Kocian et al. (2020), Amaral et al. (2019), Laurila-Pant et al. (2019), and Zhang et al. (2019). The causal relationships can be studied between variables, which compute the probabilities of a variable when other variables in the model are known. Moreover, the Monte-Carlo analysis (MCA) can be used at random sampling of probability distribution functions (PDFs) to denote the inputs of Bayesian model to produce hundreds or thousands of possible outcomes (Sperotto et al. 2019). Recently, BNs have successfully been utilized in many applications ranging from predicting energy crop yield (Gandhi, Armstrong, and Petkar 2016), prediction of coffee rust disease using BNs (Corrales 2015), and sustainable planning and management decision (Musango and Peter 2007). The BNs computed the interrelation among variables that impacts climate change scenarios in agriculture (Ershadi and Seifi 2020). Moreover, recently, Lu et al. (2020) utilized BNs to investigate the complex causal interactions between environments and plant diseases. Previous studies did not focus on association among features to further unfold nonlinear dynamics in order to further improve the health care and environmental issues regarding PM time series.
In the present study, we aimed to apply the Bayesian inference (BI) approach for comprehensive analysis to unfold the nonlinear and hidden dynamics present in the nonlinear and nonstationary indoor and outdoor PM time-series data. The PM concentrations have a very severe health impact, so there is a dire need to unfold nonlinear hidden dynamics and associations among extracted features based on Bayesian artificial intelligence methods so that the concerned health professional can take the precautionary measures to reduce the mortality risks. We computed the associations and strength of relationships among multimodal extracted features. The Bayesian approach recently gained its popularity and utilized in many biomedical signal and image processing problems. The BI evaluates the posterior probability, which can be yielded from a weighted combination of local estimates known as likelihood and estimates in surrounding spatial units. Researchers are developing intelligent methods based on machine-learning algorithms, which require the extraction of most relevant features. Our research objective was multifold; first, we computed the multimodal features based on time domain, frequency domain, and entropy-based complexity measures. We then ranked the features based on EROC value. The higher the ROC value, the more important the feature indicates. Second, once we ranked the features, we then selected the high-ranked feature as our target node and then further computed the detailed Bayesian analysis with other features to further unfold the underlying hidden dynamics. We computed the relationship analysis among the extracted nodes using mutual information (MI), Kullback-Leibler (KL) divergence, and Pearson's correlation. The strength of relationship was computed using arc analysis with 3D mapping. We then computed the parent-child relationship and node force between the nodes. The association graph for segment profile analysis was computed for further analysis. Moreover, the network performance and significance of prominence were computed using tornado diagram. The flow of our work is presented in Figure 1. We first took the PM time-series input signal, then extracted the multimodal features, applied the features to BN, and utilized different methods for detailed analysis. Finally, the network performance was evaluated using different metrics.
This study is aimed to predict the PM time series by extracting multimodal features from time domain, statistical, and entropy-based complexity measures by applying BN analysis method. We investigated the network performance including the strength of relationship and node force among the nodes (features) using MI, Pearson's correlation (PC), and KL.

Dataset
The dataset of indoor and outdoor PM time series utilized in this study was previously utilized by us and is detailed inHussain et al. (2020a).

Feature Extraction
In machine learning, the most important step is to compute the most relevant features. Researchers in the past computed the most relevant features. To detect the colon cancer, the authors (Ferland et al. 2017;Rathore et al. 2014) computed the hybrid features. Recently, Hussain and coworkers computed multimodal features and texture features to compute the various pathologies in signal and imaging problems (Abbasi et al. 2020; A multi-modal, multi-atlas -based approach for Alzheimer detection via machine learning, 28 2018; Anjum et al. 2021;Iqbal et al. 2021;Lal et al. 2021). Moreover,  computed multimodal features from time domain, statistical, and entropy-based features to detect the PM time series. Feature extraction is an important step successfully utilized in most of the image and signal processing problems (Barbhuiya, Karsh, and Jain 2021;Li, Huang, and Srivastava 2021;Xie et al. 2021). We utilized the features from time domain, frequency domain, statistical, entropy, and wavelet-based complexity features as detailed by .

Feature Ranking Algorithms
The feature ranking algorithm is used to rank the importance of features using supervised ranking algorithms. This ranking is based on the scoring values of the algorithms (H. Wang, Khoshgoftaar, and Gao 2010). Different ranking algorithmic methods including wrapper are utilized (Shakir et al. 2019). Recently, MATLAB toolbox with a total of 30 FIR methods have been utilized by Yu et al. (2019) to integrate the paves to select the features and diagnose intelligently the real-world applications. We computed the feature importance of the multimodal features extracted from PM time series. We used EROC to compute the extracted feature importance. The EROC was computed as area between the empirical receiver operating characteristic curve (EROC) and random classifier slope. The higher EROC value indicates the more important feature that contains the most important information to further unfold the nonlinear and hidden information of the data of interest. After extracting the features, the important step is to apply the feature selection methods to rank the important features (ROFFO et al. 2020;Teng et al. 2019;Venkatesh and Anuradha 2019;Yu et al. 2019).

Bayesian Network Analysis
The Bayesian analysis is successfully utilized in various signal and image processing applications (Chen et al. 2021;Kottke et al. 2021;Ni, Zhang, and Liu 2022;Yarnell et al. 2021). The causal effect and their relationship are determined using Bayesian method and directed acyclic graph (DAG) (Pearl 1986). Consider X = {X 1 , X 2 , X 3 , . . . . . . X n } a set of m-dimensional variables, the BN is defined with couplet X ¼ G; P h i, where G denotes the DAG and P denotes the set of parameters that quantify the network that contains the probabilities of each possible value of xi for each variable Xi. Mathematically: Here, X j i ð Þ represents a set of parent variables of X i for direct acyclic graph G. The posterior probability is thus computed using this algorithm through inference of variable of interest.
BayesiaLab V10 was employed for detailed analysis (Bayesia 2017). The Shannon entropy (Shannon 1948) was computed using: The MI algorithm calculates the difference between marginal entropy of the target variable and conditional entropy of the predicted variable (Shannon 1948); mathematically, which is equivalent to: Moreover, conditional mutual information (CMI) is defined as: The p (X, Y) shows joint probability distribution of X and Y. However, p(X) and p (Y) indicate the marginal distribution of X and Y, respectively. The relevant Gaussian distribution of co-variance matrix variables X 1 , X 2 , X 3 , . . . . X n (Xiao et al. 2016) can be computed as: By applying the mathematical transformation function, the MI and CMI2 can be calculated using the following formulae: To correct the underestimation of conditional Mutual Information 1 (CMI1) (Janzing et al. 2013), the CMI2 is used to integrate the interventional probability and KL divergence (Kullback and Leibler 1951).
Using the Gaussian distribution, the CMI2 can be easily calculated with the same hypothesis. The relationship between the two random variables can be computed using the Pearson's correlation coefficient (PCC) initially proposed by Pearson ("VII. Mathematical contributions to the theory of evolution. -III. Regression, heredity, and panmixia 1896) to measure the strength and direction of relationship (J. Lee and Nicewander 1988). PCC is utilized in many applications including classification (Tyagi 2015), data analysis (Tyagi 2015), biological research (Puth, Neuhäuser, and Ruxton 2014), decision-making and clustering (Liao, Xu, and Zeng 2015), and (Puth, Neuhäuser, and Ruxton 2014) finance analysis (Kim, Kim, and Ergün 2015). Mathematically, it can be computed using the following formula.
Þ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi P X i ; � X ð Þ 2 q ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi P The coefficient r XY ranges from −1 to 1. The PCC provides the strength of linear relationships between the two random variables X and Y. The value denotes the degree of relationship, positive sign shows the direct relationship, and negative sign denotes the inverse relationship. The zero value shows that there is no correlation. The value of jr XY j closer to 1 indicates the stronger relationship.

Statistical Analysis
We computed the multimodal features from indoor and outdoor PM time series using MATLAB. We then provided the feature matrix to BayesiaLab for further detailed analysis. We conducted the analysis using BayesiaLab 10.0. We used the BayesiaLab with minimum description length (MDL) of candidate network in its score-based algorithm to compare the BN structure (Conrady and Jouffe 2015). The statistical independence test (GKL-test; p-values >0.05) was used to validate the connections among the descriptors, which were identified by the learning algorithm. The p-values or independence probabilities were utilized to check the significance of each individual relationship between the nodes or between the nodes and the target node (Harris et al. 2014;Thai et al. 2012).

Exploratory Analysis of the Unsupervised Network
The exploratory analysis can be utilized to determine the potential relationship between variables of interest (Moreno-Jiménez et al. 2011). We can further explore the global analysis of problem of interest by computing influence between nodes and influence of nodes under investigation. We build our model by learning unsupervised learning algorithm utilizing maximum spanning tree algorithm approach developed in BayesiaLab V10 (Wilhere 2012). We also computed maximum spanning tree (MWST). A lowest value of MDL indicates best trade-off between complexity and data representation.

Sensitivity Analysis
A detailed sensitivity analysis was performed to check the relationship among the nodes in the selected network. To understand the relationship between the nodes, we computed the highest and lowest values of PM, MI, and KL, and node force between the nodes was examined globally on the network. The probabilistic dependencies were computed using MI between the nodes in the network. The tornado plots are used to display the influential knowledge of each node using sensitivity analysis on probability of each descriptor, and maximum strength of individual relationship between the nodes and descriptors was computed. The lowest and highest probability values for each node are displayed from the tornado plots to achieve hard evidence placed on the corresponding descriptor state . The confidence and consistency levels of the sensitivity analysis using the BN model are verified by validating the model (Goerlandt and Montewka 2015;Hänninen and Kujala 2012;Tanackov et al. 2018) to verify different conditions.

Segment Profile Analysis of Energy
The analysis was also done using segment profile analysis using radar chart for normalized mean values conditionally to energy for all other multimodal features. The significance was tested using Bayesian test (Best) and NHST t-test (a frequentist test). Using NHST t-test, the two-tailed t-test is utilized for null hypothesis significance testing. The Bayesian (Best) test is detailed by John K. Kruschke (Kruschke 2013), which follow the student's t-distribution. Moreover, 95% confidence interval (CI) is utilized. When the mean values are estimated significant, a square is added next to the label.

Results and Discussions
In this study, we first computed the time domain, frequency domain, and entropy-based complexity features from PM time-series data. The features were ranked using ROC. The SDPM, SDSD, RMSSD, MApEn, RMS, and variance yielded the highest ROC (0.4861). For the rest of the analysis, we have chosen the SDPM as our target variable. Figure 2 shows the ranking of multimodal features using ROC values. The higher ROC value indicates the more important feature. The higher entropy value indicates the more complex and important feature. We extracted the features from time domain, frequency domain, and entropy-based complexity features from PM time series of indoor and outdoor environment. The features are ranked without utilizing any unsupervised or supervised machinelearning algorithm. A specific method that ranks the features is based on the assigned score values (H. Wang, Khoshgoftaar, and Gao 2010). Finally, based on these scores, the features are ranked and the features with redundant information are further eliminated for classification. In this study, we ranked the multimodal features based on ROC values developed in MATLAB diagnostic tool. Figure 3 depicts the relationship analysis using BI methods including the MI, KL, and PC. The bold lines represent the stronger relationship, the lighter lines indicate the smaller relationship. The blue color indicates the positive relationship, whereas the red color indicates the negative relationship. Moreover, the arrows indicate the ðparent ! childÞ relationship. We kept our target node as SDPM using MI; probability of occurrence for the state ≤0.267 (58.33%), state ≤0.597 (37.50%), and state >0.597 (4.17%) with joint probability for all states is 100%. The probability distribution with other extracted nodes at selected states is depicted in Figure 3 (a-c). Figure 4 represents the 3D mapping arc analysis to show the relationship among the multimodal extracted features. The nodes represent the features and lines represent the relationship between the nodes. The strength of relationship is denoted by the width of line. The blue color represents the positive relationship, whereas the red color denotes the negative relationship. Using the MI, the highest strength of relationship was obtained between the nodes WE À LogE ! WE À Th; 1:2478 ð Þ followed by (SDAPM ! SDPM, 1.1753), SDAPM ! WE À Shannon; 0:9544 ð Þ and so on as reflected in Figure 4(a). Figure 4(b) shows the association between the nodes using KL. The highest strength of relationship was yielded between the nodes WE À LogE ! WE À Th; 1:2478 ð Þ followed by SDAPM ! SDPM; ð 1.1753), SDAPM ! WE À Shannon; 0:9544 ð Þ, and so on. Figure 4(c) denotes the relationship between nodes using Pearson's correlation. The highest strength of relationship was yielded between the nodes (WE -LogE ! WE À Th; 0:9693Þ followed by SDAPM ! SDPM; À 0:9823 ð Þ and so on. The highest negative relationship was yielded between the nodes SDAPM ! WE À Shannon; À 0:9310 ð Þ followed by SDAPM ! WEÀ ð Shannon, -9186) and so on. All other nodes exhibit the positive relation,  where a week relationship was yielded between the nodes cluster WE-norm and SDAPM. The strength of relationship using these methods is also reflected in Table 1. Moreover, the highly significant results (p-value <0.005) were yielded for all ðParent ! childÞ relationships except the last two pairs. Table 2 reflects the incoming, outgoing, and total force of different extracted multimodal features from PM time-series data. The SDAPM node has outgoing force (22.1297), incoming force (0.6082), and total force (2.7379); the eWE-Norm node has outgoing force (1.2634), incoming (0.8055), and total force (2.0688) and so on. The highest outgoing and total force was yielded by the node SDAPM such as 2.1297 and 2.7379, respectively. The highest incoming force was yielded by the node WE-Th (1.2478).
We first extracted the multimodal features of PM time series. We then ranked the features before applying the BI approach. The SDAPM was highly ranked features measured using EROC and random classifier slope, which was selected as our target for further Bayesian analysis. We computed the association of top ranked SDAPM feature with other features to further unfold the association among the features. There were four states represented by ≤0.267, ≤0.597, and >0.59 with the highest data points to lowest data points in ascending order, respectively, as represented in the Mosaic association graph in Figure 5 and Table 3.   Table 4 reflects the overall analysis of target node SDAPM with other nodes. All nodes exhibit the highly significant results. Figure 6 depicts the analysis of association graph for segment profile analysis of top ranked target node with other extracted multimodal features using the radar chart, which reflect the distributions based on 1-12 clock hours. Figure 6(a) reflects the overall probability, and we used the NHST t-test and Bayesian test to find the significance to distinguish with other different Figure 5. Analysis of target node energy with other extracted nodes using mosaic graph based on selected target node SDPM and predictions of occurrence made against each state ≤0.267, ≤0.597, and >0.59.   states such as (a) ≤0.267, (b) ≤0.597, and (c) >0.597 as reflected in Figure 6(be). The highest significance using both the tests is reflected by double boxes blue and light blue, where the significance with any one test is reflected by single box. The node with no box shows no significance at all with any test. The network performance of the selected target node SDAPM with other selected nodes yielded the highest predictions with 100% reliability and precision for all the selected states as shown in Figure 7(a-c). A relative Gini index of 99.49% and ROC index of 100% were obtained as reflected in Figure 7(b-c).
Using the tornado graph as reflected in Figure 8, we visualize the maximum deltas in the posterior probabilities of the target states, and hard evidence is set on the selected variables. The strong deltas are shown at the top of the graph. The highest association was yielded with SDAPM, WE-Shannon, RMSSD, and SDSD for state ≤0.267 (58.33%) followed by cluster state ≤0.597 and >0.597 as reflected in Figure 8. This indicates that high top ranked SDAPM feature prevails high associations with SDPM, WE-Shannon, RMSSD, and SDSD, which can be used as a better predictor for improved analysis and mass concentration of PM time series.
Most of the studies employed prediction methods. However, after extracting hand-crafted features, ranking of features and computing associations among features can further help the concerned health departments and researchers to further improve the health-care systems and environments by understanding the association and strength of relationship among the extracted features. Recently, the researchers developed different methods to predict the PMs using classification methods.  applied machine-learning classification methods to distinguish the indoor PM time series. Recently, Mengash et al. (2022) improved the classification accuracy by further optimizing the robust machine-learning algorithms and applying feature selection methods. Doreswamy, Gad, and Gad (2020) applied regression methods to forecasting ahead the concentration of PM time series. Moreover, Doreswamy, Gad, and M, Y (2021) utilized spatiotemporal clustering to investigate the PM time series. Hosahalli and Gad (2018) applied methods to handle the missing data for weather station data. Likewise, Doreswamy, Gad, and Gad (2020) employed Seasonal Autoregressive Integrated Moving Average (SARIMA) model to monitor the air quality. Most of these studies are utilizing classification, regression, and forecasting methods. However, our direction is to unfold the nonlinear hidden dynamics from extracted multimodal features from indoor and outdoor PM concentration time series. We first extracted multimodal features from time domain (to capture the time variations), frequency domain (to capture the spectral variations), entropy, and wavelet-based features (to capture the nonlinear hidden dynamics); the same features were previously utilized for classification, which provided the 100% classification performance. However, accurate classification cannot convey much information for concerned health-care professionals, instead to provide them the importance of features, along with the association among features that can be helpful in making the decision. We previously investigated the association among morphological features to predict the prostate cancer (Lal Hussain et al. 2019). In this study, after ranking the features and finding the association among features, we investigated a detailed comprehensive analysis of target variables with other computed variables as reflected in radar graph that how much important significance was found. We also computed the impact in terms of tornado diagram to see the overall impact of target variable with other features. The strength of relationship and other important measures further unfold the nonlinear hidden dynamics present in the indoor and outdoor PM time series.

Conclusions
Currently, the PMs spreading from diverse sources severely affect the human health. The different sources include anthropogenic and organic, chemical, and soluble compositions. The researchers are devising different methods to study the dynamics of PM time series. This study quantifies the associations computed between the multimodal features extracted from PM time-series data utilizing BI approach. Previous studies utilized mostly the classification and regression methods. Recently, we applied the BI method and computed the association among the multimodal features, sensitivity analysis, target node analysis, and all computed node network analysis for deeper understanding the hidden dynamics to further unfold the nonlinear dynamics in PM time series. The results reveal that the proposed approach can be very helpful for health and environment professionals to take precautionary measures to reduce the health risk due to the PM time series. Currently, we have limited data of PM time series; however, in future, we will acquire more data from diverse sources including construction areas, crush plants, heavy traffic zones, farming lands and cultivations, and wooden combustion areas. We will also extend our analysis with more in-depth analysis of PM time series. We will further apply the segment analysis, target profile analysis and other posterior probability methods, tree optimizations, function optimization, and posterior mean analysis.