Machine learning in coronary heart disease prediction: Structural equation modelling approach

Abstract This research is an application of machine learning in medical sciences. The purpose of this research was to use machine learning through the simulated data to study the association of age, body mass index, cigarettes smoked per day, alcohol consumed per week, diastolic blood pressure, and systolic blood pressure on hypertension and coronary heart disease. The Structural Equation Modelling using Partial Least Square Method was used for the analysis of data. The results have revealed that except for age, body mass index and systolic blood pressure all the rest of the factors had a significant positive association with hypertension and coronary heart disease. The results can be of use for medical practitioners as well as researchers in machine learning, as it adds to the repository of earlier studies, which have attempted to seek relationships between these variables.


PUBLIC INTEREST STATEMENT
Research on the identification of antecedents of hypertension and coronary heart disease has been active since the past several years and newer methods are being designed to identify the most significant factors, which may be responsible for the risk of hypertension and coronary heart disease. In this research machine learning technique has been used to study the association of age, body mass index, cigarettes smoked per day, alcohol consumed per week, diastolic blood pressure, and systolic blood pressure on hypertension and coronary heart disease. The Structural Equation Modelling was used for the analysis of data. The results have revealed that except for age, body mass index and systolic blood pressure all the rest of the factors had a significant positive association with hypertension and coronary heart disease. The results can be of use to the medical practitioners, as it adds to the repository of earlier studies in this area.

Introduction
Heart disease is one of the major killer diseases today (Aljanabi, Qutqut, & Hijjawi, 2018). One of the most challenging task is to identify the causes of this disease and prevent it to the extent possible. Thus, one of the most anticipated application of Machine Learning (ML) has been disease detection and prevention. Medical diagnostic reasoning is becoming popular application of ML in which expert systems and model-based schemes provide mechanisms used for the development of hypotheses, which will be then tested using the modelling and simulation techniques or statistical analysis (Magoulas & Prentza, 1999). Deep neural networks is another promising application of machine learning which is used to relate the input and output of complex models through layers so that the relationships may be studied in detail (Bakator & Radosav, 2018). Convolutional neural networks (CNN) have proved to be very promising and successfully used in anatomical localization (de Vos, Wolterink, de Jong, Viergever, & Išgum, 2016), automated segmentation (Dou et al., 2017), brain tumour grading (Pan et al., 2015), glaucoma detection (Chen, Xu, Wong, Wong, & Liu, 2015), Alzheimer's disease prediction (Payan & Montana, 2015), Automatic breast tissue classification (Dubrovina, Kisilev, Ginsburg, Hashoul, & Kimmel, 2018), Automatic detection of myocardial infarction (Acharya et al., 2017). Deep belief network path (DBN-NN) based on back-propagation algorithm of Artificial Neural Network (ANN) has been very promising in breast cancer classification (Abdel-Zaher & Eldeib, 2016). While these are few of the evidences of successful applications of ML in medical field the endeavour to use it in the prediction of the causes of disease well in advance to prevent it is a continuous and never-ending endeavour. Among various approaches, structural equation modelling approach has been very promising in recent past and this paper is an attempt in that direction.

Study population
The research is based on the real-life parameters obtained from Offspring Cohort data of the Framingham Heart Study (Mi, Eskridge, George, & Wang, 2011). The average values of body mass index, cigarettes smoked per day, alcohol consumed per week, diastolic blood pressure, and systolic blood pressure in the given range of age were randomized through the program. Each of the latent variable has been divided into three mutually exclusive groups, as the SEM approach demands at least two indicators for a given latent variable. The range of the population in terms of age was from 45 to 60 in steps of 5, the BMI ranged from 23 to 28.5 in steps of 2.9, CPD ranged from 5 to 15, APW ranged from 1.5 to 4.9 Oz, DBP ranged from 65.4 to 78.5, SBP ranged from 107.4 to 128.6, percentage chance of HPT ranged from 10 to 30, percentage chance of CHD ranged from 5 to 15 as indicated in the Framingham Heart Study (Mi et al., 2011).

Structural equation modelling
The method adopted in this research to identify the interaction of the antecedents with hypertension and coronary heart disease is Structural Equation Modelling (SEM). SEM has been specified as the tool for analysis as it can perform factor analysis and regression analysis simultaneously and as it is based on Partial Least Square Method (PLSM) it addresses the issue of multi-collinearity very effectively. SEM has two distinct components: the measurement model, which provides the reliability and validity of the research data, and the structural model, which tests the hypotheses.

Mathematical background
The hypothetical model adopted in this research is fundamentally derived through the Threshold Model proposed by Muthén (1984), according to which the dichotomous response for the CHD or presence or absence of hypertension (ϕ ijk (k = 1,2 … p)) is a function of unobserved continuous response ϕ* ijk via the threshold model, mathematically.
where τ k = is the threshold parameter.
The structural model is developed based on the above-defined threshold model with Level-1 model which is within the Family Model and Level-2 which is Family Random-intercept Model.
At Level-1, the equations are in the form of linear form of the structural model given by, where vi = a vector (p × 1) B = p x p matrix of structural parameters Γ = coefficient matrix describing the relationship between latent response and predictor variables δ ij = is a vector of residuals (p × 1) where v i = a vector (p × 1) in the i th family, γ = constant, and Ã i = a vector of residuals (p × 1).
The above Level 1 (j th individual) and the Level 1(i th family) equations provide the generalized equations for the second level SEM.

Hypothetical model
While HPT and CHD have been causally linked to each other by quite a good number of researchers (e.g. Wu et al., 2015) both have been associated with several other factors. Age is one of the factors, which has been associated with HPT and CHD (e.g. Franklin & Wong, 2013;Jousilahti, Vartiainen, Tuomilehto, & Puska, 1999;Kjeldsen, 2018;Sekhri et al., 2014;Wu at al., 2015). Kjeldsen (2018) has specifically found the need to develop age-adjusted models, models based on relative risk based on age, within and between countries to generalize the association of age with HPT and CHD. Increasing blood pressure with age has been considered as a major factor, which can induce HYP or CHD. Age-related stiffening of the aorta, which leads to a reduction in the capacity of elastic reservoir and a greater chance of CHD particularly during SBP (Franklin & Wong, 2013). Franklin and Wong (2013) found that both SBB and DBP have a significant association with CHD. Mi et al., (2011) have undertaken separate studies for male (n = 496) and female (n = 470) to study the influence of various factors on HPT and CHD and found that in case of male age had an influence on HPT, whereas factors such as BMI, CPD, APW, DBP and SBP had no influence on HPT, whereas in case of female age, DBP and SBP had significant influence on HPT, whereas CPD and APW had no significant influence on HPT. Wu at al., (2015) through a study with a large sample of 77,389 found that the influence of HPT on CVD was much higher in the case of women in comparison to that of men. The study also revealed that HYP was significantly influenced by age, BMI, CPD and APW. Franklin and Wong (2013) have found that the association of SBP with CHD was higher than that of DBP.
The above studies indicate that while there are several parameters which are associated with the HPT and CHD it is evident that there is a need to investigate more on the influence of age, body mass index, cigarettes smoked per day, alcohol consumed per week, diastolic blood pressure, and systolic blood pressure, thus the following hypotheses: H 1 : Age has a significant association with hypertension.
H 2 : Body mass index has a significant association with hypertension.
H 3 : Cigarettes smoked per day has a significant association with hypertension.
H 4 : Alcohol consumed per week has a significant association with hypertension.
H 5 : Diastolic blood pressure has a significant association with hypertension.
H 6 : Systolic blood pressure has a significant association with hypertension.
H 7 : Age has a significant association with Coronary heart disease.
H 8 : Body mass index has a significant association with Coronary heart disease.
H 9 : Cigarettes smoked per day has a significant association with Coronary heart disease.
H 10 : Alcohol consumed per week has a significant association with Coronary heart disease.
H 11 : Diastolic blood pressure has a significant association with Coronary heart disease.
H 12 : Systolic blood pressure has a significant association with Coronary heart disease.
H 13 : Hypertension has a significant association with Coronary heart disease.
The hypothetical SEM is depicted in Figure 1 3. Results and discussions

Measurement model
The Cronbach's Alpha which is a measure of internal consistency of the data varied from 0.7 to 0.9, indicating a moderate-to-high level of acceptance (Taber, 2018) (Table 1). The composite reliability of the model varied from 0.7 to 0.9 confirming moderate to high reliability (Ahmad, Zulkurnain, & Khairushalimi, 2016). Rho-A is another measure of composite reliability and values of above 0.6 are considered acceptable (Rigdon, Ringle, & Sarstedt, 2010), and in the present case it ranges from 0.7 to 0.8 which indicate moderately high composite reliability. The convergent validity measured as standardized factor loading (FL) after factor reduction (Table 2) ranged from 0.5 to 0.9 indicating a good correlation between the factor and the observed variable. While the aforementioned reliability values are referring to the reliability of measurement in terms of the items in the given factor and the factors in a research construct Discriminant Reliability is a measure of how these factors are mutually exclusive of each other. As square roots of AVE of all the dimensions are greater than their respective correlation between the remaining constructs, acceptable discriminant validity is confirmed (Table 3).

Structural model
The path coefficient values ranged from 0.01 to 0.7 (Figure 2), and the R-square values obtained for percentage chance of hypertension and coronary heart disease are 0.7 and 0.6, respectively, which is quite adequate in comparison to the other research studies in this field (R-square cut off 0.1). The t-values (Table 4 and Figure 3) indicate that the following hypotheses are supported: Results indicate that the following hypotheses are supported: H 3 : Cigarettes smoked per day has a significant association with hypertension.  H 5 : Diastolic blood pressure has a significant association with hypertension.
H 9 : Cigarettes smoked per day has a significant association with Coronary heart disease.
H 10 : Alcohol consumed per week has a significant association with Coronary heart disease.  H 2 : Body mass index has a significant association with hypertension.
H 6 : Systolic blood pressure has a significant association with hypertension.
H 7 : Age has a significant association with Coronary heart disease.
H 8 : Body mass index has a significant association with Coronary heart disease.
H 12 : Systolic blood pressure has a significant association with Coronary heart disease.

Discussions
The study results indicate that hypertension (HPT) and coronary heart disease (CHD) have a significant association. Hypertension is considered to have a significant causation with CHD by quite a good number of researchers (e.g. Biswas, Singh, & Singh, 2017;Cubrilo-Turek, 2003;Pan et al., 2015;Milane et al., 2014). Several pathophysiologic mechanisms link these two variables and research shows that mortality due to CHD is 2.3 times greater when HPT is present (Escobar, 2002;Olafiranye et al., 2011). Milane et al. (2014) have found that the association is significant only at a later age. In another study with a large sample (n = 147,201), it was found that hypertension when associated with age, education, and place of stay had a relationship with CHD than individually. Hansson and Lundin (1984) have established a cause/consequence relationship between hypertension and coronary heart disease and found that the relation could be bidirectional too. So, by and large, the outcome of this study is in agreement with the findings of earlier researchers. In contrary to these findings, there are also studies which have revealed that despite the high cases of hypertension, coronary heart disease rate has been low, and hence, coronary heart disease has been considered to be clearly multifactorial (Poulter, 1999). Thus, the revelation of this study can add to the existing literature to support the fact that HPT and CHD are associated with each other.
The results of the study indicate that cigarettes smoked per day (CSPD) has a significant association with HPT as well as CHD. In the medical field, it has been established that cigarette smoke can cause a functional and initially transient damage primarily of the endothelium (Leone, 2015). Medical explanation to the association of cigarette smoking to CHD is that smoking activates the platelet function which leads to thrombosis formation and myocardial infarction thus leading to the CHD (Inoue, 2004). Further, the ill-effects of nicotine and carbon monoxide are reduced tolerance to exercises, and hence, it increases the chance of HPT. So, this outcome is in alignment with the earlier findings (e.g. Ain & Regmi, 2015;Bowman, Gaziano, Buring, & Sesso, 2007;Gao, Shi, & Wang, 2017;Pardell & Rodicio, 2005;Robertson et al., 2014;Stallones, 2015;Yathish, Manjula, Srinivas, & Gayathree, 2011). On the contrary to these findings, there are also studies which have reported lower BP levels among smokers and lesser chance of HPT (Bowman et al., 2007;Gao et al., 2017;Pankova et al., 2015). Primatesta, Falaschetti, Gupta, Marmot, and Poulter (2001) found that the influence of cigarette on HPT was only for the age group of above 45 years. In another study, it was  (Virdis, Giannarelli, Fritsch Neves, Taddei, & Ghiadoni, 2010). Keto et al. (2016) have found through a sample size of over 5000 respondents that cigarette smoking has no clinically significant influence on CHD. These contradictions call for further research with additional parameters such as family history of smoking, duration of smoking, environment of smoking (indoor or outdoor), type of smoking habit (chain or intermittent) etc., to study the influence smoking on HPT and CHD.
The analysis indicates that alcohol consumed per week (ACPW) has a significant association with HPT and CHD. The findings of this research are in alignment with the findings of earlier research (e.g. Chevli, Ahmad, Rasool, & Herrington, 2019;Criqui & Thomas, 2017;Fuchs, Chambless, Whelton, Nieto, & Heiss, 2001;Roerecke et al., 2017;Santana et al., 2018;Xin et al., 2001). Several researchers have made attempts to study the influence of alcohol consumption in light, moderate, and heavy dosages leading to HPT and CHD. The classification of alcohol consumption has been mainly in terms of quantity of intake and frequency/week. Briasoulis, Agarwal and Messerli, (2012) claim that alcohol intake increases the risk of HPT, but the relationship between light to moderate alcohol consumption and hypertension is controversial. Santana et al. (2018) claim that the association between ACPW and HPT itself is controversial. While the contradictions regarding the association of ACPW and HPT/ CHD exist, there are studies with concrete findings. For instance, Miller, Anton, Egan, Basile, and Nguyen (2005) found that three or more standard drinks per day were associated with predictive HPT. Reduction in ACPW had no significant influence on BP in people who drank two or fewer drinks per day, but BP increased with the reduction in alcohol intake if the consumption was more than three drinks a day (Roerecke et al., 2017). There are studies, which found that light-to-moderate alcohol consumption can reduce hypertension (e.g. Gillman, Cook, Evans, Rosner, & Hennekens, 1995;Son, 2011;Thadhani et al., 2002). Thus, the inference was that a dose-dependent reduction in BP can be achieved with a reduction in alcohol consumption. Santana et al. (2018) have found that alcohol consumption increases BP. In a large sample study with 14,727,591 respondents, it was established that alcohol abuse leads to CHD (Whitman et al., 2017). There are many other similar studies in which the association between heavy alcohol consumption and CHD has been established (Klatsky, 2015). On the contrary to these findings, it has also been established that while heavy alcoholism (more than 3 standard consumptions per day) is associated with CHD some research show that moderate drinking can bring down the chance of CHD (e.g. Klatsky, 1999). Researchers have found that moderate alcohol consumption, irrespective to the type of alcohol, reduces the risk of CHD (Criqui & Thomas, 2017;Hines & Rimm, 2001). The medical explanation is that alcohol can cause changes in lipids and haemostatic factors, which in turn has a protective effect on the cardiovascular system (Yuan, Ross, Gao, Henderson, & Mimi, 1997). Thus, further research may be required to draw a conclusive evidence about the association between ACPW and CHD. It is also necessary to conduct research with standardized parameters such as demographic background, health condition, type of work, etc. and the amount of alcohol intake to draw a conclusive evidence about the association between ACPW and HPT/CHD. It was revealed in hypothesis testing that Diastolic Blood Pressure (DBP) has a significant association with HPT and CHD. A group of researchers have associated DBP with HPT (Behradmanesh & Nasri, 2012;Tringali & Huang, 2017;Wang et al., 2015) and DBP with to CHD (e.g. Drozdz & Kawecka-Jaszcz, 2014;Franklin et al., 2001;Lichtenstein, Shipley, & Rose, 1985;Shu et al., 2013;Tsujimoto & Kajio, 2018). Bavishi, Goel, and Messerli (2016) consider the association between DBP and HPT to be controversial. There are findings on both the sides of the argument. For instance, in a study, it is found that DBP elevations of as little as 2 mm Hg can lead to HPD (Sander, 2011). Franklin et al. (2001) have found that with age there is a gradual shift from DBP to SBP as the predictors of CHD. But there are also study results which have shown that systolic or diastolic blood pressure has no association with HYP or CHD. For instance, D'Agostino, Belanger, Kannel, and Cruickshank (1991) based on a sample size of 5209 concluded that DBP had no significant influence on CHD. Further, in another large sample study of 1.25 million sample size with respondents of 30 years of age and above, it was concluded that both systolic and diastolic blood pressures have no influence on HPT and CHD (Rapsomaniki et al., 2014). Thus, while it is difficult to conclude that DBP has an influence on the risk of HPT and CHD more studies in this direction can be recommended strongly to draw a conclusive evidence.
The generally accepted concept that age has a significant association with HPT, which has been revealed through earlier researches (e.g. Borzecki, Glickman, Kader, & Berlowitz, 2006;Hosseini et al., 2015;Rigaud & Forette, 2001), is not supported in this research. The result is in agreement with the findings of Soudarssanane, Karthigeyan, Stephen, and Sahai (2006). Increase in BP is considered an inevitable consequence of ageing particularly in the industrialized environment, which eventually leads to HPT (Pinto, 2007). There are also studies, which disagree with this conclusion and claim that in hypertensives of different age, if comparable blood pressure levels are considered the reference, then the variability of blood pressure is not consistently related to age (Brennan, O'Brien, & O'Malley, 1986). So, there is a need to further investigate on the association of age with HPT as it is a multifactorial construct.
This research has revealed that body mass index (BMI) had no association with HPT or CHD. Many researchers have found that BMI is a predictor of HPT (e.g. Dua, Bhuker, Sharma, Dhall, & Kapoor, 2014;Linderman et al., 2018;Soudarssanane et al., 2006). In another study, with sample drawn from three populations from Africa and Asia, it was found that BMI has a significant positive correlation with HPT. Gadhavi, Solanki, Rami, Bhagora, and Thakor (2015) have shown that there is a significant relationship between BMI and HPT through a sample of 775 respondents of 40 to 50 years of age. Tuan, Adair, Suchindran, He, and Popkin (2009) found that physical exercise, food habits, cultures, religions, and demographic characteristics may interact with BMI in predicting HPT. It is imperative that more research is required to arrive at a conclusion on whether BMI has an association with HPT or CHD.
Systolic Blood Pressure (SBP) is found to have no significant influence on HPT or CHD as revealed in this study. Research has shown that there is a linear increase in SPR up to the fifth or sixth decades of life after which later slows down gradually (Bavishi et al., 2016). The finding of this study is in contradiction to that by a group of researchers who claim that variability in SBP has a significant association with HPT and CHD (e.g. Franklin, 2004;Mehlum et al., 2018;Stevens et al., 2016). Zhou et al. (2017) in their study of 1479 articles across nations found that the variability of BP was a regional phenomenon and it cannot be generalized in its association with HPT or CHD. Thus, further research may be required to arrive at a conclusion of the association of SBP with HYP and CHD.

Conclusion
Using Machine Learning in medical sciences has always been challenging due to the complexity as well as the dynamic nature of the variables involved. Nevertheless, machine learning can be very useful in disease detection and prevention. In this research, it is specifically found that cigarette smoking and alcohol consumption has a significant positive association with the risk of hypertension and coronary heart disease. This research indicated that alcohol intake has also been significantly and positively associated with coronary heart disease. Further, this research has indicated that diastolic blood pressure had a positive significant influence on both hypertension and coronary heart disease; however, systolic blood pressure has no significant influence on hypertension and coronary heart disease. It was revealed that hypertension and coronary heart disease have a significant positive relationship. These revelations could be of help to the further researchers and it adds to the knowledge on the antecedents of hypertension and coronary heart disease.
In conclusion, this research has shown that machine learning could be very effective in the determination of relationships between the factors such as age, body mass index, cigarette smoking, alcohol consumption, and diastolic/systolic blood pressure. Among the factors considered in this research age, body mass index and systolic blood pressure have not shown an association with either hypertension or coronary heart disease. Deeper research may be required to verify this outcome, as there are earlier studies, which have associated these variables with hypertension and coronary heart disease.

Article highlights
• Machine learning through the simulated data to study the association of age, body mass index, cigarettes smoked per day, alcohol consumed per week, diastolic blood pressure, and systolic blood pressure on hypertension and coronary heart disease.
• Machine learning, a relatively new technology, is fast emerging as a better prediction model in healthcare applications.
• This research has indicated that diastolic blood pressure had a positive significant influence on both hypertension and coronary heart disease. However, systolic blood pressure has no significant influence on hypertension and coronary heart disease.
• The factors like age, body mass index and systolic blood pressure have not shown an association with either hypertension or coronary heart disease.