Prediction of the Probability and Risk Factors of Early Abdominal Aortic Aneurysm Using the Gradient Boosted Decision Trees Model

ABSTRACT Currently, abdominal aortic aneurysm (AAA) diagnosis mainly relies on the analysis of the image data, such as Doppler ultrasonic and computed tomography (CT). Once AAA has formed, it may rupture and lead to death at any time. Surgical or endovascular treatment was the only method, but it has a high complication rate and poses a huge economic burden to patients. The gradient boosted decision trees (GBDT) model proposed in this paper is used to predict the probability and risk factors that lead to AAA, and the prediction accuracy of the algorithm is able to reach as high as 96%. This study selected 15 related AAA features as training samples. After the training, age, triglycerides (TG), blood pressure (BP), low-density lipoprotein cholesterol (LDL-C), blood glucose (Glu), and body mass index (BMI) are found to have a direct impact on AAA. For individuals with a high AAA probability, the risk factors that contribute the most to the AAA probability can be determined with the GBDT model. This study presents the GBDT model that effectively predicts the probability and risk factors of early AAA, which enables an early intervention and control of these risk factors against incidence of AAA.


Introduction
Abdominal aortic aneurysms (AAA) are the most common aortic aneurysms, which often present as aneurysmal dilatation of the infrarenal abdominal aorta. It is defined as blood vessel expansion larger than 3 cm diameter or more than 1.5-times of the normal aortic diameter ( Li et al. 2019 ) . The prevalence of abdominal aortic aneurysms (AAAs) in the Western population has been well reported and ranges from 4% to 7.2% ( Ashton et al. 2002;Chan et al. 2021;Lindholt et al. 2005;Norman et al. 2004;Scott et al. 1995 ) . The overall AAA prevalence was 1.3% in the Asian population ( Chan et al. 2021 ) . Although the prevalence of AAA is low, the mortality of AAA is very high. Most AAA patients have no obvious symptoms; however, with increasing aneurysm diameter, the risk of rupture increases significantly. The mortality of patients with a ruptured aneurysm is extremely high, most patients died before they arrived at the hospital, and the mortality is over 80% ( Mik et al. 2019 ) . Therefore, early diagnosis, regular follow-up and timely treatment is very important for AAA.
With the rapid development of deep learning and neural network research, related technologies have been gradually applied in vascular surgery ( Hadjianastassiou et al. 2006;Monsalve-Torra et al. 2016 ) . Omneya Attallah et al. proposed a Bayesian neural network approach to determine the risk of re-intervention after endovascular aortic aneurysm repair surgery ( Attallah and Ma 2014 ) . Karthikesalingam et al. proposed an artificial neural network (ANN) approach to predict whether patients would be at low or high risk of endograft complications (aortic/limb) or mortality ( Karthikesalingam et al. 2015 ) . Eric S. Wise et al. predicted the inhospital mortality after ruptured abdominal aortic aneurysm repair also using an ANN ( Wise, Hocking, and Brophy 2015 ) . In recent years, some deep learning algorithms were used to diagnose AAA based on CT images without human intervention. Mohammadi et al. designed a classifier using Convolutional Neural Network (CNN) in order to detect AAA region among other abdominal regions ( Mohammadi et al. 2019 ) . Jiang et al. proposed a Deep Belief Network (DBN) to provide fast predictions of patient-specific AAA expansion ( Jiang, Do, and Choi et al. 2020 ) . In the majority of the studies, researchers focused on CT images analysis or prediction of AAA rupture risk. But our study aimed to detect early lesions compared to the other works.
In order to predict the probability of AAA before the aneurysm forms, it is necessary to identify the factors that lead to AAA and sample these factors into the dataset. As the pathogenesis of AAA is unclear and it may be caused by environmental, biological, immunological, and genetic factors, there has been a lack of effective early diagnosis and prediction methods. For instance, the prevalence of AAA in Asian populations selected for sampling cardiovascular risk factors is high ( Chan et al. 2021 ) . An appropriate algorithm model is supposed to be trained using the data from the selected population as input and capable of predicting AAA samples after the training. Neural network and decision tree are the most common classification models. A neural network ( Schmidhuber 2015 ) consists of a large number of neurons. First, each neuron receives a linear combination of inputs and performs arithmetic operations with linear weight. Then, the activation function is added to each neuron, and after nonlinear conversion, the results are produced. Gradient boosted decision trees (GBDT) ( Rao et al. 2019 ) is a machine learning technique for regression and classification problems, which produces prediction models in the form of a set of weak prediction models.
Due to the small sample AAA sample set (only 400 records), the results of experiment proved that the accuracy of GBDT AAA prediction in small data set is higher than that of neural network. So, in this study gradient boosted decision trees model is proposed to predict the probability of AAA so that early prevention and treatment could be adopted to avoid rupture and hence the economic burden caused by AAA.

Building the AAA Sample Data Set
Risk factors causing AAA were selected based on the clinical experience of the doctors and data from various studies. The data from both AAA patients and healthy individuals were selected and collected to form an AAA sample set. It is generally known that male, smoking, old age (older than 65 years), obesity, hypertension, coronary heart disease, diabetes, hyperlipidemia, peripheral artery disease, and related family history are risk factors for AAA ( Perrin, Badel, and Ogeas 2016 ) . A total of 15 features were selected, including age, sex, blood pressure (BP), triglycerides (TG), Low density lipoprotein cholesterol (LDL-C), blood glucose (Glu), smoking, alcohol consumption (Drink), family history (FH), body mass index (BMI), homocysteine (Hcy), uric acid (UA), chronic obstructive pulmonary disease (COPD), history of coronary heart disease (CHD), and history of cerebrovascular disease (CVD), as shown in Table 1.
In this study, 400 records were sampled, including 200 AAA patients and 200 healthy individuals. First, the order of the sample data was randomly disrupted by the Python program. Then, the following process was performed, as shown in Figure 1: (1) 320 randomly selected records were used as training data. Homocysteine, u A12 UA Uric acid (umol/L) A13 COPD Chronic obstructive pulmonary disease.1 = yes, 0 = no A14 CHD History of coronary heart disease.1 = yes, 0 = no A15 CVD A history of cerebrovascular disease.1 = yes, 0 = no A16 Target AAA. 1 = yes, 0 = no (2) 80 records remained were used to predict the relevance of the features to AAA.

Experiment Environment
In this study, the arithmetic program was performed on the python 3.7 platform. The tensorflow 2.0 library was used for arithmetic operation, the pandas library was used for data partitioning, and the Matlab image library was used for drawing graphics. 320 training records were put into the GBDT model for training. A1 to A15 columns were used as feature columns and A16 as a label column. The parameters of the algorithm program are shown in Table 2. After the training process, the remaining 80 records with A1 to A15 feature columns were introduced into the model, which afterward produced 80 predicted probability values ranging from 0 to 1.

GBDT Algorithm
The gradient boosted decision trees use the approximate method of the fastest descent. The value of the negative gradient of the loss function in the current model is applied to the approximation of the residual of the boosted tree algorithm in the regression model that fits into a regression tree.
The GBDT model is used to predict the probability of AAA in this study, which is defined as follows: II For m ¼ 1; 2; 3; . . . ; M (M represents the number of iterations, which is, the number of weak learners generated) (a) The negative gradient of the loss function is calculated in the current model as an estimation of the residual. For the square loss function, it is commonly referred to as the residual.
g fit into a regression tree, get the m leaf node region R mj ; j ¼ 1; 2; 3; . . . ; J (J is the number of leaf nodes per tree) (c) For j ¼ 1; 2; 3; . . . ; J, a linear search is used to estimate the value of the leaf node region, minimize the loss function, and calculate.
III Finally got the regression tree Results 400 AAA sample data were selected in this study, of which 320 records were used for training and 80 were used for verification. Based on the same training and verification set, this study compares the accuracy of the neural network and the GBDT algorithm, as shown in Table 3. The accuracy of the neural network is 89%, while that of the GBDT algorithm is 96%. In the case of a small data set, the accuracy of the GBDT algorithm is better than the neural network. According to the GBDT training results, the baseline value was 0.524. If the probability is greater than the baseline value, the sample was labeled as an AAA patient. If the probability is less than the baseline value, the sample was labeled as a healthy person. Table 4 records 10 sample results with numerical order from 0 to 9.
As shown in Table 5, among the 80 predicted results, only the samples with numerical order 0, 33, 74 were predicted incorrectly. The accuracy of prediction was about 96%. This means that the algorithm of gradient boosted decision trees model has a high accuracy for AAA prediction, which is helpful for assisting the AAA clinical diagnosis in non-medical imaging mode.

Discussion
GBDT is an iterative Decision Tree algorithm. The algorithm is composed of several decision trees, and the results of all trees are added to make the final result. GBDT and SVM are considered as strong algorithm of generalization ability. GBDT can be used to do classification and regression. So it is used widely, and is also applied to prediction in many fields ( Li et al. 2020;Ye, Liu, and Zhao 2020;Zhang, Beudaert, and Argandona 2020;Zhu, Ying, and Zhang 2020 ) .
The 15 related AAA features were selected as training samples in this study. The predicted accuracy of GBDT algorithm can reach 95%, which is very important for early prediction and intervention. AAA diagnosis mainly relies on the analysis of the image data currently, such as Doppler ultrasonic and computed tomography (CT), and magnetic resonance imaging (MRI) ( Jean-Baptiste et al. 2020 ) . Because Doppler ultrasonic is convenient, noninvasive and relatively cheap, it is very attractive in measuring the diameter of infrarenal abdominal aorta, and has high accuracy in the diagnosis of AAA ( Bredahl et al. 2016;Long et al. 2005 ) . Since the 1990s, Doppler ultrasonic has been recommended for AAA diagnosis in people over 65 years old ( Davis, Harris, and Earnshaw 2013;Stather, Dattani, and Bown et al. 2013 ) . CT has a more accurate advantage in the diagnosis and screening of AAA, and it also has a wide application in the follow-up of AAA after surgical treatment, especially after endovascular treatment ( Hallett, Ullery, and Fleischmann 2018;Hu, Pisimisis, and Sheth 2018;Kim and Litt 2020 ) . When patients are contraindicated to iodine contrast agent (allergy or renal insufficiency), MRI can be used as an auxiliary technology to assist diagnosis, but it is expensive, time-consuming, and requires no metal graft in the body, so it is limited in the diagnosis of AAA ( Lau et al. 2017 ) .
Once AAA formed, it may rupture, and surgical or endovascular treatment was the only treatment method ( Blackstock and Jackson 2020;Holden and Hill 2020;Mehmedovic et al. 2020;Powell and Wanhainen 2020 ) . By using the algorithm in this study, the probability of AAA was predicted before AAA formed. It could reduce the risk of death caused by AAA. For individuals with high probability of AAA, early intervention and control of the risk factors could reduce the incidence.
Male, smoking, the aged (older than 65 years), obesity, hypertension, coronary heart disease, diabetes, hyperlipidemia, peripheral arterial disease, and related family history were identified as risk factors for AAA ( Monsalve-Torra et al. 2016 ) . But these risk factors were only based on the single result of statistical analysis, and there was no correlation between these factors. But in this study, it was found that age, triglycerides (TG), blood pressure (BP), lowdensity lipoprotein cholesterol (LDL-C), blood glucose (Glu) and body mass index (BMI) had a direct impact on AAA, the other nine features had no direct impact on AAA. The weight of each feature to AAA could also be calculated. For individuals, the risk factors that contribute the most to the AAA probability could be determined.

Feature Importance
Feature importance refers to the weight of influence on the predicted results, which measures the change of loss when separating specific features. In this study, 320 records were learned by using the GBDT model. In these cases, the sum of weight was 1. As shown in Table 6, Age, TG, BP, LDL-C, Glu, and BMI, is the important cause of AAA. The sum of the six features' weight was 1, and the sum of the remaining nine features' weight was 0.
As shown in Figure 2, the weight of age is 0.41, which is the highest of all the features. An epidemiological survey of . . . found that with the increase of age, the incidence of AAA increased year by year. For AAA patients, if there is no surgical intervention, abdominal aorta diameter will increase with age, and even cause a rupture. So the American Society for Vascular Surgery in the latest AAA treatment guidelines recommend that ultrasonic diagnosis should be performed once a year for both men and women over the age of 65 (recommended levels: 1A) ( Chaikof et al. 2018 )

Directional Feature Contributions (Dfcs)
DFCs refer to the contribution of each feature in the specific sample. It traverses the prediction path and calculates the predicted changes after the feature segmentation. The change in prediction was attributed to the features used for segmentation. The formula of DFCs: the sum of contributions + bias = the predicted value. Table 7 shows the features of a healthy person and the gradient boosted decision trees model is used to predict the results. It was calculated that the bias is 0.5062 and the sum of contributions is 0.1118 + 0.0248 + 0.002 + (−0.0696) + (−0.0684) = 0.0006. Finally, the AAA incidence probability of this person is 0.0006 + 0.5062 = 0.5068. The probability is less than the baseline 0.525, so the predicted result is a healthy person and in agreement with the actual situation on the whole. However, the predicted value of this sample is close to the baseline and the incidence probability is very high, which is difficult to detect by ultrasonic diagnosis, CT or other imaging methods. Therefore, the methods of early diagnosis and follow-up are necessary for the potential patients of AAA. Figure 3 shows the Age, BP and TG is the most important risk factors of AAA for this person. Table 8 shows the features of an AAA patient. The gradient boosted decision trees model is used to predict the results. So it was calculated that the bias is 0.5062 and the sum of contributions is 0.2579 + 0.0973 + 0.0326 + 0.0159 + (−0.0204) = 0.3833. Finally, the AAA incidence probability of this   person is 0.3833 + 0.5062 = 0.8895. The probability is greater than the baseline 0.525, so the predicted result is an AAA patient and in agreement with the actual situation on the whole. Figure 4 shows the Age, BP, and TG are the most important risk factors for this AAA patient. Hypertension is always considered as an important risk factor of AAA. Long-term hypertension will gradually cause abdominal aortic atherosclerosis. If the blood pressure control is unsatisfactory, the AAA expansion speed will be faster. Using oral antihypertensive drugs to control blood pressure can significantly slow down the growth and expansion speed of AAA ( Chung, Da Silva, and Raghavan 2017 ) . Therefore, regular examination and follow-up are important for early AAA diagnosis in hypertension patients, especially those with unsatisfactory blood pressure control long term.

The Influence of Feature Value Change on AAA
320 records were used to train the GBDT model, and 80 records were used to predict the probability of AAA. With the changes in Age, TG, BP, LDL-C, Glu, and BMI, the contribution of each feature to the probability of AAA also changed. The following is the analysis of four features the weight which is greater than 0.1.
As shown in the left subgraph of Figure 5, most of the age groups were between 50 and 80 years old. As shown in the right subgraph of Figure 5, with the increase of age, the contribution of the age to the AAA probability gradually increases. When the age is over 64 years old, the contribution becomes positive. As age has the highest weight among the risk factors of AAA, people over 64 years old should pay more attention to the regular examination and follow-up for AAA. As shown in the right subgraph of Figure 6, with the increase of BP, the contribution of the BP to the AAA probability gradually increases. When BP is over 118, the contribution becomes positive. It is then necessary to make a comprehensive diagnosis based on other AAA features.
As shown in the right subgraph of Figure 7, when the TG is greater than 2.25, with the increase of the TG, the contribution of the TG to the AAA probability will gradually increase. When the TG is over 3.05, the contribution becomes positive. It is then necessary to make a comprehensive diagnosis based on other AAA features.    As shown in the right subgraph of Figure 8, when the LDL-C value is greater than 2.5, with the increase of LDL-C, the contribution of the LDL-C to the AAA probability gradually increases. When the LDL-C is over 3.13, the contribution becomes positive. It is then necessary to make a comprehensive diagnosis based on other AAA features. Figures 7 and 8 show that high TG and high LDL-C contribute to the AAA probability largely. Currently, there is still an argument about whether blood lipids are the pathogenic factors of AAA. A large number of epidemiological surveys have found that more than 50% of AAA patients have hyperlipidemia, and the higher the blood lipids, the higher the AAA probability. So it is necessary to strengthen the regular examination and follow-up for AAA patients with hyperlipidemia.

Conclusion
In a small sample set, the gradient boosted decision trees model is more accurate than the neural network. In this study, 320 records of AAA were used for training by the GBDT model and 80 records were used to evaluate the accuracy of prediction. After the experiments, the accuracy of the GBDT model is 96%. For each individual with a high AAA probability, the risk factors that contribute the most to the AAA probability can be obtained. Early interventions can be carried out against those risk factors, so we can reduce the possibility of rupture and huge economic burden caused by AAA.

Disclosure Statement
No potential conflict of interest was reported by the author(s).

Funding
This work was supported by the CDTU PHD FUND [2020RC002].