Effects of data preprocessing on results of the epidemiological analysis of coronary heart disease and behaviour-related risk factors

Abstract Background We carried out this study to demonstrate the effects of outcome sensitivity, participant exclusions, and covariate manipulations on results of the epidemiological analysis of coronary heart disease (CHD) and its behaviour-related risk factors. Material and methods Our study population consisted of 1592 54-year-old men, who participated in the Kuopio Ischaemic Heart Disease Risk Factor (KIHD) Study. We used the Cox proportional-hazards model to predict the hazard of CHD and applied different sets of outcomes concerning outcome sensitivity and data preprocessing procedures regarding participant exclusions and covariate manipulations. Results The mean follow-up time was 23 years, and 730 men received the CHD diagnosis. Cox regressions based on data with no participant exclusions most often discovered statistically significant associations. Loose inclusion criteria for study participants with any CVD during the follow-up and strict exclusion criteria for participants with no CVD were best in discovering the associations between risk factors and CHD. Outcome sensitivity affected the associations, whereas the covariate type, continuous or categorical, did not. Conclusions This study suggests that excluding study participants who are not disease-free at baseline is probably unnecessary for epidemiological analyses. Epidemiological research reports should present results based on no data exclusions together with results based on reasoned exclusions.


Introduction
Typically, epidemiologic research produces at least partly contradictory results. Some reasons explaining this incoherence i.e. unexpectedly large variations in results across closely related studies, are only indirectly related to research, such as clinical factors and healthcare systems. Many reasons, however, originate from study designs, methodological choices, concept definitions, and observed data [1,2]. Reasons related to datasets include at least differences in sample size and representativeness of covariates. In prospective cohort studies, also the length of follow-up with respect to age at baseline amongst study participants, as in the risk of coronary heart disease (CHD) associated with high levels of C-reactive protein [3], and possible competing events affect the interpretation of study results [4].
Other behaviour-related factors, such as alcohol consumption and stress also may increase the risk for CHD, but their associations with CHD vary across studies. The association between alcohol and CHD is nonlinear [11], and stress is a symptom of different conditions, such as psychosocial aspects of work [12], which may or may not be associated with the risk of CHD. Yet other risk factors of CHD that at least indirectly relate to behaviour through diet are homocysteine, fibrinogen, and inflammation [13]. Moreover, there may be a weak association between iron status and CHD [14].
In addition to the behaviour-related factors, nonmodifiable factors including age, male gender, genetics, and a family history of CHD increase the risk for CHD [13,15,16]. Differences between men and women regarding the risk of CHD relate mainly to oestrogens and, thus, premenopausal women [13]. The role played by personality in the development of CHD is controversial [17].
The purpose of this study was to demonstrate the effects of data exclusions, outcome variable selection, and covariate manipulations on the interpretation of the epidemiologic relationship between CHD and its traditional risk factors. These are predominantly subjective researcher-related actions unlike more technical questions, such as whether to consider competing events in statistical analyses or whether to use nonconventional statistical methods, such as neural networks [18], to deal with data-related matters. As a result of this study, we expected a combination of outcome variable selection, participant exclusion, and covariate manipulation procedures that best discovers presumable associations between CHD and risk factors.

Material
Men, n ¼ 1592, from the Kuopio Ischaemic Heart Disease Risk Factor (KIHD) Study served as a study material. The KIHD Study is an ongoing prospective cohort study originally established to discover previously unestablished reasons for the extremely high AMI prevalence among eastern Finnish men [19]. To control the effect of age on CHD we selected men representing the same age cohort, 54-year-old at baseline between March 1984 and December 1989. Briefly, 778 of them had one or more CVDs at baseline based on self-reports to the question: Has your doctor told you that you have 'the name of CVD', and 1181 of them were diagnosed, during an inpatient special health care admission, as having CVDs, ICD-10 codes I00-I99 [20], by the end of 2017. Moreover, 381 men used medication for hypertension, 77 had insulin or non-insulin treated diabetes, and nine used medication for hypercholesterolaemia at baseline. The mean (SD) follow-up time was 23.4 (9.3) years. Table 1 presents study participants' baseline characteristics with respect to variables used as exclusion criteria, covariates, and conditions and events diagnosed during the follow-up. All KIHD participants had given written informed consent, and the ethical committee of the Kuopio University had approved the KIHD Study (December 1, 1983). In 1980s, the committee did not necessarily provide study numbers but identified studies by date.

Outcome variables
The KIHD Study includes annually updated data from the Care Register for Health Care of the Finnish Institute for Health and Welfare regarding diagnoses given during special health care admissions (License THL/93/5.05.00/2013) and from the Causes of Death Register of the Statistics Finland (License TK-53-1770-16). To study the effects of outcome sensitivity on model results we constructed four different outcome variables based on these register linkages. The first outcome was 'CVD' referring to ICD 10 codes I00 À I99. The second outcome was 'CHD' referring to codes I20 À I25. The third outcome was 'MI or UA' and it referred to codes I20.0 and I21 À I22. The fourth outcome 'a fatal AMI' referred to as I21.

Covariates
First, we selected the most common risk factors of CHD based on literature and, second, we searched variables that represent these risk factors from the KIHD Study database. The chosen risk factors were smoking, obesity, physical inactivity, hypertension, diabetes, and hypercholesterolaemia. Hajar [21], for example, summaries the association between these six risk factors and CHD. In addition to the indisputable risk factors of CHD, we included alcohol consumption as a covariate in the analyses. Alcohol, in general, increases mortality and morbidity [22], but the association between alcohol consumption and CHD is visualized by a J-shaped curve; light-to-moderate drinking acts as a protective factor, whereas heavy drinking increases the risk of CHD [11]. We expected that our analyses at best would demonstrate this nonlinear relationship between alcohol consumption and CHD.
In the KIHD Study, participants self-reported their smoking behaviour, alcohol consumption, and physical activity at baseline. As a continuous smoking variable, we chose a cigarette-year that indicates the number of cigarettes per day multiplied by the number of years smoked. Moreover, we classified the participants as never-smokers, former smokers, and current smokers. Former smokers informed that they have not smoked within a month.
The KIHD continuous alcohol consumption variable indicates the amount of alcohol as grams per week. For this study, we categorized the participants into those with no health risk due to the alcohol consumption, one portion (12 grams of pure alcohol according to Finnish standards) per week at most, those with a moderate health risk, three portions per day at most, and those with a high health risk. This categorization is mainly data-specific, although it sparsely follows Finnish current care guidelines published only in Finnish. Broadly, alcohol increases mortality and morbidity and, in men, more than three to four portions, 40 grams of pure alcohol, per day increase them significantly [22].
To determine study participants' physical activity we, first, calculated the basal energy expenditure (BEE) based a body weight, body height, and age applying the Mifflin-St Jeor Equation [23]. Second, we subtracted BEE from the total energy expenditure (TEE) and used this TEE À BEE variable in the analyses as a continuous variable. To create activity ranks, we computed the physical activity level (PAL) by dividing TEE by BEE and classified the participants as follows: moderately active, PAL < 2.00, vigorously active, PAL 2.00 À 2.40, and extremely active, PAL > 2.40 [24]. In the KIHD cohort, practically, all participants were at least moderately active at baseline. Eight participants of this study had not reported their physical activity.
Body weights and heights were not self-reported but measured by a research nurse during the baseline examination. Based on these measures we calculated the Body Mass Index (BMI) by dividing the weight in kilograms by the square of height in metres. In the analyses, we obeyed the standard guidelines for BMI: <25.0 kg/m 2 refers to normal weight, 25.0À29.9 kg/m 2 to overweight, and !30.0 kg/m 2 to obesity [25] and classified the participants according to them.
On the first baseline examination day, one research nurse measured the participant's blood pressure six times with a random-zero mercury sphygmomanometer. After a supine rest of five minutes, the nurse took three measurements in supine, two in sitting, Table 1. Baseline characteristics (the total column) and numbers of study participants with the following conditions diagnosed during the follow-up: any cardiovascular disease (CVD), coronary heart disease (CHD), a myocardial infarction (MI) or unstable angina (UA), and a fatal acute myocardial infarction (AMI). Conditions  and one in a standing position with 5-min intervals. In the present analyses, we used the mean of six systolic blood pressures (SBP) values as a continuous variable. To distribute study participants into groups according to SBP, we followed the thresholds suggested by Mayo Clinic: SBP < 120 mmHg is a desirable level and SBP > 139 mmHg indicates hypertension [26].
Study participants gave blood samples between 8 and 10 a.m. after abstaining from alcohol for three days and from smoking and eating for 12 h. After a supine rest of 30 min, a research nurse draw blood with Terumo Venoject VT-100PZ vacuum tubes (Terumo Corp., Tokyo, Japan) using no tourniquet. The laboratory of our institute used an enzymatic method to measure STC concentrations (CHOD-PAP, Boehringer Mannheim, Mannheim, West Germany) and a glucose dehydrogenase method (Merck, Darmstadt, West Germany) after protein precipitation with TCA using a clinical chemistry analyzer (Kone Specific, KONE Instruments Oy, Espoo, Finland) to measure FBG concentrations. Salonen et al. [27] describe the lipid analysis in detail. For the present analyses, we classified the participants according to the serum total cholesterol (SCT) as follows: <5.2 mmol/L is a desirable level and >6.2 mmol/L indicates hypercholesterolaemia [28]. Correspondingly, we distributed the participants into groups according to the fasting blood glucose (FBG) as follows: < 5.6 mmol/L is a desirable level and > 6.9 indicates diabetes [29].

Statistical analyses
The Cox proportional-hazards model [30] served as an analysis method and IBMV R SPSS V R Statistics Version 25 served a statistical platform. In all analyses, we applied three different data exclusion criteria ( Figure 1). The first criterion, termed as Criterion A later in the text, excluded study participants according to conditions. Precisely, we excluded participants, who reported that they have any CVD or diabetes at baseline or that they use hypercholesterolaemia medication. This exclusion criteria reduced the number of study participants from 1592 to 794. The second criterion, Criterion B, excluded study participants, who reported that they have a CVD, except for hypertension, at baseline. This criterion resulted in 920 participants. The third criterion, Criterion C, meant no exclusions. Correspondingly, in all analyses, we used CVD, CHD, AMI or UA, and a fatal AMI as dependent variables. These four "nested" outcomes demonstrate the outcome variable selection process with respect to outcome sensitivity. Moreover, to study the effect of covariate manipulations on the Cox model results, we executed Cox regressions adjusted for seven covariates, the six traditional risk factors and alcohol consumption that were either in their original continuous form or distributed in predetermined categories.
Altogether, we performed three analysis sets ( Figure 1). The first set included covariates as  Tables 2-4).
continuous variables and tested their associations with CVD, CHD, AMI or UA, and a fatal AMI separately for each data exclusions criterion, A, B, and C. The second set included covariates as categorical variables. The reference categories were as follows: never-smoker, no health risk due to the alcohol consumption, normal weight, moderately physically active, desirable SBP, desirable FBG, and desirable STC. As the first set, the second set tested associations of covariates with CVD, CHD, AMI or UA, and a fatal AMI separately for each data exclusions criterion, A, B, and C. The third set, also, included covariates as categorical variables but used different data exclusion criteria for study participants, who received a CVD diagnosis during the follow-up, and for those, who did not.
The third analysis set constituted two analysis scenarios ( Figure 1). In the first scenario, termed as Scenario Y later in the text, the exclusion of men with CVD during the follow-up was based on Criterion A and that of men with no CVD during the follow-up was based on Criterion C i.e. no exclusions. This resulted in 957 study participants eligible for the analysis. In the second scenario, Scenario Z, the exclusion of men with CVD during the follow-up was based on Criterion C and that of men with no CVD during the follow-up on Criterion A. Scenario Z resulted in 1430 study participants.

Outcome sensitivity
CVD and a fatal AMI associated with covariates differently compared to each other as well as compared to CHD and MI or UA (Tables 2-4). CVD was the outcome that most evidently associated with SBP; a high SBP increased the risk of CVD. A fatal AMI in turn was the only outcome that showed only statistically non-significant associations with SBP and physical activity. CHD and MI or UA highlighted the same risk factors. Specifically, they associated with STC more strongly than CVD and a fatal AMI did.

Participant exclusions
Cox regressions based on data with no exclusions most often discovered statistically significant associations of CHD with its risk factors, irrespective of covariate manipulations (Tables 2 and 3). In all these associations, the direction of the association was correct i.e. the risk factors related to hazard ratios (HR) Table 2. Hazard ratios and corresponding p-values of any cardiovascular disease (CVD), coronary heart disease (CHD), a myocardial infarction (MI) or unstable angina (UA), and a fatal acute myocardial infarction (AMI) with respect to one unit (1 U) or one standard deviation (1 D) increase in seven factors used as continues covariates in the Cox proportional-hazards model. larger than one and the protective factors related to HRs below one. Only regressions based on data with no exclusions identified, statistically significantly, the protective effect of physical activity; the highest category versus the lowest one. Appendix presents sample size calculations regarding the main outcome of this study, CHD, and Criterion A that excluded study participants according to conditions at baseline. The comparison between Scenarios Y and Z showed that strict data exclusions regarding men with no CVD during the follow-up combined with no exclusions regarding men with CVD during the followup yielded more often statistically significant and plausible results than no data exclusions concerning men with no CVD and strict exclusions regarding men with CVD (Table 4).

Covariate manipulations
There were only minor differences in Cox model results between analyses including covariates as continuous variables and those including covariates as Table 3. Hazard ratios (HR), probabilities (P), and corresponding p-values of any cardiovascular disease (CVD), coronary heart disease (CHD), a myocardial infarction (MI) or unstable angina (UA), and a fatal acute myocardial infarction (AMI) with respect to seven factors used as categorical covariates in the Cox proportional-hazards model. A refers to a dataset excluding CVD, diabetes, and high total cholesterol at baseline (n ¼ 794). B refers to a dataset excluding CVD, except for hypertension, at baseline (n ¼ 920). C refers to a dataset with no exclusions (n ¼ 1592). Bold font indicates a statistically significant HR. a Alcohol consumption in g/week. b Body Mass Index, kg/m 2 . c Physical activity level, the total energy expenditure divided by the basal energy expenditure, moderate <2.00, extreme >2.40. d Systolic blood pressure in mmHg. e Fasting blood glucose in mmol/L. f Serum total cholesterol in mmol/L. categorical variables (Tables 2-4). Continuous and categorical covariates led to the same conclusions regarding the association of CHD with its risk factors. Being a former or current smoker, being overweight or obese, and having borderline high or high FBG or STC levels significantly increased the risk of CHD. The effect of high SBP levels on the risk of CHD was uncertain as well as the protective effect of physical activity. Our analyses found no statistically significant association between CHD and alcohol consumption.

Discussion
Traditionally, epidemiological studies use in their analyses only study participants who are free of the disease of interest at baseline. Our study suggests that excluding study participants who have the disease already at baseline is probably unnecessary. Specifically, our analyses led to the best results when we included all study participants who received the diagnosis during the follow-up irrespective of their self-reported baseline statuses but excluded all study participants who did not receive the diagnosis during the follow-up but had self-reported the disease at baseline. Moreover, our study does not, unconditionally, support participant exclusions with respect to covariates either. Excluding participants who are at risk already at baseline may enable discovering the strongest associations, such as the relationship between diabetes and CHD, but, simultaneously, it may fade out weaker, although relevant, associations, such as the relationship between physical activity and CHD. In other words, a combination of "loose cases" and "strict controls" may yield the best results. In the next paragraphs, we evaluate our results from the viewpoint of CHD risk factors.
In our study, smoking, overweight, and high blood glucose levels, evidently, associated with CHD. Outcome variable selection, participant exclusion, and covariate manipulation procedures had no effects on Table 4. Hazard ratios (HR), probabilities (P), and corresponding p-values of any cardiovascular disease (CVD), coronary heart disease (CHD), a myocardial infarction (MI) or unstable angina (UA), and a fatal acute myocardial infarction (AMI) with respect to seven factors used as categorical covariates in the Cox proportional-hazards model. .07 Note. Y refers to a dataset with no exclusions for study participants with no CVD during the follow-up (n ¼ 411) and excluding CVD, diabetes, and high total cholesterol at baseline for study participants with CVD during the follow-up (n ¼ 546). Z refers to a dataset excluding CVD, diabetes, and high total cholesterol at baseline for study participants with no CVD during the follow-up (n ¼ 248) and no exclusions for study participants with CVD during the follow-up (n ¼ 1182). Bold font indicates a statistically significant HR. a Alcohol consumption (g/week). b Body Mass Index (kg/m 2 ). c Physical activity level, the total energy expenditure divided by the basal energy expenditure, moderate <2.00, extreme >2.40. d Systolic blood pressure (mmHg). e Fasting blood glucose (mmol/L). f Serum total cholesterol (mmol/L).
conclusions drawn from results related to these three cornerstone risk factors. Being a current smoker or being obese (BMI ! 30.0) resulted in 1.5 times higher hazard compared to never smokers and normal-weight study participants, whereas diabetes (FBG > 6.9 mmol/ L) approximately tripled the hazard of CHD. Large prospective cohort studies have reported even stronger effects of smoking and obesity on CHD already in 1960s [31]. The three times higher hazard of CHD among diabetic men seems to be a rule of thumb [9].
Total cholesterol and blood pressure were the covariates that most evidently revealed differences related to outcome sensitivity. Total cholesterol is only one of many measures of the lipid status of which all show somewhat unique associations with CHD and other CVDs [10,13]. Total cholesterol, for example, does not associate as strongly with the risk of stroke [32] as it associates with the risk of CHD [10]. Conversely, high blood pressure increases, specifically, the risk of stroke [33], which may for its part explain, together with reasons related to the sample size, why SBP associated statistically significantly with CVD but not with CHD and MIs in our study.
Irrespective of outcome variable selection, participant exclusions, and covariate manipulations, our study found no statistically significant effects of alcohol consumption on the hazard of CHD. Although alcohol, in general, increases mortality and morbidity [22], light-to-moderate drinking may protect against CHD [11], which for its part may complicate the statistical detection of the association between alcohol consumption and CHD. Moreover, the association relates to the pattern of consumption i.e. binge drinking via the progression of atherosclerosis [34], which we did not considered in this study.

Limitations
Our results are based on one dataset and, therefore, them are not straightforwardly generalizable. Moreover, our study does not consider severity of diseases per se or diagnoses other than CVD, CHD, MI or UA, and AMI.
All KIHD study participants, practically, were at least moderately active at baseline and nearly half of them were extremely active based on PAL values. This indicates the active lifestyles of the KIHD study participants; many of them were farmers or lumberjacks and highly interested in cross-country skiing, which to some extent distinguishes the KIHD cohort from otherwise similar cohorts. On the other hand, extreme physical activity levels, PAL > 2.4, are unrealistic in the long run because they lead to a negative energy balance i.e. weight loss [35]. This contradiction, most probably, is due to the KIHD assessment method of physical activity. In general, self-assessment physical activity questionnaires show low validity and reliability [36]. Consequently, the present TEE and PAL values are adequate for creating data-specific activity ranks [37] but not for comparing the KIHD cohort to other cohorts as such.

Conclusions
Our Cox model example of the epidemiological relationship between CHD and its common risk factors evidently demonstrated that outcome variable selection and participant exclusions must be considered when interpreting results of epidemiological analyses. Preprocessing procedures that were loose regarding study participants with any CVD during the follow-up and strict concerning study participants with no CVD during the follow-up were best in discovering the association between risk factors and CHD. Outcome sensitivity affected associations across covariates and outcomes. For example, total cholesterol associated, specifically, with CHD and MI or UA but weakly with CVD or AMI. The covariate type, continuous or categorial, had only minor effects on Cox model results. We strongly suggest that research reports present results based on no data exclusions together with results based on reasoned exclusions.