Intracerebral Hemorrhage Detection in Computed Tomography Scans Through Cost-Sensitive Machine Learning

ABSTRACT Intracerebral hemorrhage is the most severe form of stroke, with a greater than 75% likelihood of death or severe disability, and half of its mortality occurs in the first 24 hours. The grave nature of intracerebral hemorrhage and the high cost of false negatives in its diagnosis are representative of many medical tasks. Cost-sensitive machine learning has shown promise in various studies as a method of minimizing unwanted results. In this study, 6 machine learning models were trained on 160 computed tomography brain scans both with and without utility matrices based on penalization, an implementation of cost-sensitive learning. The highest-performing model was the support vector machine, which obtained an accuracy of 97.5%, sensitivity of 95% and specificity of 100% without penalization, and an accuracy of 92.5%, sensitivity of 100% and specificity of 85% with penalization, on a test dataset of 40 scans. In both cases, the model outperforms a range of previous work using other techniques despite the small size of and high heterogeneity in the dataset. Utility matrices demonstrate strong potential for sensitive yet accurate artificial intelligence techniques in medical contexts and workflows where a reduction of false negatives is crucial.


Introduction
Intracerebral hemorrhage (ICH) is a neurological condition occurring due to the rupture of blood vessels in the brain parenchyma (Badjatia and Rosand 2005). It is the most severe form of stroke, with the chance of death or severe disability exceeding 75% and only 20% of survivors remaining capable of living independently after 1 month (Nawabi et al. 2021;Xinghua et al. 2021). It has an incidence of 24.6 per 100,000 person-years and accounts for 10-15% of all strokes (Ziai and Ricardo Carhuapoma 2018). An early diagnosis of ICH is crucial as half of the mortality occurs in the first 24 hours . Computed tomography (CT) scans are currently the preferred noninvasive approach for ICH detection (Hai et al. 2019). The diagnosis time for ICH remains very long-reaching 512 minutes in one study-which couples with a high misdiagnosis rate-13.6% according to one estimate-to make it a prime candidate for workflow improvement through machine learning Hai et al. 2019).
The conditions and characteristics that are specific to the medical field must be taken into consideration during the application of machine learning techniques to problems such as ICH diagnosis. A primary concern is that false negatives are usually much more costly than false positives (Claude and Webb 2011). The frequent under-representation of the minority class coupled with the increased emphasis on its correct prediction makes this a challenging problem for artificial intelligence (Thai-Nghe, Gantner, and Schmidt-Thieme 2010).
Taking the disproportionate costs of false negatives into account-more broadly called cost-sensitive learning-has resulted in positive outcomes in other medical tasks (Charoenphakdee et al. 2021;Freitas, Costa-Pereira, and Brazdil 2007;Fuqua and Razzaghi 2020;Kukar et al. 1999;Septiandri et al. 2020;Siddiqui et al. 2020;Tianyang et al. 2020) However, there is very little literature on the application of such techniques in hemorrhage diagnosis. Hence, the aim of this research was to test the results of the implementation of cost-sensitive learning-specifically a utility matrixbased approach-on ICH classification.

Materials and Methods
The data used for the study were obtained under a CC0: Public Domain license (Kitamura 2017). The dataset consisted of 200 anonymized, publicly-available images of non-contrast computed tomography (CT) scans (brain window), 100 of which contained instances of intraparenchymal hemorrhage with or without intraventricular extension, and 100 of which did not. A sample of 4 images from the data is shown in Figure 1. Figure 1a,b show scans without intracerebral hemorrhage, at the level of the lateral ventricles and third ventricle, respectively. Figure 1c displays a large intracerebral hemorrhage with intraventricular extension. Figure 1d contains a small intracerebral hemorrhage without intraventricular extension.
A trained radiologist confirmed the veracity of these images, and was unable to find any mislabeled images. Thus none of the images was discarded. To obtain the most accurate representation of model performance in real-life clinical scenarios, the images were not augmented in any way. Two datasets were subsequently created: a training dataset with 160 images and a testing dataset with 40 images. Both had equal numbers of hemorrhagic and nonhemorrhagic CT scans. It should be noted that the dataset consisted of images taken from searches on the World Wide Web, hence introducing a high level of heterogeneity due to variations in source machines, patient conditions, scan time, radiation dose, etc. This problem is compounded by the small dataset size, thus the results obtained here are likely to only be conservative estimates of the real potential of the techniques employed (Lacroix and Critchlow 2003;Lim et al. 2019).
6 machine learning models were chosen for the study. The specific parameter configurations for these models are provided in Table 1

Decision Tree
A decision tree is a machine learning model that consists of nodes and branches, and is built using a combination of splitting, stopping and pruning (Song and Ying 2015). Combining all of these techniques, while not guaranteed to result in the theoretically-optimal decision tree -as that would require an exceedingly long time -still results in a highly useful model (Swain and Hauska 1977).

Nodes and Branches
The root node is the first node, and it represents a decision that divides the entire data into two or more subsets. Internal nodes represent more choices that can be used to further split subsets. Eventually, they end in leaf nodes, representing the final result of a series of choices. Any node emanating from another node can be referred to as its child node. Branches are representations of the outcomes of decisions made by nodes and connect them to their child nodes (Song and Ying 2015).

Splitting
Both discrete or continuous variables can be used by decision trees to set criteria for different nodes, either internal or root, that are used to split the data into multiple internal or leaf nodes. Decision trees choose between different splitting possibilities using certain measures of the child nodes, such as entropy, Gini index, and information gain, to optimize the splitting choices (Song and Ying 2015).

Stopping
If allowed to split indefinitely, a decision tree model could achieve 100% accuracy simply by splitting over and over until each leaf node only had one data sample. However, such a model would be vastly overfitted and would not generalize well to test data. Thus there needs to be a limit (often in the form of maximum depth allowed or minimum size of leaf nodes) (Song and Ying 2015).

Pruning
There are two types of pruning. Pre-pruning uses multiple-comparison tests to stop the creation of branches that are not statistically significant, whereas postpruning removes branches from a fully-generated decision tree in a way that increases accuracy on the validation set, which is a special subset of the training data not shown to the model while training (Song and Ying 2015).

Random Forest
A random forest, as the name suggests, is a collection of randomized decision trees that is suitable for situations where a simple decision tree could not capture the complexity of the task at hand (Breiman 2001). The random forest model performs a "bootstrap" by choosing n times from n decision trees with replacement (Biau and Scornet 2016). The averaging of these predictions is called bagging (short for bootstrap-aggregating) and is a simple way to improve the performance of weak models, which many decision trees are when the task is complex (Breiman 1996).

Gradient Boosted Trees
Gradient tree boosting constructs an additive model, using decision trees as weak learners to aggregate, similar to the concept of random forests, but it does so sequentially; gradient descent is then used to minimize a given loss function and optimize the construction of the model as more trees are added (Friedman 2001;Jerry et al. 2009).

Nearest Neighbors
The nearest neighbors model is based on the concept that the closest patterns of the data sample in question offer useful information about its classification. Thus the model works by assigning each data point the label of its k closest neighbors, where k is specified manually (Kramer 2013).

Support Vector Machine
Support vector machines, linear and nonlinear, are a family of modular machine learning algorithms for binary classification problems by using principles from convex optimization and statistical learning theory (Mammone, Turchi, and Cristianini 2009).

Separating Hyperplane
For n-dimensional data, a hyperplane (straight line in a higher-dimensional space) of n-1 dimensions is used to separate the data into two different groups such that the distance from the clusters is maximized and the hyperplane is "in the middle," so to speak. This hyperplane also dictates which labels will be assigned to the samples from the test set (Noble 2006).

Soft Margin
Most datasets, of course, do not have clean boundaries separating different clusters of samples. There will also be outliers, and an optimal model would allow for a certain number of outliers to avoid overfitting while also limiting misclassifications. The soft margin roughly controls the number of examples allowed on the wrong side of the hyperplane as well as their distance from the hyperplane (Noble 2006).

Kernel Function
A kernel function refers to one or more mathematical operations that project low-dimensional data to a higher-dimensional space called a feature space, with the goal being the easier separation of the example data (Suthaharan 2016).

Logistic Regression
Logistic regression is a mathematical model that describes the relationship of a number of features (X 1 ; X 2 ; . . . ; X k ) to a dichotomous result. It makes use of the logistic function to assign probabilities to samples of belonging to either class, and the sample is then classified to the class where it has the higher probability of belonging (Kleinbaum and Klein 2010).

Utility Matrices
For the second part of the study, the concept of utility functions was used. A utility function is a mathematical function through which preferences for various outcomes can be quantified (Russell and Norvig 2002). In the case of a utility matrix for classification problems, U ij provides the utility where i is the ground truth and j is the model prediction reference (Wolfram Research 2014). It is also referred to as a cost matrix (Elkan 2001).
The utility matrix created for this study is shown in Table 2. The sole difference from the default utility matrix is that false negatives (Row 1, Column 2) now have a utility of -x instead of 0. Subsequent experimentation with different values of x was done and the results were recorded. Essentially, the models trained using this matrix will have an aversion to false negatives, thus decreasing the number of false negatives at the expense of potentially increasing the number of false positives. The strength of this aversion and subsequent change is quantified by x. A certain amount of trial and error is needed since, for most medical problems, the cost is not certain and needs to be estimated (Wang, Kou, and Peng 2021). The cost is contextspecific; depending on the workflow where the algorithm is being implemented, the goals for sensitivity and specificity will vary and the cost must be determined accordingly.
In Round 2, the models were re-trained with the same parameters as the previous set, with the only difference being that the value for the "UtilityFunction" parameter was set to the utility matrix provided in Table 2.

Results
The final results for all of the models trained are given in Table 3. The probability histograms for the 3 most accurate models are given in Figure 2. These histograms show the actual-class probability of the 40 test samples. Table 4 lists the results for each of the models after setting their utility function to the matrix given in Table 2, for three values of À x. Finally, Figure 3 shows, for the top  three models, how their sensitivities and specificities change as the penalty is increased, with 0 being the default penalty that corresponds to Table 3. According to the data in Table 3, the support vector machine model is the most accurate as well as the most sensitive and specific. Logistic regression and gradient boosted trees score second and third respectively in overall accuracy, though the nearest neighbors model has a higher specificity than gradient boosted trees, coming in second and tied with logistic regression.
Interestingly, while Table 3 shows that the support vector machine is more accurate overall, logistic regression actually appears to be more certain in its predictions, as demonstrated by the difference in the number of samples in the 0.4-0.6 bin. Overall, though, all three of the best-performing models seem highly certain in the correct predictions that they are making, as the vast majority of samples fall in the 0.8-1.0 bin.
In Table 4, for a penalty of −1, three models give a 100% sensitivity, although for the decision tree that comes at the cost of having a 0% specificity. The support vector machine performs the best, matching the 100% sensitivity of gradient boosted trees. Its specificity of 85% is less than that achieved by logistic regression, though its sensitivity is greater.
The results for a −2 penalty are almost exactly the same, with the only difference being a slightly reduced accuracy for logistic regression.
When the penalty is increased to −3, model performances decrease along the board, except for random forest, which performs better). While the sensitivities for almost all the models are now 100%, they come at the expense of drastically reduced specificities.
Further values of À x were not tested as most model performances had started rapidly deteriorating at a −3 penalty. The random forest, however, might perform better at more negative penalties, based on the observed trends.
As expected, Figure 3 shows that greater penalties yield reductions in specificity and growth in sensitivity. Though in cases like Figures 3a,c, where the sensitivity maxes out early, additional penalties only decrease the specificity. While tweaking the penalties, some changes, such as the one 0.85 to 0.70 in the specificity in Figure 3b, can be quite drastic, making it all the more important to conduct detailed experimentation while determining the penalty.

Discussion
The Round 1 results compare favorably to prior studies conducted for ICH identification, especially taking into the account the small training set and heterogeneity in the training data. For instance, work done by Majumdar et al. (Majumdar et al. 2018) shows a sensitivity of 81% and specificity of 98%. In one study, the model accuracy of 82%, recall of 89% and precision of 81% remarkably still resulted in a better recall than 2 of the 3 senior radiologists used for benchmarking (Grewal et al. 2018). More recent research by Dawud et al. (Dawud, Yurtkan, and Oztoprak 2019) resulted in the development of three models, with accuracies of 90%, 92% and 93%. Jnawali et al. (2018) performed a study with dataset heterogeneity comparable to ours, although with a considerably larger dataset size of 1.5 million images, and achieved an AUC score of 0.87. A study using 37,000 training samples, also with pronounced heterogeneity in the dataset, obtained an accuracy of 84%, a sensitivity of 70%, and a specificity of 87% . A retrospectively-collected dataset of 3605 scans was used to evaluate a commercial artificial intelligence decision support system (AIDSS) and showed a sensitivity of 92.3% and a specificity of 97.7% (Voter et al. 2021). There were also two studies that achieved better accuracies of 99% and one study with an AUC of 0.991 (Bobby and Annapoorani 2021;Tog˘açar et al. 2019;Weicheng et al. 2019).
Recently, a new type of neural network was developed; 90 versions were trained and tested on ICH detection ability, with the best model attaining an accuracy of 83.3% (J. Y. Lee et al. 2020). The important similarity between the work of Lee et al. is the comparable dataset size of 250 images, split into 166 scans for training and 84 for testing. Thus, given the data available, the model presented in this paper surpasses prior performance. In another paper, a nature-inspired gray wolf optimizer (GWO) algorithm called GWOTLT was trained and tested, using 10-fold cross-validation, on the same dataset as the one used in this study. The baseline VGG-16 had a precision and recall of 88.6% and 86.0% respectively, while the improved GWOTLT achieved a precision of 90.9% and a recall of 93.0%, thus demonstrating that the performance here compares favorably to research utilizing the same data (Vrbančič, Zorman, and Podgorelec 2019).
As for cost-sensitive machine learning, only one example could be found for the application of such a technique on intracerebral hemorrhage in prior literature, where the authors ended up achieving an overall sensitivity of 96% and specificity of 95% on a training dataset of 904 cases; the cost in this case was introduced through a modified loss function instead of a utility matrix (used in this study) (H. Lee et al. 2019). A smaller training dataset of 160 images combined with greater heterogeneity likely reduced model performances in this study compared to Lee et al.'s work (Lacroix and Critchlow 2003;Lee et al. 2019;Lim et al. 2019).
The accuracy of the cost-sensitive support vector machine exceeds a number of prior studies' performances. Comparing the accuracy against the array of research mentioned previously, it manages an accuracy greater than the models constructed by Dawud et al., Grewal et al., Arbabshirani et al. and Jnawali et al., while still maintaining a 100% sensitivity Dawud, Yurtkan, and Oztoprak 2019;Grewal et al. 2018;Jnawali et al. 2018). As Rane and Warhade pointed out in their literature review, a cost-sensitive algorithm shows great potential as a strategy for more effective diagnosis, and the results confirm this, with the cost-sensitive models demonstrated here outperforming a number of other techniques (Rane and Warhade 2021).

Applications
Considering the details of utility matrix application and the results it produces, such a type of optimization is suited for a workflow where the artificial intelligence is not working in isolation. Since penalization for false negatives can end up reducing specificities, a further test might become necessary, whether it is a separate algorithm, human clinician(s), or something else. Costsensitive learning is ideal in situations where misdiagnosing a positive case could have serious negative impacts, but at the same time the misdiagnosis of a negative case would not cause much inconvenience.

Limitations and Future Research
Two limitations of this study were, as mentioned, the small sample size and high heterogeneity within the dataset. Due to the nature of the dataset, distinctions based on sex, ethnicity or age weren't possible Future research that applies cost-sensitive techniques to larger, more standardized datasets, as well as utilizing datasets with details about sex, age, ethnicity, etc., could offer greater insights into the advantages and drawbacks of cost-sensitive AI algorithms, however this study is quite beneficial as a starting point for the novel application of utility matrices in pertinent medical scenarios.

Conclusion
Clearly, utility matrices have potential as a tool for minimizing unwanted false negatives while still providing accurate overall results. There is a trade-off between sensitivity and specificity, and a suitable value must be chosen after multiple trial and error estimates, but once the ideal penalty has been determined, a cost-sensitive model achieves the given goal better than a more general technique. More work should be done on a variety of use cases, but the initial results are promising.

Disclosure Statement
No potential conflict of interest was reported by the author(s).