Use of deep learning in forensic sex estimation of virtual pelvic models from the Han population

Abstract Accurate sex estimation is crucial to determine the identity of human skeletal remains effectively. Here, we developed convolutional neural network (CNN) models for sex estimation on virtual hemi-pelvic regions, including the ventral pubis (VP), dorsal pubis (DP), greater sciatic notch (GSN), pelvic inlet (PI), ischium, and acetabulum from the Han population and compared these models with two experienced forensic anthropologists using morphological methods. A Computed Tomography (CT) dataset of 862 individuals was divided into the subgroups of training, validation, and testing, respectively. The CT-based virtual hemi-pelvises from the training and validation groups were used to calibrate sex estimation models; and then a testing dataset was used to evaluate the performance of the trained models and two human experts on the sex estimation of specific pelvic regions in terms of overall accuracy, sensitivity, specificity, F1 score, and receiver operating characteristic (ROC) curve. Except for the ischium and acetabulum, the CNN models trained with the VP, DP, GSN, and PI images achieved excellent results with all the prediction metrics over 0.9. All accuracies were superior to those of the two forensic anthropologists in the independent testing. Notably, the heatmap results confirmed that the trained CNN models were focused on traditional sexual anatomic traits for sex classification. This study demonstrates the potential of AI techniques based on the radiological dataset in sex estimation of virtual pelvic models. The excellent sex estimation performance obtained by the CNN models indicates that this method is valuable to proceed with in prospective forensic trials. Key points Deep learning can be a promising alternative for sex estimation based on the pelvis in forensic anthropology. The deep learning convolutional neural network models outperformed two forensic anthropologists using classical morphological methods. The heatmaps indicated that the most known sex-related anatomic traits contributed to correct sex determination.


Introduction
Accurate sex estimation of skeletal remains is fundamental to individual identification in forensic anthropology, by which other biological elements (e.g. ancestry, age, and stature) could be determined [1][2][3][4]. The pelvis has always been considered as the most reliable among all human bones for its remarkable sex dimorphism, primarily affected by the functions of bipedal locomotion and parturition [3,[5][6][7][8]. Traditionally, forensic anthropologists estimate sex by empirically evaluating morphological features of the pelvis (e.g. ventral arc and subpubic contour). In some cases, however, they must deal with parts of the pelvis when corpses are poorly preserved during mass disasters or by carnivore scavenging activities [9][10][11]. These could make pelvis-based sex estimation more challenging. Therefore, it may be beneficial to establish additional objective methods to supplement routine morphological observation.
Current classical morphological sex estimation approaches mainly originate from American and English skeletal collections [12][13][14]. The individuals from these collections were born in the second half of the 18th and 19th centuries [15,16]. Although these approaches have been widely validated, it remains questionable whether their implementation can directly apply to the Han nationality due to environmental influences on skeletal development and population-specificity [13,15,[17][18][19][20][21]. Since Computed Tomography (CT) scanners have been widely used in hospitals, collecting adequate forensic anthropological reference data in a contemporary population is convenient. Besides, the CT-based three-dimensional (3D) reconstruction technique is sufficient to portray many morphological features on their actual skeletal counterparts [22,23]. Therefore, virtual skeletal remains constructed by CT scanning could be recognised as a potential candidate for forensic anthropology and even as a substitute for traditional skeletal collection in some specific situations [24,25].
Artificial intelligence (AI) is likely to affect many fields by accomplishing tasks considered difficult for human experts [26][27][28][29]. Powered by its advances in computation on vast amounts of datasets, deep learning has gained considerable attention for realising AI, especially in the domain of medical image recognition [30]. As exemplified by the study from Kermany et al. [29], a deep learning model with transfer-learning was used as a diagnostic tool to screen patients with common treatable blinding retinal diseases. Additionally, several studies reported the application of deep learning as a promising method in forensic anthropology. For example, Spampinato et al. [31] proposed several deep-learning approaches to perform automatic skeletal bone age assessment on 1 391 left-hand X-ray scans of children; the results showed an average discrepancy of about 0.8 years between manual and automated evaluation. Li et al. [32] used 1 875 clinical pelvic radiographs to develop a convolutional neural network (CNN) for bone age estimation; the model achieved excellent performance with the mean absolute error of 0.89 years in test samples. Notably, only a few studies have focused on applying CNN models for sex estimation, using radiographs of hands and wrists and CT reconstructions of skulls [33,34].
Deep learning has the advantage of hierarchically extracting feature representations from the input imaging data. This study aimed to train the CNN models (GoogLeNet Inception V4) for sex estimation based on virtual hemi-pelvic bones, reconstructed with CT scanning from 862 individuals of Han nationality. Moreover, a comparative study was performed between the trained CNN models and human experts in independent testing.

Study population and 3D model reconstruction
This work is a retrospective study based on 862 (females: 437; males: 425) pelvic CT scans retrieved from the database of the Department of Medical Imaging of Hanzhong Hospital. The scans were randomly selected from the adult Han Chinese visiting the hospital for CT-imaging of the pelvis between 2015 and 2017. These individuals aged from 20 to 85 years represented the Han population in this context. The pelvises with diseases, deformities, and injuries were eliminated, and only information of sex, age, and ancestry was retained. The study was approved by the Ethics Committees of Nanjing Medical University and Academy of Forensic Science, Ministry of Justice, and undertaken according to the Declaration of Helsinki. The committees exempted written informed consent from patients because of the anonymity of the participants' details and the retrospective nature of this study. The permission to use the information in this database for the purposes of this research was obtained from the dataset owner, Hanzhong Hospital.
CT acquisition was performed on an Optima CT660 (GE Healthcare, Chicago, IL, USA) with the tube voltage of 120 kV, tube current of 300 mA, slice thickness of 1.25 mm, and spiral pitch factor of 0.98. The scans of the pelvises were manually reconstructed into 3D virtual skeletal models with the Mimics software (Materialise Co., Leuven, Belgium) by a single researcher using comparable standard protocols. Then the virtual ossa coxae were separated from their adjacent soft tissues and bone structures with the Hounsfield unit measurements from 226 to 3 071.

2D Image acquisition and preprocessing
Only the left sides of the ossa coxae were included in this study. Eighty percentage of the samples (female: 350; male: 340) as training (female: 263; male: 255) and validation datasets (female: 87; male: 85) were randomly selected for CNN model calibration while the remaining 20% ones (female: 87; male: 85) as a testing dataset were used for a comparative study between the CNN models and two human experts.
Six specific regions, including the ventral pubis (VP), dorsal pubis (DP), greater sciatic notch (GSN), pelvic inlet (PI), ischium, and acetabulum, were manually cut off from the virtual hemi-pelvic models by a single researcher. The ventral and dorsal profiles of the pubic bone, GSN and PI plane, the ischium's ventral profile, and the acetabular rim were oriented perpendicular to the viewer. Their corresponding 2D images were manually captured and downsampled to 255 × 255 pixels. Our training and validation datasets were relatively limited for deep learning. Therefore, data augmentation, randomly changing the images' contrast, brightness, and rotation angles (including 90°, 180°, and 270°), was performed to inflate the dataset size. This technique is a powerful and widely used method to improve a model's generalisability over unforeseen data. In this way, the images were increased fourfold for the CNN training.

CNN models training and validation
The GoogLeNet Inception V4 architecture [35], whose internal weights had been pre-trained on the ImageNet dataset containing 1.28 million images with 1 000 categories, was adopted for sex estimation on the virtual hemi-pelvic images using transfer learning. Transfer learning is another approach to addressing a lack of data by using the pre-trained weights as the initial weights in a new task. The CNNs were trained and assessed internally using the images from the training and validation datasets. During the training, the loss values mirroring model performance were reduced iteratively by an Adadelta algorithm [36] with a learning rate of 0.01 and a mini-batch size of 64. Learning rate is perhaps the most critical hyperparameter in deep learning, controlling how quickly the model is updated and how much the weights are changed in each training epoch. Batch size is also a hyperparameter that defines the number of images to be fed into the model at each step. A learning rate decay factor of 0.8 and a decay step of 10 implied that the learning rate decreased to 80% after 10 epochs. Heatmap analysis based on the Guided Backpropagation algorithm [37] was used to determine the pixel regions on the hemi-pelvic images contributing to sex classification by the CNN models. All the experiments were implemented on an Ubuntu 16.04 standard computer equipped with an NVIDIA Titan Xp 12 GB graphic processing unit (GPU), an Intel I7 8700 K central processing unit (CPU), and 32 GB random access memory (RAM).

Sex estimation on virtual hemi-pelvises of the independent testing dataset by CNN models and human experts
The trained CNN model was fed with the virtual hemi-pelvic region images in the independent testing dataset, then the probability values for sex estimation were output. By artificially setting the threshold to 0.5, a female could be determined when the probability value was higher than 0.5; and a male could be distinguished when the probability value was less than 0.5. Two experienced anthropologists (A, B, with 12, 8 years of forensic anthropology experience, respectively) were given the independent images and then evaluated empirically their sexes being female or male based on previously reported methods. For VP, DP, and GSN, the images were scored on an ordinal scale from 1 to 5 (1 = hyperfeminine, 2 = feminine, 3 = intermediate, 4 = masculine, and 5 = hypermasculine) to reflect the variation in the expression of morphological traits according to the method previously published by Klales et al. [14] and Walker [13]. The images with scores ≤2 were determined as females, those with ≥ 4 as males, and those = 3 as undetermined individuals. As for PI, ischium, and acetabulum, the anthropologists estimated the sex on a binary scale based on Rogers and Saunders [38] and Bruzek [4]. According to Rogers and Saunders, the male PI is heart-shaped, while the female expression is elliptical; and the acetabulum in males is large and oriented laterally, whereas in females, it is said to be small and directed more anterolaterally. Bruzek [4] reported that the ischium length is longer than pubis length in males while shorter in females.

Statistical analysis
We used the receiver operating characteristic (ROC) curve to show the classification ability of the deep learning models in sex estimation. The ROC curve was created by plotting the sensitivity against the specificity by varying the predicted probability threshold, and then the area under the ROC curve (AUC) value was achieved. The females and males were artificially prescribed as true positives and true negatives. Sensitivity was calculated as the fraction of the correctly identified females, and specificity was calculated as the fraction of the males who were correctly identified. 95%CIs for sensitivity and specificity were calculated with the Clopper-Pearson method. The DeLong nonparametric statistical test implemented in MedCalc (Version 19.2.0) was used to assess significant differences among the AUC values [39].
The intra-and interobserver agreements between the two observers were quantified with the Weighted Kappa (k) for the qualitative ordinal ranked data of the VP, DP, and GSN and the Cohen's Kappa (k) for the binary data of the PI, ischium, and acetabulum. A score = 0.00 shows no agreement, 0.01-0.20 indicates slight agreement, 0.21-0.40 is fair, 0.41-0.60 is moderate, 0.61-0.80 is substantial, and 0.81-1.00 means almost perfect agreement. Statistical analysis was performed using SPSS 25.0 (IBM, Armonk, NY, USA).

CNN models training and validation
The training samples were iteratively fed into the Inception V4 to modify the model's weights and biases in different layers. The trained model was evaluated at each step on the validation dataset regarding classification accuracy and validation loss. The loss function was a metric that distilled all aspects of the CNN model into a single number called loss value, which we sought to minimise during deep learning training. Low validation loss and high accuracy empirically implied an excellent efficacy. In Figure 1 and Table 1, the best models of VP, DP, and GSN converged at about 50 steps with a training time of about 35 min, which was shorter than the other three models. Nevertheless, the maximum training time for the six models was about 2 h. As for the internal validation, the accuracies of the six models ranged from 74.6% (acetabulum) to 98.2% (PI), with most of the CNN models having validation loss values below 0.35. Only the validation loss values of the ischium and acetabulum models was higher than 0.50, implying that their sex estimation efficacy could be unpromising.

Independent prediction on the testing samples with CNN models
When the six CNN models were fully trained using the training and validation datasets, the models' performance was then evaluated independently with the testing dataset. As shown in Despite excellent performance in visual classification, neural networks are also called "black boxes" that lack transparency. In Figure 2, we selected some examples of the positive outputs containing the regions of interest determined by our deep learning classifiers through the Guided Backpropagation heatmap test. In the VP and DP, the inferior margin of the pubic ramus and the whole pubis shape contributed highly to sex estimation. Based on the heatmaps from the VP, DP, ischium, and acetabulum, it was implied that the obturator foramen contributed somewhat to the pelvic sex estimation. Table 3 demonstrated almost perfect levels of intraobserver agreements for VP, DP, and GSN (Anthropologist A: 0.864-0.891; Anthropologist B: 0.820-0.899), and substantial agreements for PI, ischium, and acetabulum (Anthropologist A: 0.759-0.770; Anthropologist B: 0.724-0.809). There were substantial interobserver agreements for all the traits (Anthropologist A & B: 0.651-0.815). These suggested the relative reliability of the sex estimation made by human experts. As shown in Table 4, anthropologist A and B presented similar prediction metrics, with accuracies ranging from 72.7% (65.4%-79.2%) to 80.2% (73.5%-85.9%), sensitivities ranging from 71.3% (60.6%-80.5%) to 88.5% (79.9%-94.3%), and specificities 56.5% (45.3%-67.2%) to 81.2% (71.2%-88.8%).

Comparison of sex estimation on the testing samples between CNNs and human experts
As shown in Figure 3, the points (1-specificity, sensitivity) of both human experts, except for the acetabulum, laid below the ROC curves of their corresponding CNN models, implying that the CNN models achieved superior overall performance in sex estimation than human experts in the five selected pelvic regions other than the acetabulum. As for the acetabulum, the CNN model and the anthropologists had almost comparable sensitivity and specificity. Except for the acetabulum, the CNN models performed better than the experts in terms of accuracy, sensitivity, specificity, and F1 scores (Tables 2 and 4).

Discussion
In some cases, forensic anthropologists have no access to regular and entire pelvis bones, which could be partially destroyed, missed, or deformed note: all the metrics were calculated using an untuned threshold value of 0.5. The females and males were artificially prescribed as true positives and true negatives. sensitivity and specificity were calculated as the fraction of females and males who were correctly identified in the true condition. Key descriptive statistics include total samples (n = 172), true female (positive) findings (n = 87), and true male (negative) findings (n = 85). The numbers in parentheses are the 95% confidence intervals. aUc: the area under the receiver operating characteristic curve.
due to antemortem or postmortem injury during wars, air, or traffic crashes, dismember, and animal gnawing. It is challenging to make accurate sex estimation in the above scenarios, regardless of using morphological or metric methods. Besides, some sex estimation methods based on ancient skeletal collection may not be applicable to other current populations due to significant variations of ancestry, environment, and nutrition [12-14, 17, 18, 40]. These all would prevent forensic anthropologists from making effective sex judgments, thereby misleading further construction of skeletal biological note: The numbers in parentheses are the 95% confidence intervals. The intra-and interobserver agreements were quantified with the Weighted Kappa (k) for the qualitative ordinal ranked data of the ventral pubis, dorsal pubis, and greater sciatic notch and the cohen's Kappa (k) for the binary data of the pelvic inlet, ischium, and acetabulum. The females and males were artificially prescribed as true positives and true negatives. sensitivity and specificity were calculated as the fraction of females and males who were correctly identified in the true condition. Key descriptive statistics include total samples (n = 172), true female (positive) findings (n = 87), and true male (negative) findings (n = 85). The numbers in parentheses are the 95% confidence intervals.

Figure 2.
Guided Backpropagation heatmaps on the ventral pubis, dorsal pubis, greater sciatic notch, pelvic inlet, ischium, and acetabulum. red indicates areas that contribute highly to sex estimation classification, while blue represents little contribution.
profiles and police investigation. To deal with these situations, we investigated the performance of the CNN models on the specific regions of 3D virtual hemi-pelvis samples reconstructed from the contemporary Han CT scans, including the VP, DP, GSN, PI, ischium, and acetabulum. We then compared the performance of these models to that of two experienced forensic anthropologists using traditional methods.
The findings of this retrospective study showed that most CNN models tested in the independent hemi-pelvic dataset could achieve high accuracies, sensitivities, and specificities in sex estimation, which were comparable to those in the internal validation procedure. This demonstrated the generalisability of our deep learning models to the unforeseen samples without significant performance differences of overfitting. In the case of overfitting, the deep learning model is so well-trained on the training data that it cannot adapt to new data. In contrast, an underfitted model fails to make accurate predictions even on the training data. In this context, our acetabulum model should be considered as underfitted according to its AUC value and accuracy in the testing dataset. We do not know whether this model's underfitting is mainly related to the limited data size or to the sexually dimorphic nature of the acetabulum, which should be clarified in future investigations.
The results also showed that most CNN models outperformed human experts on sex estimation except acetabulum in terms of accuracy, sensitivity, and specificity. In addition, CNN models could be well-trained in a few hours, whereas a qualified anthropologist may require years of professional training. The heatmap based on the Guided Backpropagation algorithm demonstrated that these models mainly concentrated on the anatomic structures that are well-known to contribute significantly to sexual classification. Indeed, several morphological studies have confirmed the validity and reliability of some traits (e.g. ventral arc, subpubic angle, pubis body shape, and subpubic contour) on the VP and DP [12,14,17,19,20,[41][42][43][44][45]. However, our results yielded higher classification accuracies of both pelvic regions than these studies. Moreover, it was found that the pelvis inlet has sexual dimorphism, reaching a high accuracy of 96.5%-100%, even though this anatomic trait is not widely recognised as an ideal alternative for sex estimation [46]. Compared to similar studies on the GSN [4,13,47], higher accuracy we obtained could be explained by the ability of the CNN models to ignore useless details such as developmental variation of marginal structures like the ischial spine and piriform tubercle, which usually makes the traditional morphological and metric method more challenging. Contrary to several other studies (accuracy ranging from 82.5% to , ischium (e) and acetabulum (F). except for the acetabulum, most models achieve superior performance to two anthropologists as their specificity-sensitivity points lay below the corresponding model's receiver operating characteristic curve. aUc: the area under the receiver operating characteristic curve.
These results demonstrate not only the power of AI technology for sex estimation but also the advantages of radiological techniques on rapid data renewal, especially for contemporary populations within specific regions. Despite the high performance of our sex estimations, there are still several limitations of the proposed method. Firstly, the dataset scale is relatively small. Data augmentation and transfer learning techniques were applied in this study to overcome the challenge of lacking training data; and the sex estimation performance of the models on the independent dataset demonstrated that these techniques appeared to address the problem of sample insufficiency. In addition, several studies reported the feasibility of transfer learning to train models with limited data. Kermany et al. [29] found that the classifier trained with 1 000 samples using transfer learning retained comparable performance to that trained with 10 000 samples from scratch. However, Kermany et al. also acknowledged that the model's performance using transfer learning would be inferior to that of a model trained on an extremely large dataset. Secondly, the samples we collected originated from one platform and primarily are north-western Han Chinese. Although most of our models exhibited excellent generalisability in the independent dataset, the applicability of our method needs to be further evaluated using the CT dataset that would be generated by various instruments, software, and population statistics and the images not derived from CT scans, such as pictures from cameras or smartphones. Future multicentre studies should include more data and expand the sets to real-world data from other resources to increase the performance and generalisability of our AI sex estimation system.
Although our computational analyses may play a role in the hemi-pelvic sex estimation, we still believe that AI will not replace the forensic anthropologists who use morphological methods, given its limitations. Firstly, for their nature of lacking interpretability and transparency, the inspiring and promising deep learning techniques should be applied with caution. The heatmap test showed that our models mainly focused on the skeletal structures with sex dimorphism; it is still impossible to explain how and why the models produce output for a particular image. Secondly, another factor limiting deep learning is the error for landmark recognition and size between virtual and dry skeletons generated during the 3D reconstruction. Some sex estimation traits, such as pre-auricular sulcus [23] and pubic symphysis scarring, may be invisible or poorly defined in the 3D virtual models. Therefore, AI techniques can be combined with manual methods to augment the capabilities of forensic anthropologists, but they cannot replace forensic anthropologists.

Conclusion
Herein, we demonstrate the potential of AI techniques based on the radiological dataset in sex estimation of virtual hemi-pelvic models. Despite the limited number of samples available, most of the CNN models, trained with images of various hemi-pelvic anatomical regions using transfer learning, can achieve high accuracies, sensitivities, specificities, and AUC values in sex estimation, even better than human experts with significant anthropological experience. This current AI technology cannot replace forensic anthropologists, but we believe that deep learning will soon become a complementary tool for more accurate sex estimation. We would proceed with this method and test its applicability in prospective forensic trials with much large and multicentre training datasets and expand our algorithm to other types of skeletal remains with sexual dimorphism (cranium, mandible, humerus, and femur). 82072115 and 81922041), grants from the Ministry of Finance (No. GY2020G-2) and Science and Technology Commission of Shanghai Municipality (No. 17DZ2273200 and 19DZ2292700).