Wear debris from total joint replacements: evaluation of automated categorisation by scale-invariant feature transforms

AbstractWear debris is a crucial factor in determining the lifespan of a total joint replacement. Not only do particulates between a bearing surfaces increase wear rates through third-body abrasion, but immune reactions can lead to inflammation and osteolysis. In this paper, the use of computer vision to analyse and classify scanning electron microscope images of debris was investigated. UHMWPE debris was generated using an in vitro simulator or a linear tribometer, images were analysed using scale invariant feature transforms and a support vector machine classifier. The accuracy was 77.6% with a receiver operating characteristic area under curve of 92%.


Introduction
The characteristics of wear debris can be related to the wear mechanics of a bearing system (Anderson 1982). These characteristics have particular importance in joint replacement implants as not only do they indicate the wear regime (Kumar et al. 2013), but they also influence immune reactions of the patient Hallab & Jacobs 2009).
The wear of joint replacement implants has been studied in detail (Punt et al. 2011;Xin et al. 2013;Neukamp et al. 2014;Moghadas et al. 2015) and the debris from these studies is often examined using scanning electron microscopy (Nine et al. 2014), as it has excellent imaging properties at the magnification ranges required to produce highly detailed micrographs of debris. While it is easy to characterise debris on simple metrics such as equivalent circle diameter (ECD), aspect ratio (AR) and roundness, the task of categorising the contents of any image in qualitative terms i.e. morphology, remains a significant challenge.
Various studies have attempted to create bespoke wear debris recognition algorithms, making use of the various properties of the debris particles (Kirk et al. 1995;Zhang et al. 1997;Podsiadlo & Stachowiak 2000). For example, partition-iterated function systems (PIFS) (Stachowiak & Podsiadlo 1999) and wavelet (Chen et al. 2006) methods take advantage of the fractal nature of wear debris, i.e. there are recurring features at various scales. These methods have been shown to be extremely powerful, and have high success in terms of accuracy.
At the time of development of these classifiers, the field of computer vision was nascent and emerging methods and algorithms were performing relatively poorly (Boiman et al. 2008).
However, advancement in the field has occurred quickly throughout the mid-late 2000s (Everingham et al. 2010).
There are two key areas of improvement within this field: that is the classifier, and the descriptor quantiser. The classifier is the mechanism for classifying objects based on the data fed to them; these can be learning based, or non-parametric based. Common classifiers are support vector machine (SVM) (Cortes & Vapnik 1995;Zhang et al. 2006), decision trees (Bosch et al. 2007) and nearest neighbour (Boiman et al. 2008) based classifiers. The descriptor quantiser describes the algorithms a computer uses to interpret an image and generates quantitative data about an image that describes what is pictured within. This can be as simple as colour histograms, or object size/aspect ratio, to more complex properties such as fractal dimensions, geometric-blur (Zhang et al. 2006) or the descriptor quantiser used in this paperscale-invariant feature transforms (SIFT).
A comparison of various descriptor algorithms was performed by Mikolajczyk and Schmid (2005); their results showed SIFT performed well against comparable methods for locating and describing key points. This algorithm combined with the SVM classifier performed well in the Pascal Visual Object Classes Challenge 2012 (Everingham et al. 2015). Additionally, all the highest scoring methods for classification and detection used an SVM for classification (Everingham et al. 2015).
Wear debris analysis is of particular interest for orthopaedic applications, as wear debris has been linked to the immune reactions leading to osteolysis (Purdue et al. 2006). This reaction to wear debris has been shown to be affected by size and shape (Green et al. 1998(Green et al. , 2000Yang et al. 2002 wear debris morphology. The use of ultrapure water simplified the classification of debris particles by avoiding large-scale lubricant-born contaminants.

Debris isolation
The wear debris was collected from the lubricants through vacuum filtration. The bovine serum was first digested to remove attached proteins and other biological contaminants found in bovine serum using the hydrochloric acid (HCl) method outlined in BS ISO 17853:2011(British Standards Institution 2011a). This method uses hydrochloric acid at 50 °C to digest the biological content in BSA; it was then diluted with methanol to reduce the viscosity for filtering. Subsequent to digestion, the debris containing fluids was filtered through 0.1 μm nuclepore filters (Whatman International ltd, Maidstone, United Kingdom) in a vacuum filtration system, and then was mounted on an SEM aluminium stub and sputter coated in gold for 60 s at 30 mA using an Agar automatic sputter coater (Agar Scientific, Elektron Technology UK ltd, Essex, United Kingdom).

SEM imaging
SEM images were taken on either a Jeol 7000F FEG-SEM (Jeol ltd., Tokyo, Japan) or an FEI Duelbeam FIB-SEM (FEI, Hillsboro, Oregon, USA). Electron voltages were 10 kV unless the debris started to suffer from beam damage (e.g. swelling or cracking), in which case the beam voltage was lowered to 5 kV. Secondary electron images were collected from the typical Everhart-Thornley detector on the Jeol 7000F; however, images generated by the FEI system used the 'through lens detector' on 'ultra-high resolution' mode.
Images were focused and astigmatism was corrected for, the magnification was chosen to achieve a full frame image of the is regularly characterised (Kumar et al. 2013;;Nine et al. 2014;Hongtao et al. 2011;Eckold et al. 2015;Saikko et al. 2015), there have been few advancements recently on moving beyond simple size and shape attributes. Using more sophisticated tools for the analysis of wear debris, new insights into which debris and wear regime are more likely to lead to failure could be made. New orthopaedic implants could then be designed to avoid this debris.
This paper describes an implementation of, and the viability of using, an open-source, but well-regarded and robust generic object recognition algorithm for the use of wear debris analysis from scanning electron microscope (SEM) micrographs. The aim is to introduce methodologies from outside disciplines with greater experience in computer vision, thereby allowing biomedical engineers and tribologists the opportunity to analyse SEM images without the need to reinvent tools found elsewhere. By removing the obstacle of creating a program that can recognise debris, greater comparison can be made between papers on the subject of wear debris analysis. The algorithm used in this paper is known as SIFT, and was invented by lowe (2004); the implementation is known as VlFeat from the Oxford Vision laboratory (Vedaldi & Fulkerson 2010b, 2010c).

Debris generation
Wear debris was generated, isolated and then imaged using SEM. The material used to generate the debris was ultra-high molecular weight polyethylene (UHMWPE) and it was generated using two different methods: (1) a Bose ElectroForce Spinal Disc Fatigue/Wear system (SD-F/W) (Bose Corp., ElectroForce Systems Group, Eden Prairie, Minnesota, USA); or (2) a high-frequency reciprocating rig (HFRR) (PCS Instruments, london, United Kingdom).
The in vitro UHMWPE debris was created in a study by Moghadas et al. (2015) using the Bose SD-F/W to wear test a Charité total disc replacement (TDR) implant. This device has two cobalt-chrome-molybdenum' (CoCrMo) concave end-plates and a central convex UHMWPE core ( Figure 1). The Charité was loaded' sinusoidally at 2 Hz between 500 and 2000 N, according to ISO standard 18192-1:2011 (British Standards Institution 2011b). Rotations were between 6° and 3° in flexion/extension, and between 2° and 2° in both lateral bend and rotation ( Figure 2). The lubricant in the Bose SD-F/W was bovine serum albumin (BSA) (30 gl −1 protein content) (Sera laboratories Int, West Sussex, United Kingdom).
The HFRR is a linear reciprocating motion ball on disc tribometer. The test used to generate debris used a 6-mm diameter ANSI E52100 steel ball (PCS Instruments, london, United Kingdom) on GUR 1120 UHMWPE discs (Orthoplastics, lancaster, UK). The tribometer used a 2-mm stroke length. In one test, the ball was roughened with P400 grit wet and dry paper (3 M, St. Paul, Minnesota, USA) to Ra = 0.5 μm to increase the amount of abrasive debris generated; in this test, a 20 Hz frequency was used. To generate the adhesive wear, a smooth ball was used (Ra = 0.05 μm) at a frequency of 25 Hz. The lubricant used in the HFRR was ultrapure deionised water (resistivity > 18 MΩ cm, inorganic content < 2 ppb), and was selected to simplify the tribological mechanisms involved in generating and controlling particle. A long scan for 26 s was taken without averaging to produce a clear, low-noise image that was saved as a TIF file.

Image processing & machine learning
The images were sorted by humans into classes based on the following criteria: Adhesion -wear debris generated by the HFRR in the smooth ball, higher frequency setting (Figure 3(a) and (b)).
Chip -wear debris generated by the HFRR in the rough ball setting (Figure 3(c) and (d)).
large Sphere -wear debris that was of > 5 μm in diameter and appeared to be of a spherical shape (Figure 4(c) and (d)).
Flake -wear debris that appeared flat and sheet-like (Figure 4(e) and (f )).
Images were processed using the MATlAB image processing toolbox to remove the background. This ensured that features on the filter that may be common between different images, did not cause erroneous positive matches that were not based on the particle morphology. The background was removed from the image using edge-based image segmentation; the outline of the particle was found using 'Canny' edge detection (Canny 1986). Examples of wear debris once the background had been stripped are shown in Figures 3 and 4.
SIFT descriptors were generated for the images, and they were in the form of 128-dimension vectors that described the gradient vector of intensity within a 4 × 4 grid. Five images of each class were selected pseudo-randomly to be training images. To increase computational efficiency, rather than comparing the descriptors of unknown images against the entirety of the descriptor data of the training images, clusters of similar descriptors were found and averaged. The individual means of these clusters formed the 'words' used in the solver's vocabulary and this is known as a 'bag of words' method. An example using four clusters of random   (Table 2)) of the solver. Table 2 shows the confusion matrix; this shows the percentage of images found to be a match by the SVM. The rows show which class of debris the SVM was classifying , and the columns show coordinate data is shown in Figure 5, the blue dots are the cluster means, and the lines are the Voronoi polygons that divide the clusters.

Confusion matrix
To generate the 'words' used to describe the particles, all the descriptors were concatenated, and the means of the clusters of descriptors were found. The mean value for each cluster was found using a k-means algorithm (Hartigan & Wong 1978), where k is the number of 'words' . An amount of 'words' were chosen based on a preliminary study, measuring the accuracy of the solver and computational time. As shown in Figure 6, the complexity was linear with respect to the number of 'words' , but the accuracy did not improve beyond 600 'words' .
For rapid nearest neighbour searches, i.e. finding the closest matching 'word' for the descriptor in Euclidean space, the vocabulary was indexed by generating a k-d tree (Bentley 1975). This is an efficient way to find the 'word' with the shortest orthogonal distance between descriptor vectors.
To make comparisons between images of how far their descriptors differ from the 'words' efficiently, the closest match (which did not exceed a threshold) was tallied and a histogram of how many descriptors matched what 'word' was computed. The histogram can be seen as a compact précis of the descriptor data.
The SVM is a binary solver -it only categorises something as belonging to a class, or that it does not. To train the SVM, it requires both the histograms of the training image which belong to a class and all of the training histograms that do not. The SVM used is a linear SVM; however, a homogenous kernel map is used to approximate the χ 2 -kernel (Vedaldi & Zisserman 2012). This reduces the computation time to train the SVM dramatically while being indistinguishable in performance (Vedaldi & Zisserman 2012)

Performance analysis
For each class listed in 2.4, the SVM is trained on a random subsection of all the images; the SVM is taught what is a member of a given class vs. all other classes. Once trained, all non-training images are scored against each given class.
The robustness of the method was tested with repeated random subsampling cross validation. By varying the seed used in the pseudo-random number generator, different images were selected for training and testing, the accuracy of the image classification algorithm was assessed using different combinations of training and test images. The process of training and assessing images was repeated five times with different seeds and averaging the results; the number of matches for each class was then found. For each random permutation of images, half were used for training, and half for testing.
The number of matches found was subdivided into which class the image truly belonged , thereby generating a confusion matrix of true and false positives. Table 1 shows the overall accuracy of the SVM for all the classes, the average accuracy ± the standard deviation was 77.60 ± 4.56%. The accuracy was found by taking the mean of   Table 2. Confusion matrix of sVm. rows indicate the particle class the sVm has been trained to find and the columns indicate per cent of particles determined to be a match.

Timing
Using a 64-bit 2.2 GHz Intel Sandy Bridge E5-2660 CPU with 32 GB available memory, computing the SIFT descriptors and frames took 1.0016 s per image. The MATlAB version was R2015a and the implementation of the SIFT algorithm is from VlFeat 0.9.20. how many matches were found from each class the image actually belongs to. Figure 7 show the Receiver operating characteristic (ROC) curves for each class of debris. The ROC curves shown are a measure of the true positive rate (recall) against true negative rate at different discrimination thresholds. A good classifier will have both a high rate of true positives and true negatives (the area under curve [AUC] will approach 100%). Random chance is shown with Figure 7. roC curves of the sVm for each class. (a) roC curve for fibril debris classification, (b) roC curve for adhesive particle classification. the roC curve for spherical particles is (c) and the roC curve for chips is (d). (e) roC curve for sheets/flakes. the average auC of the roC curves was 92.28 ± 6.49%. the dashed red line (-) is a roC curve or random chance, the blue line ( -) is the mean roC curve and the shaded area is the standard deviation of the mean curve. auC = area under curve.

Receiver operating characteristic curves
However, for 'feature poor' debris like sheet/flakes, (Figure 4(e), (f )) the classifier accuracy was <60% (although still better than chance). While it may appear that large spheres should be 'feature poor' , the SIFT algorithm also uses the particles perimeter as a feature, where the circular outline is unique to spherical particles. Considering only a single descriptor type was used for characterisation, the scores demonstrated show high accuracy compared to other methods (Chatfield et al. 2011). It is possible with the use of a more sophisticated classifier, for example, one that uses multiple metrics to describe the image, that the classification will be less prone to error when analysing particles with few key features.

ROC curves
As shown in Figure 7, the ROC curves for large spheres, fibrils and adhesive particles are all >90%; demonstrating the SVM is highly capable, correctly identifying debris of these morphologies without erroneously including incorrect matches. The AUC for sheets and chips was >80%.

Applications to biomedical engineering
The field of biomedical engineering places great importance on wear debris analysis, since the debris have such a pronounced effect on the life of an implant (Harris 1995;Green et al. 2000;. However, comparisons between papers from different research groups are challenging -both due to the subjective nature of debris characterisation and the variety of methodologies used. Some efforts have been made to create debris quantifiers, but these have yet to be adopted by the community as a whole, despite the methods having been published for some time (Kumar et al. 2013). It is the intention that by demonstrating the viability of using freely available machine learning and computer vision techniques developed by specialists in computer science, tribologists will be able to produce and reproduce comparable results without the need to perform redundant development of complex computer algorithms to analyse images.

Conclusion
This paper has investigated the accuracy of using an SVM classifier to characterise SEM images quantised with SIFT descriptors.
• The overall accuracy and rates of misclassification are comparable with general computer vision papers -demonstrating these methods are suitable for wear debris analysis. • Debris classes that contained particles which had many morphological features were classified at a greater rate with fewer false positives than classes with fewer features. • By factoring size and AR data into the SVM, a general particle classifier could easily be constructed, making comparisons between studies possible.

Disclosure statement
No potential conflict of interest was reported by the authors.

Discussion
This study has examined the suitability of SIFT and a parametric classification algorithm typically used for general computer vision, applied specifically for the use of wear debris analysis. It assesses the overall accuracy, as well as the response within each class of wear debris. The use of image analysis and machine learning for the automated characterisation of debris greatly speeds up the analysis of large quantities of SEM images. It has been shown that the use of general computer vision techniques is applicable for examining micrographs of wear debris when given training images of different debris morphologies, and is comparable to bespoke methodologies (Stachowiak et al. 2008;Kumar et al. 2013). While computers have long been used to do basic analysis for finding size, AR and roundness of particles, the recognition of what an image contains remains a challenging problem in all fields of computer science. Using more general machine learning and image analysis tools than those used in previous studies of debris analysis, the breakthroughs discovered outside of the field of tribology can be re-purposed for examining wear debris with greater accuracy and with fewer inefficiencies attempting to recreate redundant methods.
Wear debris analysis commonly suffers from the subjective nature of interpreting images; by training a computer vision algorithm using debris either of pronounced class, or generated in a sterile environment that greatly favours certain wear regimes, characterising the debris with known confidence levels is possible.

Accuracy
The overall accuracy of this method, shows a high rate of classification accuracy, correctly identifying debris 77.6% of the time. An accuracy of 77.6% was in line with the capabilities of similar methods using the same descriptor generation and classifier when performed on the Caltech-101 and the PASCAl VOC 2007 classification challenge (Vedaldi & Fulkerson 2010a;Chatfield et al. 2011). Chatfield et al. (2011 found the accuracy of various computer vision methods, including the method presented in this paper, was between 72.3 and 77.78% (SIFT scored 73.77 ± 0.70) on the Caltech-101. Therefore, with no loss in accuracy, the SIFT-based method can be adapted for wear debris recognition. It was found that the accuracy does vary between different random seeds, implying that the quality of training images has some impact on the accuracy of the SVM. The effect of 77.6% accuracy on the reported distributions of wear debris classes is dependent on the prior distribution of each class (lee 2012). However, the particle classes that take up the majority share will still be reported as such.