Inter-rater reliability and a training effect of the functional movement screen in police physical training instructors

Abstract The aim of this study was to determine the inter-rater reliability of the Functional Movement Screen (FMSTM) within a police population and determine whether formal training improved reliability and assessment accuracy. Police Physical Training Instructors (PTI) (n = 67) rated 98 randomised videos of officers performing four primary FMSTM tasks (overhead squat, hurdle step, in-line lunge, and rotary stability) twice within a two-day annual training program. A one-hour FMSTM training session was completed between the two assessment periods. PTI scores were compared to a Master score. The level of agreement between raters was low to moderate across all FMSTM items. The inter-rater reliability between the raters did not improve significantly following training. The level of agreement between “Rater” and “Master” scores improved significantly post-training. FMSTM subtest items have varying reliability. Staff training should be performed and compared to a Master rater prior to employing the FMSTM as an assessment tool in law enforcement.

ABOUT THE AUTHOR Associate Professor Rob Orr served in the Australian Army for over 23 years as an infantry soldier, Physical Training Instructor, physiotherapist and human performance officer. Still serving in the Army Reserve, Dr Orr is the director of the Tactical Research Unit at Bond University, teaches in the Doctor of Physiotherapy Program, and supervises masters and doctoral students. Rob has run training for military, law enforcement, and firefighter agencies across the globe providing research, consultancy and educational services. With awards for research outcomes, over 100 tactical publications and numerous annual speaking conference and workshop requirements, Dr Orr serves as the editor of the Tactical Strength and Conditioning Report and on several scientific and organising committees for renowned international conferences.

PUBLIC INTEREST STATEMENT
Police services, and other similar occupations use the Functional Movement Screen (FMS) TM to assess physical movement competency. With the importance that this screening tool plays in a police officer's service, ensuring different assessors would score an officer the same (interrater reliability), is of importance. So too is determining whether training in this tool is needed to improve reliability. The study found that the FMS TM had low to moderate reliability, which does not improve with FMS TM training. However, the individual scores of assessors were closer to the master (correct) score following training. These results suggest that FMS TM training may not improve inter-rater reliability but may improve accuracy in scores. As such, when staff are being trained to implement the FMS TM their scores should be compared to a "master" score to improve score quality.

Introduction
The nature of law enforcement sees police officers perform tasks that can be unpredictable and in potentially dangerous or hostile environments (Bonneau & Brown, 1995). Tasks can include "attending a domestic incident" to "affecting an arrest" (Decker et al., 2016) and require officers to pursue suspects on foot to prevent escape and/or the use of reasonable force such as wrestling with an offender (Achterstraat, 2008). The physical nature of these tasks leaves officers at risk of musculoskeletal injury (Peate et al., 2007). As such, police officers are known to suffer a range of musculoskeletal injuries, with soft-tissue sprains and strains commonly caused by a non-compliant offender serving as an example (Lyons et al., 2017).
The Functional Movement Screen (FMS) TM is an assessment tool used in screening for musculoskeletal injury risk in both sporting (Chorba et al., 2010;Kiesel et al., 2011;Letafatkar et al., 2014) and tactical populations (Bushman et al., 2015;Lisman et al., 2013;Peate et al., 2007) and has been associated with some elements of occupational performance in law enforcement (Bock, Stierli, Hinton & Orr, 2016). This tool measures the musculoskeletal movement ability of a participant to identify functional limitations and asymmetries through the use of seven key movement patterns: a deep squat, hurdle step, in-line lunge, push-up and rotary stability task (Cook et al., 2006). The FMS TM is designed to place the body in challenging positions that may reveal weaknesses, imbalances and compensations should the participant not have sufficient stability (Cook et al., 2006). The FMS TM also includes three clearing tests including the shoulder mobility, extension and flexion-clearing tests, which are graded either positive or negative based on pain. On this basis, the highest score than can be attained in the FMS TM is 21 points.
Previous research has suggested that lower scores, being <14 points, are associated with an increased risk of injury (Chorba et al., 2010;Kiesel et al., 2007;Perry & Koehle, 2013;Schneiders et al., 2011). Two reports (Chorba et al., 2010;Kiesel et al., 2007) found that athletes who scored <14 points had a significantly greater risk of injury when participating in sports. In addition, reports within general populations (Perry & Koehle, 2013;Schneiders et al., 2011) have confirmed a FMS TM score of <14 points is indicative of an increased risk of injury. In tactical populations, marine officer candidates who scored poorly in the FMS TM (<14 points) were more likely to experience an injury during Marine Corps officer training (Lisman et al., 2013;O'Connor et al., 2011).
Considering this, if the FMS TM is to be used as a predictor of injury in a law enforcement population, inter-tester reliability is paramount, especially if departmental decisions are going to be based on assessment outcomes. The FMS TM was constructed so that individuals of any skill level could be trained to correctly implement the tool. So far, research has shown that physical therapists and physical therapy students, athletic trainers and athletic trainer students, Certified Strength and Conditioning Specialists and FMS TM certified specialists have good to excellent reliability in scoring the FMS TM (Butler et al., 2012;Minick et al., 2010;Onate et al., 2012;Smith et al., 2013;Teyhen et al., 2012). This research has been conducted on middle school-aged students (Butler et al., 2012;Wright et al., 2015), youth hockey players (Parenteau et al., 2014), physically active adults (Onate et al., 2012;Schneiders et al., 2011), active-duty service members (Teyhen et al., 2012), elite squash players (Leeder et al., 2016), NCAA Division I varsity athletes (Shultz et al., 2013), college students (Gulgin & Hoogenboom, 2014;Minick et al., 2010;Sorenson, 2016), university students (Gribble et al., 2013), and physical therapy students (Smith et al., 2013), with the number of raters ranging from two (Butler et al., 2012;Onate et al., 2012;Schneiders et al., 2011) to 38 (Gribble et al., 2013) and participants from three (Gribble et al., 2013) to 209 (Schneiders et al., 2011).
As yet, the inter-rater reliability of the FMS TM , conducted by Physical Training Instructors (PTI) serving within a law enforcement agency has not been determined. The importance of determining the level of agreement between PTIs is of note given that different PTIs, from the same or different locations, could assess officers and their findings may impact on the officer and departmental resources. On this basis, the aim of this study was to determine the inter-rater-reliability of the FMS TM within a law enforcement PTI population and to determine whether providing formal training in the FMS TM improved inter-rater reliability and accuracy.

Materials and methods
In this retrospective cohort design, a total of 14 (6 women, 8 men) police officers were examined. Raters were recruited from an Australian state police force whose officers were attending a Physical Training Instructor Annual Training session. Of the 67 police PTI training attendees, all 67 consented to participate in the research (22 women: age = 39.3 ± 6.1 years; 45 men: age = 39.3 ± 7.9 years). Criteria for participants were: 1) participant must be a serving police employee, and 2) be a qualified PTI. Participant data were excluded if: a) the participant's script could not be clearly determined, or b) the participant did not score according to the FMS TM scoring criteria (e.g. gave ½ scores). Level of experience as a PTI ranged from 1 to 9 years and all PTI's had some level of experience with the FMS TM . The sample of raters was representative of the population. All participants provided written informed consent, and ethics approval for this study was provided by the Bond University Human Research Ethics Committee under protocol RO 1595. Gatekeeper approvals were provided by the Australian state police force from which participants were drawn.
Four of the seven FMS TM tasks were previously recorded displaying a frontal and sagittal view using two Apple iPhone 6 mobile phones set up on a tripod. The position of participants and camera set up was standardized for each view. The videos showed a split view of 14 unidentified police officers performing the overhead squat, in-line lunge, hurdle step and rotary stability with each exposure played three times, followed by 10 seconds in which to score. The push-up assessment, shoulder mobility and active straight leg raise were excluded from this study as the scoring methods needed direct interface with the participant and as such could not be scored by video. Each exposure was presented in a predetermined randomized order and changed from pre-to post-testing. Participants were given a score sheet in which to record their scores. The movement patterns were scored on a scale of one (1) to three (3) based on the quality of movement. A score of zero (0) was unable to be used, as pain is unable to be determined during video data capture.
Over 2 days, the raters were required to attend two FMS TM rating sessions (Session 1 and 2) and a FMS TM education session interspersed between the rating sessions. Session 1 was held on the first day of the conference and consisted of a 50-min data capturing session where the participants scored a total of 98 videos of a randomized sequence of unknown police officers performing the FMS TM .
Following this first session, in controllable groups of approximately 20 participants underwent a FMS TM training session. Education on the procedure and scoring of the FMS TM was provided by the same two researchers (BH and KR) who determined a "Master" score. Both researchers are Physical Training Instructors who have over 5 years' of PTI experience, a Level 1 and Level 2 FMS TM Certification, and have each conducted over 2,000 Functional Movement Screens in this population. Each group was provided with the same one-hour lecture on the FMS TM movements, scoring, and variable differences. After the education session, the raters were again exposed to a different randomized sequence of the 98 videos and same process as Session 1.
For both "pre-training" and "post training" descriptive analysis and inter-rater reliability, levels of agreement were determined across all raters for each test element, including R, L, and Final Scores where relevant (n = 14 videos for each element). Due to the ordinal nature of individual FMS TM item scores, Kendall's Coefficient of Concordance (W) and the Average Spearman Correlation Over All Raters (R ̅ s ) were calculated to determine the levels of agreement across all raters in the scores they assigned for each element of each assessed FMS TM item, where each element of each item was represented by and assessed across 14 different video clips.
Average Spearman Correlations were then calculated to determine levels of concordance between "Master" scores and rater scores for each FMS TM item (with multiple videos and associated scores for each item from each individual rater) both pre-and post-training. The "Master" score was determined by the two researchers (BH and KR) as mentioned above. Frequency distributions (charts and tables) of numeric differences between rater scores and master scores pre-and post-training were compiled to visually determine the extent to which rater scores differed from master scores (as a measure criterion-related validity), and to determine if these distributions changed with training. Related Samples Wilcoxon Signed Rank Tests were then conducted to examine the statistical significance of any visually apparent differences in these frequency distributions.

Results
In total, 67 raters scored 98 different video clips for each FMS TM item (Overhead Squat, n = 14: Hurdle Step Left, n = 14, and Right, n = 14: In line Lunge Left, n = 14, and Right n = 14, Rotary Stability Left, n = 14, and Right, n = 14). The resulting Kendall's Coefficients of Concordance (W) and Average Spearman Correlations (R ̅ s) overall raters pre-and post-training are shown in Table 1. These results suggest that the level of agreement between raters was low to moderate across all items with rotary stability on the right side having the lowest reliability pre-training and post-training (Table 1). The left inline lunge and hurdle step demonstrated the highest reliability across the FMS TM items pre-and post FMS TM training, respectively. Considering this, it can be seen that the levels of agreement across all raters improved post-training in 50% of FMS TM sub-items (overhead squat, left hurdle step, left, right and final rotary stability), but were poorer post-training in the other 50% (right hurdle step, final hurdle step, left, right and final inline lunge). The difference in levels of agreement of FMS TM sub-items pretraining to post-training were largest in the overhead squat, inline lunge on the left, inline lunge total, rotary stability on the right and final rotary stability scores. Overall, the hurdle step had the smallest changes in pre-training to post-training FMS TM sub-item scores.
While the inter-rater reliability between the general "Raters" did not improve consistently following training (indicating some ongoing variability in scores allocated for specific video clips), the level of agreement between "Rater" and "Master" scores did significantly improve post-training for each of the FMS TM items assessed (Table 2). This suggests that, following training, "Raters" allocated scores that were much more consistent with Master scores for the same FMS TM items.

Discussion
The aim of this study was to determine the inter-rater reliability of the FMS TM within a law enforcement PTI population and to determine whether providing formal training in the FMS TM improved inter-rater reliability and accuracy. The results suggest that the inter-rater reliability between the four individual components of the FMS TM assessed were only low to moderate, improving to moderate, following training. While not significantly improving rater scores posttraining, the variability when comparing pre-training and post-training to the "Master" score did improve meaning that the PTI scores were more closely related to the "Master" score. This study was different from others in multiple ways: the use of PTIs as raters, introducing a FMS TM training between sessions, and comparing raters of variable experience to Master raters.
No other known studies have examined the reliability of the FMS TM using police PTIs as examiners. The majority of studies examine the inter-rater reliability of the FMS TM using physical therapists, athletic trainers, Certified Strength and Conditioning Specialists and FMS TM certified specialists. The closest example to law enforcement includes the reliability of physiotherapy students rating the FMS TM on armed service members (Teyhen et al., 2012); however, no service members were used to rate the FMS TM . In general, the reliability of the FMS TM demonstrates a high level of inter-rater reliability for the FMS TM composite score and a moderate to a high level of inter-rater reliability of the FMS TM items (Bonazza et al., 2017;Cuchna et al., 2016;Stobierski et al., 2015). A FMS TM composite score was not calculated in this study due to the removal of three FMS TM test items, however, the remaining FMS TM items had an overall low to moderate reliability (r = 0.425). The overall FMS TM reliability in this study using four items was lower than documented in previous literature. Composite or FMS TM total reliability scores have found acceptable inter-rater reliability with ICC values of 0.76-0.98. (Bonazza et al., 2017). While the results of this study differ, the findings are supported in one study which reported a Krippendorff's Alpha value of k = 0.38 (Shultz et al., 2013). This study was unable to conclude if the results were due to the ambiguity of scoring criteria or the need for improved rater training (Shultz et al., 2013). The three clearing tests, being the active straight leg raise, trunk stability push up, and the shoulder mobility tests, have previously demonstrated high reliability (Minick et al., 2010;Onate et al., 2012;Parenteau et al., 2014). As such, the removal of these clearing tests from this study (due to the inability to assess these values via video) may have contributed to the overall lower reliability values found in this study.
Although the FMS TM composite score was not calculated, the FMS TM subtest items displayed considerable variability. Few studies have documented higher variability in the inter-rater reliability of the FMS TM subtest items (Onate et al., 2012;Parenteau et al., 2014;Shultz et al., 2013) with Kappa values ranging as low as 0.26 (Parenteau et al., 2014). A common theme among all studies when examined closely is the lower values for the FMS TM subtest items for the hurdle step, rotary stability and in-line lunge (Minick et al., 2010;Onate et al., 2012;Parenteau et al., 2014;Shultz et al., 2013). Among the 67 raters, during the pretraining assessment, the rotary stability (right and final), overhead squat and inline lunge (right) demonstrated the lowest reliability for the FMS TM subtest items. When re-evaluated after the FMS TM training, the right rotary stability and in-line lunge (final) presented with the lowest reliability scores. These findings are supported by one report which compared the reliability of a small group of novices to experts, and cross-examined the values (Minick et al., 2010). In relation to this study, they reported lower FMS TM sub-item values for the lunge tests and rotary final with values ranging from 0.40 to 0.54 (Minick et al., 2010). In general, the FMS TM subtest items inter-rater reliability values from this study, are lower than previous reports. The subtest items may have a lower value in this report due to the larger sample size of FMS TM raters or different statistics used. Few studies have performed inter-rater reliability studies with over 20 raters (Gribble et al., 2013;Leeder et al., 2016). In addition, the majority of inter-rater reliability studies have performed intraclass correlation coefficients (ICC) with a 95% confidence interval (Gulgin & Hoogenboom, 2014;Leeder et al., 2016;Onate et al., 2012;Teyhen et al., 2012). The low inter-rater reliability on the FMS TM subtest items is of value because poor reliability of a single item of the FMS TM may negatively influence the FMS TM total score and lead to false assumption (Onate et al., 2012).
Multiple studies have compared and analysed the rater experience level in relation to FMS TM reliability (Gulgin & Hoogenboom, 2014;Minick et al., 2010;Smith et al., 2013;Sorenson, 2016). These studies have not looked at the relationship of experienced versus general experience before and after a FMS TM training intervention. When comparing the inter-rater reliability of uncertified versus certified FMS TM examiners, good reliability has been documented (Onate et al., 2012). One report also notes that certification and FMS TM experience does not seem to improve scoring consistency (Smith et al., 2013). When cross-examining novice versus experienced scorers, Minick et al. (Minick et al., 2010) found that novice scorers demonstrated six excellent levels of agreement and as opposed to expert scorers who achieved only four excellent levels of agreement. To support the discrepancy, another study also revealed that raters with less than 1 year experience had fair inter-rater reliability, and raters with more than 2 year's experience demonstrated poor inter-rater reliability (Shultz et al., 2013). The findings of these reports demonstrate that experienced raters can have lower inter-rater reliability scores than novice examiners which emphasizes the importance of having all FMS TM raters retrained to ensure the highest levels of accurate inter-rater reliability. As summarized (Shultz et al., 2013), when using the FMS TM for clinical purposes to have confidence in the screen, a reliability test should be performed by clinicians and researchers using their own staff and population (Shultz et al., 2013). This study further reiterates the importance of FMS TM training by expert raters in the profession prior to administering and scoring the FMS TM to strengthen the reliability of the screen.
To date, no study has examined the inter-rater reliability after a FMS TM training session. One of the primary aims of this study was to determine whether providing formal training in the FMS TM improved inter-rater reliability. Representative of all examiners in the profession, each examiner in this study had variable levels of training and experience with the FMS TM . Contrary to our expectations, when comparing PTI "Raters" following FMS TM training, our findings reported that the interrater reliability did not consistently improve, following a FMS TM training course. Although this finding was not significant, the general PTI "Raters" were also compared to a "Master" examiner and the level of agreement between "Rater" and "Master" scores did significantly improve. This finding is particularly important as it demonstrates the inter-rater reliability shift away from a central tendency and towards the correct score. After training, the FMS TM test items increased towards an agreement with the "Master" rater on an average of 22%. As demonstrated in Table 2, after the FMS TM training session, the inter-rater reliability is strengthened towards the correct FMS TM score as provided by a "Master" examiner. Previous inter-rater reliability studies as described in various systematic reviews (Bonazza et al., 2017;Cuchna et al., 2016;Stobierski et al., 2015) have not examined whether raters are able to accurately determine the correct FMS TM score. Practically, this means that if a department were to implement the FMS TM as a predictor of injuries, it should ensure that ample training is provided for assessors. This training could be in the form of FMS TM certification, or through mentorship or supervision by a "master assessor". This will ensure both a high level of agreement between raters for comparisons and greater accuracy of results.
Due to the nature of the study design, which was set up across a two-day annual training session, a limitation of this study may include a learning effect. Participants were exposed to 98 videos twice across the 2 days for pre-and post-assessment and raters may have automatically recalled the videos during the second assessment. Although the videos were randomized, the raters may have recognized the athletes and their movement patterns which could have influenced the FMS TM score provided. Another limitation of this study was the use of two Master raters who also provided the FMS TM education intervention. A final limitation was the use of only four of seven components of the FMS TM . Although previous literature has documented moderate to excellent reliability in these areas, due to the removal of three items, the reliability of the FMS TM composite score or the other three variables cannot be verified with this study. Video recording to determine inter-rater reliability was considered as a limitation of this study, however upon review of literature, the use of video for FMS TM inter-rater reliability has been well documented and has demonstrated excellent reliability (Parenteau et al., 2014;Shultz et al., 2013). Future studies may want to confirm the improved accuracy after an FMS TM training intervention, examine reasons for the variable FMS TM subtest item inter-rater reliability, and diversify the reliability in other law enforcement and tactical populations.

Conclusion
The findings of this paper confirm the variability of the FMS TM subtest items using a large sample size and emphasize the importance of FMS TM training by staff prior to conducting video analysis. It is imperative that clinicians scoring the FMS TM are retrained by their own experienced staff members and performing reliability tests against a Master rater prior to evaluating the FMS TM . This baseline is essential in ensuring the accuracy of the FMS TM scores. Without the confidence of the FMS TM scores being accurate, the FMS TM cannot be utilized to predict injury or abnormal movement patterns.