The expert eye? An inter-rater comparison of elite tennis serve kinematics and performance

ABSTRACT This study examined the reliability of expert tennis coaches/biomechanists to qualitatively assess selected features of the serve with the aid of two-dimensional (2D) video replays. Two expert high-performance coaches rated the serves of 150 male and 150 female players across three different age groups from two different camera viewing angles. Serve performance was rated across 13 variables that represented commonly investigated and coached (serve) mechanics using a 1–7 Likert rating scale. A total of 7800 ratings were performed. The reliability of the experts’ ratings was assessed using a Krippendorffs alpha. Strong agreement was shown across all age groups and genders when the experts rated the overall serve score (0.727–0.924), power or speed of the serve (0.720–0.907), rhythm (0.744–0.944), quality of the trunk action (0.775–1.000), leg drive (0.731–0.959) and the likelihood of back injury (0.703–0.934). They encountered greater difficulty in consistently rating shoulder internal rotation speed (0.688–0.717). In high-performance settings, the desire for highly precise measurement and large data sets powered by new technologies, is commonplace but this study revealed that tennis experts, through the use of 2D video, can reliably rate important mechanical features of the game’s most important shot, the serve.


Introduction
In tennis, the most important and heavily practiced skill is the serve.Players have complete control over their performance and their execution can directly win points (Martin, 2019).Research examining the serve has focused on describing its mechanics (Abrams et al., 2011;Elliott et al., 1995;Martin et al., 2016;Noffal, 1998;Touzard et al., 2019), as well as its relationship to performance and injury (Martin et al., 2014;Noffal, 1998;Sombelon et al., 2017).The efficacy of serve interventions that can inform drill design and progression have also been evaluated (Fernandez-Fernandez et al., 2013;Kovacs & Ellenbecker, 2011;Salonikidis & Zafeiridis, 2008).Ultimately, it is the coach who is primarily responsible for serve analysis and review and as stated by former grand slam champion Mary Joe Fernandez, "the coach's trained eye is invaluable in identifying even the subtlest technical flaw.It's their ability to see what others may miss that sets them apart" (Fernandez, 2015).Coaches can spend considerable time identifying and correcting technical flaws (Knudson & Morrison, 1999), using mainly visual observation and mental imagery.The effectiveness of these techniques has been questioned especially for movements that occur at very high speeds and across multiple planes (Arend & Higgins, 1976;Knudson, 2000;McPherson, 1996) Precise measurement of complex, high speed and multiplanar movements generally requires the use of threedimensional (3D) motion capture systems.The cost of these systems can be prohibitive often resulting in the use of more easily accessible two-dimensional (2D) video analysis.Twodimensional video analysis was first introduced in the 1970s when Plagenhoef utilised this approach to review stroke mechanics (Plagenhoef, 1970).The difficulty of assessing 3D movement using 2D video analysis has been highlighted in high-level cricket bowling, where one viewing angle has been revealed as inadequate for reliably assessing elbow joint movement (Aginsky & Noakes, 2010).This reaffirms the importance of multiple camera views to assess 3D sports skills (Groppel, 1989;Sinclair et al., 2014), although research is equivocal about the best viewing angles for certain mechanics.
Empirical work examining the effectiveness of the tennis coach's eye is sparse.However, parallels do exist with the analysis undertaken by professionals in other domains requiring judgement or movement assessment.For example, Knudson explored the results of numerous visual assessment studies and highlighted that allied health professionals, doctors and even students, could accurately detect changes in their areas of speciality at low speed, though significant decreases occurred when observers were not positioned at an appropriate location or the speed of the task increased (Knudson, 2013).In gymnastics, Ste-Marie (1999) also observed that expert judges were more efficient and accurate in both predicting and processing biomechanical movements than novice judges (Ste-Marie, 1999), while expert physical therapists have been reported to detect changes as small as 2° in trunk flexion/extension, albeit mostly in controlled settings and restricted to the sagittal plane (Whatman et al., 2012).Together, these findings point to the role of expertise in the assessment of movement kinematics in certain closedskill settings (Knudson, 2013), although Knudson has noted that even then, qualitative assessments can have poor to moderate inter-rater reliability (Knudson, 2000(Knudson, , 2013)).Currently, it remains unclear if expert tennis coaches perform similarly when visually assessing the mechanical aspects of the serve.
Factors considered critical to the success of a tennis serve have been well documented, dating back to the early works of Elliott et al (1995), Groppel (1989) and Knudson (Knudson & Morrison, 1999;Knudson et al., 1994).Leg drive (Elliott, 2006;Fenter et al., 2017;Kaya et al., 2018;Knudson, 2006;Pugh et al., 2003;Whiteside et al., 2015;Wong et al., 2014), shoulder mechanics (Knudson, 2006;Konda et al., 2010;Pugh et al., 2003;Raphael et al., 2007), trunk rotation (Martin et al., 2014;Sombelon et al., 2017;Tubez et al., 2015), ball toss placement (Elliott, 2006;Knudson, 2006), service rhythm (Wong et al., 2014) and ball velocity (Baiget et al., 2016;Martin et al., 2013Martin et al., , 2014) ) have all been widely critiqued.The relationships between body movements, racket kinematics (Knudson, 2006) and injury (Dines et al., 2015;Elliott, 2006), considering player gender and age (Fernandez-Fernandez et al., 2019), have also been detailed.In turn, these characteristics have been included into various skill frameworks to aid analysis (Knudson, 2006;Knudson et al., 1994;Kovacs & Ellenbecker, 2011;Myers et al., 2017;Đurović et al., 2008;Šlosar et al., 2019) involving categorical checklists (Hume, 2003), descriptive models with ranges of acceptability (35) and video-based binary critiques of technique as "Good (1)" or "Bad (0)" (Myers et al., 2017).The reliability of these frameworks is largely untested, yet they provide the bases upon which serve performance might be compared and monitored over time.Furthermore, if coaches can reliably assess the key technical characteristics of serve performance, it may be possible to use ratings of technique as inputs into supervised machine learning models that inform technology-led coaching applications.Such applications, which are based on self-recorded 2D footage, already appeal to many international and national sporting organisation although their accuracy or specificity remains unknown (Xu et al., 2022).The action recognition component of these technologies has been validated (i.e., ability to identify a serve or forehand) (Ganser et al., 2021), yet their fidelity in assessing the quality of the different sports skills is underdeveloped (Parmar & Morris, 2019;Xu et al., 2022).Professionally labelled or tagged video in the form of expert ratings of technique has the potential to act as a critical input for initial quality assessment models.
Accordingly, this paper aims to examine the reliability of expert ratings of serve mechanics and performance using 2D video from different camera viewing angles, across player gender and age groups.It was hypothesised that expert ratings would be influenced by camera viewing angle but unaffected by the age or gender of the tennis player, due to the unique and extensive experience of the experts.Higher velocity joint actions, such as shoulder internal rotation, were also anticipated to be more difficult to rate reliably, independent of camera viewing angle.Should the expert ratings prove reliable, the resulting data could form an action quality data set to train a machine learning model that can assess key technical characteristics of serve performance.

Participants
Two expert high-performance coaches, with research PhD qualifications in biomechanics, and over 20 years of experience analysing tennis serve techniques, voluntarily rated the serves of 150 male and 150 female players (n = 300) of which half are current or past professional players.This was completed for three different age groups and from two different camera viewing angles (sagittal and rear coronal).A total of 600 serve video clips across 6 different age*gender groups were analysed.Two experts with the above profiles were needed to establish reliability of the data prior to the creation of a machine learning algorithm that would be used to assess tennis serving mechanics from 2D video to a similar level of accuracy to that of professional coaches.This will also provide an optimal action quality data set for the artificial Intelligence model that requires qualified human annotations to score service action.
The age groups comprised 9-11-year-olds (n = 25 male and n = 25 female, average age = 10.2 years) state-level players, 12-15-year-olds (n = 50 male and n = 50 female, average age = 14.3 years) that competed at the Junior Australian Open or similar events, and 18 years and over (n = 75 male and n = 75 female, average age = 26.5 years) who held professional rankings.The videos were sourced from **** vision archives, with participants consenting (individually and/or with a parent) to the use of the vision for research purposes.All athletes were deidentified by name.The camera viewing angles were standardised across three different camera types: iPad 8 th Generation 1080P at 30 frames per second (fps) (n = 25), iPhone 11 1080P (n = 175) at 30fps and the broadcast cameras which are 1080P at 30fps (n = 400).

Experimental design
Footage was recorded from the sagittal view (positioned along the baseline at approximately 10 m from the server) and the rear coronal view (stationed directly behind and approximately 8 m from the server at heights between 3 and 6 m from the ground, Figure 1).All trials were first service attempts; however, a successful serve was not a requirement.As vision was extracted from the Tennis Australia vision archives, obtaining the same serve (from the sagittal and rear coronal views) from a specific match was not always possible (38 trials).The deuce side of the court was preferred and was recorded for 203 of the 300 players.
All trials were manually trimmed (using the open-source video editing software Windows Media Player (Corporation, 2022)) from the final frame before a player's preparatory bounce, to the frame after the racquet passed the front leg in the followthrough of the service action.This meant that ball flight from contact to approximately the service line was in the camera field of view.Names and scores were removed from all service video footage.The order in which the experts were presented video footage was counterbalanced for age, gender, and camera viewing angle.Within each of the six groups, the sequence of video trials presented to the experts was randomised using a random number generator to eliminate the influence of order bias.Videos were replayed in real time by the experts.They were able to watch replays as often as required to perform what they perceived to represent an accurate analysis.
Guided by previous research describing serve performance (Abrams et al., 2011;Elliott, 2006;Elliott et al., 1995;Konda et al., 2010;Kovacs & Ellenbecker, 2011;Martin, 2019;Martin et al., 2014;Myers et al., 2017;Whiteside et al., 2015), the experts independently compiled a set of features considered to be of principal analytical interest when appraising service technique, independent of age and gender, before then coming together to establish a consensus view.Given the volume of vision to be analysed (13 variables x 2 views for each of the 300 participants), the experts were encouraged to be as practical as possible.The final list of variables that the experts determined for review broadly fit into five categories and are summarised in Table 1.1.Notably, the experts highlighted the importance of attempting to assess rhythm and the potential for lower back injury based on the service technique, both of which have received limited applied research attention and have been omitted from previous frameworks (Kovacs & Ellenbecker, 2011;Myers et al., 2017).The inclusion of an overall serve score was also deemed relevant as an analytical baseline.
The 13 features broadly fit into five areas that coaches and experts commonly assess in practice, as shown below and in Table 1: • Global serve performance: (i) overall serve score, (ii) rhythm, (iii) power of serve and (iv) lower back injury.• Leg drive: magnitude of knee joint flexion and lower limb drive upward.• Trunk action: magnitude of transverse plane trunk rotation during the backswing and magnitude of frontal plane trunk rotation (shoulder-over-shoulder) during the forward swing.• Shoulder action: the magnitude of maximum external rotation during the backswing and the magnitude of internal rotation upper arm velocity in the swing to impact.• Ball toss: the vertical, forward, and lateral trajectory of the ball toss.The experts rated the 600 serves across all of the variables using a 7-point Likert scale (Konda et al., 2010), totalling 7800 ratings per expert.The application of the scale was as follows: scores of 1 and 7 represented the poorest and ideal performance of that service feature for players of that age and gender.The only exception to this logic was when the experts appraised the potential for lower back injury.In that circumstance, players were rated as being extremely unlikely (1) or likely ( 7) to sustain such an injury based on their service action.As the focus of the investigation was to evaluate the analytical eye or ratings of experts, and not the precision or interpretability of a serve rubric, the experts agreed on guiding principles on only the boundary conditions (extremes) of each variable, independent of age and gender.The viability of a more prescriptive guide would have been challenging given the six population groups but may represent an opportunity for future research.Table 1 summarises these guiding principles applied to the analysis of each serve.When rating, the experts were encouraged to review the service actions as they would in a standard practice session.Subsequently, variables were rated by the experts in the order they deemed most appropriate, and the frequency and speed of playback was self-regulated in VLC Media player (VideoLan, VLC media player, 2006).This represents both a limitation (lack of control) and strength (higher ecological validity) of our research design.The experts were unable to back-track to reassess previously rated videos and were limited to assessing one video at a time, in the pre-determined and counterbalanced order they were presented.The 12 sets of service video footage (2 camera viewing angles × 2 genders × 3 age groups) were assessed by the experts across 24 days, at a total of 100-150 hours of analysis per expert.Both experts reported that they frequently used the pause function and frame by frame play back, as it was beneficial for analysis of multiple variables (>95% of trials).

Data analysis
Experts directly entered their ratings into a series of custom Microsoft Excel (Corporation, 2018) templates.These ratings were then collated and analysed using R studio (Team, 2020).No data was excluded as the aim was to determine how reliably the serve could be assessed.Where the raters had a difference of >4 points for any single rating, they were asked to rescore, while being blinded to their original score.This happened on 27 occasions and every occasion resulted in a score being updated, as it appeared evident that an expert had inadvertently flipped the scale on their first attempt.Inter-rater reliability of all 13 variables in the sagittal and coronal plane was assessed.The agreement in expert rating for the same serve viewed from the two camera viewing angles was also evaluated.The strength of all agreements was assessed using a Krippendorffs' alpha, where 0 indicated no agreement, 0.01-0.20 none to slight, 0.21-0.40fair, 0.41-0.60moderate, 0.61-0.80substantial, and 0.81-1.00near perfect agreement (Vanbelle, 2014).

Results
The reliability of the expert ratings was analysed for both camera viewing angles and to also allow a comparison between the different age groups and genders.

Reliability of the expert eye from the sagittal view
Table 2 compares the expert's inter-observer reliability for all 13 sagittal plane features across gender and three age groups.Strong levels of inter-rater agreement were generally observed, independent of the age or gender of the player.Lateral ball toss displacement displayed the most volatile agreement especially amongst the over 18 age grouping (0.61-0.90).The experts rated the mechanical features for the younger athletes with higher levels of agreement than the older age groups (0.88 > 0.81).Agreement remained consistent across gender.
Table 3 represents the agreement between experts for variables assessed from the rear coronal camera view across all features, genders, and ages.Strong reliability was shown across age and gender when experts rated the overall serve score (0.73-0.92), power or speed of the serve (0.72-0.91), rhythm (0.74-0.94), quality of the trunk action (0.77-1.00) and leg drive (0.73-0.96).Experts showed mixed agreement, however, when rating the quality of the shoulder action (0.41-0.95) and the ball toss (0.42-0.95) independent of player age and gender.Ball toss height (0.70-0.95) and maximal external rotation (0.72-0.94) were more reliably assessed from the sagittal view.The hypothesis that experts could reliably rate service actions that posed a risk to lower back injury was supported (0.70-0.93).As hypothesised, there was no difference in expert ratings based on the gender of the player in the coronal (rear) plane.The average agreement in the scores of the individual experts assessing the same serve (across both genders) from both camera viewing angles is summarised in Figure 2 (after the 27 corrections occurred).The difficulty in reliably assessing the forward displacement (0.57-0.62), peak height (0.54-0.62) and lateral displacement (0.66-0.74) of the toss between the two camera viewing angles was anticipated.The experts also found it more difficult to consistently rate shoulder internal rotation speed (0.69-0.72) compared with the other mechanical features (>0.75).Player gender did not influence the reliability of expert ratings.

Discussion
Judging and assessing techniques in high-level sport is often scrutinised.It is therefore surprising that investigations examining reliability of experts' analysis in tennis are so limited (Heiniger & Mercier, 2021;Mercier & Heiniger, 2018;Ste-Marie, 1999).In this paper, the agreement between expert raters was consistent with the body of research evaluating expert assessment in other sports (Bučar et al., 2012;Pajek et al., 2012;Ste-Marie, 1999).Analysis has been confounded depending on body part and camera viewing angles (Heiniger & Mercier, 2021;Mercier & Heiniger, 2018).With relation to the work in the allied health industry, our data supported the understanding that experts could reliably rate the movements specific to their domain of expertise; however, our findings challenge that inter-rater reliability was often found to be low in qualitative assessments of movements.This study aimed to determine the efficacy of experts in using their observational analysis skill to reliably rate mechanical features and outcomes of the tennis serve, as performed by male and female players across three age ranges.In the current study, expert ratings were generally strong from both sagittal and rear coronal viewing planes.As expected, the experts showed very strong agreement (0.84) when rating global serve performance across all ages, genders and camera viewing angles.Select serve kinematics, such as maximum shoulder external rotation, upper arm internal rotation speed during the swing to impact, forward displacement, and height of the ball toss were more reliably assessed from the sagittal camera view.This is instructive given the importance of long axis rotations of the shoulder in the serve (Martin et al., 2013) and the previously reported difficulty with its measurement in 2D (Aginsky & Noakes, 2010).The challenge in perceiving specific mechanical changes can be compounded by clothing, which can mask movement (Elliott et al., 2007), as well as the interdependency of the kinematic chain.For example, when evaluating the role of the trunk in the serve, the rotation of the shoulder and hip alignments occur simultaneously about all three axes and at high speed, making their assessment extremely difficult even for the expert eye.Unsurprisingly, ratings of the lateral position of the ball toss were more reliable when assessed from behind the player.It therefore seems necessary for coaches to adjust their viewing perspective based on the kinematics of interest.This may involve positioning that is perpendicular to the plane of motion of interest (Knudson, 2013), with consideration being given to expert preference.Coaches may be better at analysing from the sagittal plane, as they do it more in practice.Marginally stronger levels of consensus were found among the experts when evaluating the female and junior players.The slower and more compact actions often used by these populations (due to shortened or abbreviated backswings seen in juniors, as well as slower velocity of certain joints in both the junior and women's actions) (Elliott et al., 2013;Fernandez-Fernandez et al., 2019) may present a less complex analytical challenge especially when supplemented by video replay.
The expert assessment of the "likelihood of injury" and to a lesser extent "rhythm", is somewhat novel (Kovacs & Ellenbecker, 2011;Myers et al., 2017).While it seems ambitious for experts to provide real-time evaluations of injury, some guidelines of what technical deficiencies may lead to injury have been presented (Kovacs & Ellenbecker, 2011;Myers et al., 2017), although previous biomechanics research has had limited success in identifying predictive measures of injury in tennis (Moreno-Pérez et al., 2021).In the current study, the criterion validity of the expert ratings remains unknown but the high inter-rater agreement for the likelihood of lower back injury (0.70-1.00) and rhythm (0.74-0.94), may offer a practical qualitative alternative to the more traditional measurement approaches.Indeed, given the pervasiveness of back injury in tennis playing cohorts (Pluim et al., 2006) (and the suggested importance of rhythm in serving (Crespo & Miley, 2002)), further research investigating the utility and validity of these types of ratings is warranted.
The underlying rating system used in this study relates closely to ordinal scales used in diving and gymnastics.It is well accepted across these sports that difference in reliability exists between judges, although these decrease as the expertise of the judges increase (Heiniger & Mercier, 2021;Mercier & Heiniger, 2018).This study presents similar findings in that the expert coaches showed strong levels of reliability.With experts providing reliable scores across most values, the data may also be used as a viable input into training supervised machine learning models that ultimately predict expert ratings of service technique.By providing access to a data set of action quality information, one of the barriers to high fidelity digital coaching tools in tennis may be overcome (Parmar & Morris, 2017).Similar approaches have been used in gymnastics and diving to train artificial intelligence models with some success (Parmar & Morris, 2017, 2019;Xu et al., 2022).Instructively, research from these two judged sports as well as data from this study suggest that certain movements should only be trained from data obtained from specific viewing angles.
Although a novel addition to the applied biomechanics literature, this study has several limitations.The sample size of two expert raters, although consistent with comparable experimental designs (Del Baño-Aledo et al., 2017;Fort-Vanmeerhaeghe et al., 2017;Rinne et al., 2001), might benefit from expansion to capture a wider range of coaching experience and backgrounds.Other mechanical features, such as ankle and wrist flexion/extension, elbow flexion/extension and contribution of the ball tossing arm could also have been evaluated, as they have attracted previous research or practical focus.An assessment of intra-rater reliability was not pursued, which is similar to previous research (Šlosar et al., 2019), yet it could add value to our understanding of analytical expertise.Furthermore, as coaches often observe from in-line with the player and not from an elevated viewpoint, controlling camera height, as well as vision playback speed might provide a more comprehensive understanding of the interplay between skill mechanics and expert assessment.

Conclusion
While little empirical research exists examining the reliability of rating high-speed movements, this study attempts to enhance our understanding of the degree to which expert coaches agree on skill assessment.It supports the position that a combination of camera views is needed to obtain a more comprehensive understanding of overall serve mechanics and that high-level expert coaches can reliably rate mechanical features across age and gender.Expert ratings may therefore present as inputs into supervised machine learning models that predict the quality of service actions.Though experts do have a similar understanding of the technical constructs of serve performance, further research could examine the extent to which video playback speed and annotations influence expert ratings.

Figure 1 .
Figure 1.The side (sagittal) and rear (coronal) camera viewing angles for the service analysis.

Table 1 .
Guiding principles for the analysis of serve technique using a 7-point scale by experts # .

Table 2 .
Inter-rater reliability, as determined by Krippendorffs alpha, of the 13 serve features in the sagittal plane.

Table 3 .
Inter-rater reliability, as determined by Krippendorffs alpha, of the 13 serve features in the rear (coronal) plane.Comparison of the mean inter-rater reliability performance between camera viewing angles across the 13 mechanical features in the 3 age groups.