Clustering of match running and performance indicators to assess between- and within-playing position similarity in professional rugby league

ABSTRACT This study aimed to determine the similarity between and within positions in professional rugby league in terms of technical performance and match displacement. Here, the analyses were repeated on 3 different datasets which consisted of technical features only, displacement features only, and a combined dataset including both. Each dataset contained 7617 observations from the 2018 and 2019 Super League seasons, including 366 players from 11 teams. For each dataset, feature selection was initially used to rank features regarding their importance for predicting a player’s position for each match. Subsets of 12, 11, and 27 features were retained for technical, displacement, and combined datasets for subsequent analyses. Hierarchical cluster analyses were then carried out on the positional means to find logical groupings. For the technical dataset, 3 clusters were found: (1) props, loose forwards, second-row, hooker; (2) halves; (3) wings, centres, fullback. For displacement, 4 clusters were found: (1) second-rows, halves; (2) wings, centres; (3) fullback; (4) props, loose forward, hooker. For the combined dataset, 3 clusters were found: (1) halves, fullback; (2) wings and centres; (3) props, loose forward, hooker, second-rows. These positional clusters can be used to standardise positional groups in research investigating either technical, displacement, or both constructs within rugby league.


Introduction
Rugby league is an example of a collision-based invasion team sport. A match comprises two teams of 13 on-field players, each with distinct positional roles that interact with each other and the opposition (Gabbett et al., 2008). Players may be classified by their individual playing position (i.e. fullback, left and right wings, left and right centres, half-back, stand-off, hooker, loose forward, left and right second-row, left and right props), or more often classified into broader positional groups (e.g., forwards, backs) based on commonality in their match characteristics and physical qualities (Gabbett et al., 2008). Typically, these characteristics include a combination of measures from various sources such as microtechnology and notational analyses that represent either physical, technical or tactical constructs (Johnston et al., 2014). Understanding the similarities between positions and players and how they should be logically grouped, using an objective framework and based on these constructs, is an important task (Johnston et al., 2014). Identifying logical positional groupings could help to inform team selection, assist in determining logical training groups, or could allow for the standardisation of positional groups in research thus allowing for easier comparisons between studies in future.
However, there is currently no consensus in the literature as to exactly how these logical positional groups are formed, since they are usually anecdotally chosen. Some studies include no positional groupings and treat all players as the same sample (Kempton et al., 2017;Murray et al., 2014;Twist et al., 2014;Varley et al., 2014), whereas others classify players using the individual playing positions themselves (Austin & Kelly, 2014). Studies that do use positional groupings commonly include a forwards and backs split (e.g., Oxendale et al., 2016;Rennie et al., 2020), or forwards, backs and adjustables (King et al., 2009). Adjustables consist of any combination of either halves, hookers, or fullbacks (King et al., 2009). This disparity likely reflects the different philosophies of the researchers or the study design employed, but nonetheless makes it difficult to compare results between studies (Glassbrook et al., 2019).
One method of identifying positional groupings is through unsupervised machine learning, such as cluster analysis. Within rugby union, previous research has used hierarchical cluster analysis to determine positional groups from a number of performance indicators and displacement metrics (Quarrie et al., 2013). Displacement metrics are considered to be any variable describing a measure of distance, speed, or acceleration of a player (Polglaze et al., 2016). From the dendrogram (i.e. the tree diagram) produced by the analysis, it is possible to see how positional sub-units cluster together, as well as their relatedness to other sub-units. For example, within their data set, Quarrie et al. (2013) reported outside backs (left wing, right wing, fullback) to be more related to centres (inside centre, outside centre), before joining with halves (fly half, scrum half) to form the backs positional group. Importantly however, their analyses relied on positional aggregation without consideration for intra-positional variability. A recent study in rugby league observed high between-player variability (i.e. true player-to-player variability after accounting for the position, the fixture, and the club) in match displacement metrics within the Super League (SL; Dalton-Barron, Palczewska et al., 2020). For example, total distance and high-speed running (HSR; >5.5 m·s −1 ) distance during ball-in-play phases varied by 9.4% (90% confidence limit [CL] = 0.8%) and 15.0% (2.4%). Therefore, it may be worthwhile aggregating data at the player level as well as the position level, to account for the variability within positions in terms of displacement.
More recently, Wedding et al. (2020) used a comprehensive framework involving dimension reduction and cluster analysis at the player level to identify positional groups in the Australasian National Rugby League (NRL). They firstly classified each player in the NRL into one of four a priori chosen positional groups (adjustables = halves, hooker and fullback; backs = centres and wingers; forwards = second rows, props, loose forward; interchanges = benched players), based on previous literature (Austin et al., 2011;Gabbett et al., 2010Gabbett et al., , 2012. These groups were used as a basis for comparing with groups identified via their two-step data driven approach, which consisted of an initial principal component analysis (PCA) followed by a hierarchical cluster analysis. The original dataset used 48 technical performance indicators, after PCA the authors kept only the first 14 principal components as inputs into a hierarchical cluster analysis. They found six distinct positional groups that consisted of the four a priori identified positional groups (i.e., forwards, adjustables, interchanges, backs) as well as two additionally identified positional groups (i.e., interchange forwards, utility backs; Wedding et al., 2020). Although useful, their analyses only included technical performance indicators which may lead to a somewhat onedimensional view. Combining technical data with displacement data derived from microtechnology may yield different results, since displacement has also been shown to differentiate between positions in previous research (Glassbrook et al., 2019).
Indeed, the widespread use of microtechnology and notational analyses within matches means that researchers and club practitioners now have a high volume and variety of information available to quantify the demands imposed on players and positions. However, this also means they are faced with the challenge of analysing, visualising, and interpreting increasingly complex data sets (Dalton- Barron, Whitehead et al., 2020;. One method of reducing this complexity is through the use of dimension reduction techniques, which is a global term incorporating both feature extraction and feature selection techniques. Feature extraction techniques involve projecting the original data onto a new smaller subspace with lower dimensionality whilst retaining the majority of the variance in the original data such as in PCA (Abdi & Williams, 2010). Feature extraction has gained much attention within sport recently (e.g., , as it lends particularly well to visualisation and may highlight previously unobservable groups or patterns within the data. However, the representation of the original data is abstracted since a new feature space is created. Whilst this may be the researcher's or practitioner's intention , they may also be interested in the detail provided by the original features to inform further decisions or analyses. Unlike feature extraction, feature selection methods select a subset of important features without altering the features themselves, thus retaining their semantic value (Saeys et al., 2007). Feature importance in this context refers to the relevance of the feature with its target, which may either be categorical (e.g., match outcome) or continuous (e.g., points difference). Feature selection plays a vital role as a pre-process step in building either statistical or machine learning models within other fields such as computer science (Guyon & Elisseef, 2003), bioinformatics (Saeys et al., 2007), and medicine (Remeseiro & Bolon-Canedo, 2019). Such techniques have gained less attention in sport but may nonetheless still prove useful. For example, feature selection may be used to determine an optimal dataset that only contains important features for discriminating between positions. In this way, feature selection may be used as a pre-process step in hierarchical cluster analysis to determine broader positional groups.
Within rugby league, there are no studies that examine the similarities between positions whilst accounting for the multidimensional nature of match-play, which includes both physical (i.e., displacement) and technical constructs. Therefore, the primary aim of this study was to determine the similarity between positions in the SL in terms of match displacement and technical indicators, through a combination of feature selection and hierarchical cluster analysis (Aim 1). Furthermore, our second aim was to visually represent the intra-positional, or between-player, variability through PCA and cluster analysis (Aim 2). Such visualisations may uncover new multivariate patterns or groups, whilst accounting for the intra-positional variability in the data.

Methods
The flow chart in Figure 1 outlines the entire methodology for determining the similarity between playing positions (Aim 1) and players (Aim 2) in terms of displacement and technical performance indicators. The analysis was repeated three times to include three different datasets: 1) match displacement features only, collected from microtechnology devices; 2) technical performance indicators only, collected from notational analysis; 3) combined dataset including both match displacement and technical performance indicators. For the purposes of this study, playing positions at the most residual level were considered as 8 standard positions (i.e., fullback, wings, centres, halves, hooker, props, second-rows, loose forward). Left and right positional variations (e.g., left wing and right wing) were not considered and were treated as the same position. This is because players can swap left and right sides, even within a match, which makes assigning a position label that represents the whole match problematic. Whereas players are much less likely to swap positions entirely. All analyses outlined below were completed in R (version 4.0.2).
Match displacement data were from a league-wide project (i.e., "Project SL-Catapult"). Within the project, all SL clubs use the same microtechnology devices (Optimeye S5, Catapult Sports, Melbourne, Australia; Firmware version = 7.17) and software (Openfield™, Catapult Sports, Melbourne, Australia; Software version = 3.1.0) for downloading raw data and subsequent uploading to Catapult servers. The research team then accessed 10-Hz sensor data files for further data processing and filtering. These data and filtering processes were the same as those used by Dalton-Barron, Palczewska et al. (2020), resulting in the identical displacement data. Included are 7617 observations collected from 11 SL teams and 366 senior male professional SL players. Matches included are from the 2018 and 2019 SL seasons; the Middle 8s phase of the 2018 season was excluded since it included Championship teams. This dataset also includes 35 discretised displacement metrics stratified by phases-of-play (i.e. attack, defence and transition phases), and are both absolute (i.e. total distance, high-speed running [HSR] distance, sprint distance) and relative to playing time (i.e. average speed, HSR distance per minute, sprint distance per minute, and absolute acceleration [Delaney et al., 2016]). Each match observation's associated technical match performance indicators were then extracted from Opta (Stats Perform, London, UK) Superscout files. Initially, 558 technical features were extracted that included both actions (e.g., pass) and action outcomes (e.g., pass completed). Upon consultation with two expert rugby league coaches, these were then reduced to 41 key technical features, which were then expressed both in absolute terms and relative to playing duration, totalling 82 features. Both coaches have international coaching experience and have over 15-and 30-years' coaching professionally within the SL and NRL, respectively.

Feature ranking using ensemble feature selection
Firstly, taking an initial dataset, features were filtered if they displayed near zero variance using the "nearZeroVar" function from the Caret package (Kuhn, 2008; frequency cut off ratio = 100/1, unique values = 10%). Near zero variance features have few unique values and occur infrequently in the data, and as such likely contain little valuable predictive information (Kuhn, 2008). The frequency cut off ratio and proportion of unique values are two frequently used indicators of near zero variance. Features were also filtered if they were highly correlated with another variable (r > 0.8). Removing highly correlated variables prior to feature selection is a common process to reduce model complexity (Andersen & Bro, 2010), without altering the feature space such as in PCA (Graham, 2003). This resulted in 39 features removed from the technical dataset and 15 features removed from the displacement dataset. For descriptive data including median and quartile ranges for each position and dataset see Supplementary File 1.
Features were then ranked according to their importance for classifying playing position at the most residual level (i.e. fullback, wing, centre, halves, hooker, loose forward, second row, prop) using an ensemble of feature selection techniques including filter, wrapper, and embedded methods. The objective of feature selection is to select an optimal subset of the original features within a dataset, such that the end model employed on the data contains a reduced set of features that maintain or even improve predictive performance. For a comprehensive review of feature selection and available methods see, Guyon and Elisseef (2003). The details of each feature selection technique used in this study, as well as their implementation in R, are outlined in Table 1. Multiple feature selection techniques were used to compensate for potential biases encountered using a single technique (Prati, 2012). Each technique provided its own base feature ranking according to each technique's definition of importance. After which all base rankings were aggregated based on the order of each base ranking via the "Borda Count" voting system (Prati, 2012). The Borda count of a feature is its mean position in all base rankings, that is: where π j f i ð Þ is the rank of feature f i in the ranking π j .

Determining optimal number of important features
To determine the optimal number of important features for the subsequent clustering analysis (i.e., the minimum number of important variables that still hold high predictive performance), 1 to k features were recursively inputted as predictors in a random forest. The randomForest function from the randomForest package was used. 500 trees were inputted and the number of features used at each split was calculated as the square root of the total number of inputted features. Each random forest model was then cross-validated to gain the areaunder-curve (AUC) statistic, whereby data were split by 70% training and 30% testing. Since the AUC requires a binary classification, multiple receiver-operator characteristic (ROC) curves were calculated for the classification of each position using the pROC package. The AUC was extracted from each ROC curve and the median AUC across all classifications was taken to gain overall model predictive performance. Each random forest was run 100 times to gain a stable AUC statistic. The AUCs from each dataset were then visually inspected and a judgement was made on the number of features to retain for subsequent analyses, based on the point at which the AUC plateaus. Subsequently, the top 12 technical features, the top 11 displacement features, and the top 24 combined features were retained for further analysis (Figure 2).

Hierarchical cluster analysis
Two hierarchical cluster analyses were then applied to each of the three filtered datasets. The first hierarchical cluster was conducted at the positional level (Aim 1) and the second at the player level (Aim 2). For the positional level analysis, data were grouped by position and the mean taken for each feature. Data were then normalised (mean centred and scaled to unit variance) since there was a variety of features calculated in different units. Ward's method of agglomerative hierarchical clustering was used to logically cluster positions (Ward, 1963), using a squared Euclidean distance matrix. Briefly, Ward's method starts with each observation, then finds pairs of clusters with the smallest within-cluster error sum-of-squares increase; hence the method is sometimes termed the "minimum variance method". The "ward.D2" implementation in R was used here (Murtagh & Legendre, 2014). The results were then visualised on a dendrogram. For the player level analysis, players were first labelled according to their most frequently played starting position. Starting positions for each player were provided by Opta. Data were then grouped by their associated player ID and the mean taken for each feature, at which point the data were then also normalised. Observations were filtered if the player did not play at least five matches in their respective position. The same cluster procedure as the positional level was applied at the player level. However, since there were so many observations, PCA was also applied on the same dataset to visualise the results in a 2-dimensional space. PCA is an eigenvector-based method and is one of the most common techniques for dimension reduction (Ringnér, 2008). Taking a high-dimensional dataset, it is possible to create a linear set of orthogonalized composite variables, termed the principal components with minimal loss of information. The original data can then be projected onto the first two principal components for visualisation. Ward's method of agglomerative hierarchical clustering is complementary to PCA since it utilises the same multivariate Euclidean space to find its clusters. As such the identified clusters are likely to be found in high density areas of PCA ordination (Murtagh & Legendre, 2014). Two principal component plots were created for each dataset, both are projections of the original data in eigenspace, with each point representing a player and their colour representing either their position or their cluster membership. Data ellipses representing 90% of the data were also drawn in each principal component plot around each class (either position or cluster). Lastly, the NbClust function was also applied to find the optimal number of clusters within each dataset, which implements 30 commonly used indices and suggests the best clustering scheme according to the majority rule. For a full conceptual and mathematical outline of the function and its indices, see, Charrad et al. (2014).

Results
For descriptive data of each feature used in each of the datasets see Supplementary File 1. Figure 2 shows the AUCs extracted from the random forests built for classifying position, as a function of the number of inputted important features. The median AUCs at the chosen number of features (i.e. the dashed vertical line) were 0.77, 0.84, 0.82 for the technical, displacement, and combined datasets respectively. Table 2 shows the top 10 extracted features from each dataset. For a full list of aggregated feature rankings and descriptions see Supplementary File 2.

Positional clustering
The results of the hierarchical cluster analysis at the positional level are presented in Figure 3. Up to seven possible clusters may be extracted from each dendrogram through "cutting" the dendrogram at different thresholds. However, for the technical dataset there appears to be three clear clusters which include: Technical cluster 1 (Tech C1 ) = Props, loose forwards, secondrows and hooker; Tech C2 = Halves; Tech C3 = Wings, centres and fullback. For displacement, four clusters are noted which  Figure 4 shows the results of the hierarchical cluster analysis at the player level and includes a series of principal component plots. For the full results of the PCA applied to each dataset, including the eigenvalues, eigenvectors, and percent variance explained by each principal component see Supplementary Files 4A, 4B, and 4C.

Discussion
The primary aim of this study was to determine the similarity between positions in professional rugby league by using analyses that include both physical and technical characteristics. This study implemented a novel framework which firstly used supervised feature selection to identify important technical and displacement features for classifying position. After which positions were clustered using those important features as inputs into three separate hierarchical cluster analyses, which included technical only features, displacement features, and a combined dataset. The dendrograms and principal component plots produced from the analysis provide a visual insight into the similarity between playing positions and players, respectively. Using these visuals, practitioners and researchers may choose the number of clusters to extract from each dataset as required. For example, if 4 clusters are required the dendrograms in the positional clusters can be cut at the desired level and the resulting positional groups can be used. The use of a league-wide sample over two competitive seasons also allows researchers and practitioners greater confidence in the generalisability of the presented results.

Determining logical positional groups using position labels
At the positional level, there appears to be common clusters across the three datasets ( Figure 3). Firstly, wings and centres are consistently clustered across displacement, technical, and combined datasets, which is expected given their similarities in attacking and defensive roles (Sirotic et al., 2011). Fullbacks are often also grouped with centres and wings to form the "outside backs" positional group (e.g., Cummins & Orr, 2015;Twist et al., 2014;Waldron et al., 2011). This is reflected in the technical only dataset (Tech C3 ), but not in the displacement dataset where fullbacks form their own cluster (Disp C3 ) or the combined dataset where they are more similar to halves (Comb C2 ). The latter is somewhat surprising given their distinct positional roles, particularly in defence where a fullback's main responsibilities involve covering the goal line from kicks and breaks in the defensive line. Although some authors have previously included both halves and fullbacks in a broader "adjustables" group, along with hookers (Cummins et al., 2016;Gabbett et al., 2011).
Another common grouping across all three datasets includes props, loose forwards, and hookers (Tech C1 , Disp C1 , Comb C1 ). The emergence of this cluster could be due to a number of reasons, however the increased tackling involvements these positions experience during match play compared to other positions is notable (Supplementary File 1; Naughton  Johnston et al., 2014), all of which were deemed important for predicting positions for technical and combined datasets within the ensemble feature selection step (Table 2). It could be argued that the reduction in carries could be due to these positions being typically interchanged, and that by including carries per minute may resolve this. However, this was accounted for in the initial filtering step of the current framework and carries-per-minute was removed since it was highly correlated with total carries (r > 0.8). Given the known interplay between contact involvements and displacement (Johnston et al., 2019), it also not surprising that these positions cluster together when looking at solely displacement (Disp C1 ). Props and loose forwards are often grouped together in research as "middles" or "hit up forwards" (e.g., Scott et al., 2017). However, it is somewhat surprising that hookers are more related to this group than halves, since in attack the three positions work together to organise the area around the ruck and the attacking structure in general (Sirotic et al., 2011). Instead, halves are a unique position in the technical dataset (Tech C2 ), which is likely due to their kicking responsibilities which appear as important variables (Table 2). Interestingly, for displacement, halves are much more related to the second-row position (Disp C2 ) and both are somewhat similar to the wings and centres (Figure 3(b)). Again, this could be explained by a number of different reasons but is likely related to their similarities in spatial occupancy. Although second-rows are commonly labelled as "forwards" they do operate in wider channels to provide attacking and defensive support. This increased space allows second-rows to accumulate more high-speed running than props, loose forwards, and hookers ( dissimilarity. For example, the props (n = 2), hooker and loose forward are typically the "middle" four, and the second-row, half-back, centre and wing are the "edge" four on each side.

Determining logical positional groups using player labels
Given the large between-player (i.e. within-position) variability found previously for the same displacement variables used here (Dalton- Barron, Palczewska et al., 2020), different clusters were expected to form at the player level analysis. However, the identified positional clusters are exactly the same as those identified at the positional level. Aside from the combined dataset where fullbacks join with centres and wings in the player level analysis instead of halves. This study also found very good separation between positions (Figure 4 a1, b1, c1), and players tend to cluster very well with their positional counterparts (Figure 4 a2, b2, c2) which can be seen visually in the principal component plots (Figure 4). Practically, this means there is more dissimilarity between-position centroids than there exists within-positions for both technical and displacement datasets and suggests that data may be aggregated at the positional level. That being said, the methodology used to identify similarity between players may be used in other applications. For example, to help guide team selection; if a player is injured coaches may wish to choose a player who displays similar technical qualities. The dendrograms and principal component plots may act as a tool to visualise highly complex data, whilst supplementing the numerical data.
Whilst feature selection has been implemented previously in sport as a pre-process step (Bunker & Thabtah, 2019;Wundersitz et al., 2015), to the authors' knowledge this is the first study to use an ensemble of techniques, which includes expert-domain led feature selection, in team sports. Importantly, there is variation in the taxonomy of the identified important features (Table 2). This means the variables inputted in the subsequent hierarchical cluster analyses represents a holistic overview of match play instead of focusing on solely a single aspect, such as only attacking play. For the technical dataset, the most important predictors of position are related to forward attacking play (e.g., tackle busts), defensive play (e.g., total tackles), and kicking and catching qualities. Whereas for the displacement dataset, the most important features identified relate to attacking, defensive, and transition running. They also include cumulative metrics (e.g., total distance, HSR distance, sprint distance) and metrics relative to playing time (e.g., average speed, average acceleration). This also outlines an important limitation of this study which is the reliance on discrete data for both technical and displacement data. That is not to say the current data are not useful; rather the inclusion of spatial and temporal properties in the data may yield new insights into the clustering of positions. Nonetheless,  the variables included in this study are some of the most commonly used in rugby league for studies that use technical (e.g., Parmar et al., 2018;Wedding et al., 2020;Woods et al., 2017) or displacement features (e.g., Delaney et al., 2016;Kempton & Coutts, 2016;Sirotic et al., 2011).

Conclusion
In conclusion, the positional clusters identified can be used to standardise positional groups in future research investigating either technical, displacement, or combined features in senior men's rugby league. Importantly, whilst it appears that three clusters emerge from the technical and combined datasets, and four clusters from the displacement dataset, practitioners and researchers may choose the number of clusters to extract from each dataset as required. For example, if 4 clusters are needed the dendrograms can be cut at the desired level and the resulting positional groups can be used. Whilst withinposition (i.e. between-player) variability did exist in all datasets, the separation between classes was still large enough to clearly demarcate the clusters in these data. However, performing cluster analyses at the player level may still be warranted if practitioners were interested in the similarity between individual players in their team. Ensemble feature selection may also be used and generalised to other problems in research or practice, where the objective is to identify important features without changing their original semantic value.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
The author(s) reported there is no funding associated with the work featured in this article.