Test–retest reliability measures for curve data: an overview with recommendations and supplementary code

ABSTRACT The purpose of this paper is to provide an overview of available methods for reliability investigations when the outcome of interest is a curve. Curve data, or functional data, is commonly collected in biomechanical research in order to better understand different aspects of human movement. Using recent statistical developments, curve data can be analysed in its most detailed form, as functions. However, an overview of appropriate statistical methods for assessing reliability of curve data is lacking. A review of contemporary literature of reliability measures for curve data within the fields of biomechanics and statistics identified the following methods: coefficient of multiple correlation, functional limits of agreement, measures of distance and similarity, and integrated pointwise indices (an extension of univariate reliability measures to curve data, inclusive of Pearson correlation, intraclass correlation, and standard error of measurement). These methods are briefly presented, implemented (R-code available as supplementary material) and evaluated on simulated data to highlight advantages and disadvantages of the methods. Among the identified methods, the integrated intraclass correlation and standard error of measurement are recommended. These methods are straightforward to implement, enable results over the domain, and consider variation between individuals, which the other methods partly neglect.


Introduction
Biomechanical research often involves the collection of large data sets based on integrated three-dimensional kinematics (e.g., motion capture), kinetics (e.g., force plate data), and muscle activation patterns (e.g., electromyography) during variable functional tasks. Such biomechanical data are high-dimensional and complex (Colyer, Evans, Cosker, & Salo, 2018) but are usually reduced to simple discrete outcome measures. Common examples found in the literature include joint range of motion (Sinsurin, Vachalathiti, Srisangboriboon, & Richards, 2018), maximal angles and moments (Markström, Schelin, & Häger, 2018), ground reaction forces (Elias, Hammill, & Mizner, 2015;Sinsurin et al., 2018), values of co-contraction (Elias et al., 2015) and others. These, or similar, outcomes are often collected within specific time frames of interest to enable the use of basic and well-documented statistical methods. However, it is logical to believe, and has been shown, that data showing the entire movement during, e.g., a jump landing, will reveal information that is not evident from pre-selected data points (Donoghue, Harrison, Coffey, & Hayes, 2008;Pataky, Robinson, & Vanrenterghem, 2013;Richter, O'Connor, Marshall, & Moran, 2014). As such, there has been an increased interest and application of methods for analysis of curve data within the field of biomechanics, e.g., using various techniques from functional data analysis (FDA) (Ramsay & Silverman, 2005) or statistical parametric mapping (Friston et al., 1994). Selected research includes topics of sports performance, functional developmental stages for children, and consequences of injury to the anterior cruciate ligament (see, e.g., Harrison, Ryan, & Hayes, 2007;Hébert-Losier et al., 2015;Hébert-Losier, Schelin, Tengman, Strong, & Häger, 2018;Ryan, Harrison, & Hayes, 2006;Warmenhoven et al., 2019aWarmenhoven et al., , 2019b. Thus, the variables of interest are no longer restricted to single values but also include functions over a domain. From a theoretical point, such functions are assumed to have an infinite dimension. However, even when analysing it as functions, the data over the domain are typically discretised, either by a basis representation or, as here, by pointwise evaluations of the functions. The reliability and validity of the data acquired from curve analyses remains an important consideration, particularly given the increasing interest in the analysis of such data. Repetition of trials and/or observations is common practice in scientific literature when measuring task performance over time as well as the performance of the assessors. If such measures are not reliable it is impossible to establish whether an observed effect of an intervention is due to the intervention itself or if it is due instead to the error in the measurement tools or variability in motor performance. Both absolute and relative reliability must be established before interpreting and presenting conclusions of the data. Absolute reliability is the degree of variation in measurements for individuals, whereas relative reliability is the degree of variation in position among individuals over repeated measurements (Bruton, Conway, & Holgate, 2000;Weir, 2005). While several articles have reviewed, discussed and presented available methods for reliability studies for discrete data (Atkinson & Nevill, 1998;Bruton et al., 2000;De Vet, Terwee, Knol, & Bouter, 2006;Hopkins, 2000;Rankin & Stokes, 1998;Weir, 2005), the same interest has not been shown to the topic of reliability for curve data. There is no study to date that presents an overview or review of suitable methods that may be used to evaluate reliability when the outcome of interest is an entire curve. The lack of such overview hinders future studies from adequately analysing and presenting the reliability of more complex data sets before drawing (possibly erroneous) conclusions from the data. Although complex data also include multidimensional curves, the focus of this manuscript is on curves in one dimension.
The purpose of this paper is therefore to provide an overview of available methods that may be used to evaluate reliability of curve data in the common case of a test-retest setting. The characteristics of these methods are described in a language adapted for the wider audience within the field of biomechanics while advantages and disadvantages are highlighted by applying them to different sets of simulated data. R-code (R Core Team, 2014) is also provided as supplementary material to enable easy and immediate implementation for the interested reader.
Overview of suitable reliability methods for curve data A review of present literature for reliability measures in a test-retest setting related to curve analyses was performed using Google Scholar, Web of Science and Medline. This search was directed to literature related to the analysis of biomechanics and to statistics in general. Since an ambiguity in terminology exists in the literature where alternative terms like repeatability, consistency, agreement, concordance, reproducibility, and stability are used in preference to reliability (Atkinson & Nevill, 1998;Berchtold, 2016;De Vet et al., 2006), these terms were also considered in the searches. The reference lists within the relevant articles were also reviewed.
Within the biomechanics-related literature the most frequently occurring measures used to evaluate relative reliability (for discrete values) were found to be correlation coefficients, including Pearson correlation and different versions of the intraclass correlation (ICC). There are multiple guidelines available that are dedicated to analyses of such data using ICCs (Koo & Li, 2016;McGraw & Wong, 1996;Trevethan, 2017). The most common measures to evaluate absolute reliability includes the standard error of measurement (SEM), the coefficient of variation and limits of agreement (LoA) (Bland & Altman, 1986). Relative and absolute measures for pre-selected data points can be applied to curve data by integrating any pointwise index (see, e.g., Kainz et al., 2017;Sankey et al., 2015;Schwartz, Trost, & Wervey, 2004). This is however not widely adopted in the literature. Bland Altman plots have also been used for gait curves as a visual tool in addition to a univariate reliability study (Meldrum, Shouldice, Conroy, Jones, & Forward, 2014). A formal extension of LoA to curve data called functional LoA (FLoA) has been developed by Røislien, Rennie, and Skaaret (2012). Other identified methods and measures include the coefficient of multiple correlation (CMC), which is based on data from the entire gait cycle, and the coefficient of multiple determination (the square of the CMC) (Kadaba et al., 1989). A systematic review by McGinley, Baker, Wolfe, and Morris (2009) identified and evaluated the evidence for reliability of three-dimensional kinematic gait outcomes, but presented no additional methods for curve data.
Within the statistical literature of curve data analyses, measures of similarity or dissimilarity between curves are common (for example, Dubin & Müller, 2005;Jacques & Preda, 2014;Sangalli, Secchi, Vantini, & Vitelli, 2010), as are measures of distances (Jacques & Preda, 2014). Such measures have not yet been used for the purpose of reliability, but potentially provide new possibilities. Curve similarity has also been assessed using linear regression techniques by Iosa et al. (2014), but the main focus of the paper was on reducing the dimension of the data. Using the technique suggested by Iosa et al. (2014), the parameters describing the similarity can be used to investigate test-retest reliability.
As such, this review presents and discuss the following methods: integrated pointwise index, FLoA, CMC, as well as measures of distance and similarity. To begin with, a few key concepts are defined. These concepts are curve, domain and domain alignment, presented here in Table 1. These terms will be used throughout the paper and should be addressed whenever working with curve data.
For the presentation of the methods that follow, some common notations are introduced. Let G denote the number of test sessions (G ¼ 2 in the test-retest setting). Let i ¼ 1; . . . ; N; denote the index of the N individuals. Assume that all individuals participate in all test sessions. Let Y denote the variable of interest, e.g., degree of knee flexion. The notation Y gi refers to the curve corresponding to individual i at test occasion g. A curve can mathematically be described with a function over the domain a; b ½ (e.g., during a landing phase where a = impact and b = peak knee flexion angle). Thus, the variable of interest will be expressed as a function of the point t of the domain, i.e. Y gi ¼ Y gi t ð Þ; t 2 a; b ½ : It is common to discretise the domain using a uniform grid for computational reasons. In such cases, denote witht 1 ; t 2 ; . . . ; t T the discrete points of the domain, where T refers to the total number of discretisation points. In this presentation, the data are assumed to be complete. In practice, there may be individuals with missing data or curves with missing fragments, but such a discussion is outside the scope of this paper.

Integrated pointwise indices
Reliability measures for univariate data have been proposed and widely studied (Bruton et al., 2000;De Vet et al., 2006;Rankin & Stokes, 1998). The extension of these measures to curve data follows by calculating so called integrated pointwise indices. Even though such an approach is intuitive and has been applied in some works (see, e.g., Kainz et al., 2017;Sankey et al., 2015;Schwartz et al., 2004) it has not been generally described. This may be one reason as to why it has not been widely adopted in papers that evaluate reliability of data originating from curves. Another reason may be that it is non-trivial to perform inference for such measures, since confidence intervals or p-values that are calculated pointwise should be presented with caution and should be adjusted to take multiplicity into account (see Discussion and Implications for more details).

CURVE
Whenever the object of the statistical analysis is a quantity that is a function of a continuous parameter (e.g. time), it can be considered curve data. Curve data are often represented as a collection of values at a finite number of points of the (time) domain, with values at neighbouring points being similar. For a general introduction to functional data analysis the interested reader is referred to Ramsay and Silverman (2005) and Sørensen, Goldsmith, and Sangalli (2013). DOMAIN Each curve is observed on a domain that often is, but not restricted to, time or percentage. For gait data, the domain is often defined over a gait cycle as 0; 100 ½ . For other types of curve data such as knee joint angles during a one-leg hop for distance landing phase, the choice of the domain a; b ½ is less trivial. In general, the start and end of the domain need to be defined and common for all individuals. DOMAIN ALIGNMENT Gait data has a natural domain alignment (also referred to as registration or warping) where the start and end of the gait cycle often occurs at approximately the same relative (time) point for most individuals. For other applications, domain alignment should be addressed prior to any reliability analysis. Indeed, it can be used to remove differences in phase shift variation or amplitude variation that are not of interest. The simplest form is referred to as landmark registration and implies a simple transformation of the curves. More advanced methods for alignment also exist, see e.g. Marron et al. (2014) and Ramsay and Silverman (2005), as well as the references therein.
When discretising the domain in a set of points, a functional data set turns out to be a multivariate data set composed of the values of the functional data at the points. At every fixed point t of the domain, a univariate data set composed of the values at point t are obtained. Hence, it is always possible to compute all classical reliability measures for every point of the domain. Such indices will vary over the domain; for every different point a different value of reliability, here denoted as R t ð Þ, are obtained. The functional version of a classical reliability index R, i.e., the integrated pointwise index, is obtained by computing the integral of R t ð Þ over the domain: If the discretisation step is constant, such a quantity can be numerically estimated with Every reliability index that is defined for univariate data can be extended to functional data by considering its integrated version. For these integrated pointwise indices, three different measures are considered; the Pearson correlation coefficient, ICC (type: absolute agreement) and SEM. These are not the only possible measures that could be used, but they are by far the most commonly used in biomechanical research. The ICC and SEM are calculated using variance components of an ANOVA. In detail, they are based on a two-way mixed effect model without interactions, following the recommendations in Trevethan (2017) for this test-retest situation. The reader is referred to the supplementary material for details of formulas.

Functional limits of agreement, FLoA
Functional limits of agreement (FLoA)  is an extension of LoA developed by Bland and Altman (1986). It is based on the assumption that the difference, between pairs of measurements and the corresponding average, This assumption can be verified using functional linear regression (Abramowicz et al., 2018;Ramsay & Silverman, 2005, pp 247-258). The FLoA is calculated as δ t ð Þ AE 1: Þthe corresponding standard deviation. The FLoA approach acknowledges the nature of the curve data and visualises observed differences.

Coefficient of multiple correlation, CMC
The CMC is a frequently used measure of similarity between curves (Ford, Myer, & Hewett, 2007;Kadaba et al., 1989;Milner, Westlake, & Tate, 2011). The CMC value for individuali is where for every individual, i, Y i t ð Þ is the mean curve across all test situations and Y i is the mean value across all test situations and all time points t ¼ t 1 ; t 2 ; . . . ; t T : The overall CMC value is obtained by averaging CMC i ð Þ across all individuals i ¼ 1; . . . ; N. An alternative way of calculating CMC is given in Røislien, Skare, Opheim, and Rennie (2012), where the CMC is calculated across all individuals simultaneously. CMC attains values between 0 and 1, with higher values corresponding to a higher degree of similarity between curves.

Distance measures
In the FDA literature, the distance between curves is often used to measure how similar they are. The most commonly used measure of distance between curves is the L 2distance, the extension of the Euclidean distance to functional data. An overview of other distances can be found in Jacques and Preda (2014) and references therein. In detail, for individual i, the L 2 -distance between test and retest curves If curve data are discretised on a grid of points on the domain t ¼ t 1 ; t 2 ; . . . ; t T with a constant discretisation step Δt ¼ t T À t 1 ð Þ=T, the distance can numerically be estimated with As for CMC, an overall measure of reliability is obtained by averaging d L 2 i ð Þ across all individuals:If the average distance is large, data are not reliable, and vice-versa.

Similarity measures
Similarity measures between curves (from the FDA literature) that do not depend on the unit of measurement and that always attain values in the range À1; 1 ½ can be found in Dubin and Müller (2005) and Sangalli et al. (2010). The simplest example of a similarity measure between two curves is the functional correlation S. For individual i the functional correlation between the two curves of test and retest Y 1i t ð Þ and Y 2i t ð Þ is: Again, if curve data are discretised on a grid of points of the domain with constant discretisation steps, the similarity can be numerically estimated with Note that in this case, it is not necessary to multiply the similarity measure by the discretisation step Δt, since it is included in both the numerator and the denominator. For the same reason, S i ð Þ is a dimensionless and standardised quantity. A global measure of reliability is obtained by averaging S i ð Þ over all i ¼ 1; . . . ; N. This similarity measure assumes high values (close to 1) in case of very similar curves, and low values (close to −1) in case of very different curves.
Simulation description Data were generated according to eight different scenarios in the test-retest setting to highlight different properties of the methods under comparison. All scenarios were replicated 1000 times, and one instance (replicate) of simulated data for all scenarios is reported in Figure 1. The same random seed was used for each scenario. For each scenario, test/retest curves are plotted in red/blue, with the same line type corresponding to the same individual. As a complement, Figure 2 displays the difference between test and retest for each individual on the same instance of the data. Simulations were carried out with sample sizes of 10, 20 and 30 individuals. Results for different sample sizes were very similar and as such the sample size of 10 is presented only for the purpose of clearer visualisation. Further, similar results were obtained when selecting a different number of points to discretise the domain; however, too few points are not appropriate, since this might lead to a misrepresentation of the real data. In this paper, results for T = 100 are presented, but simulations were carried out for T = 10, 20, 50, 100, 150, 200. Starting from T = 50 the results are very similar to those presented, with differences within two standard deviations from the average value in all scenarios.
The eight scenarios were divided into two different situations, one where the between-individual variability was higher than the within-individual variability (scenarios A1, B1, C1 and D1) and one where the between-individual variability was lower than the within-individual variability (scenarios A2, B2, C2 and D2). In scenarios A1 and A2, all curve data within each scenario were sampled from the same distribution. To resemble real-life situations, scenarios included cases with simulated patient and control groups, represented in the figures by light and dark colours, respectively. In scenarios B1 and B2, the two groups differed in mean but had the same error distribution. Curves of patients (light) assume higher values with respect to curves of the controls (dark). In scenarios C1 and C2, data of controls were simulated as in scenarios A1 and A2, but data of patients always assumed lower values in the test than in retest. The shape of curve data was consistently similar within and between individuals for these first six scenarios. Finally, scenarios D1 and D2 showed two cases where the shape of the curve data of test and retest was different, with two maxima for the test curves and only one maximum for the retest. There were no simulated group differences in scenario A1, A2, D1 and D2.
Based on the construction of the scenarios, a high reliability is expected in scenarios A1 and B1, characterised by high variability between individuals with respect to the variability within individuals and groups of individuals that follow the same model in test-retest. In scenario B2, two groups of individuals are clearly identified both in the test and retest sessions, where the variability between individuals was low within each group. Hence, a high overall reliability is also expected in this scenario, since group differences are not considered. In all other scenarios, a lower reliability is expected. Figure 1. Simulated data including parameters. The plots show an instance of simulated data for each scenario. The curves of test/retest are plotted in red/blue, and the same line type corresponds to the same individual. Curves of patient/control groups are plotted in light/dark colour, respectively. Note that the two groups differ only in scenarios B and C. Additionally, the parameter setting for each simulated scenario is presented [The reader is referred to the electronic version of this paper for the colour representation.]. Figure 2. Simulated data differences. The difference between test and retest for each individual and each scenario (same instance as in Figure 1). Curves of patient/control groups are plotted in light/ dark colour, respectively. Note that the two groups differ only in scenarios B and C.

Technical details of the simulation
Data were simulated according to the model, where f i t ð Þ was the true curve of individuali evaluated at time point t and ε gi t ð Þ was an error function. All functions were obtained by simulating coefficients of a cubic B-spline basis on 0; 100 ½ with five basis functions. The five coefficients of the basis expansion of f i t ð Þ were independent and normally distributed with The parameter vector σ B represent variability between different individuals. Coefficients of the basis expansion of ε gi t ð Þ were also assumed to be independent and normally distributed. The corresponding notations for the mean and variances were m ε gi and σ W , respectively. The parameter vector σ W represent variability within each individual.
As mentioned, eight different scenarios were considered in the simulations by varying parameters m f i , m ε gi , σ B , and σ W . Within-individual variability was constant and equal to one in all scenarios. In scenarios A1 and A2 parameters m f i and m ε gi were the same for each individual and for both test sessions. In scenarios B1 and B2 two groups of individuals were simulated, with different values of m f i . Difference between groups was equally expressed in the first test session (g ¼ 1) and in the second test session (g ¼ 2). In scenarios C1 and C2 two different groups of individuals react differently in the two test sessions, simulated using different values of both m f i and m ε gi . Finally, in scenarios D1 and D2, the two different groups reacted differently in the two test sessions, but only in a part of the domain, simulated by changing the parameter m ε gi . All parameter values are presented in Figure 1, in addition to one instance (replicate) of simulated data for all scenarios. The R-code for generating the data is provided as supplementary material.

Advantages and disadvantages of the methods
The reliability outcome results of the simulation are reported in Table 2. In detail, for each considered reliability measure the average index over 1000 simulated scenarios are reported, with its standard deviation in parentheses. In scenarios presenting two distinguished groups of individuals (B1, B2, C1 and C2), both a common measure of reliability (ignoring the groups) and the inter-group reliability were computed. The integrated measures can be visually illustrated pointwise over the domain as illustrated for ICC in Figure 3 for one instance of data. Results for FLoA are presented in Figure 4 since the results are visual and without any numerical index. In the following, each of the presented methods is discussed in relation to the results of the simulation, to highlight the advantages and disadvantages with each method.

Integrated pointwise indices
When integrating a reliability index over the domain, the range, properties and interpretation of the index remain the same as the original univariate version. This is an advantage since it makes interpretation straightforward. For the same reason, potential disadvantages of the chosen univariate index will transfer to its integrated version. Integrated pointwise indices perform generally well in the sense that they capture the specific aspects of the different scenarios. Integrated Pearson correlation has the desired features for the first scenarios, but it attains high values (close to 1) in scenario D1, where data are not reliable in the second part of the domain. This is explained by the fact that in this scenario there is a constant mean difference between test and retest that is not captured by Pearson correlation, which only focusses on relative differences between data and the corresponding means. For this reason, the correlation values in scenarios D1 and D2 coincide with values in scenarios A1 and A2, respectively. Hence, Pearson correlation is not recommended for reliability analyses when the actual values of data, the absolute agreement, is of interest.
Integrated ICC displays high values (close to 1) in scenarios A1, B1 and B2 and low values in all other cases (see Table 2). This means that it is sensitive to changes in variability between individuals (scenarios A1 and A2), to changes in at least one group of individuals between test and retest (scenarios C1 and C2) and to changes over parts of the domain (scenarios D1 and D2, see Figure 3).
Integrated SEM behaves differently with respect to ICC, but differences are in line with its definition and interpretation. Indeed, SEM is a measure of variability and is expected to be high and low, respectively, in unreliable and reliable situations. The implication of high and low depends on the application. In addition, while changing the between variability parameter as shown for scenarios A1 compared to A2 and B1 compared to B2, the same SEM values are expected. SEM attains low values in scenarios A1, A2, B1 and B2. The SEM has the same unit of measurement as the data and, as such, it has to be compared with the magnitude of data. This is one reason as to why it is recommended and commonly presented in addition to reliability measures such as the ICC.
It should be noted that the indices ICC and SEM are based on variance components from an ANOVA, with an underlying assumption that the model is suitable. Explicit formulas used in the calculations are provided as supplementary material. If formulas for ICC type consistency were used instead of absolute agreement, a high reliability would be observed for scenarios with a constant shift in mean but the same relative differences between test and retest (such as for Pearson correlation coefficient). However, a high reliability (ICC close to one and a relatively small SEM) is not warranted in those scenarios.
For integrated reliability indices, it can also be useful to visualise the pointwise values of the index (Figure 3). The value of ICC is nearly constant over the whole domain for five of the scenarios. Further, the expected change over the domain is visible in scenario D1, with high values only in the first part of the domain. In such cases, the choice of the domain of the functional data is important since it has a clear impact on the results and its conclusions. To complement the pointwise index with pointwise confidence bands is straightforward from the univariate case. However, producing functional confidence bands taking into consideration the nature of the data is less straightforward and outside the scope of this paper.

Functional limits of agreement, FLoA
The FLoA results reported in Figure 4 show a high reliability in scenarios A1, A2, B1 and B2, as well as in parts of the domain for scenarios D1 and D2. The method is not able to distinguish between scenarios where the between-individual variability is higher than the within-individual variability (e.g., C1 compared to C2). FLoA is visually easy to interpret since it provides pointwise bands for the difference between test and retest data, but no global index is provided. Deciding whether the observed difference is within acceptable limits is therefore up to the interpreter. The method relies on the assumption that δ t ð Þ and A t ð Þ are uncorrelated, which has to be checked through functional linear regression. This may however not be trivial to check and requires additional computational knowledge.

Coefficient of multiple correlation, CMC
The CMC is very similar (and close to its maximum value of 1) for the first six scenarios in Table 2 (A1, A2, B1, B2, C1, C2). Several methodological shortcomings have been reported for the CMC (Chia & Sangeux, 2017;Sangeux, Passmore, Graham, & Tirosh, 2016). The method does not work for curves with a small ROM, since in such cases the term under the square root can be negative and CMC is undefined. This is illustrated in scenarios C1 and C2, where the CMC for the second group is undefined. In such cases, the common CMC value is computed as the average among individuals of the first group only. This is why the indices for C1 and C1 group 1 are identical and high. This computational disadvantage may result in the mistake of concluding high reliability where it is not. Further, the statistical properties of CMC are related both to the number of individuals and to the number of test situations . It does not take into account the variability between individuals. As such, CMC is not able to distinguish between low and high between-individual variation, as in scenarios A1 and A2. Also, the CMC provides an overall value, thus with no information of the reliability throughout the domain.

Distance measures
The distance between two curves is always defined and so, unlike CMC, it is always possible to provide a measure of reliability with this method. The same distance value is observed between the first four scenarios (A1, A2, B1, B2), between scenarios C1 and C2, and between scenarios D1 and D2 (Table 2). This occurs because the distance between test and retest curves only depends on the within-individual variability, i.e., the difference between test and retest measurements. As constructed, the variability between individuals is ignored, as for the CMC. In the mentioned scenarios the parameters characterising this difference are the same and the same random seed is used; hence, the same distance is observed. The average distance is very low in scenarios A1, A2, B1 and B2, quite high in scenarios D1 and D2, and very high in scenarios C1 and C2.
Although this method seems intuitive it suffers from a few problems. First, the distance between two curves depends on the unit of measurement of data. The distance can be standardised with respect to the unit of measurement, but no thresholds can be defined. The observed distance is also dependent on the domain, since the discretisation step is a multiplying factor in the estimation. It is also highly influenced by the low reliability in one of the two groups as presented in Table 2 (see scenarios C1 and C2). The distance measure provides an overall value, thus with no information of the reliability throughout the domain.

Similarity measures
The similarity measure is an always well-defined dimensionless quantity measured on a specific range, making its interpretation easy. However, similar to CMC and distance measures, it neglects variability between individuals. As a result, very similar similarity measures are obtained for very different scenarios, which for instance is displayed in scenarios A1 and A2. Similar to the distance measure, the similarity measure provides no information about the reliability throughout the domain.

Discussion and implications
This review provides an overview of available methods for test-retest reliability when the data of interest are curves (also called functional data). Overall, we recommend the use of the integrated pointwise indices, specifically ICC and SEM, to investigate reliability of curve data. These methods provide advantages of being straightforward to implement, enable results over the domain, and consider variation between individuals. As such, the methods are relevant for both the researcher and the sports coach.
It is important to realise that an observed curve consists of two parts: a true curve and an error component. The error component corresponds to all different sources of variation, both systematic and random. In the test-retest situation the variation can originate from the individual (fatigue, mood), the test leader (instructions), the test situation (practicalities around the test), the instrument (technical, the approach), and the processing (coding) (Chau, Young, & Redekop, 2005;Gorton, Hebert, & Gannotti, 2009;Sankey et al., 2015;Srinivasan & Mathiassen, 2012). Some sources of variation should always be controlled, e.g., by utilising clear instructions and rigorous routines. Applicable to data collected from humans, Kottner et al. (2011) argued that reliability may be defined as the ratio of variability between persons to the total variability of all measurements in the sample. Put differently, reliability may be defined as the ability of a measurement to separate between these persons. Since biomechanical research often aims to evaluate effects of injury, diagnosis, or level of expertise, a suitable statistical method employed to investigate reliability should incorporate variability both within and between individuals. Indeed, motor variability in human tasks is considered an important factor theorised to be regulated by the brain to, e.g., avoid overuse injuries (Srinivasan & Mathiassen, 2012). The variability of curve data may in itself be relevant for comparisons between different cohorts. As an example, greater variability for intralimb couplings has been displayed among persons with an injury to the anterior cruciate ligament when compared to asymptomatic persons (Gribbin et al., 2016;Pollard, Stearns, Hayes, & Heiderscheit, 2015). Further, not considering the variability between individuals may increase the risk of concluding that data is reliable when it is in fact not. The evaluation of such erroneously reliable data may result in nonsignificant findings that may be real if evaluated more appropriately. Such conclusions may be devastating for the targeted population of interest and hinders the development of future knowledge.
Another important aspect that needs special attention is the integrated version of the reliability index along the domain, which is essentially an average of the index values for the considered time points. The interpretation of the integrated pointwise ICCs and SEMs is the same as the original quantities for discrete data. Hence, it is possible to use the same thresholds. However, the choice of appropriate thresholds for discrete data is often somehow arbitrary, and supporting theoretical results are lacking. In addition to the integrated pointwise index, a figure that shows the pointwise index over the curve provides, at least visually, descriptive local information about the shape of the function R t ð Þ along the domain. A problem may arise if data are highly reliable in one part of the domain and highly unreliable in another part of the domain, as shown by the simulated example (see Figures 3 and 4, scenario D1 and D2). Indeed, greater variability in knee moments have been displayed in the weight acceptance phase (0-25% of landing) compared to later periods of the landing phase during a side-cutting task (Sankey et al., 2015). As such, the initial choice of the domain of interest can strongly impact the result and need careful consideration. Applied sport scientists should rather discuss on the obtained results using graphs of pointwise indices instead of only checking if the integrated value is above or below a certain threshold. These graphs highlight particular phases over the domain that may be more or less reliable, thus aiding sports biomechanists and coaches in the evaluation of biomechanical data for movement technique or performances for athletes. Note however that such figures are only descriptive, and not suited for inferential identification of reliableunreliable phases.
Another issue to consider is the possible misalignment of curve data that may negatively affect the results of a reliability analysis (e.g., phase shift in the start of the swing phase during a gait cycle). Therefore, the key concepts of domain and domain alignment for curve data (Table 1) deserve special attention and should be clearly stated. Therefore, the key concepts of domain and domain alignment for curve data (Table 1) deserve special attention and should be clearly stated. Here, the data are assumed to be aligned prior to the reliability analysis. This is naturally the case for curve data of gait cycles, where the domain is typically expressed as cycle percentage. In other cases, however, the original data are not aligned, and one should carefully perform domain alignment prior to the analysis. For many applications, landmark registration is a suitable method for domain alignment (Moudy, Richter, & Strike, 2018). Indeed, in most cases, it is possible to identify landmarks on the curves, which often correspond to events of interest. Using landmark registration in connection to pointwise reliability indices has some sound properties. For instance, the pointwise reliability index at the landmark correspond to the value of the reliability index that would be computed on univariate data that only refer to the specific landmark. With careful considerations more flexible methods may also be used (Marron, Ramsay, Sangalli, & Srivastava, 2014). However, the more flexible the alignment technique is, the less interpretable the results tend to be. With a flexible alignment, the concept of domain itself risks being difficult to interpret, since after aligning the curves, the domain is not time nor percentage of the task (except the case when landmark alignment is performed). It is important to emphasise that these key concepts be comprehensively discussed and decided upon in advance of data analysis, and in relation to the specific research question to attain the best possible results.
Finally, it is important to point out that no inferential procedure is presented here. Inference (tests with p-values or confidence bands) for, e.g., the ICC in each point of the domain would require adjustments due to multiple comparison issues. Within the FDA-framework techniques for adjusting p-values have been presented (Abramowicz et al., 2018;Pataky et al., 2013;Vsevolozhskaya et al., 2013). For future studies, such methods may be used in the reliability setting.

Conclusion
This paper presented advantages and disadvantages of the available methods for assessing reliability of curve data in a test-retest setting, in order to provide recommendations for future research. A summary of these findings is presented in Table 3. It is recommended to use the integrated pointwise indices, specifically ICC and SEM, since the methods evaluate reliability of curve data in an appropriate way. The use of such analyses is encouraged in order to avoid drawing potentially erroneous conclusions from the data. R-code are provided as supplementary material to facilitate the implementation of the recommended analysis methods. Potential disadvantages for the chosen univariate measure will transfer to this index.

Functional limit of agreement
Acknowledges the nature of the curve data and enables results over the domain.
The user must decide on the acceptable limits. Does not take into account the variation between individuals. It may be computationally demanding to check assumptions.
Computational errors and possibly not defined when the range of motion is small. Does not take into account the variation between individuals. Only presented as a global measure. Distance measure Easy to compute and always welldefined.
The user must decide on the acceptable limits. Does not take into account the variation between individuals. Only presented as a global measure. Similarity measure Easy to compute, always well-defined and attains values in [−1,1].
Does not take into account the variation between individuals. Only presented as a global measure.