Uncertainty about Rater Variance and Small Dimension Effects Impact Reliability in Supervisor Ratings

ABSTRACT We modeled the effects commonly described as defining the measurement structure of supervisor performance ratings. In doing so, we contribute to different theoretical perspectives, including components of the multifactor and mediated models of performance ratings. Across two reanalyzed samples (Sample 1, N ratees = 392, N raters = 244; Sample 2, N ratees = 342, N raters = 397), we found a structure primarily reflective of general (>27% of variance explained) and rater-related (>49%) effects, with relatively small performance dimension effects (between 1% and 11%). We drew on findings from the assessment center literature to approximate the proportion of rater variance that might theoretically contribute to reliability in performance ratings. We found that even moderate contributions of rater-related variance to reliability resulted in a sizable impact on reliability estimates, drawing them closer to accepted criteria.

Performance ratings hold a central role in applied psychology and human resource management as a developmental aid, an indicator of individual performance, and as a criterion in validation studies (Aguinis, 2019;DeNisi & Murphy, 2017;Murphy & Cleveland, 1995;O'Neill, McLarnon, & Carswell, 2015). The performance of employees is often evaluated by their supervisors. A long-term concern related to supervisor ratings is the substantial accumulated evidence that they lack adequate reliability (Murphy, 2008;Thorndike, 1920). Researchers in the field estimate the interrater reliability coefficient of supervisor ratings as only around .52 (Rothstein, Schmidt, Erwin, Owens, & Sparks, 1990;Schmidt, Viswesvaran, & Ones, 2000;Viswesvaran, Ones, & Schmidt, 1996); a figure well below levels typically regarded as acceptable in the psychometric literature (e.g., Lance, Butts, & Michels, 2006;LeBreton, Scherer, & James, 2014).
The .52 estimate for the reliability of performance ratings assumes that all rater-related variance in the evaluation of individual performance contributes to unreliability. This assumption has been challenged by researchers (Murphy & DeShon, 2000a, 2000bPutka, Hoffman, & Carter, 2014), who argue that between-rater differences may arise from raters being exposed to, or focusing on, different aspects of each ratee's performance. This position implies that variability between raters might not only reflect variance that contributes to unreliability, but also reliable information of value to the evaluation of ratees.
To progress an understanding of these issues, it would be of assistance to establish the underlying measurement structure of performance ratings and to identify the degree of variance directly associated rater-related effects. Once rater-related effects have been modeled and their magnitude In more recent literature, and touching on a related criticism, Murphy (2008) discussed the tenuous correspondence between performance and performance ratings, suggesting that such ratings fail to fulfil their intended purpose. Even within the last decade, renewed interest in and debate surrounding the status of performance ratings emerged with a focal article by LeBreton et al. (2014). LeBreton et al. raised the question: "Why are performance ratings allowed to survive in spite of what most would agree is abjectly problematic measurement?" (p. 482). The authors described performance ratings as "fundamentally flawed" and in which "~50% of the observed variance is measurement error" (p. 482).

Generalizability (G) theory and reliability estimation
The Schmidt et al. (2000) contention that interrater reliability is the only estimate relevant to corrections to unreliability in performance ratings has not gone without challenge. In particular, Murphy and DeShon (2000a) suggested the application of generalizability theory (G theory) to estimate reliability in this context. Unlike the process by which interrater reliability is traditionally estimated, G theory can be used to simultaneously model multiple effects relevant to the measurement structure of performance ratings. This allows for direct, statistically partialled comparisons between key variance components, including those relating to general, behavioral rating, dimension, and rater effects. G theory permits researchers to classify sources of rater-related variance as a contribution to unreliability or, alternatively, as reliable, systematic effects reflective of the rater's perspective on a given ratee. Unlike classical approaches to psychometrics, it thus facilitates researcher decisions on how to define multiple sources of universe (akin to true) score in contrast to multiple potential sources of unreliability and other, uncategorized sources variance 2 (Brennan, 2001). Moreover, G theory can be used to provide a detailed evaluation of how multiple measurement-design-relevant effects uniquely contribute to reliable and which contribute to unreliable variance, 3 and thus has the potential to inform theory on the structure of performance ratings.
With a G theory approach, once a complete set of effects relevant to a measurement design is estimated, it is possible to approximate the consequence of aggregating ratings into different types of summary score. Aggregation can have the effect of changing the proportion of variance associated with specific effects in a measurement structure (Putka & Hoffman, 2013. Sets of rating items might be aggregated to form dimension scores, dimension scores could then be aggregated across different raters, or aggregation could occur across all rating items, dimensions, and raters to arrive at overall scores. All three of these approaches to aggregation could result in different reliability outcomes, as has been suggested in other research contexts (Jackson, Michaelides, Dewberry, & Kim, 2016;Putka & Hoffman, 2013).

Extant G theory analyses of performance ratings
The measurement design for performance ratings is typically described as involving raters evaluating assessees on rating items nested in each of several performance dimensions (Bennett, Lance, & Woehr, 2006;Murphy & Cleveland, 1995;Murphy & DeShon, 2000a;O'Neill et al., 2015). We were unable to find an analysis that partialled effects discussed in the literature as being primarily relevant to this design (i.e., inclusive of raters, ratees, items, and performance dimensions). In the empirical studies we reviewed that investigated multiple effects, rater-related variance was always treated as contributing to unreliable variance (see Putka & Hoffman, 2013 for further discussion on this issue). The idea that at least some portion of rater variance might contribute to universe score (see Murphy & DeShon, 2000a, 2000b does not appear to have been investigated empirically in this context.

2
Uncategorized sources of variance are those that are neither relevant to universe score nor to unreliability, irrespective of the measurement intentions of the researcher. For example, when comparing across ratees, the main effect for items has no bearing on the rank ordering of ratees and is therefore neither relevant to universe score nor unreliable variance. 3 We adopt the terms "reliable" (or universe score) and "unreliable" variance from Putka and Hoffman (2013).
An element of the performance ratings measurement design intended to directly summarize performance is that concerning performance dimensions (or "competencies"). Supervisors often evaluate ratees on dimensions such as teamwork and communication skills (e.g., Bartram, 2005;Kurz & Bartram, 2002). Greguras and Robie (1998) presented a G theory model that addressed several important sources of variance relevant to performance ratings. While the authors modeled item effects, they did not model performance dimensions. Moreover, raters were confounded with ratees in their design. Although central to their measurement design, the structure of performance dimensions (e.g., teamwork ability, customer focus) has generally been underexplored in the context of performance ratings. However, in many real-world measurement designs, performance dimensions play a central role, even in the estimation of overall scores. This is true of supervisory job performance ratings (Bartram, 2005), assessment center (AC) ratings (Putka & Hoffman, 2013), and situational judgment tests (Christian, Edwards, & Bradley, 2010). Dimensions are of theoretical importance because they supposedly define meaningful subcomponents of the performance construct domain (e.g., Arthur & Villado, 2008;Bartram, 2005;Borman & Brush, 1993). O'Neill et al. (2015) is a rare example of the modeling of dimension effects, along with assessee and rater effects, for supervisor ratings. They found small dimension effects (around 6% of variance explained). However, in their study, item-related effects were not modeled. Item effects might play a key role in performance ratings, particularly given their involvement in aggregation relating to summative dimension scores.
Formulae for estimating the effects of aggregation are available in the G theory literature (e.g., Brennan, 2001;Cronbach, Gleser, Nanda, & Rajaratnam, 1972;Shavelson & Webb, 1991).  adapted such formulae in their reanalysis of data from Greguras and Robie (1998). While insightful, the conclusions that could be drawn from this analysis were limited by the fact that, in the original study, assessees were confounded with raters and no performance dimensions were defined. In contrast, the effects in the O'Neill et al. (2015) study, while including dimensions, neither acknowledged different types of aggregation nor, as mentioned above, effects relating to rating items. Jackson, Michaelides, Dewberry, Schwencke, and Toms (2020) considered G theory formulae for aggregation as it relates to multisource ratings. However, we were unable to find a study that explored aggregation pertaining to an unconfounded measurement design specifically for supervisor performance ratings.
Theoretical models for performance ratings G theory involves partitioning multiple measurement design effects that are potentially relevant or irrelevant to the performance construct (Cronbach et al., 1972). This approach facilitates exploration of the multifactor and mediated theoretical perspectives that have been proposed for performance and performance ratings (discussed below). Murphy (2008) summarized 3 theoretical models that describe the relationship between the performance construct and performance ratings. First, the one-factor model suggests a direct relationship between performance and ratings of performance. However, this relationship is subject to measurement error, which, if removed, allows for a direct representation of performance via ratings. The one-factor model assumes that performance ratings can be decomposed into true score + error, and corrections for the latter enable an estimation of performance (e.g., as in the corrections for attenuation in Viswesvaran et al., 1996).
Second, multifactor models aim to delineate a multiplicity of effects that might influence performance ratings, including performance, raters, items, job characteristics, and cognitive processes (Landy & Farr, 1980). Murphy suggests that a useful contribution from this work is that it highlights the impact not only of job performance, but also various other systematic factors on ratings. Multifactor models have the potential to help explain the array of nonperformance-related factors that may have a bearing on ratings. This idea has implications for shared rater variance relating to estimates of general ratee performance. Such estimates might not only indicate ratee performance, but other, nonperformance characteristics, including individual (e.g., rater recall of events) and system characteristics (e.g., use of rating scales, see Murphy, Cleveland, & Hanscom, 2019).
Third, Murphy describes mediated models. These expand on multifactor models by suggesting that distortions (e.g., concerning organizational politics and individual rater goals) can influence the link between performance and ratings. The idea here is that the multiple influences identified in multifactor models are mediated through rater goals and intentions, which are, in turn, reflected in ratings. However, only one of the many factors involved in this evaluation and perceptual process is the performance of ratees.
To date, studies of the reliability of supervisor ratings have typically been conducted within the framework of classical test theory and thus align closely with the one-factor model described above. The classical approach typically involves correlating ratings provided by large numbers of supervisor pairs, where each pair evaluates the performance of a specific ratee. This provides a suitable, unbiased inter-rater reliability estimate of the ratings of the overall performance of ratees. However, it yields an incomplete perspective on performance ratings (Murphy & DeShon, 2000a, 2000b. This is because the design of supervisory ratings involves measurement elements that are ignored by the approach to reliability assumed in the one-factor model. As suggested in multifactor and mediated models, many of these elements are likely nonperformance effects that have a bearing on performance ratings.

Measurement design elements related and unrelated to performance
Rater-related effects have presented a topic of much debate. The most common approach taken in the literature is to treat all rater-related variance as a contribution to unreliable variance. The idea that rater variance contributes to unreliability is implied in the common estimation of and correction for interrater reliability estimates (Schmidt & Hunter, 1996;Viswesvaran et al., 1996). However, this is not the only perspective on the topic. Murphy and DeShon (2000a) suggest that there is "no clear justification" (p. 877) for defaulting to an "unreliable" classification for rater-related variance. They submit that rater perspectives on a given ratee might vary meaningfully because of the rater's position, their relationship with the ratee, and political motivations. Thus, rater variance could, in part, reflect different contextual perspectives on employee performance . To illustrate, one supervisor might have more experience with an employee in the context of client engagement. A different supervisor might have more experience with the same employee in the context of logistics management. These are different environments across which employee performance might meaningfully vary. Experience is only one example of the more general issue of variability in performance output that could be affected by any number of effects (e.g., stimuli, mood, context, etc, see Awtrey, Thornley, Dannals, Barnes, & Uhlmann, 2021;Kane, 1986).
One of the challenges to this rater context-driven perspective is that there is no clear guidance about the proportion of variability in rater effects that might contribute to reliability. This is because systematically varying work contexts are not typically included as part of the measurement design for performance ratings (e.g., Schmidt et al., 2000). If some portion of raterrelated variance contributes to universe score, then the classification of all rater-related variance as a contribution to unreliability (e.g., Schmidt et al., 2000) will result in erroneously inflated estimates of rater variance. However, it is known that even highly trained raters, evaluating performance in standardized environments and required to focus exclusively on ratee performance, commit known failures (Jackson et al., 2016;Putka & Hoffman, 2013). Thus, the notion that naturalistic employee performance ratings are error-free is untenable. A reasonable take therefore suggests that some, but not all rater variance might be associated with reliability.
A specific line of research has suggested smaller rater effects than those previously estimated. This research area has focused on performance ratings in particular occupations, such as in healthcare, applied psychology, ergonomics, and occupational safety (Burke et al., 2011(Burke et al., , 2006. Burke, Landis, and Burke (2014) report that the measurement designs used in these occupations typically involve two raters who evaluate the same ratee at the same time and in the same context. The authors report higher reliabilities for such designs with "provisional" interrater reliability estimates of around .80 (p. 534). However, this still leaves open the possibility that ratings from different raters in different contexts might, in part, reflect perspectives that vary meaningfully. If context-varied effects are substantial and yet are treated wholly as contributing to unreliable variance, then the reliability of ratings might be underestimated.
ACs present a measurement design that includes varied work-relevant contexts. Research on ACs has explored the issue of contextual perspectives in detail as it pertains to rater sensitivity to changes in situational characteristics in the form of exercise effects (i.e., variance relating to different AC exercise contexts, see Lance, 2012;Lance, Lambert, Gewin, Lievens, & Conway, 2004). Two recent studies modeled rater (or assessor), exercise, and a multitude of other measurement design effects. This allowed for a statistically partialled perspective on the extent to which raters differentiated between AC exercise contexts (Jackson et al., 2016;Putka & Hoffman, 2013). The most conservative estimates from these studies suggested that whilst partialling idiosyncratic rater and other effects, between 33.51% and 38.10% of variance in AC ratings was attributable to the capacity for raters to identify differences between exercises. These estimates, based on assessor ratings evaluated within each exercise, provide initial insights into the expected proportion of rater variance that might indicate sensitivity to ratee performance in different work-relevant contexts. The Jackson et al. and Putka and Hoffman estimates partialled rater-related effects and thus aspects of possible rater bias.

Summary and knowledge gaps related to supervisor performance ratings
Theoretical development on performance ratings has focused on measurement structure (Greguras & Robie, 1998;Hoffman, Lance, Bynum, & Gentry, 2010;Lance, Teachout, & Donnelly, 1992;O'Neill et al., 2015). The prevailing perspective on supervisory performance ratings appears to be that their interrater reliability is low at around .52 and that this outcome is due to unreliability based on large rater-related effects (Schmidt & Hunter, 1996;Schmidt et al., 2000). However, a statistically partialled perspective on the measurement structure of supervisory performance ratings is currently unavailable. Such a perspective is required to add clarity to the literature on this widely applied measure.
To develop a theoretical understanding of the structure of supervisor ratings, a study is required that partials sources of variance central to their measurement design (raters, assessees, items, performance dimensions, and their interactions) whilst acknowledging the effects of aggregation. This leads to our first Research Question: Research Question 1: On aggregation, what proportion of the variance in supervisory performance ratings is uniquely associated with: raters, assessees, items, performance dimensions, and their interactions?
Given previous research findings on the interrater reliability of performance ratings (LeBreton et al., 2014;Schmidt et al., 2000;Viswesvaran et al., 1996), we expect to find sizable rater effects in our results. DeShon (2000a, 2000b) argue that at least some proportion of rater variance might contribute to universe score because raters evaluate ratees in different work contexts. Yet, it is highly unlikely that all rater variance contributes to universe score.
Although the measurement design of performance ratings does not typically differentiate between contextual influences, in contrast, ACs do differentiate between work contexts. Modeled in both the Jackson et al. (2016) and Putka and Hoffman (2013) estimates 4 was the potential for assessors to be sensitive to (a) performance within exercises and (b) performance on dimensions that vary by exercise (see also Hoffman, Kennedy, LoPilato, Monahan, & Lance, 2015). These findings could help to inform on the proportion of rater variance in supervisor ratings that is associated with sensitivity to ratee performance in different work contexts. The intention here is not to provide the definitive and final response to the question about which proportion of rater variance contributes to universe score. However, we seek to provide an approximation of the expected outcome when an informed proportion rater variance is accounted for by sensitivity to different performance contexts. This leads to our second Research Question: Research Question 2: How do reliability estimates for performance ratings change when accounting for rater sensitivity across different performance contexts?
Results relating to our research questions will facilitate a consideration of how the multiple effects relevant to the measurement design of performance ratings contribute to universe score or unreliability. This consideration will, in turn, inform on the multifactor and components of the mediated theoretical models summarized by Murphy (2008).

Method
We reanalyzed subgroups from the data sets that appeared in Jackson et al. (2020). In the original study, the authors focused on a multisource measurement design. The emphasis of the current study is on supervisory ratings, for which there were two separate samples available in the Jackson et al. database. As a supplementary analysis and to test whether our findings replicated in different roles, we repeated our analyses on the other individual sources available in the data set. The Jackson et al. database allowed a unique level of complexity as it modeled the main features of the supervisory ratings measurement design (including items, dimensions, and raters) for data that potentially present a challenge for applied researchers to obtain. Data from two different samples were available for analysis. Each of these samples reflected a specific, albeit similar measurement design. However, each design was sufficiently different to offer insights about the potential for cross-sample generalization.

Participants
Participants in Sample 1 included 392 unique managerial ratees (298 men, 94 women) and 244 unique supervisory raters (183 men, 61 women) who were managers ranked a level above and who directly supervised ratees. Although supervisor ratings were our primary focus, we aimed to provide the reader with comparative findings from other roles available in the data set. We therefore included separate ratings from 420 direct reports (315 men, 105 women), 775 colleagues (581 men, 194 women), and 579 stakeholders (434 men, 145 women). The participant organization was involved in manufacturing in the United Kingdom. The main purpose for the procedure used in Sample 1 was for employee development. Neither ethnicity nor age data were collected out of concerns related to confidentiality.

Measurement design
All participant ratees (p) were assessed by raters (r, an average of 2 per role or source) who assessed on rating items (i, on average 5 16.46 for each dimension), which were nested in performance dimensions (d, totaling 4). This design is typical of the type reported in the literature on performance ratings (e.g., Greguras & Robie, 1998;O'Neill et al., 2015). 5 We applied harmonic mean values for averaging facet levels, in keeping with Brennan (2001)

Sample 2
Participants Participants in Sample 2 included 342 unique managerial ratees (216 men, 126 women). The mean age of ratees in the full data set was 38.31 (SD = 9.65). 6 Ratees were assessed by 397 unique raters ranked a level above and who supervised ratees. As with Sample 1, direct supervisor ratings were our focus. However, we included for analysis data from other roles for comparison, including those from direct reports (N = 833), peers (N = 872), and clients (N = 272). Demographics on gender were only available for the total number of raters in the data set, including self-ratings (1579 men, 1057 women with a mean age of 40.24, SD = 9.89; note that 262 of these cases involved self-ratings, which we did not analyze). No further demographics were available. Unlike in Sample 1, the Sample 2 data set reflected ratings from different client organizations who made on-demand use of the performance management system. These client organizations were involved in banking, retail, accounting, insurance, human resources, and management consulting businesses in the United Kingdom. Applications of the procedure in Sample 2 depended on client requirements but included performance assessment and employee development.

Measurement design
In Sample 2, participant ratees (p) were assessed by raters (r, 2 on average per role or source) on rating items (i, on average 10.04 per performance dimension), which were nested in dimensions (d, total = 24). A mean of 5.40 dimensions were, in turn, nested in each of 5 summary dimension categories (c). The nested component of this measurement design related to dimensions was not present in Sample 1. Different clients made use of the performance management facility in Sample 2 to meet their specific demands, and so variation was apparent in the numbers of levels relating to sources of variance in this measurement design.

Rating procedures in samples 1 and 2
The procedures in both Samples 1 and 2 were developed on the basis of job analyses relating to the positions being evaluated (e.g., Williams & Crafts, 1997). Example rating items from Samples 1 and 2 respectively were: "Ensures the strategy, objectives, and activities of the team are focused on addressing customer needs" and "Gives ongoing and constructive performance-related feedback." In Sample 1 a rating scale was used ranging from 1 (the rater has never observed this behavior) to 5 (the rater always observes this behavior). In Sample 2, a percentile (0-100) score was available, which took the original 1 (strongly disagree) to 6 (strongly agree) scale rating and referenced this against responses from a norm group. Full definitions for the performance dimensions assessed in both Samples 1 and 2 appear in Appendix A1 of Jackson et al. (2020). These were described in the original study as "jobcritical knowledge, skills, abilities, and other characteristics identified in the job analysis for each sample" (p. 318). Dimension titles appear in the Appendix of the present article (Table A3).
Rater training in Sample 1 covered use of the online platform used by the organization to input ratings and the use of mock assessments together with a discussion centered on a comparison of ratings. The latter was based on a frame-of-reference training procedure (e.g., Bernardin & Buckley, 1982). Sample 2 training involved a half-day course that covered procedural content and a mock assessment akin to that described for Sample 1. Training outcomes were not assessed by the organization in Sample 1. For Sample 2, training performance was assessed and only those who passed a training evaluation could proceed to use the rating procedure. The organization in Sample 1 used the evaluation on different occasions, but Jackson et al. (2020) were only provided access to ratings relevant to one evaluation period. The Sample 2 procedure was a one-off assessment. 6 The mean and SD for age were based on ratings from all roles in Sample 2, including self-ratings. No other demographic information was available.

Measurement design, effects, and generalization
Both measurement designs in Samples 1 and 2 required the estimation of 11 separate effects each, although some of the specific effects in each sample were different to one another. The number of effects in the present study differs from that in the original because of the absence of a source effect and source-related interactions, given the focus here on supervisors. Full descriptions of the effects estimated in this study are provided in the Appendix, Tables A1 and A2. Briefly, across samples, we were able to simultaneously estimate effects associated with general performance (participant ratee main effects, akin to a general effect in classical test theory, CTT, or latent variable theory, LVT), Participant × Dimension interactions (akin to dimension-related effects or an indication of discriminant validity in CTT and LVT), multiple rater-related effects (e.g., CTT analogues of rater leniency or severity), and item-related effects. In Samples 1 and 2, items were nested in dimensions. A key feature of Sample 2 was that dimensions were nested in summary 2 nd -order dimension categories.

Bayesian inference
Bayesian inference offers practical and statistical advantages over traditional approaches to estimation in the random effects models often applied as a basis for G theory (as detailed in Jackson et al., 2016Jackson et al., , 2020LoPilato, Carter, & Wang, 2015). We therefore opted to apply Bayesian inference to our data and, in doing so, we respond to general calls in the literature to explore applications of Bayesian statistics in applied psychology (Kruschke, Aguinis, & Joo, 2012;Zyphur, 2009;Zyphur, Oswald, & Rupp, 2015).

Ill-Structured designs and aggregation
Raters and ratees were neither fully crossed, but nor were they perfectly nested in any of the samples in this study. Thus, there was some degree of overlap between raters and ratees in both samples, constituting what is often referred to as an ill-structured measurement design (Putka, 2011). To address the data sparseness associated with ill-structured data configurations, we fitted a hierarchical Bayesian model (Gelman & Hill, 2007). An advantage of applying this approach is that it does not require that any data are deleted to fulfil the aim of developing a crossed design for the purposes of analysis. To reflect the degree of overlap between raters and ratees, we rescaled rater-related variance estimates in both samples using the q-multiplier approach (see Putka, Le, McCloy, & Diaz, 2008 for details). Moreover, we tailored formulae from the G theory literature (Brennan, 2001; and rescaled variance estimates to reflect aggregation across (a) rating items to form dimension scores, (b) dimension scores aggregated across raters, and (c) all items, dimensions, and raters to form overall scores. These formulae were applied to the posterior distributions of the model parameters so that we could obtain posterior distributions for all estimates.

Model specification
We used R 3.6.0 (R Core Team, 2019), Stan 2.19.1 (Stan Development Team, 2019), and brms 2.13.5 (Bürkner, 2017(Bürkner, , 2018 to conduct the analyses in this paper. Samples 1 and 2 were configured with 11 variance components and 1 fixed intercept (to address our Research Question 1). Cross-sample comparability was facilitated by scaling each raw dataset to 1 standard deviation. This approach and others we have applied here assume that our data distributions approximated normality. Recent research on performance ratings challenges this assumption (Aguinis, Ji, & Joo, 2018). However, perusal of density and QQ plots did not raise concerns about appreciable deviations from normality in any sample relevant to the present work.
We applied weakly informative priors in all our analyses. For the fixed intercept, this was specified as a normal distribution with a mean of 3.06 (for Sample 1) and 3.05 (Sample 2) and a scale of 5.00 standard deviations. These mean values were selected based on rounding the mean of the dependent variable. For the standard errors of the random effects and the residual, we used the brms default weakly informative prior of a student t-distribution with 3 degrees of freedom, a 0 mean, and a scale of 2.5 (Bürkner, 2017(Bürkner, , 2018. The reasoning behind using these priors is that they will not allow the analysis to return values that are conceptually impossible. Whilst, at the same time, they are flexible to the extent they can permit a large range of values, even if the probability of them occurring is small. Weakly informative priors constitute the recommended practice for G theory models and have been successfully applied in organizational contexts involving raters (Jackson et al., 2016(Jackson et al., , 2020. Simulations were conducted with four chains and with 10,000 iterations per chain. We treated the first 5,000 iterations as warm-up and retained the remaining chains for the main analysis. Convergence was acceptable in all analyses, according to visual inspections of trace, density, and autocorrelation plots. These outputs suggested good mixing of chains and did not raise any concerns about autocorrelation. Other indicators of effective convergence such as the scale reduction factor, effective sample size, and Monte Carlo standard errors were found to fall within acceptable parameters (see Gelman & Rubin, 1992).

Generalizability coefficients and rater sensitivity across work situations
On rescaling with the q-multiplier (as described in Putka et al., 2008), we estimated generalizability coefficients (G coefficients, Shavelson & Webb, 1991) for three types of generalization. First, we estimated generalization across different raters (generalization to r). This approach considers rater-related variance to be nonsystematic and to contribute to unreliability. Second, we estimated generalization to both different raters and rating items (generalization to i,r). This considers rater-and item-related variance to contribute to unreliability. Both generalization to r and i,r are consistent with the dominant perspective in the discipline, which considers rater-related variance to be classed as a contribution to unreliable variance (e.g., Schmidt & Hunter, 1996;Schmidt et al., 2000;Viswesvaran et al., 1996).
Third, we estimated reliability in keeping with the possibility raised by Murphy and DeShon (2000a) that at least some portion of rater-related variance might represent meaningful, systematic variation (see Research Question 2). We approximated the proportion of this potentially meaningful rater variance by referring to the AC literature, as detailed previously. Following the course of action suggested in , we reapportioned only systematic rater-related variance in our study. For the Jackson et al. (2016) AC estimate, we took the sum total of all systematic raterrelated variance in the present study and partitioned it into 33.52% universe score and 66.48% unreliable variance. We repeated this principle for the Putka and Hoffman (2013, 38.10% universe score and 61.90% unreliable variance) estimate. We then used this approach as a basis for projected G coefficients for generalization to r only. As an aside and to provide clarity, in all G coefficient estimates we present, undifferentiated residual variance, which includes residual rater-related variance, was always specified, in full, as contributing to unreliability. We did not reapportion residual variance. Table 1 shows all 11 effects estimated for the supervisory ratings in Sample 1. Of these effects, 9 were relevant to comparisons between assessees (i.e., between-participant comparisons) and so constitute our focus. This is because in performance management, interest generally lies in how performance compares across different ratees. The results in Table 1 are presented initially in their pre-aggregated form. This is followed by estimates for aggregation across items to arrive at dimension scores, dimensions aggregated across raters, and overall scores across all raters, items, and dimensions. The aggregated presentation of results is likely to be relevant to many or most applications of performance ratings.

Sample 1: supervisor ratings
With reference to our Research Question 1, Table 1 shows a consistent pattern of results across the three different aggregation types relevant to this analysis. The assessee or participant main effect, σ 2 p , akin to a general effect, explained a large portion of variance across the dimension, dimension across raters, and overall aggregation types (28.76%, 44.06%, and 49.72%, respectively). Prominent across aggregation types were effects relating to raters. The Participant × Rater, σ 2 pr , interaction explained between 25.89% and 34.10% of variance. Likewise, the main effect for raters, σ 2 r , explained a substantial proportion of variance (between 16.63% and 21.91%). Collectively, rater-related effects explained most of the variance when aggregating to dimensions or to dimensions across raters (69.69% and 53.57% respectively) and around half at the overall aggregation level (49.88%).
We found that performance dimensions in Sample 1 explained a very small proportion of variance. The maximum contribution offered by the Participant × Dimension (σ 2 pd ) effect was at the dimension-across-rater level of aggregation at 1.11% of the variance in ratings. The assessee Participant × Item nested in Dimension interaction (σ 2 pi:d ) explained similarly low proportions of variance (≤1.26%).
This leads to a consideration of the G coefficients for Sample 1, which are presented in Table 1 for two types of generalization: specifically, to different raters (r), or items and raters (i,r). When generalizing to r or i,r, results were almost identical, given the large rater-and relatively small item-related effects evident in Table 1. G coefficients were uniformly low when attempting to generalize across different raters or items and raters (between .29 and .50). Descriptions of the effects listed above are provided in the Appendix. p = participant ratee, d = performance dimension (or competency), i = rating item, r = rater. BP = between-participant, VC = variance component, Var = variance. G to = generalization across the effects that follow (for example, G to r = the expected generalizability coefficient when generalizing across different raters). All rater effects were corrected with the q-multiplier described in Putka et al. (2008). Estimate 1 based on Putka and Hoffman (2013, p. 38.10% contextual variance). Estimate 2 based on Jackson et al. (2016, p. 33.52% contextual variance). Observed G coefficients generalizing to r are given by the ratio of p + pd + pi:d to total BP variance. Observed G coefficients for generalizing to i,r are given by the ratio of p + pd to total BP variance. Projected Estimate 1 is given by the ratio of p + pd + pi:d + (38.10% of pr + prd + r + rd + ri:d) to p + pd + pi:d + pr + prd + r + rd + ri:d + pri:d. Projected Estimate 2 is given by the ratio of p + pd + pi:d + (33.52% of pr + prd + r + rd + ri:d) to p + pd + pi:d + pr + prd + r + rd + ri:d + pri:d. Note that pri:d is the estimate for residual variance and therefore does not contribute to universe score in projected estimates.

Sample 1: Non-Supervisor ratings
For brevity, we only reported a single aggregation level for non-supervisory ratings, namely that across dimensions and raters. Table 2 shows three different non-supervisory roles for Sample 1, including direct reports, colleagues, and stakeholders. The profile of variance was similar, regardless of role type, and was a similar type of profile to that observed for supervisors in Table 1. Across all three roles, σ 2 p (between 34.63% and 49.34%), σ 2 pr (between 25.29% and 29.76%), and σ 2 r (between 15.77% and 23.33%) all suggested prominent effects. Small effects were observed relating to dimensions, including σ 2 pd (≤1.45%) and σ 2 pi:d (≤ .73%). G coefficients for non-supervisory roles were similar to those described above for the supervisory role. When generalizing to r or i,r, G coefficients were low (≤ .50).

Sample 2: supervisor ratings
Results for Sample 2 supervisor ratings are presented in Table 3. Despite reflecting a somewhat different measurement design, and in reference to our Research Question 1, the results in Sample 2 were similar to those observed in Sample 1. Table 3 shows that on aggregation, the primary contributions to variance in ratings were associated with a participant main effect σ 2 p (between 27.57% and 46.34%) and a rater-related effect (σ 2 r:p , between 38.50% and 50.68%). The second-order dimension effect (σ 2 pc ) at the dimension and dimension-across-rater levels of aggregation explained 6.38% and 8.15%, respectively, of variance in ratings. Also, at the same levels of aggregation, the first-order dimension effect (σ 2 pd:c ) was estimated at 2.11% and 2.70% and the item-nested-in-dimension effect (σ 2 pi:d:c ) at 2.22% and 2.84%. However, gains in dimension-related variance in Sample 2 did not result in improved G coefficients when generalizing to raters. This is because, relative to dimension effects, rater effects were much larger. As found in Sample 1, G coefficients in Sample 2 were low when generalizing to r and i,r (≤ .49), regardless of aggregation type. Table 4 shows outcomes for the three roles relevant to Sample 2, including direct reports, peers, and clients. As with the equivalent analysis in Sample 1, we only reported results for scores aggregated across items to form dimension scores and across raters. Findings for non-supervisory ratings were consistent with those for the supervisory ratings for Sample 2. The main contributors to variance in Descriptions of the effects listed above are provided in the Appendix. p = participant ratee, d = performance dimension (or competency), i = rating item, r = rater. BP = between-participant, VC = variance component, Var = variance. G to = generalization across the effects that follow (for example, G to r = the expected generalizability coefficient when generalizing across different raters). All rater effects were corrected with the q-multiplier described in Putka et al. (2008). Descriptions of the effects listed above are provided in the Appendix. p = participant ratee, d = performance dimension (or competency), i = rating item, r = rater, c = summary dimension category. BP = between-participant, VC = variance component, Var = variance. G to = generalization across the effects that follow (for example, G to r = the expected generalizability coefficient when generalizing across different raters). All rater effects were corrected with the q-multiplier described in Putka et al. (2008). Estimate 1 based on Putka and Hoffman (2013, p. 38.10% contextual variance). Estimate 2 based on Jackson et al. (2016, p. 33.52% contextual variance). Observed G coefficients generalizing to r are given by the ratio of p + pc + pd:c + pi:d:c to total BP variance. Observed G coefficients for generalizing to i,r are given by the ratio of p + pc + pd:c to total BP variance. Projected Estimate 1 is given by the ratio of p + pc + pd:c + pi:d:c + (38.10% of r:p + r:pc + r:pd:c) to p + pc + pd:c + pi:d:c + r:p + r:pc + r:pd:c + r:pi:d:c. Projected Estimate 2 is given by the ratio of p + pc + pd:c + pi:d:c + (33.52% of r:p + r:pc + r:pd:c) to p + pc + pd:c + pi:d:c + r:p + r: pc + r:pd:c + r:pi:d:c. Note that r:pi:d:c is the estimate for residual variance and therefore does not contribute to universe score in projected estimates. Descriptions of the effects listed above are provided in the Appendix. p = participant ratee, d = performance dimension (or competency), i = rating item, r = rater, c = summary dimension category. BP = between-participant, VC = variance component, Var = variance. G to = generalization across the effects that follow (for example, G to r = the expected generalizability coefficient when generalizing across different raters). All rater effects were corrected with the q-multiplier described in Putka et al. (2008).

Sample 2: Non-Supervisor ratings
ratings across all rater roles were σ 2 p (between 30.98% and 38.79%) and σ 2 r:p (between 39.90% and 45.18%). The contribution of dimension-related variance was small but differed somewhat across roles and was primarily associated with the second-order dimension effect σ 2 pc (ranging from 2.92% with clients, up to 6.29% with peers). G coefficients were once again low when generalizing to different raters and different items as well as raters (≤ .42).

Projections based on reapportioned systematic rater variance
Entries for projected generalization across r, based on a reapportioning of systematic rater variance guided by the AC literature, appear in Tables 1 and 3 (see Research Question 2) for supervisor ratings. The results of this approximation were similar, regardless as to whether the Putka and Hoffman (2013, Estimate 1) or Jackson et al. (2016, Estimate 2) estimates were applied. Reliability increased substantially when systematic rater-related variance was reallocated according to the AC-based estimates. At the overall level of aggregation in Sample 1, reliability increased from the original estimate of .50 to a maximum of .69. At the overall level of aggregation in Sample 2, reliability increased from .49 to a maximum of .66. These projected estimates still do not meet criteria ordinarily set for acceptable reliability LeBreton et al., 2014). However, they do move the reliability estimates closer to these criteria.

Discussion
The theoretical development of performance ratings has focused on their measurement structure relating particularly to general performance (Scullen, Mount, & Goff, 2000), performance dimensions (Borman & Brush, 1993;Kenny & Berman, 1980), and rater effects (Lance et al., 1992). Murphy (2008) described three models to explain the relationship between the performance and performance ratings, including one-factor, multifactor, and mediated models. Many current estimates of the measurement structure of performance ratings refer to the one-factor model based on classical test theory, where rater-related variance is typically assumed to contribute to unreliability (e.g., Viswesvaran et al., 1996). Less attention has been directed towards the multifactor and mediated models and the related possibility that at least some systematic rater-related variance might contribute to universe score. A statistically partialled evaluation of the measurement design usually described for performance ratings would inform these perspectives. It is this partialled account of the measurement structure of performance ratings that we sought to present (see Research Question 1). We further considered the impact of different perspectives on what defines multiple sources of universe score and unreliability in ratings, particularly regarding the status of systematic rater-related effects (Murphy & DeShon, 2000a, 2000b, see Research Question 2).
Our findings suggest that the structure of supervisor ratings tends to primarily reflect general performance and rater effects. This structure held across three different types of aggregation and two different measurement design variations. The largest portion of variance associated with raters in both samples was an effect involving both raters and participant ratees (σ 2 pr in sample 1 and σ 2 r:p in sample 2). This implies that different raters held varying perspectives on ratee performance. When rater effects were treated as contributing to unreliable variance, as in the one-factor model, and in keeping with results from previous studies (LeBreton et al., 2014;Rothstein et al., 1990;Schmidt et al., 2000;Viswesvaran et al., 1996), we found low reliability estimates for supervisor ratings (≤ .50 for overall aggregation). The only contribution of note to universe score was that associated with a general performance effect. Our results further indicate that the reliability of performance ratings is undermined by the relatively small contribution of dimension effects (<3% of variance explained for overall aggregation). Deshon (2000a, 2000b) suggest that some proportion of rater variance might present meaningfully different context-based perspectives on a ratee. In our study, we estimated this proportion based on findings from the AC literature (Jackson et al., 2016;Putka & Hoffman, 2013). Our results suggested that even when using conservative estimates, projected reliabilities increased substantially (from .50 to a maximum of .69 in Sample 1 and from .49 to .66 in Sample 2 for overall ratings) when systematic rater-related variance was reallocated according to AC-based estimates (see Research Question 2). These increases did not result in outcomes that met acceptability criteria often applied to reliability coefficients LeBreton et al., 2014). Nonetheless, they approached such criteria (at between .66 and .69), and our estimates were based on the most conservative figures available in the Jackson et al. and Putka and Hoffman studies.

A statistically partialled perspective on the structure of performance ratings
By simultaneously partialling for all systematic effects relevant to their measurement design (see Tables 1 and 3), we suggest new insights into the structure of supervisory ratings. We were unable to find more detailed treatments of reliability in performance ratings in the literature, with previous studies being based on separate intra and interrater reliability estimates (e.g., Viswesvaran et al., 1996), or on simultaneously modeled but incomplete effects (e.g., Greguras & Robie, 1998;O'Neill et al., 2015).
In response to our Research Question 1 across 2 samples, it was clear on aggregation that the structure of supervisory performance ratings was primarily concerned with (a) person main effects (also referred to as general performance, σ 2 p , >27% of the variance in ratings) and (b) rater-related effects (various main effects and interactions involving raters and ratees, >49%, see Tables 1 and 3). The main contributions to variance in supervisor ratings were therefore associated with a general, positive-manifold-type appraisal of ratee performance (Ree, Carretta, & Teachout, 2015;Viswesvaran, Schmidt, & Ones, 2005), coupled with interactions involving participant ratees and raters (Lance et al., 1992). Similar effects were apparent in the other organizational roles we tested for comparison (see Tables 2 and 4).
Across both samples, our findings suggest that performance dimensions contributed only small proportions of variance (σ 2 pd ≤ 2.70%) to the structure of performance ratings. Our estimate of the contribution of these performance dimensions was somewhat lower than the σ 2 pd ≈ 6% of variance estimated in O'Neill et al. (2015), where item-related effects were not modeled. In our Sample 2, we modeled the analogue of 2 nd -order dimensions, which, as found in other contexts (see Hoffman, Melchers, Blair, Kleinmann, & Ladd, 2011), explained greater proportions of variance than our analogue of 1 st -order dimensions (i.e., σ 2 pd ). However, even these effects were too small (σ 2 pc ≤ 8.15%) to have any appreciable influence on reliability outcomes.

The question of rater-related variance
Uncertainty is apparent in the literature with respect to the status of rater-related variance. The relevant body of literature is silent on precisely what proportion of rater-related variance should be specified as contributing to universe score in supervisor ratings. Some proportion of rater-related variance might indeed represent meaningful context-perspective effects, as has been suggested from a conceptual stance (Murphy & DeShon, 2000a, 2000b. But surely not all rater variance could be reasonably classified as universe score. For example, differences based on personality, mood, role, cognitive ability, or any number of other characteristics could be relevant to raterrelated variance. In the absence of other specific guidance about the reasons for rater-related variability, a conservative course of action is to treat all rater-related variance as a contribution to unreliability. It is this conservative approach that prevails in the research literature (see LeBreton et al., 2014 for a summary).
The projected estimates for reliability we present based on knowledge from the AC literature provide a step toward finding some, informed mid-way between two hypothetical extremes (i.e., all rater variance = unreliable variance, versus all rater variance = universe score). We provide what can be thought of as a theoretical estimate and not one that is intended to deliver the definitive answer to the question about what proportion of rater-related variance is universe score. Locating that precise figure might present a difficulty for the discipline, given that the definition of rater context perspectives might differ markedly by sample. Moreover, the measurement design of performance ratings represented in the literature does not typically (or possibly ever) include an estimate of rater-related context. This is likely because such contexts are likely to change unsystematically in real-world scenarios.

Comparison with the multisource rating design
Our study offers a unique opportunity to compare our results, focused on supervisor ratings, directly with multisource ratings, given that we reanalyzed data from a multisource ratings data set. The conclusion in the original study was that multisource ratings showed encouraging evidence for reliability, even when all systematic rater-related variance was treated as contributing to unreliability (≥ .81). But using the same data sets and, like in the original study, treating systematic rater variance as contributing to unreliability, we found considerably lower reliabilities for supervisor ratings (≤ .50).
The reason for this apparent discrepancy is due to the presence of a source effect in the multisource design that is absent in the supervisory ratings design. 7 Relative to source effects in Jackson et al., rater effects were small. However, when the source effects were removed, as in the present study, rater effects became more prominent relative to other remaining effects in the supervisory ratings design. This finding highlights the relative nature of the effects that contribute to reliability estimation. The addition of even one substantial effect in a measurement design can make a sizable difference to a reliability estimate. Murphy (2008), in his description of multifactor and mediated models, suggested that performance is only one of several possible components that contribute to performance ratings. Our findings suggest that universe score components of performance ratings are defined primarily by general effects. General effects could partly represent general performance but could also represent individual differences on psychological constructs (Putka & Hoffman, 2013Ree et al., 2015). We found only a small portion of variance in the component of the measurement design that clearly attempts to formalize an evaluation of performance in the form of performance dimensions. In comparison to multisource ratings, in supervisor ratings there is a greater reliance on a smaller number of effects typically deemed to contribute to reliability (specifically, general effects and performance dimensions). Putting aside a consideration of rater-related effects, our results suggest that the reliability of supervisor ratings is primarily reliant on a relatively large general effect, particularly given that dimension effects tend to be small.

Implications for researchers and practitioners
The finding that dimensions (or competencies) contributed small proportions of variance in our study is consistent with findings in other settings (e.g., ACs, interviews, and situational judgment tests, see Jackson et al., 2016Jackson et al., , 2020Lance et al., 2004;O'Neill et al., 2015;Putka & Hoffman, 2013). We believe that researchers could address this phenomenon in future studies. Small dimension effects might be a consequence of conceptual issues (e.g., dimensions are not defined in ways that suit what it is that raters are able or prefer to evaluate) or due to time-related pressures and practicalities. For example, managers have a limited period at their disposal in which to complete appraisal forms and so they might only provide a similar rating across all dimensions. Yet another consideration is the number of occasions over which dimensions are evaluated, as we discuss below in our limitations section. 7 We only generalize to r here for brevity and, in any case, the results for generalization to r and i,r were almost identical.
We estimated the proportion of rater variance that might contribute to universe score by drawing on findings from the AC literature. In Table 5, we show corrections for meta-analytic r xy validity coefficients corrected for the traditional .52 r yy reliability estimate from Viswesvaran et al. (1996). We show the same type of corrections in Table 5 for the range of projected reliability estimates from the present study. Several of these corrections make a difference of note. For example, when corrected for r yy = .52, r xy for empirically-keyed biographical data = .44. When corrected for r yy = .69, the same r xy = .37.

Limitations
Partly because of the complexity of the models involved in this paper, we opted for an approach based on random effects models with Bayesian inference. This is not, however, the only approach that can be used to generate variance estimates, and confirmatory factor analytic (CFA) models can be used to address the same types of data structure and have the advantage of providing more detail about specific constructs of interest (Le, Schmidt, Harter, & Lauver, 2010). Despite these advantages, models with a large number of effects can be computationally impractical to analyze with CFA. Furthermore, CFA provides no straightforward approach toward handling ill-structured measurement designs Putka et al., 2008). In contrast, random effects models based on Bayesian inference provide the capacity to handle a large number of effects and ill-structured measurement designs. Our random effects models moreover provided the level of detail required for us to address our research questions.
The samples in the present study were sizable and utilized measurement designs that were likely comparable. However, they also presented differences in their measurement designs for reasonable cross-sample comparisons. We found similar effects, not only across samples, but also across the supplementary roles. Essentially the same results were repeated across two samples, involving eight roles and two variations on a measurement design. That said, it would be helpful to investigate the unconfounded measurement design of performance ratings in a range of different types of occupation. For example, we do not know if our results will generalize to nonmanagerial samples. Meta-analytic estimates above are based on Sackett, Zhang, Berry, and Lievens (2021). K = number of samples; N = number of participants across samples. r xy mean meta-analytic predictor-criterion correlation. r yy = .52 from Viswesvaran et al. (1996), r yy = .66 from Sample 2 in the present study with aggregation to overall ratings and based on partitioning of rater-related variance from Putka and Hoffman (2013) and Jackson et al. (2016). r yy = .69 from Sample 1 in the present study with aggregation to overall ratings and based on partitioning of rater-related variance from Putka and Hoffman (2013). r yy = .66 to r yy = .69 represents the range of r yy estimates in this study post reallocation of rater-based variance.
Regarding our reapportioning of rater-based variance based on AC estimates, we were cognizant that, despite the similarities, there are differences in performance rating and AC procedures (e.g., AC exercises could represent maximal performance scenarios and ACs elicit performance in exercises that are likely designed to be different from one another). Accordingly, we applied the most conservative estimates from the most complex AC models we could find in the literature (i.e., from Jackson et al., 2016;Putka & Hoffman, 2013). We further reiterate that our intent is not to present the final word on the status of rater variance. However, we seek to provide an informed perspective on the possibility that even a moderate contribution of rater variance to universe score might make a difference to the estimated reliability of performance ratings. It is our hope that our estimates can be refined in future research.
It would be possible to test the DeShon (2000a, 2000b) context-based perspective for rater variance directly with a quasi-experimental design. For example, raters (IV1, with >1 levels per ratee) could be assigned to systematically differing work contexts (IV2, >1 levels) but crossed such that all raters assess in all contexts. Rater effects could then be separated from contextual perspectives, with the former defined as unreliable variance and the latter defined as universe score. To address the potential for different raters to focus on different aspects of performance or to hold differing views on performance levels, standard-setting training could be introduced (e.g., Pulakos, 1986) as a third IV with 3 levels (trained, non-trained, and control). Scaled covariates could be introduced into the study (e.g., for rater personality, cognitive ability, mood, etc.). The design described here could present a potentially fruitful opportunity for future research and could help to provide further guidance on what portion of rater variance contributes to universe score. However, it might raise problems of generalization, as a reasonable take suggests aspects of many real-world work contexts routinely change unsystematically.
We found relatively small dimension effects in our study when compared to the general and raterrelated effects in our models. It is possible that this finding was specific to the samples included in this study. However, the dimensions included in the original study (see Appendix Table A3) suggest at least some conceptual distinctions between the dimensions that were applied. The dimensions in our samples are reminiscent of those found in research guidance elsewhere (e.g., Arthur, Day, McNelly, & Edens, 2003). Moreover, our findings are consistent with those in other contexts (as mentioned previously).
On general performance, Murphy et al. (2019) note that shared variance among raters might not purely indicate ratee performance, but could also reflect nonperformance-related individual rater and system-related characteristics. These issues remain relevant, even if idiosyncratic rater effects are isolated, as they were in our study. The training procedures used in our samples were aimed at mitigating nonperformance effects. Nonetheless, the points that Murphy et al. raise remain as important background considerations when evaluating the meaning of evaluations generated by any measurement design employing external raters.
The data set in this study did not include repeated measures of the same group of ratees. Thus, we could not model the effect of occasions of measurement (Brennan, 2001 provides a discussion on this topic). It would be interesting to know, particularly when the aim is developmental in nature, how occasions interact with other effects in the performance ratings measurement design. For example, would the presence of an occasions facet increase the magnitude of effects associated with dimensions whilst considering the expectation that performance is expected to develop over time? One possible explanation for small dimension effects is that raters possibly require a greater number of opportunities to observe dimension-related behavior on different occasions. Nonetheless, as with multisource ratings, occasions are not typically described in the literature as being fundamental to the measurement design of performance ratings (e.g., Greguras & Robie, 1998;LeBreton et al., 2014;Murphy & DeShon, 2000a;O'Neill et al., 2015).

Conclusion
Our results suggest a measurement structure for supervisory ratings that primarily reflects general and rater-related effects, but with substantially smaller effects for performance dimensions. Our findings suggest that reallocating even a moderate portion of systematic rater-related variance to universe score makes a sizable difference to reliability estimates for performance ratings. Future research could offer further insights into what proportion of rater variance is likely to be best classed as universe score. In addition, reliability gains in performance ratings would likely follow if it were possible to improve dimension-related evaluations.

Disclosure statement
For disclosure, the fifth and sixth authors had financial interests in the processes used in the present study. However, the remaining authors, who did not hold financial interests in these processes, were granted full liberty to analyze and interpret any associated data to help preserve impartiality. Only full data sets, not selected subsets of variables, were provided and analyzed.