Examining the Carnegie Classification Methodology for Research Universities

ABSTRACT University ranking is a popular yet controversial endeavor. Most rankings are based on both public data, such as student test scores and retention rates, and proprietary data, such as school reputation as perceived by high school counselors and academic peers. The weights applied to these characteristics to compute the rankings are often determined in a subjective fashion. Of significant importance in the academic field, the Carnegie Classification was developed by the Carnegie Foundation for the Advancement of Teaching. It has been updated approximately every 5 years since 1973, most recently in February 2016. Based on bivariate scores, Carnegie assigns one of three classes (R1/R2/R3) to doctorate-granting universities according to their level of research activity. The Carnegie methodology uses only publicly available data and determines weights via principal component analysis. In this article, we review Carnegie’s stated goals and the extent to which their methodology achieves those goals. In particular, we examine Carnegie’s separation of aggregate and per capita (per tenured/tenure-track faculty member) variables and its use of two separate principal component analyses on each; the resulting bivariate scores are very highly correlated. We propose and evaluate two alternatives and provide a graphical tool for evaluating and comparing the three scenarios.


Introduction
The Carnegie Classification (CC), now under the auspices of Indiana University Bloomington's Center for Postsecondary Research (Indiana University 2016), is one of the oldest regularly published rankings of university programs and reputations for doctorate-granting universities. It has been described as "the dominant classification system" for higher education research (Brewer, Gates, and Goldman 2002) and as " …the standard taxonomy for American higher education" (Graham and Diamond 1997). Published approximately every 5 years since 1973 and most recently in February 2016, the ranking currently assigns each of 335 US universities to one of three clusters: R1/R2/R3 or Highest/Higher/Moderate Research Activity, respectively. A stated goal of the classification is to provide a framework for researchers to compare programs among peer institutions (McCormick and Zhao 2005;Zhao 2011). Many universities designated as R2 have created institutional goals and research plans that include achieving an R1 rank in 5 or 10 years (The University of Nevada Las Vegas 2014; University of Houston 2015). Carnegie has often stated their classification is not a ranking and was not created for this purpose (Shulman 2001;McCormick 2005). However, the original Carnegie Classification described the R1 category as the "The 50 leading universities in terms of federal financial support of academic science... " (Carnegie Commission on Higher Education 1973) and the ordering of the current classes seems clear, so we shall use "ranking" and "classification" interchangeably. The most prominent university ranking system in the United States (U.S.) for a wide variety of institutions and individual degree programs is published annually by U.S. News and World Report (USNWR). USNWR uses the institutional categories as defined by the Carnegie Foundation. Both USNWR and Carnegie create scores based upon a data matrix. While both solicit some data from the universities themselves, all of the Carnegie Rankings (CR) data are publicly available and published on their website. Conversely, USNWR solicits subjective data from high school counselors and academic peers that are unpublished. From a statistician's perspective, the major difference between the methodologies used to create the scores is the selection of weights applied to the data matrix. USNWR uses a subjective approach while CR uses weights derived from principal component analysis (PCA), which is a standard multivariate statistical technique. In the final analysis, USNWR provides a numerical ranking of the form 1, 2, . . . , N whereas CR only provides the three clusters mentioned above (Morse, Brooks, and Mason 2015).
The Carnegie Classification has evolved over time. The 1973The , 1976The , 1987The , and 1994 iterations roughly evenly split the top 200 doctorate-producing universities into four categories based on federal spending and achieving a doctorate production above a certain threshold (50 doctorates per year to be classified as a research university) (Carnegie Commission on Higher Education 1973; Carnegie Council on Policy Studies in Higher Education 1976; The Carnegie Foundation for the Advancement of Teaching 1987, 1994. The short-lived 2000 edition essentially combined the top two and bottom two categories and divided the resulting two categories based on how many doctorates schools granted without regard to research spending (The Carnegie Foundation for the Advancement of Teaching 2000). The latest scheme used in the 2005, 2010, and 2015 iterations of the Carnegie Classification is much more sophisticated than the earlier versions. It uses a more diverse set of variables and advanced statistical techniques to characterize schools. Despite the strengths of the modern classification, further refinement is still possible.
In this article, we examine in detail several aspects of Carnegie's methodology: Carnegie's data matrix, Carnegie's application of principal components, and Carnegie's formation of the three clusters in light of its stated goals. We also propose and examine two modifications that we believe are consistent with Carnegie's stated goals, more faithfully represent the structure of the data, and offer organizational advantages. We then evaluate their impact on the final clustering. The first modification involves the specific application of principal components to the data matrix. The second modification additionally involves the augmentation and redefinition of the data matrix that defines the per capita variables. We discuss the relative merits of these alternatives and Carnegie's technique.

Carnegie Methodology Synopsis
The Carnegie Classification for research universities is based on research expenditures, number of research staff, and number of doctorates granted. All values are annual. Research expenditures are divided into two categories: science, technology, engineering, and mathematics (STEM) spending and non-STEM spending. Doctorates are split into four categories: humanities, social science, STEM, and "professional. " These seven variables are each transformed to ranks, which reduces the influence of extreme skewness and outliers. We shall assume that the ranktransformed data have been centered at zero and scaled to have unit variance. Ties are broken by taking the minimum assignable rank, rather than the average, which we believe is consistent with Carnegie's methodology. The first principal component score of these aggregate ranks (computed with loadings C A ) is combined with the first principal component score of per capita versions of the research expenditures and support staff (computed with loadings C P ) to produce a bivariate set of scores for each school. Note that there are two first principal components because there are two separate PCAs done on separate sets of variables. The per capita variables are the STEM and non-STEM research expenditures and the number of research staff each divided by the number of research faculty, transformed to ranks, and standardized. The final classification of each school is based on a standardized (Carnegie does not specify the exact formula) distance of the Carnegie principal component scores from a particular point (again unspecified, but approximately the minimum of each score). The very-high-research category is the set of schools farthest away from this point (The Carnegie Classification of Institutions of Higher Learning 2016).
The result of this transformation and subsequent classification as computed by the Carnegie Foundation is given in Figure 1 (The Carnegie Classification of Institutions of Higher Learning 2016). We provide a schematic of the Carnegie Foundation's approach in Figure 2, which illustrates how the aggregate and per capita variables are treated separately.

The Carnegie Data Matrix
The raw input for the Carnegie Classification comes from two sources: the Integrated Postsecondary Education Data System (IPEDS) and the National Science Foundation (NSF). IPEDS consists of a collection of surveys conducted under the auspices of the U.S. Department of Education. Participation in these surveys is required for schools that take part or apply to take part in federal student financial aid (IPEDS 2016). Reporting requirements are extensive, with some institutions employing multiple employees for IPEDS alone (Field 2008). The NSF conducts two surveys relevant to the doctoral university classification: the Higher Education Research and Development Survey (HERD), and the Survey of Graduate Students and Postdoctorates in Science and Engineering (GSS). Both have relatively low nonresponse rates of 3.2% and 0.3%, respectively (HERD 2016;GSS 2016).
The doctoral degree and faculty counts come from the 2013-2014 IPEDS data and fall 2013 IPEDS data, respectively. Here faculty is assistant, associate, and full professors. "Professional" doctorates consist of doctorates related to professional fields, but not so-called "professional practice" degrees such as M.D. 's and J.D. 's. Research expenditures are derived from the 2014 HERD. Institutional demarcation for research expenditures is somewhat unclear. For example, Johns Hopkins University more than doubles its reported expenditures by including its Applied Physics Laboratory (HERD 2017), which is an independent institution according to its website (APL 2016). This is not necessarily an error and Hopkins would still have a very high-ranked amount of research spending without this amalgamation; its class is not in doubt. However, this fact does illustrate that the inputs to the Carnegie Classification are partially affected by some subjective choices made by parties that may benefit from not choosing the conservative option. Finally, the number of doctorate-holding research staff is provided by the 2013 GSS. We provide estimates of the distributions of the variables used in Carnegie's system and our proposed systems in Figure 3 and summary statistics in Tables 1 and 2.

Rotated Ordinary Components
Carnegie Components The middle scheme uses the same variables as the Carnegie method, but a single two-dimensional principal component analysis. We then perform a rotation to retain the aggregate and per capita interpretation of the Carnegie scores. () The right scheme is the same as the middle scheme, except that we balance the number of per capita and aggregate variables. Our proposed schemes are detailed in Section  but are included here for comparison.

Effect of the Rank Transform
Prior to being subjected to PCA, the input variables for the Carnegie Classification are transformed to ranks. This has a tendency to reduce differences between schools with large values on particular dimensions, which is advantageous because it mitigates the influence of outlying values on the subsequent PCA. However, it also has some negative effects. As Figures 1 and  4 demonstrate, transforming to ranks exaggerates differences between schools with small values along these dimensions. For example, there is a difference of only eight humanities doctorates per year between the lowest ranked schools and the median ranked schools for that dimension. Conversely, the highest ranked school in that dimension produces more than 20 times that difference. Additionally, transforming to ranks tends to mitigate even obvious clustering structure. As an example of this, Figure 5 demonstrates that even well-separated clusters may no longer be apparent after applying the rank transformation. In the top panels, clusters that are well separated in a dimension that is then rank transformed are difficult to distinguish. In the middle panels, clusters that are both correlated and separated along an axis  are made difficult to distinguish by taking the rank transform of all dimensions. However, it is not always the case that the rank transform destroys the clustering structure. This is demonstrated in the bottom pair of panels where clusters are separated but not individually correlated with respect to an axis that is not itself rank transformed. In practice, the three common transforms of (1) simply standardizing, (2) log transforming, and (3) rank transforming often produce similar results when combined with PCA (Baxter 1995). We found that replacing the rank transform with a log transform offered no advantages for Carnegie's purposes and required arbitrary translation of the data away from zero. Despite this general consistency across transforms, the above caveats about the rank transform still apply since Carnegie is attempting to distinguish clusters along an axis with which the individual clusters are correlated.

Carnegie's Principal Component Analysis
Principal component analysis is a dimension reduction technique that seeks to capture maximal variance of the original data. One of the characteristics of a typical PCA is that the component scores are uncorrelated. However, since the Carnegie Classification is based on the first principal components of two closely related sets of variables, the two component scores have correlation 0.84. If the numbers of doctorates in each category are not included in the aggregate measure, the correlation between the two component scores is 0.93. Evidently the per capita metrics are very similar to the aggregate metrics except they contain less information about non-STEM research. The fraction of total variation accounted for by the Carnegie components is 0.70, which is similar to the fraction accounted for by the single first component of all the variables 0.66 and not as much as is accounted for by the first two components of all variables 0.79. The fraction for each component is given in Table 4.
Here, the fraction of total variation is taken to be the trace of the residual covariance divided by the trace of the covariance (correlation) matrix of all the variables. We define the residual to be the original 10-dimensional (7 + 3) point minus the specified components (Carnegie, or ordinary principal). The second ordinary component indicates a natural contrast between STEM research and other research that is not accounted for in the Carnegie components. The loadings for the schemes described in this article are given in Table 3. All loading signs have been chosen to be consistent with the Carnegie methodology. The rank transform tends to exaggerate differences between low-ranked schools; however, it mitigates differences between high-ranked schools.

Carnegie's Clusters
What clustering method is being used to determine how schools are categorized is not clear. The classification made by the Carnegie Foundation is given in Figure 6. Ignoring the color cues, it is not visually obvious that there are any distinct or natural clusters. If there are indeed three true clusters, applying k-means clustering gives a different, albeit similar, classification. If a normal mixture distribution is fitted using the mclust (Fraley and Raftery 2002) package, having three clusters is determined to be optimal from an information criterion perspective; however, the three clusters are quite different from the three clusters given by the Carnegie Foundation and k-means clustering, as can be seen in the bottom left panel of Figure 6. Overall, there does not appear to be a reason to consider a given school more similar to schools in its assigned cluster than to other schools Rank(x 1 ) Rank(x 2 ) Figure . The effect of the rank transform on clustering structure with simulated data. The rank transform tends to reduce the separation between clusters (top panels), especially when variables are correlated and the separation is along the diagonal (middle panels). However, the clusters in the bottom panels are still easily visible after the rank transform. These data are simulated from bivariate normal distributions.
with more similar principal component scores that happen to lie across a cluster boundary determined by one of these methods.

Alternative Proposals
Rather than employing an ad hoc procedure to capture the aggregate and per capita components of the variation between schools, we propose using a rotation of the first two principal components such that the magnitude of the loadings is It is easy to verify that r 1 and r 2 are orthonormal and that the angle of rotation θ is such that Since principal component loadings with opposite signs are equivalent (we choose to have positive STEM signs for consistency with Carnegie), a rotation of θ is equivalent to a rotation of 180 + θ . We report the smaller in magnitude of the two as the angle of rotation but always define θ as in Equation (1) for consistency with convention. We seek Note that the norm is the Euclidean norm and that this rotation is very similar to the varimax rotation, which is one of the most popular (Jolliffe 2002) rotation criteria. This technique has a number of advantages. These two components account for the same amount of variance as the first two ordinary components. The parameterization of this rotation makes reference to the first two principal components easy. The transition from the Carnegie components to the standard principal components and these proposed components is given in Figure 8. This method is based on the same variables as those used by the Carnegie Foundation and we shall refer to it as the rotated and ordinary (RO) scheme. However, the Carnegie system may benefit from defining some new variables and combining some of the current variables. Given the sensitivity of the location of some schools to small changes in the non-STEM doctoral variables, it may be desirable to sum them into a single category. It is unclear why the doctorates awarded by schools are not currently included among the per capita variables. It does not seem implausible that the number of students graduated per research faculty member is important for characterizing schools. We also consider a version of the above proposed technique with the non-STEM doctorates summed into a single variable, and the number of STEM and non-STEM doctorates per capita as additional covariates. This balanced version of the data matrix has five aggregate variables and five per capita variables. The loadings are rotated in a way analogous to Equation (2). We shall refer to this second scheme as the rotated, ordinary, and balanced (ROB) scheme. Both schemes are visualized in Figure 2. The optimal angles of rotation for the RO and ROB schemes are 31 degrees and 19 degrees, respectively. The objective functions leading to these optima are pictured in Figure 7. As a result of these rotations, the principal component scores are somewhat correlated (ρ RO = 0.62,ρ ROB = 0.41) even though the loadings (given in Table 3) themselves are orthogonal.
To illustrate how much the new scoring systems would change the current classification system, we have added hypothetical boundaries to the bottom two panels of Figure 8. We attempted to preserve the current classification boundary as much as possible. Summary two-way tables are given in Table 5. These tables are sensitive to the precise choice of the boundary.

Figure .
The panels in the left and right columns correspond to the RO and the ROB scores, respectively. The top row illustrates the transition from the Carnegie scores to the ordinary PCA scores; the middle row illustrates the effect of the rotation; the bottom row is the ultimate difference between the Carnegie scores and the proposed scores resulting from the top rows. Note that the combined effect for both schemes is largely to add an orthogonal component because we use a true second principal component. We have added hypothetical class boundaries to illustrate the minimal change in classes caused by the new scoring systems.

Differential Effects of Investment
The sensitivity of a school's location within the Carnegie scheme as a result of the rank transform combined with the great difference in unit cost leads to an interesting calculus for schools that wish to move up in the Carnegie ranking. Figure 9 illustrates how much selected schools would move in the Carnegie scheme if they were to spend an additional $10 6 on a particular Carnegie variable assuming an annual unit cost given in the table at the top of the figure. Clearly spending money on some dimensions is much more effective than others for schools seeking to move up in the ranking. We provide an interactive graphical tool in the next section that allows users to manipulate the characteristics of individual schools and visualize the resulting movement in the Carnegie and our alternative schemes.

Viewing Application
We have created a Shiny (Chang et al. 2016) web application, pictured in Figure 10, to illustrate the three schemes described in this article.
The application displays the location of all the doctoral universities following the three transformations described in this article. Individual schools can be selected from the dropdown menu on the left. Once a school is selected, that school's marker in the plots is highlighted. Starting from the true values, users can adjust the input variables for a particular school by moving the sliders. To reset the sliders to the true values, users can press the "Reset to Truth Button" button. The extrema of the sliders are determined by the extrema of the variables for the doctoral universities so that an individual school can be changed to have the highest or lowest rank. Since the principal components are calculated based on the ranks of the schools, having even more extreme values would have no effect. If the slider values are changed, the new rank is computed by holding every other school constant. The transformation loadings are not changed by changing the sliders. To view the school corresponding to a particular point, users can click on that point. To view a larger version of a particular set of components, users should select the corresponding radio button. Some caution must be used when drawing conclusions from this exploratory application.  Certain changes to the attributes of a particular school may not be possible. For example, lowering the number of research faculty at a particular school may improve its location relative to the other schools since large per capita numbers are favored by the Carnegie methodology. However, even though this change is possible in the application, in reality it is probably not possible to lower this number without affecting the other dimensions.

Discussion
The Carnegie Classification of research universities is an effective and transparent research tool for many reasons. Carnegie publishes the data they use. Carnegie provides enough details about its methodology to (at least approximately) reproduce its scoring. Carnegie endeavors to use a well-established statistical method, PCA, to produce scores that represent the structure of the underlying data. It also employs transformations to limit the impact of outliers. Finally, Carnegie cautions against using its classification system as a ranking, which is appropriate given the uncertain nature of the divisions.
However, there are ways in which the Carnegie system may be refined. Carnegie uses the first components from two separate one-dimensional principal component analyses rather than one two-dimensional analysis and therefore does not optimally represent the underlying structure of the data from a variance perspective. We seek to improve this by computing the first two ordinary principal components, which accounts for more variance, and applying a rotation that attempts to preserve the interpretation of one loading representing aggregate research and the other representing per capita research. This largely succeeds. More variance is accounted for. However, the new second component as much contrasts how much STEM research versus non-STEM research is done at a particular school as it represents the per capita research output of schools. The second Carnegie component also measures faculty composition rather than per capita research output to some extent, as it heavily weights STEM over non-STEM research. It appears that there is more of a distinction between STEM and non-STEM research universities after accounting for aggregate research than there is wide variance in per capita output.
It is unclear why it is useful to separate non-STEM doctorates into three categories. Perhaps the fact that this favors non-STEM oriented schools helps balance the fact that research spending favors STEM (McCormick 2013), but it makes schools' positions very sensitive to small changes in doctorates granted for these separate categories. If it makes sense to separate these categories, why not decompose STEM? We propose combining the three non-STEM categories and also including per capita versions in our balanced alternative to the Carnegie system. As with the rotated ordinary principal components on the original data, the rotated ordinary principal components on this modified data mainly represent aggregate research in the first component and STEM versus non-STEM research in the second.
The class boundaries in the Carnegie scheme can be replaced with straight lines and approximately maintain the current classification. This is further evidence that the two components that make up the Carnegie scheme contain the same information. In our alternatives to the Carnegie scheme, roughly vertical lines can be used to divide the Carnegie classes because our first component is similar to Carnegie's two components.
Given that all three systems produce useful bivariate scores (we assert that our proposed scores contain the most useful information), it seems unnecessary to insist upon discretization. Our interpretation of the first two ordinary principal components for Carnegie's data is that a majority of the variance in universities comes from aggregate research and STEM versus non-STEM composition. We consider neither concentration on STEM nor aggregate research output appropriate measures on which to rank schools because it is not clear that more of either is better. Researchers can select their own cohorts based on the scores rather than a classification that may not be well-suited for their purposes. For consistency with the historic Carnegie methodology and to emphasize per capita research, we suggest using our proposed rotation as well.
The focus of this article has been on how to summarize multivariate information about research universities given metrics that have already been chosen by Carnegie. The choice of what metrics to include is also of great importance. As outlined in the introduction, the Carnegie Classification (CC) has traditionally been based on research expenditures and doctorate production. The CC could be improved by including additional research metrics that have become available since its creation in the 1970s. In particular, we think that the inclusion of citation data would be useful for measuring research output. Citation quantity is not a perfect metric of the research quality of a University, but it does have some nice attributes: r It is not based on a survey whose answers depend on a small number of people.
r It reflects the opinions of researchers rather than funding agencies.
r It seems, at least naively, to favor STEM less than research funding.
r It could fit neatly into Carnegie's aggregate versus per capita scheme. One of the best qualities of the CC is that the "raw" data are made available by Carnegie. We would not desire the CC to lose that attribute to include more data. However, if Carnegie can find a source of citation data that they can publish, we think it would make a useful addition.