Validating Wordscores: The Promises and Pitfalls of Computational Text Scaling

ABSTRACT Wordscores is a popular computational text analysis method with numerous applications in communication research. Wordscores claims to scale documents on specified dimensions without requiring researchers to read or even understand the language of the input text. We investigate whether Wordscores delivers this claim by scaling the Euromanifestos of 117 political parties across 23 countries on 4 salient dimensions of political conflict. We assess validity by comparing the Wordscores estimates to expert surveys and other judgmental measures, and by examining the Wordscores’s estimates ability to predict party membership in the European Parliament groups. We find that the Wordscores estimates correlate poorly with expert and judgmental measures of party positions, while the latter outperform Wordscores in the predictive validity test. We conclude that Wordscores does not live up to its original claim of a “quick and easy” language blind method, and urge researchers to demonstrate the validity of the method in their domain of interest before any empirical analysis.


Introduction
The empirical evaluation of many theories in comparative politics, ranging from government coalitions to voting behaviour, requires data on the policy positions of political parties. Yet, despite the promise and availability of several cross-national data sources, the methods used to estimate parties' positions continue to be a highly contested area of political science. In the debate regarding the appropriateness of competing methods, the computer-assisted analysis in political text has offered particularly promising insights (Grimmer & Stewart 2013). One prominent method in this area is the Wordscores scaling method as proposed by Laver, Benoit & Garry (2003). Wordscores can be seen as an application of correspondence analysis to words as data (Lowe 2008, 366-368). In a nutshell, the vocabulary of a set of 'reference' texts for which the position on the dimension of interest is known is used as a training set for estimating the unknown positions of another set of 'virgin' texts.
To position documents and hence political actors, Wordscores makes a series of assumptions regarding the distribution of reference documents across the dimension of interest, the distribution of words across reference documents, and of the use of words as data more generally (Lowe 2008). As Grimmer & Stewart (2013) note, however, most of these assumptions might not hold in practice, so it is absolutely important to evaluate the performance of computer-assisted methods for analysing political text. Nevertheless, despite the 'validate, validate, validate' recommendation by Grimmer & Stewart (2013), our review of the published studies using Wordscores revealed that there are very few studies that assessed the validity of Wordscores output. Moreover, most of the few attempts that tried to assess the validity of Wordscores in the context of estimating parties' positions were rather limited in terms of their scope.
In this paper, we present the most rigorous approach to date in validating Wordscores. 1 After a short explanation of the Wordscores assumptions, we review the previous attempts to validate the Wordscores output and outline the design of our study. Our analysis consists of an extensive application of Wordscores to estimate the positions of 164 parties across 23 countries over four widely-used policy dimensions. We furthermore check the robustness of our estimation employing multiple reference scores for the reference texts and methods of transforming the raw Wordscores output. Following estimation, we attempt a rigorous assessment of validity in the framework laid out by Carmines & Zeller (1979). We conclude that, despite the promise in the original exposé (Laver, Benoit & Garry 2003), Wordscores cannot produce valid estimates of parties' positions in a crossnational context. Our findings have important implications for those who use Wordscores in their empirical analyses.

Wordscores as a popular method of automated text analysis
The Wordscores method was originally proposed by Laver, Benoit & Garry (2003). According to the method, it is possible to estimate the positions of documents (called 'virgin' texts) on an a priori defined dimension of interest, by comparing them to a set of documents (called 'reference' texts) in which their position on the dimension of interest is known. Wordscores can therefore be described as a supervised scaling model (Grimmer & Stewart 2013), in the sense that documents are placed on a priori defined policy scales, that it uses 'reference texts' and scores assigned to them akin to a training set in a machine learning framework. As such, Wordscores makes the 'bag-of-words' assumption by treating individual words as 'data' irrespective of their syntactic context, and assumes that the relative frequencies of specific words provide manifestations of underlying political positions (Klemmensen, Hobolt & Hansen 2007, 748).
Over the years, Wordscores has proven to be highly popular due to its ease of use and implementation in two popular statistical programmes (Stata and R). As of October 2016, Google Scholar gives 1021 citations to Laver, Benoit & Garry (2003) who introduced Wordscores (hereafter Laver et al.). Some of the most prominent applications applications of the method involve the analysis of election manifestos to estimate the policy preferences of political parties and use these measurements in order to empirically test a wide range of questions. For instance, Wordscores has been used to explain government coalitions at the national and sub-national level (Bäck, Debus, Müller & Bäck 2013, Debus 2009, Linhart & Bräuninger 2010, Proksch & Slapin 2006, to study party competition by mapping parties in multi-dimensional ideological space (Laver, Benoit & Sauger 2006), to study similarity in the context of intra-party politics (Coffé & Da Roit 2011, Debus & Bräuninger 2009), to investigate whether parties keep their policy promises (Debus 2008), to explain the success of bills in legislatures (Brunner & Debus 2008), the choice of putting the EU's constitutional treaty on a referendum (Hug & Schulz 2007b), and to establish the policy preferences of sub-national parties and governments (Klingelhöfer 2014, Müller 2009), or simply to map the positions of political parties across time (Kritzinger, Cavatorta & Chari 2004).
Moreover, Wordscores has been used extensively to estimate the positions of documents other than party manifestos. These include speeches delivered by MPs in Ireland, Italy, Germany, and Spain (Bernauer & Bräuninger 2009, Giannetti & Laver 2009, Laver & Benoit 2002, Leonisio & Strijbis 2012, speeches by US state governors (Weinberg 2010), leaders of Russian regional parliaments (Baturo & Mikhaylov 2013), delegates at the Convention on the Future of Europe (Benoit et al. 2005) and the head of state in the UK (Hakhverdian 2009). Furthermore, novel applications of Wordscores outside comparative politics include analyses of reports from US state lotteries (Charbonneau 2009), Chinese newspaper articles (Chen 2011), public statements by US Senators justifying their votes (Bertelli & Grose 2006), advocacy briefs in the US Supreme Court (Evans, McIntosh, Lin & Cates 2007), press releases of the European Commission (Klüver 2009), and even open-ended questions in surveys (Baek, Cappella & Bindman 2011).
Despite this breath and wealth of applications, one could argue that Wordscores is becoming increasingly outdated as a method, especially due to the advent of more sophisticated methods of automated text analysis in political science (see Grimmer & Stewart 2013). To investigate this possibility, we performed a rigorous review of all the citations to Laver et al. article that were captured by Google Scholar. 2 Our review revealed that there are total of 146 uses of Wordscores in empirical analyses, 78 of which have been published in peer-review journals, with the remaining appearing in monographs, chapters in edited volumes, working papers, and conference papers. Interestingly, as Figure 1 shows, the publication of empirical analyses using Wordscores constitute a relatively stable fraction of the total citations to the Laver et al. article, whereas the trend of the publications of empirical analyses in peer-review journals closely mirrors the trend of publications in other outlets. Finally, as shown in Figure 2, our review shows no evidence that the empirical analyses using Wordscores are now published in lesser quality journals (at least judging from their impact factor) compared to previous Note: The plot on the left shows mere citations compared to empirical applications, while the plot on the right shows the empirical applications published in peer-reviewed journals compared to other outlets.
years. We therefore conclude that, despite the advent of more sophisticated methods of automated text analysis, Wordscores deserves a rigorous evaluation in its own right as it remains a popular automated text analysis method in the literature.  Note: Trend line is a locally adjusted regression curve (loess, bandwidth=.7).

Estimation and assumptions
The estimation process begins with the researcher defining a set of reference texts that have positions on a dimension that we can assume with some confidence (for example, when they are obtained by an expert survey). Reference texts therefore need to be informative with regards to their content (words), and need to have a known position on the dimension of interest. Wordscores, implemented as a user-written package in Stata and R, begins by counting the frequency of words in each reference text and assigns a score to each of these words. To do so, Wordscores calculates the probability P that a word w appears in reference text r as follows: where F wr is the frequency of word w in reference text r. Using these probabilities, Wordscores calculates a score for each word w on each dimension of interest d as follows: where A rd is the known position of reference text r on dimension d. To score each virgin text v on dimension d, Wordscores use the word scores S wd obtained from reference texts as follows: According to Laver, Benoit & Garry (2003, 316), F wv in equation 3 denotes 'the relative frequency of each virgin text word [w], as a proportion of the total number of words in the virgin text [v]' (emphasis added). However, all the statistical packages that have been written to implement Wordscores, 3 use a different definition of F wv . Here the relative frequency of each virgin text word w is taken as a proportion of the total number of words co-occurring between the reference and the virgin texts. This inconsistency between the Laver et al. article and the software implementations is of no particular concern to how Wordscores work, but it does challenge the proof-of-concept validation presented in the Laver et al. article as we will see in the following section.
Nevertheless, irrespective of how one defines F wv , the S vd scores only indicate the relative position of virgin texts to each other on dimension d. To be able to compare the scores of virgin texts to the scores of reference texts, we need one more step. Wordscores will transform the raw scores back to the original metric used in the scores used in the reference texts, as this allows us to compare the raw scores of the virgin texts with the assigned scores of the reference texts. In their original paper, Laver et al. suggest the following transformation: Here, S * vd is the transformed score, S vd the raw score, Sv d the average raw score of the virgin texts, and SD rd and SD vd the standard deviations of the reference and virgin text scores respectively. This metric preserves the mean of the virgin text scores, but equals their variance to that of the reference text scores, thus allowing for comparison. Lowe (2008) points out that the LBG transformation assumes that the raw virgin text scores have the correct mean, but the incorrect variance. However, due to the large amount of overlapping words, the virgin score mean is invariably close to the reference text mean-an effect called shrinkage. These overlapping words are often words as 'the' or 'and', and as they occur frequently in all documents, they get centrist scores. As such, the distances between the virgin texts are shrunken, and all texts bounce towards the middle of the scale. Laver et al. fix this by recouping the original variance, but falsely assume that the newly derived mean is correct. This is no problem when the variance and mean are expected to be the same for both reference and virgin texts. However, as Lowe (2008, 359-360) notes, increasing polarisation between parties, or joint movement to the sides of a set of parties, makes it hard, if impossible, to discern whether the mean of the virgin texts is centrist due to the reference scores or a shrinkage artifact. Martin & Vanberg (2008, 95-97) agree with the above criticism and note several more shortcomings of the Laver et al. transformation method. First, as the transformation uses the standard deviation of the virgin text raw scores it depends on the set of virgin texts themselves. This makes the scores non-robust with regard to the virgin texts, and any difference in the set of reference texts automatically leads to a difference in the scores. This way, a researcher could obtain different positions in the virgin texts solely because of a different selection in the reference texts. Second, despite what Laver et al. claim, their method fails to recover the accurate relative distance ratios and therefore put the transformed scores and the virgin scores on the same metric. This is due to shrinkage, as we pointed out above. To combat these problems, Martin & Vanberg (2008) provide a new transformation based on the idea of relative distance ratios S i : where two 'anchoring texts' S R1 and S R2 are chosen, and the placement of all other texts are expressed in relation to this 'standard unit' (Martin & Vanberg 2008, 97). They then use these ratios to construct a new transformation: Here, S * vd is the transformed score, S vd the raw score, A R1 and A R1 are the assigned scores to reference texts R1 and R2 (where R1 is located to the left of R2), and S R1 and S R2 are the reference texts' raw scores. In their article, Martin & Vanberg use two reference texts, or 'anchor texts' located to the left and right of virgin texts. As seen in equation (6) above, both the assigned scores for the reference texts are recovered, and the virgin texts are thus placed on the original metric. However, as soon as more than two reference texts are used-as Laver, Benoit & Garry (2003) strongly advise-not all the original exogenous scores of the reference texts can be recovered exactly, as only two texts can be used to define the metric. MV thus suggest a change to the transformation: Here A Rmin and A Rmax denote the lowest and highest placed reference text on the original metric. The positions of these texts will be recovered exactly, while the scores of the other texts will be distorted as the relative distance ratios of the raw scores do not correspond to the relative distance ratios of the reference scores. Comparison between reference and virgin texts thus becomes difficult and researchers face a trade-off between increased accuracy of the dictionary and internal consistency, and the ability to make valid comparisons (Martin & Vanberg 2008) (see Appendix F).
To conclude, while the transformation by Laver, Benoit & Garry (2003) depends on the virgin texts and is indifferent to the composition of the reference texts, the transformation by Martin & Vanberg (2008) depends on the reference texts and is indifferent to the composition of the set of virgin texts (Lowe 2008, 360). Moreover, Laver et al. assume that the variances of both the set of reference texts and virgin texts are the same, while the Martin & Vanberg transformation does not do so (Benoit & Laver 2008, 110). In this paper, we use both transformation methods as we have no use for the raw scores and neither of the scores has until now proven to be the most appropriate in all circumstances.
More generally, Lowe (2008) criticised Wordscores for its heavy dependence on reference texts. Lowe (2008, 366-368) views Wordscores as an approximation to correspondence analysis and goes on to treat the method as a statistical ideal point model for words. In doing so, he identified six conditions that Wordscores needs to fulfil in order to ensure consistent and unbiased estimation of the parameters of the ideal point model: 1. The word scores of the virgin texts need to be equally spaced and extend over the whole range of word scores for the reference texts 2. The word scores of the virgin texts need to be spaced relative to the informativeness term (all texts are thus informative) 3. The reference scores of the reference texts need to be equally spaced and extend past each word score of the virgin texts in both directions 4. The word scores of the reference texts need to be spaced relative to the informativeness term (all texts are thus informative) 5. All the words need to be equally informative 6. The probability of seeing a word needs to be the same for all words According to Lowe (2008, 369), conditions 5 and 6 will never hold for word count data because any text exhibits a highly skewed word frequency distribution, regardless of the genre, and contain many uninformative words. Nevertheless, we can significantly reduce these problems by filtering out uninformative words such as stop words, function words that do not convey meaning but primarily serve grammatical functions, very uncommon words, and words which appear in less than 1% and more than 99% of documents in the corpus (Grimmer & Stewart 2013). Doing this makes the probability of seeing a word more equal, and removes non-informative words.
Conditions 1 and 2 will be less likely to hold when there is not enough overlap between word distributions between the reference documents. However, by using many documents as reference texts (as Laver et al. advised), the conditions might be well approximated. Condition 2, however, suffers from the fact that some documents are small, and thus contain very little to no information. This does not only increase the confidence intervals around the estimates, but also creates a large bias in the estimates, negatively influence the validity of the virgin documents scores.
Conditions 3 and 4 are similar to 1 and 2, but as words are more plentiful then texts, the changes of insufficient overlap are considerably lower, and the conditions are thus less important. Lowe even states 'we might hope that they [words] may relatively evenly spread out across a policy dimension' (Lowe 2008, 369), which makes the conditions even more plausible. Last, Lowe (2008, 369) considers that conditions 1 and 3 can never hold simultaneously, as this would require an infinite data set-and thus concludes that bias in Wordscores is inevitable.

Previous validation attempts and their shortcomings
Considering the comprehensive critique of Lowe (2008) one could conclude that Wordscores could find little use in political science. However, as Grimmer & Stewart (2013, 270) note, the question is not whether computer-assisted methods satisfy assumptions with regards to how language works and texts are generated, but to evaluate methods on the basis of 'their ability to perform some useful social scientific task'. In this respect, we should not focus on the assumptions, but on validation. As Grimmer & Stewart (2013, 271) note, validation in supervised methods such as Wordscores should involve demonstrating that the computer-assisted method can reproduce the results in a set of documents for which the true scores of the quantity of interest are known. When true scores are not known, the output of computer-assisted methods can be validated against human judgement (see, for instance the validation of another method by Lowe & Benoit 2013).
Validation, however, is more difficult in the case of parties' ideological positions because the true scores of the quantity of interest are unknown and it is difficult to estimate them reliably using human judgement (see ). In such instances, researchers often resort to assessing the 'face validity' of estimates of party positions, in other words whether positions 'appear' to be valid in the eyes of the researcher. As Sartori & Pasini (2007, 363) pointed out, however, demonstrating a measure's face validity might be comforting when other types of validity cannot be employed due to the lack of resources, but this strategy is not adequate. Face validity should be seen as a necessary but not sufficient condition for good measurement. In the absence of face validity, one could certainly question the usefulness of the measuring instrument. However, face validity by itself is not enough, and researchers need to assess additional types of validity as outlined in Table 1 (Carmines & Zeller 1979). These three additional types of validity should not be considered interchangeable (Adcock & Collier 2001, 537). If we fail to validate a measure in one type of validity, this cannot be compensated by showing that the measure fares well in terms of another.  Carmines & Zeller (1979) and Sartori & Pasini (2007) More specifically, in the case of estimating parties' ideological positions Grimmer & Stewart (2013, 271) argue that validation 'requires numerous and substance-based evaluations', and propose that 'scholars must combine experimental, substantive, and statistical evidence' to demonstrate that the output of computer-assisted methods such as Wordscores can be considered to be valid. Nevertheless, while these recommendations have been stated in classic works in social (Zeller & Carmines 1980) and political science (Adcock & Collier 2001) measurement, and content analysis (Krippendorff 2004), our review of the literature showed that most of the published studies have used the Wordscores routines in Stata or R without validating the output.
As expected, the first study that attempted to validate the Wordscores output was the original article by Laver, Benoit & Garry (2003). In their article, Laver et al. use the 1992 manifestos of British and Irish parties as reference texts and assign to them reference scores from expert surveys conducted in 1992 in order to estimate parties' positions of the 1997 election manifestos in both economic and social policy dimensions. Laver et al. then assess the criterion validity of the estimates by comparing the Wordscores output against the estimates of an expert survey conducted in 1997. Laver et al. also used a similar approach to estimate parties' positions for the German election of 1994 but, in lack of comparable expert survey data, only assessed the German estimates in terms of face validity. Our replication of the Laver et al. analysis not only revealed the inconsistencies between the definitions in the article and the way Wordscores is implemented in R and Stata, but more importantly, that the results presented in the article are not particularly robust. More specifically, we found that the addition of manifestos of smaller parties in the analysis drastically change the estimates provided by Wordscores, making them inconsistent in comparison to expert survey estimates. We report in detail these findings in Appendix A. Furthermore, we argue that if Wordscores aims to be a useful tool for estimating parties' positions on policy dimensions, its validity needs to be evaluated beyond such simple 'proof of concept' demonstrations, especially when these demonstrations are shown not to be robust.
In this respect, Budge & Pennings (2007) compared the estimates given by Wordscores to those of the Manifesto Project on the left-right dimension for British parties across time. Their results were unfavourable as they found that Wordscores produces flat scores across time compared to the Manifesto Project estimates. However, in a response, Benoit & Laver (2007a) dismissed these findings because Wordscores was not properly implemented (Budge & Pennings merged several manifestos before using them as reference texts) and because the Manifesto Project estimates were used as a benchmark, something which, the authors argue, can easily be contested. Klemmensen, Hobolt & Hansen (2007) performed a similar evaluation by using Wordscores to estimate the positions of Danish parties on the left-right dimension. Although their article has been widely cited as a successful validation of Wordscores, a closer investigation of the results shows that this is not actually the case. The correlations reported by Klemmensen et al. show that Wordscores performs worse than the Manifestos Project estimates when compared to a common benchmark (expert surveys). If the proponents of Wordscores argue that the Manifesto Project estimates are problematic because they do not always correlate with expert surveys (e.g. Benoit & Laver 2007a, Benoit & Laver 2007b, then it should follow that Wordscores estimates are even worse. Most recently, Hjorth et al. (2015) have repeated this exercise in both Denmark and Germany, by validating the Wordscores output against placements by experts and voters using rank order correlations. The results of this validation pointed that the Wordscores output correlated better with independent measures of party positions compared to the output produced by another popular text scaling method (Wordfish). However, the rank order correlations examined by the authors produced a far too lenient test on a method which promises to deliver interval level measurements of party positions (point estimates with associated 95% confidence intervals).
The most comprehensive validation so far has been conducted by Bräuninger, Debus & Müller (2013) who used Wordscores to estimate parties' left-right positions across 13 West European countries between 1980 and 2010 in a study specifically aimed to assess the validity of the technique. Their results were mixed, concluding that Wordscores estimates correlated well with the Manifesto Project in some countries, but not in others. We note that the results of this comparative study were far more cautious compared to the earlier investigations based on single countries (including the original proof of concept in Laver et al.). The Bräuninger et al. study, however, had its own limitations namely that it only assessed estimates on a single dimension (left-right), using a single benchmark (the Manifesto Project data) which is controversial in itself as previously argued. 4 In general, all of the previous studies attempted to assess the validity of Wordscores in the context of party positions, looked at criterion validity, neglecting other, equally important, types of validity as discussed above. Moreover, the correlation coefficients used to assess criterion validity were either Pearson's product-moment or Spearman's rank-order, which do not take into account systematic measurement error. Finally, none of the studies attempted to investigate the robustness of estimation by using difference sources for the reference scores and different transformation methods. Our study addresses all these limitations and provides the most rigorous validation approach to date. We use Wordscores to estimate parties' positions in 23 countries, across four different policy/ideological dimensions, using three different sets of reference scores, and two different transformation methods, and we assess the estimates in terms of content, criterion, and construct validity using appropriate statistical measures.

Study design
We applied Wordscores to the manifestos of political parties published on the occasion of the 2009 elections to the European Parliament (hereafter we refer to these documents as 'Euromanifestos') across 23 countries using the 2004 EP elections Euromanifestos as reference texts. 5 We chose the elections to the EP over national elections to improve the comparability of estimates across countries. National elections contain more idiosyncratic parameters in the campaigning and use of political text compared to elections to the EP that take place at the same time and within a shared political context. Moreover, we avoid stretching the comparison across time (unlike Bräuninger et al.) in order to ensure that our comparisons are not affected by changes in the political discourse. This way we provide a very favourable context to test the validity of Wordscores, much like Laver et al. have done so.
Instead of tracking down all these documents ourselves, we rely on an off-the-shelf collection provided by the Euromanifestos Project. 6 These are the documents collected and coded (according to a hand-coding scheme similar to the Manifesto Project) by countryspecific coders of the Euromanifestos Project (Braun, Mikhaylov & Schmitt 2010). As also shown in the case of the Manifesto Project (Gemenis 2012, Hansen 2008, the collection of these documents is fraught with problems. Along with 'genuine' Euromanifestos, the collection includes all sorts of documents of dubious usefulness in terms of estimating parties' positions. Amongst them, there are small pamphlets that do not present a broad policy profile, and documents that contain irrelevant or misleading sections (e.g. references to other parties' positions). As evident, such documents would be highly problematic to use with computer-assisted methods for content analysis (see Proksch & Slapin 2009). We nevertheless decided to use this off-the-shelf database in order to test the method in a realistic context as researchers are more likely to rely on off-the-shelf collections for their cross-country comparative analyses than constructing their own using country experts (e.g. Hug & Schulz 2007b, Pennings 2006. Unlike all the previous studies we do not limit our validation to the left-right dimension, but estimate parties' positions on three additional dimensions: European integration, economic left-right, and the socio-cultural liberal-conservative dimension. These are dimensions that have been used extensively to analyse party competition in the context of (elections to) the EP (Hix 1999, Hix, Noury & Gérard 2006, Hooghe & Marks 1999, Hooghe, Marks & Wilson 2002, McElroy & Benoit 2007). In addition, unlike previous studies, we use a variety of sources for reference scores, and also various sources of party positions to compare the Wordscores estimates against. To begin with, we do not use the estimates from the Manifesto Project as we agree with Laver (2007a, 2007b) that they are fraught with measurement error and, as such, should not be used as a 'gold standard' for evaluating the validity of other methods. The reasons for doing so are further explained elsewhere (see Gemenis 2013b). Instead, we use expert survey estimates as Laver et al. and most of the empirical applications that we cited earlier on have done. Of course, expert surveys have their own problems, so we crossvalidate the Wordscores estimates using estimates from an alternative, less used, but highly useful approach: the judgemental estimation of party positions using manifestos and other document sources. For the advantages and shortcomings of the judgemental approach to coding see Gemenis (2015Gemenis ( , 2293Gemenis ( -2296. We further cross-validate the findings by employing two different data sources within each approach. For expert surveys, we use the 2003 and the 2002 and 2010 Chapel Hill Expert Surveys (Bakker et al. 2015, Hooghe et al. 2010, and for judgemental coding, the overall position coders assigned to the party on the basis of the whole document in the Euromanifestos Project dataset (Braun, Mikhaylov & Schmitt 2010), and the estimates from the 2009 EU Profiler dataset (Trechsel 2010) as scaled in Gemenis (2013a). Table 2 gives a summary of these sources, while the exact wording of questions and scales used in our our study are presented in Appendix D. Finally, unlike previous studies we cross-validate the results by employing two different transformations for each set of Wordscores estimates: the transformation originally proposed by Laver et al. (hereafter referred to as LBG) and the alternative transformation proposed by Martin & Vanberg (2008), hereafter referred to as MV. 7 The use of all of these sources and methods for transforming the raw scores allows us to perform the most extensive validation of Wordscores to date.

Results
The combination of different sources of reference scores and transformation over the examined methods and countries implies that we ran the Wordscores scaling model a whooping 600 times for the validation: 25 countries/territories (including separate analyses for Flanders, Wallonia, and Northern Ireland)*4 dimensions*3 sources of reference scores*2 transformation methods. All the Wordscores estimates from these analyses were copied to a meta-dataset with parties as the unit of analysis and merged with estimates from the sources listed in the last column of Table 2. This meta-dataset was used for the subsequent analyses presented below.

Content validity
According to Carmines & Zeller (1979), content validity refers to whether the method used for measuring a latent construct represents all of its facets. If one uses multiple indicators that are scaled in a single index, then these indicators should represent all facets of the construct. Alternatively, if one uses a single indicator (for instance as done in surveys asking for a left-right placement) then this indicator has to capture all different facets of the construct. Moreover, a measure that includes facets that do not belong to the construct would be problematic in terms of content validity. As noted in the section about the previous validation attempts, the evaluation of content validity is usually of qualitative nature, so it would be difficult to see how it could be assessed in the context of the output presented by Wordscores. We propose a workaround this problem by conceptualising the construct in the context of Wordscores as being represented by the words used in the reference texts.
When Wordscores places virgin texts on a dimension of interest it does so by calculating a wordscore for each of the words occurring in the reference texts. As Wordscores is non-discriminating and scores all words on all dimensions, treating all words as equally informative of the dimension of interest is problematic in terms of content validity. This is because we should not expect each and every word in a reference text to be associated with a dimension of interest, no matter what this dimension is. This problem of Wordscores is known, of course, but here we are interested in quantifying the degree of content validity in order to investigate how big of a problem it is for estimating parties' positions.
To do so, we decided to treat each of the words scored in the reference texts as an indicator of the latent concept, and evaluated whether these words relate to the latent concept/dimension of interest. To assess this, following Krippendorff (2004, 101-102) we looked at the context in which these word appear. For example, the word 'committee' can be indicative of a party's position in the dimension of EU integration when it refers to an EU committee, but not when it refers to other types of committees. We therefore hand-coded each and every word in the reference texts to see how many of the words used to score the virgin texts were actually used in the context of the dimension of interest. As this is a particularly time-consuming process, we restricted this analysis to British documents and the European integration dimension. Our choice of British parties should be fair for Wordscores given that British Euromanifestos are some of the best documents in terms of relevance for assessing parties' positions on European integration. For our hand-coding exercise we defined the context as a natural sentence that starts with a capital letter, and end with one of the following delimiters: '.', ' ?', ' !', ';' (Däubler et al. 2012, 942). Items in (bullet-pointed) lists were considered as separate sentences.
Each word was coded as one (1) when it was used in a context referring to European integration and zero (0) otherwise. In Figure 3 we plot the distribution of the average hand-coding evaluation for among all the words used in each virgin document of each British party. What is clear from the figure is that the vast majority of words used by Wordscores to estimate party positions are not particularly informative if one looks at the context in which they appear. It appears that Wordscores uses far more noise than signal to estimate party positions. .2 .4 .6 .8 1 word relevance (mean) Total Note: The horizontal axis refers to the rate in which words were considered by the hand-coding to be relevant.
If one considers that all this noise brought by the non-informative words which are automatically used in Wordscores moves party positions towards the middle of the scale, one can understand the logic behind the LBG transformation which stretches the party scores towards the end points of the scale. Although we agree that one needs to make some kind of transformation to account for the presence of noise that leads to the centrist bias in party positions, we do not agree that such a fundamental problem in content validity present in Wordscores can be solved by a simple transformation of the raw scores. To give an example, we examine closely the wordscoring of the 2009 UKIP manifesto. UKIP is well-known for its extreme anti-EU stance which should leave no doubts about where the party should be placed. The Wordscores raw placement for UKIP is 11.5 [11.2, 11.8] and the LBG transformed one is 9.3 [5.5, 13]. In either case, the party is placed in the middle of the scale. The transformation only improves this placement by specifying that this counter-intuitive middle placement is estimated with a lot of uncertainty. Wordscores tells us that UKIP could be placed on either side of the scale even though one should not have much difficulty in establishing the position of the party simply by looking at the UKIP Euromanifesto.
One could argue of course, that this is a problem of the 2009 UKIP Euromanifesto being very short. However, the size of the document should only contribute to making the confidence interval around the point estimate larger. However, the problem here is that the UKIP point estimate is counter-intuitively estimated in the middle of the scale. This is not because the UKIP document is short, but because Wordscores is unable to accurately estimate the party position due to all the noise that was introduced by the scoring of non-informative words. This is clearly shown in Figure 4, where we plotted all the words scored in the UKIP 2009 Euromanifesto according to their wordscore. Most of the words scored by Wordscores are not informative with regards to placing UKIP on the European integration dimension and since most of the words have wordscores near the middle of the scale, the point estimate for UKIP was counter-intuitively given at 11.5 (transformed by LBG to 9.3).
The problem is therefore deeper than the uncertainty that comes with the size of the documents, and this can be established simply by looking at the cases of parties with much larger documents than UKIP. The fundamental problem lies in the content validity of Wordscores. The lack of content validity brought by scoring each and every word irrespective of its relevance in providing information about the dimension of interest, pushes scores towards the middle of the scale. Transforming the raw scores will pull the estimates towards the endpoints of the scale, but there is no guarantee that the estimates will be pulled to the right direction. This will become evident in the next section where we examine the criterion validity of the Wordscores estimates across all countries. Note: Word size corresponds to the frequency of appearance in the UKIP virgin text; words that were hand-coded as being relevant at least 50% of the instances are plotted in black.

Criterion validity
Criterion validity refers to the extent to which a measure correlates with another measure which reflects the same concept (Carmines & Zeller 1979). Here, we assess the criterion validity of Wordscores by comparing its estimate to alternative measures of party positions on each dimension as outlined in the study design section. As we have argued, this comparison needs to be made using appropriate correlation coefficients. Neither Pearson's product-moment correlation coefficient nor Spearman's rank-order correlation coefficient are able to capture the presence of systematic measurement error.
As has been pointed out by (Krippendorff 1970, 144), both Pearson's and Spearman's coefficients, are based on the presumption of linearity (Y = bX) which is not the same as agreement between two measurements (Y = X). It is therefore possible for two measures to correlate perfectly (according to Pearson's or Spearman's coefficients) without them being identical measures. Therefore all the studies that have used such coefficients to assess the criterion validity of measures of party positions (including all previous validation studies involving Wordscores) are likely to overestimate the degree of validity in case of the presence of systematic measurement error. In order to overcome these problems, we use the concordance correlation coefficient (Lin 1989) defined as: Where µ x and µ y are the means for the two measures and σ x and σ y are the corresponding variances, and ρ is Pearson's product-moment correlation coefficient between the two measures. Put more simply, CCC is conceptualised as or, in other words, as the product between Pearson's product-moment correlation coefficient ρ that measures dispersion (i.e. the degree of random measurement error) and a bias correction factor C b that measures the deviation from the 45 degrees line of perfect concordance. A ρ c of 0 denotes absence of concordance, a ρ c of 1 denotes perfect concordance, and a ρ c of -1 perfect negative concordance.
To estimate and interpret the CCC, we further need to consider two complicating factors. Firstly, CCC requires for both measures to be on the same scale. Normally, one could rescale all estimates of party positions from 0 to 1 using the well-known estimate−min max−min formula. Although this is straightforward using the expert survey and judgemental coding data where the scale minimum and maximum are clearly defined, this is not the case with Wordscores estimates. Despite the promise made by the LBG transformation that it puts the estimates on the same metric of the reference texts (Laver, Benoit & Garry 2003, 317), this does not always happen in practice. For instance, our Wordscores estimates on the left-right range from -2.09 to 22.45 when the BL expert survey that was used for the reference scores ranges from 0 to 20. The question is thus how to treat such counterintuitive results. Following other studies that used the CCC with the Manifesto Project estimates that suffer from the same problem (Gemenis 2012, Gemenis 2013b), we use the empirical scale minimum and maximum as given in the Wordscores output. In one approach, we do this per dimension (in the aforementioned example, we use -2.09 and 22.45 as min and max in the formula respectively), and in another we implement this process per individual country. This way, we can check whether our inferences are robust to this rescaling.
Secondly, we need to set beforehand an objective criterion of what will be considered the minimum accepted correlation for criterion validity. Unfortunately, all previous studies have interpreted correlation coefficients (as strong, moderate, etc) entirely on subjective criteria. Given that Lin's original strength-of-agreement criterion ρ c > .9 is too stringent for social science measurement, we use as the criterion the CCC between various estimates to which we compare the Wordscores estimates to. 8 This way, we have a clear, precise, and objective criterion for our assessment. If Wordscores promises to estimate party positions accurately, then these positions should correlate with other measures of party positions at least as high as these other measures correlate with one another. Finally, we introduce a measure of uncertainty for the CCC, based on 95% z-transformed confidence intervals. To be as lenient as possible, we consider successful in terms of criterion validity when the upper CI (not the point estimate) of the CCC is higher than three CCCs possible when comparing the three other datasets of party positions to one another.
Despite the objective but lenient terms of our evaluation, Figures 5 and 6 clearly show that the Wordscores estimates cannot be considered as valid estimates of party positions in terms of criterion validity (for a detailed overview of the concordance correlations see Appendix G). No matter the dimension (left-right, European integration, economic, or socio-cultural), the source of reference scores (BL, CHES, or EMP), the method of transformation (LBG or MV), rescaling to estimate the CCC (whole dimension or per country), or the dataset to which we compared them to (CHES, EMP, or EUP), the correlation of Wordscores with other datasets never attained a CCC as high as the other datasets attained when compared to one another. 9 To be sure, one could argue that this pessimistic conclusion could be due to the constraints put by rescaling and calculating of the CCC. Nevertheless, the simple Pearson's r correlation coefficients on the estimates before the rescaling needed for CCC (available in Appendix H) were also very low.

Construct validity
Construct validity refers to the extent to which our measure behaves as expected within a given theoretical context. To assess construct validity, we formulate a simple hypothesis, about the relationship between party positions and membership in the political groups of the EP. This relationship has been used before to illustrate the use of the Manifesto Project (Klingemann, Volkens, Bara, Budge & McDonald 2006, 36-39), and expert survey (McElroy & Benoit 2007) data. In this paper, we take this hypothesis a step further, arguing that we can predict with some confidence party membership in the political groups of the EP on the basis of national parties positions on the socio-economic and European integration dimensions. To do so, we estimate a multinomial regression model, where the dependent variable takes eight values, one for each of the seven party groups in the EP (as of 2009) with non-attached parties forming the eighth group.
To assess the explanatory power of the model we use count R 2 which is simply the proportion of correct predictions, as well as McFadden's pseudo-R 2 which compares the explanatory power added by the independent variables compared to a model that includes only the intercept. We compare the explanatory power of the model using the three predictor variables as estimated by Wordscores (using all possible configurations of reference scores and transformations) to the explanatory power of models using exactly the same predictors as measured by three alternative datasets as shown in Table 2: the 2010 Chapel Hill Expert Survey, and the judgemental coding of the Euromanifestos Project and EU Profiler.
As can be seen from Figure 7, in none of the cases do the Wordscores estimates perform better than estimates from other datasets in predicting membership in the EP party groups. To avoid misleading evaluations as to how much better one model is compared to the other, we use the Bayesian Information Criterion (BIC) as a measure of overall fit. In every case, the difference in BIC between models using the Wordscores estimates and models using estimates from the other datasets is larger than 10. This indicates 'very strong' evidence (see Long & Freese 2001, 87) against the model using the Wordscores estimates. What does this imply for Wordscores? According to Zeller & Carmines (1980, 82), construct validation requires 'a pattern of consistent findings' across different hypotheses and studies in order for a measure to establish a high degree of construct validity. Our study did not provide such extensive evidence, but it is rather instructive that Wordscores failed the very simple construct validation test that has been used elsewhere in the literature.

Conclusions
In their proof-of-concept Laver, Benoit & Garry (2003, 329) promised that Wordscores can deliver 'effective' estimates of political actors' policy positions in a matter of seconds. Our replication of Laver et al. revealed inconsistencies in the software implementations of Wordscores and showed that the results presented in their proof-of-concept are not particularly robust. Following Grimmer & Stewart's (2013) advice to 'validate, validate, validate', we subjected Wordscores to a rigorous validation on conditions that should be favourable to the method. Hence, we focused on a cross-sectional rather than longitudinal (cf. Bräuninger, Debus & Müller 2013) comparison where we should not expect significant changes in the discourse that could compromise the effectiveness of the method. Moreover, we used an 'off-the-shelf' collection of documents and data from expert surveys and the judgemental coding of party manifestos, which are consistent with how the method is used in practice.
In contrast to what was promised by Laver et al. our findings showed that the Wordscores estimates of party positions cannot be considered valid. The examination of content validity showed that the Wordscores estimates are compounded by the scoring of irrelevant words and this cannot be corrected by the LBG rescaling method. The exami-nation of criterion validity showed that the Wordscores estimates correlate far lower with other estimates of party positions than the other estimates correlate with one another. Moreover, the examination of construct validity showed that Wordscores estimates have significantly lower predictive power when used in statistical models compared to other estimates of parties' positions. Finally, these findings were shown to be robust across different configurations of reference scores and rescaling methods.
In general our overall negative conclusions imply that Wordscores should not be used to estimate parties' policy positions using electoral manifestos as reference and virgin texts. However, we need to qualify this conclusion. As the performance of Wordscores has shown to vary widely depending on the circumstances of estimation (see Bräuninger, Debus & Müller 2013), we outline three ways in which the Wordscores estimates can be improved, namely by careful document selection, pre-processing, and parsing.
With regards to document selection, we note that our results could be driven by the fact that we used Euromanifestos rather than national election manifestos. However, the most comprehensive validation study using national election manifestos, found mixed results (see Bräuninger, Debus & Müller 2013). It seems that the problem is not so much the electoral context in which the documents are produced, but rather the quality of the documents as sources of party positions. In our validation we used the off-the-shelf collection of the Euromanifestos Project which is less than ideal. One could possibly improve the validity of Wordscores estimates by carefully selecting the documents to be analysed, as already pointed out by Proksch & Slapin (2009) for the case of Germany.
Second, researchers can further improve the validity of Wordscores estimates by using a more rigorous document pre-processing procedure than the one we used in this paper. Instead of removing the most frequently occurring words as we did, researchers could consider removing stop words even more rigorously using a pre-defined list. Removing stop words would reduce the amount of noise, which tends to push Wordscores estimates towards the middle of the scale irrespective of the informative content of the documents. It is also worth mentioning this this problem has already been accounted for by another popular scaling method, Wordfish, which applies weights 'capturing the importance of [words] in discriminating between party positions' (Slapin & Proksch 2008, 709).
Third, researchers should consider using only those parts of the documents they are interested in. So, when the object of investigation is foreign policy, only the paragraphs directly dealing with foreign policy should be used, and not the document as a whole. Parsing documents to different policy areas depending on the estimated policy dimension is required in text scaling methods like Wordfish that assume that the text is unidimensional (Slapin & Proksch 2008). The same logic can be taken to Wordscores assuming that the content of policy areas one is not interested in would only add noise to the estimates.
Nevertheless, while these three suggestions can improve the validity of the estimates they come at the expense of considerable investment in time and resources. Document selection requires considerable expertise in terms of party politics, and is often difficult to assemble and manage in a cross-national project. Lists of stop words are often context dependent, while compound words can cause considerable problems in identifying stop words by automated software. Moreover, parsing documents into policy-related sections requires knowledge of the language the documents were written, something which goes against the promise of Wordscores as a method where it is 'not necessary for an analyst [using the technique] to understand or even read the text to which the technique is applied' (Laver, Benoit & Garry 2003, 329).
Wordscores could potentially produce valid estimates of party positions, but only after some serious investment in time, language-and country-related expertise. We leave to the reader the question whether this investment negates the original promise of a quick and easy method (Laver, Benoit & Garry 2003, 226, 312). What we showed here is that, when the method is used as a language-blind and quick way to estimate party positions, it does not deliver what it promises. Therefore, any researcher who wishes to use Wordscores 'as is' should always demonstrate the validity of the output using a carefully designed validation study as shown here.
Notes 1 Full replication material, including .do files and all associated source documents, will be made available through a public data-verse on publication.
2 A spreadsheet with the details of the review can be found in the replication materials. 3 These are the wordscores package in Stata (written by Kenneth Benoit), and the austin (written by Will Lowe) and quanteda (written by Kenneth Benoit and Paul Nulty) packages in R. 4 Ruedin (2013a) and Hug & Schulz (2007a) compared Wordscores estimates against many other methods aiming to measure parties' positions. Their comparisons, however, did not focus on Wordscores as such but rather showed how results might differ across the various methods.
5 The countries in our study include all EU member-states up to 2009 with the exclusion of Luxembourg and Malta where no appropriate reference scores were available for 2004. The names of parties used in the study can be found in Appendix B.
6 The collection can be accessed at http://www.ees-homepage.net/. The names of the documents used can be found in Appendix B. Moreover, following the advice by Grimmer & Stewart (2013, 272-273), we processed these documents to make them suitable for computer-assisted analysis. We present our processing method in Appendix C.
7 Following, Laver et al. we use all available documents for 2004 as reference texts when using the LBG transformation. This way, the texts more or less extend over the whole range as required by the first assumption made by Wordscores (see section on Wordscores assumptions). In Appendix E, we show which two documents we selected for each country to serve as anchors for estimation according to the MV transformation. 8 We would like to thank Oliver Treib for suggesting this. 9 Detailed results and additional figures are available in Appendix G.
Appendix A: Reanalysis of Laver, Benoit & Garry (2003) Much of the initial validation for Wordscores rested on scoring the 1997 Irish manifestos on a social and economic dimension using the 1992 manifestos as reference texts (Laver, Benoit & Garry 2003). We attempted to replicate the findings in the paper using the manifestos, code, and reference scores as available on the Wordscores website http: //www.tcd.ie/Political_Science/wordscores/index.html. Unfortunately, we were not able to replicate the results published in Laver et al. using the materials from the website. Upon closer examination we realized that replication is not possible for two reasons. First, the reference texts provided in the Wordscores website are not the same as the ones used in the Laver et al. article. As is clear from the number of words, the documents provided in the website have been cleaned differently compared to the documents used in the Laver et al. article. This cleaning refers to the removal of numbers, special characters, document formatting content (tables of contents, headers, footers), and occasionally stop words which is an important step in computer-assisted text analysis. Moreover, the website, includes in the set of reference texts the manifestos of two additional parties (Greens and Sinn Fein), unlike the Laver et al. article which uses as reference texts the manifestos of only five parties.
Second, and most importantly, the current (as of March 12, 2022) '23-June-2009' version of wordscores for Stata gives different results than the older version 'v0.36' that was used to produce the results in the Laver et al. article. The differences in the output given by these two versions can be attributed to changes in the code with regards to how F wv (equation 3 in the main text) is calculated. According to Laver, Benoit & Garry (2003, 316), F wv denotes 'the relative frequency of each virgin text word [w], as a proportion of the total number of words in the virgin text [v]' (emphasis added). This is what has been implemented in the '23-June-2009' version of the Stata wordscores package. Conversely, 'v0.36' and the two packages that can implement Wordscores in R ('austin' and 'quanteda'), define Fwv as the relative frequency of each virgin text word w is taken as a proportion of the total number of words co-occurring between the reference and the virgin texts. In an e-mail communication, Kenneth Benoit clarified that the 'correct' implementation of Wordscores is in the R packages and 'v0.36' version of wordscores for Stata. This implies that the definition of F wv given in Laver et al. is incorrect. It also implies that all those who used the '23-June-2009' version in their (published) papers got the 'wrong' Wordscores results. In our communication, Kenneth Benoit also indicated that the change in how F wv is defined does not make much difference as the results correlate highly.
We tested this claim by implementing the two versions of wordscores (v0.23 and '23-June-2009') for Stata across all the parties in our analysis for four different dimensions (left-right, European integration, economic, social) using the  expert survey for the reference text scores and the LBG transformation. Figure A1 shows the results which clearly contradict the claim that the results of the two implementations correlate would highly ('about .97'). The concordance between the two scores measured by the concordance correlation coefficient are .44 (left-right), .53 (European integration), .33 (economic), .32 (social). The respective Pearson correlation coefficients are .55, .62, .41, .38. The correlations are similar when different sources for the reference text scores were used. This is clear evidence that changing the definition of F wv changes the Wordscores estimates radically.  Table 1 below, we show how the Wordscores estimates for Irish party positions vary when one uses different sets of documents for reference texts (five parties as in the Laver et al. article versus seven parties as in the replication material found in the Wordscores website) and different implementations of wordscores for Stata ('v0.36' versus '23-June-2009') lead to substantially different results.
The results in the top left quartile of Table 1 attempt to replicate the findings of Laver et al. by using the manifestos of five Irish parties (FF, FG, Labour, DL, PD) and the 'v0.36' wordscores for Stata (which is identical to the wordscores and quanteda packages in R). They are almost identical save some minor differences due to the way the documents were cleaned for the analysis in Laver et al. As pointed out in that article, the results look reasonable and consistent with how the parties have been placed in expert surveys (e.g. DL and Labour on the economic left, the other parties on the economic right).
However, when we change the definition of v from 'the total number of words in the virgin text' as stated in the original article Laver, Benoit & Garry (2003, 316) to 'the proportion of the total number of words co-occurring in the virgin and reference texts' as was done in the '23-Jun-2009' version of wordscores for Stata, we get the much different results presented in the bottom left quartile. It is clear from the figure that changing the definition of v produces estimates that move parties in a way that does not make much sense (for instance, Fianna Fail as the most economically left party) and otherwise makes it impossible to distinguish between the parties given the confidence intervals of the estimates.
The change in the definition of v that was implemented on 23 June 2009 will produce party positions that appear reasonable and intuitive only if one adds the manifestos of Greens and Sinn Fein in the set of reference texts as shown in the bottom right quartile. However, if we add these two manifestos in the set of reference texts, but keep the definition of v as in the Laver et al. article, we will get the results in the top right quartile. Again, these results do not make much sense, since the confidence intervals overlap significantly and many of the point estimates are rather implausible (e.g. the Greens and Sinn Fein are in the middle of both scales. We find it strange that the documents for Greens and Sinn Fein were not included in the APSR article, but were included in the replication of the article as implemented in the Wordscores website which contained a different Stata wordscores code. Why did the authors not include the SF and Greens documents in their original analysis as presented in the APSR article? We believe that this was not done because the addition of these two parties in 2003 under the alternative definition of v which is used in R and is favoured by Kenneth Benoit (as per our e-mail communication) would have given results that are inconsistent with expert surveys. Similarly, when the wordscores code was changed and the results appeared to be implausible, the two documents were added as reference texts in the replication materials in the Wordscores to improve the validity of the results. Since the positions of parties under the Laver et al. transformation (which is used in the APSR article) are sensitive to the inclusion/exclusion of virgin texts as shown by Martin & Vanberg (2008), we ask whether the exclusion of SF and the Greens from the analysis in Laver et al. but their inclusion in the 'replication' of the analysis in the Wordscores website does not constitute an attempt to 'cherry pick' among different possible results in a way that supports the argument in favour of Wordscores. Appendix C: Document preparation

Document Selection
We obtained the manifestos from the Euromanifestos Project website. 10 For all countries, text files were available for the 2009 manifestos, while for the 2004 manifestos, only some parties in Germany and the United Kingdom were available in this format. We thus used the stored portable document file, which we converted into UTF-8 text files, to assure compatibility and preservation of non-English characters. When conversion from .pdf was not possible due to the file being saved as an image, we used optical character recognition (OCR) software. While OCR will never convert a text 100% faithfully, sufficient results can be gained, especially as the software we used allowed us to manually correct mistakes and instances were the software was not sure. For some countries, not all the released manifestos were stored in the database, or the stored document was something other than a true Euromanifesto, in which case we looked for the document in other online sources. Both the resulting .txt and .pdf version of these source documents can be found among our replication files.

Pre-processing
From all text files, we removed headers and footers, page numbering, section headings, graphs, numbers, currency symbols and tables. We then imported these texts into Wordfreq (cite) to make the frequency tables for each country. From these frequency tables, we then deleted stop-words as they carry minimal information value (Slapin & Proksch 2008, 332). While not all studies using Wordscores apply stop-words, a significant number do (Ruedin 2013a, Ruedin 2013b, Slapin & Proksch 2008. Moreover, the practise seems to be common in automatic content analysis (Grimmer & Stewart 2013), and seems especially suited for Wordscores, as it falsely assumes all scored words to carry the same informative value. However, a word such as 'immigration' adds information to a text in a way words like 'the' or 'and' do not. Nevertheless, as these words occur often in all texts, their score will be close to the mean of the reference texts, and will thus cause the scores for the virgin texts to cluster around the mean. As such, they are indistinguishable from truly centrist words, causing parties to appear more centric than they really are (Lowe 2008, 360-361). Removing these words thus increases the discriminative power of Wordscores.
Here, we follow Ruedin (2013b), and remove the 20 most frequently occurring words for each country in both 2004 and 2009. We do not use stemming, as this decreases the effectiveness of the method (Ruedin 2013b) and because it is not beneficial for all languages. This is especially the case for languages in which compound words are common, such as in German or Finnish, where stemming may lead to a reduction of information. Table 5 shows the 20 most frequently occurring words that were dropped for Great Britain. Most of these words can easily be considered non-informative, as they are either adjectives, adverbs or propositions. Even a word as european or europe can be argued to function mostly as an adjective as would be expected in a manifesto for European elections. The .dta files with these words removed may be found in the replication files.

Wordcount
The table below shows the word count for the documents. Using the wordscores package for Stata, we calculated the mean and standard deviation for the total words in the documents and the unique words (referring to words only occurring in a single document). In addition, New indicates whether the 2004 European election was the first election the country participated in. Documents from the new countries were significantly shorter in 2004, but showed an increase in 2009, while the number of unique words changed little. The number of documents analysed was higher in 2004 than in 2009, which is mostly to due the availability of an existing digital copy. The number of words per manifesto differs significantly per country and also within countries as shown by the standard deviation. This implies that the size and scope of documents differ and that when performing an analysis, scholars need to be aware of what the document under investigation covers and whether all documents are the same.

Social
Left (1) Favours (1) Promotes raising taxes to increase public services (1) Favours liberal policies on matters such as abortion, homosexuality, and euthanasia (1) Right (20) Opposes ( 2,3,5,6,7,8,9,10,11,14,16,18,19 and 20,with missing values recoded to 4 (Neutral) Original EU Integration (Y axis), using items 12, 21, 22, 23, 24, 26 and 27 Scale composed of items 1, 2, 11, 14, 16, and 18 Scale composed of items 5, 6,7,8,9,10,19,20 and 25 †Denotes variables that have been reversed for subsequent analysis ‡EU Profiler data were scaled according to Gemenis (2013a).  In their original article Martin & Vanberg (2008) (hereafter MV) advise in a footnote to calculate the difference between the exogenous assigned scores and the score as used the their transformation to calculate the size of the trade-off scholars have to make between increased accuracy of the dictionary and internal consistency and the ability make valid comparisons. While this step is not necessary to validate the applicability of the MV transformation in our study as we do not compare our scores against the reference scores, we decided to calculate these differences in order to test the transformation and give a preliminary assessment of the trade-off for scholars who want to use the transformation in the future. To calculate the trade-off, we input the reference documents a second time as the virgin documents. The difference between the transformed score and the exogenous assigned score then indicates the degree of trade-off. In addition, it provides the user with an extra tool to assess whether the actual word usage of the texts is reflected in the exogenous assigned score. A large difference then means that the exogenous score is not equal to what is reflected in the words. This difference can be either negative or positive, depending on the direction (either lower or higher on the dimension of interest). To give an idea of how this works, we calculate the difference on the EU integration dimension in the Netherlands using the reference scores from the Benoit & Laver dataset. 13.9 13.9 0 0 As Table 9 shows, the scores of the anchor texts (LPF and D66) are fully recovered, while the scores of the texts in between have changed. These changes range from −2.01% for the PvdA to 35.41% for the SP, indicating that the words in the documents indicate a respectively lower score for the PvdA and a higher score for the SP then what is suggested by the exogenous reference scores. Nevertheless, the SP document, which shows the most significant difference, retains its position relative to the other parties as the CU-SGP score also increases. However, a reversal does take place between the CDA and GL. Based on the exogenous scores, the GL document is more positive about European integration than the CDA, while the MV transformation switches these positions. Besides the PvdA, all parties receive a higher score than exogenous assigned, ranging from a small 3.59% voor GL to a 35.41% for the SP. While Martin & Vanberg (2008) do not give a criterion as to what the maximum amount of difference should be, we consider the differences between the exogenous scores and the scores given by the MV transformation to be sufficiently large to warrant closer inspection. We therefore extend our calculation and include all countries and dimensions, to rule out any possibility of these differences arising out of peculiarities of this specific example.
As the table below shows, the results of this analysis show a similar pattern. However, in some cases the positions of the parties are switched and large differences such as the 35.41% for the SP above are not uncommon. Therefore, if scholars choose to use the MV transformation in the future, we would strongly advise them to calculate these differences. Not only will this help them to assess the size of the trade-off, the MV calculated score for the reference documents will also be a more valid score to compare the transformed scores for the virgin texts against. Additionally, they can be used as a (partial) check on how well the exogenous assumed relative distances between the reference texts are shown in the actual word use (Martin & Vanberg 2008). Especially with large differences this can warrant a closer inspection of the exogenous assigned score for the party and why it differences from the actual word use.

Appendix G: Concordance Correlations
The four tables below show the concordance correlations between the wordscores estimates for the virgin texts and the expert scores, for all the different combinations of exogenous reference score, type of benchmark, transformation, and rescaling.    The four graphs below shows scattermatrices between the wordscores using the LBG transformation (as can be found in the tables above) and the 2009 expert scores. The matrices were constructed in R using the car package and show the relations between the six data sets including a density plot over the diagonal axis.