Structural validity and measurement invariance of the short version of the Big Five Inventory (BFI-10) in selected countries

Abstract We sought to determine the applicability and structural equivalence of a personality instrument developed in western, educated, industrialised, rich and democratic (WEIRD) contexts in non-WEIRD environments. The data for this study came from interviews conducted during the sixth wave of the World Values Survey in the Netherlands (N = 1902), Germany (N = 2046), Rwanda (N = 1527), and South Africa (N = 3531). We conducted exploratory and confirmatory factor analyses to assess structural validity and measurement invariance. The findings from the Big Five Inventory 10 (BFI-10) instrument did not support a perfect five-factor model as theorised by the Big Five Personality model across all countries, even though Germany and the Netherlands obtained better As a result, the findings do not support structural validity and do not demonstrate measurement invariance between WEIRD and non-WEIRD countries. The findings indicate that while the concise BFI-10 instrument partially replicates the structure of the B5P model in WEIRD countries, it falls short in non-WEIRD countries. Users of the instrument should therefore proceed with caution in both WEIRD and non-WEIRD contexts, bearing in mind the instrument’s structural flaws.


PUBLIC INTEREST STATEMENT
Personality tests are important for determining a person's strengths and shortcomings. The outcomes of such examinations can be utilised to make potentially life-changing career decisions. However, failing to account for the fact that various cultural groups may attribute different meanings to items in a rating instrument can lead to erroneous judgments about people's personalities. As a result, it is critical to guarantee that the tests being utilised have been scientifically proven to be valid and reliable. The study described in this article looks at the psychometric features of the BFI-10 test in diverse cultural settings. The findings show that structural validity exists in WEIRD countries but not in non-WEIRD countries. The findings show that practitioners and researchers should proceed with caution when employing purportedly globally accepted assessment tools in contexts for which they were not designed.

Background and introduction
Since Tupes & Christal's, 1992/Tupes & Christal, 1992 naming, and Norman's (1963) replication of the five-factor Big Five Personality (B5P) traits model, it has become one of the central theoretical lenses in personality-related research. The model is derived from the psycholexical methodological approach to personality research, which emphasises the use of language to describe human personality traits. The significance of the five-factor structure stems from its robust ability to explain and predict individual differences regarding diverse topics such as mental health (Anglim & Horwood, 2021;Sun et al., 2018), job satisfaction (Bui, 2017), academic performance (Trapmann et al., 2007;Vedel, 2016), and work performance (Barrick & Mount, 1991;Pletzer et al., 2019).
The five major dimensions of the B5P model are extraversion, agreeableness, openness, conscientiousness, and neuroticism (Costa & McCrae, 1985). Costa and McCrae (1985) start by defining extraversion, stating it to be a personality trait comprising energy, talkativeness, and assertiveness. Second, agreeableness is explained as the degree of friendliness, cooperation, and compassion. Third, openness entails being perceptive and imaginative, in addition to possessing a diverse range of interests. Fourth, conscientiousness refers to the human characteristics of orderliness and thoroughness. Finally, neuroticism refers to an individual's emotional stability and susceptibility to negative emotions.
Research interest in the B5P has coincided with the proliferation of a variety of measurement instruments claiming to assess the big five traits. Subjectively, the measures can be classified as either long or short versions. Best known are Costa and McCrae's (1992) 240-item Revised NEO Personality Inventory, their 60-item NEO-Five Factor Inventory and their 44-item Big-Five Inventory. Taylor and De Bruin (2006) developed a South African version of the test, the Basic Traits Inventory (BTI), with 193 items. Short versions include Donnellan et al.'s (2006) 20-item International Personality Item Pool-Five Factor Model (IPIP-FFM) and Gerlitz and Schupp's (2005) 15-item Big Five Personality Inventory (BFI-S). Used in the World Values Survey (sixth wave) was Rammstedt and John's (2007) 10-item Big Five Personality Inventory (BFI-10).
While scholars of personality have agreed that personality traits generally fall within the five categories proposed by the B5P (John, 2021), a methodological concern is whether respondents react similarly to B5P items (Hahn et al., 2012). Within the South African context, Abrahams and Mauer (1999) and McDonald (2011) demonstrated that individuals from different cultural backgrounds may differ in their interpretations of the terminology used to categorise the big five personality traits. Consistent with this, Grobler and De Beer (2015, p. 50) observe that when participants from diverse cultural backgrounds are included in studies, the likelihood of "measurement bias from item interpretation differences is high, and empirical investigation of the items is important." This study examined the psychometric properties of the BFI-10 instrument, a brief instrument for measuring the B5P factors, using data from the World Values Survey (WVS) of four culturally diverse countries, namely Germany, the Netherlands, Rwanda, and South Africa. The first two countries can be classified as western, educated, industrialised, rich, and democratic (WEIRD), and the second two as non-WEIRD. The goal was to determine whether a five-structure model similar to that proposed in the B5P is suited to application in both WEIRD and non-WEIRD countries. Based on data from the sixth wave of the WVS, Ludeke and Larsen (2017) as well as Simha and Parboteeah (2020) have raised concerns about the structural equivalence of the BFI-10 questionnaire. Ludeke and Larsen (2017) reported that when the questionnaire was used, indicators from the same scale tended to correlate negatively. Additional testing is necessary, as structural equivalence, which exists when a factor model is applicable across groups, is a necessary condition for accurate statistical analyses across cultural groups, and this requirement must be objectively established (Fontaine et al., 2008).
Although Ludeke and Larsen (2017) found significant item-correlation issues with the Big Five measures in the WVS data, their analysis ignores the pattern of loadings and factor structures that emerge from the data in order to determine the aspects of the theorised model that are applicable to different countries. To analyse the factor structures and patterns of factor loadings in four countries with differing degrees of WEIRD-ness, we revisited the data for four countries with varying degrees of WEIRD-ness. This study is expected to contribute to a better understanding of the Big Five model's applicability in various countries. Considering this, the objectives of the study are as follows.

Main objective
To examine the psychometric properties of the BFI-10 instrument, a brief instrument for measuring the B5P factors, using data from the World Values Survey (WVS) of four culturally diverse countries, namely Germany, the Netherlands, Rwanda, and South Africa.

Sub-objectives
-to determine whether a five-structure model similar to that proposed in the B5P is suited for application in both WEIRD and non-WEIRD countries.
-to assess if the short version of the Big Five Inventory is measurement invariant in both WEIRD and non-WEIRD countries.

Literature review
Personality tests, as well as career aptitude and competence assessments, are useful for ascertaining people's strengths and weaknesses. The results of such tests can be used as a basis for potentially life-altering career decisions (Cascio & Aguinis, 2011). However, concerns have been raised about the direct transfer of assessment tools developed in developed economies to disadvantaged and culturally distinct contexts without considering the instrument validity implications in the contrasting contexts (Allik et al., 2017;Laajaj et al., 2019;Meiring et al., 2005). Inadequate consideration of the fact that different cultural groups may ascribe different meanings to items in a rating instrument can result in inaccurate judgements about individuals' personalities (Fontaine et al., 2008). In South Africa, the Employment Equity Act (No. 55 of 1988) offers a measure of protection in stipulating that "psychometric testing and other similar assessments of an employee are prohibited unless the test or test being used has been scientifically shown to be valid and reliable".

Measurement bias and personality assessment
From the foregoing, it is evident that measurement bias (incomparability or inequivalence) is an ever-present threat to the reliability and validity of personality scales in cross-cultural studies. In the literature, three categories of measurement bias are identified, namely construct, method and item bias (Berry et al., 2011). Construct bias occurs when a construct applies uniquely to a particular cultural group, or when construct indicators cannot be used across different groups (Fontaine et al., 2008). Method bias arises when supposed construct measures in an instrument do not measure the construct they are supposed to measure (Meiring et al., 2005). This may be due to translation errors, acquiescent responding, or group-influenced response patterns. Lastly, item bias occurs when a construct indicator systematically demonstrates a higher or lower score than expected with a particular group (F. Van de Vijver & Tanzer, 2004). While measurement bias and its consequence can emanate from issues such as different norms across different cultures, problems with translation issues, and language issues (Nye et al., 2008;Sass, 2011;Saucier et al., 2014;Thalmayer et al., 2020), a useful way to identify the possibility of such a problem with an instrument is by evaluating its measurement invariance (Jak et al., 2014). This concept is explained in the next subsection. Wang et al. (2018) define measurement invariance as a statistical property of a research instrument that indicates whether it consistently measures a latent variable (construct) across groups of respondents. Wu et al. (2007) provide a similar characterisation, stating that "measurement invariance holds if and only if the probability of an observed score, given the true score and the group membership, is equal to the probability of that given only the true score" (p. 2). Measurement invariance is seen as a reliable indicator of the structural equivalence of a research instrument (F. J. Van de Vijver & Poortinga, 2002). Selig et al. (2008) offer a comparable and perhaps the most complete definition, stating that measurement invariance entails testing an evaluating tool to ascertain whether a latent variable it seeks to test across population groups provides comparable equivalence information. In other words, measurement invariance examines whether questionnaire items measuring a particular construct, such as personality, are understood in the same way by different groups. Groups typically used in invariance studies are age (Dong & Dumas, 2020), ethnic origin (Selig et al., 2008), socioeconomic status (Hughes et al., 2021), level of education (Patel et al., 2019), nation (F. J. Van de Vijver & Poortinga, 2002) and occupation (Spurk et al., 2015).

The measurement invariance concept
The B5P assessment tools are examples of where measurement invariance testing is applicable, with a number of studies having been done to test the measurement invariance of the B5P scales (Chiorri et al., 2016;Laverdière et al., 2013;Schmitt et al., 2011). Measurement invariance is applicable in the case of the B5P scales, as it employs multiple and conceptually related clusters of items to assess the five personality characteristics. When measurement invariance holds, respondents from diverse groups assign the same meaning to each of the five personality clusters, allowing for the comparability of research results.

Assessing measurement invariance
The literature on measurement invariance has grown over the years. This is particularly evident from a study of the articles in the Journal of Cross-Cultural Psychology [0022-0221 (print); 1552-5422 (web)]. Essentially, assessing measurement invariance is affirmed through a stepwise process, using mainly confirmatory factor analysis, and follows a series of increasingly restrictive equality constraints hypotheses (Berry et al., 2011). In this sub-section, we discuss the categories of measurement invariance according to the different levels of model restriction. A literature search identified anything from three to five distinct levels/types of measurement invariance.
Here the focus will be on the most comprehensive typology, which comprises conceptual, configural, metric, scalar, and strict invariance.
• Conceptual invariance implies that the domain or trait should make sense in all the groups to be compared (Berry et al., 2011). When a measured construct is specific to a particular context, it would thus be impossible to find a comparable operational pattern of relationships with other constructs across the groups (Fontaine et al., 2008). Although conceptual invariance is based mainly on theoretical arguments, and although no statistical tests directly test conceptual equivalence, Berry et al. (2011) state that evidence of configural invariance supports claims regarding conceptual equivalence.
• Configural (configurational) invariance is the fundamental type of invariance and is examined first, before any other types of invariance are considered. Configural invariance (pattern invariance) is confirmed when the number of factors and their loading patterns is consistent across groups (Bialosiewicz et al., 2013). However, with configural invariance the strength of the factor loadings may vary across population groups, and this does not guarantee structural equivalence across respondents' groups in a multigroup study, with additional tests being required to confirm group comparability.
• Metric invariance (also known as weak invariance) is a type of measurement invariance that is at a higher level than configural invariance. Unlike configural invariance, which requires only a similar number of factors and an identical pattern of factor loadings to confirm measurement invariance, metric invariance requires that the strength/size of the factor loadings be equal across population groups (Davidov et al., 2014). Where there is no metric invariance, any comparison of constructs across groups should be performed with caution, as the constructs themselves are not identical (Marsh et al., 2012).
• Scalar invariance is the most robust and desired level of measurement invariance, according to many authors. Apart from the conditions of configural and metric invariance, the intercepts of the scale items must be equivalent across respondent groups (Melipillán & Hu, 2020), and only then will scalar invariance be indicated. When this level of invariance is achieved, meaningful comparisons of constructs across groups of respondents are possible (Li et al., 2018;Wang et al., 2018).
• The final level of measurement invariance, known as strict invariance, is concerned with the equivalence of residual error between groups (Bialosiewicz et al., 2013). Unlike the previously discussed levels of measurement invariance, strict invariance has two sublevels. The invariance of factor variances is the first level of strict invariance, where the error variances of the factors are equal across groups. The second level of strict invariance refers to the invariance of the error terms of an indicator variable, indicating that the unique errors of the indicator variable are equal across groups. Therefore, when you test for strict invariance, you are in essence determining whether your residual error is comparable across administrations (Bialosiewicz et al., 2013).
In measurement invariance studies, typically multi-group confirmatory factor analysis (MGCFA) is performed in phases. The first phase tests for configural invariance, the second for metric invariance, the third for scalar invariance, and the last for strict invariance. These phases are inextricably linked, and researchers frequently abandon testing when any of these steps exhibits noninvariance. While strict invariance is often tested for in standard syntaxes, most scholars agree that scalar invariance is sufficient for drawing meaningful conclusions about group comparability (Wang et al., 2016).
Confirmatory factor analysis indicators are used to determine an acceptable level of model fit and, ultimately, measurement invariance. Among the most commonly used goodness-of-fit indices are the comparative fit index (CFI), the Tucker-Lewis index (TLI), the root mean square error of approximation (RMSEA), and the standardised root mean square residual (SRMR), as well as the chi-square statistic (Bibi et al., 2020;Browne & Cudeck, 1993;Kim, 2017;Kong, 2017;Sun, 2005). For each of the indicators, Hu and Bentler (1999) propose the following cut-off criteria: CFI and TLI values of 0.9 or greater; RMSEA and SRMSR values of less than 0.08; and a chi-square statistic that is not statistically significant. Last-mentioned is seldom met. Lavaan (R; R Core Team, 2020) can be used to test measurement invariance up to the level of strict invariance, and both Svetina et al. (2020) and Steyn and De Bruin (2019) used Lavaan (R) to analyse their data, and applied the abovementioned guidelines to interpret the findings of their respective studies. Once measurement invariance is proven, group comparisons can be made to ascertain whether different categories of respondents interpret the multiple scale items measuring a particular construct similarly (Rhudy et al., 2020).

Analyses of abridged versions of B5P personality models
Although using detailed and often longer multi-item rating scales for B5P-related investigations is ideal due to the increased content validity and reliability, time limitations sometimes make this difficult (Rammstedt & John, 2007). Long questionnaires may tire participants, frustrate them, and elicit inattentive responding. In the end, the veracity the instruments' results may be jeopardised (Soto & John, 2017). As a result, abridged instrument versions have evolved with time.
Although short questionnaires are convenient in psychological studies, reservations over them, particularly the brief B5P scales, have been highlighted. For instance, it is claimed that lengthier B5P tests seem to be more reflective of broad personality constructs and have better measuring capacity than shorter ones (Langford, 2003). According to Gosling, Rentfrow, and Swann Jr. (2003), abridged versions of the B5P show poorer psychometric qualities than the regular multi-item measure. Similar instrument validity concerns were observed by Laajaj et al. (2019) whose research based on a 15-item instrument could not accurately assess the target personality traits and found that the instrument had low validity.
Several studies have found the Rammstedt and John's (2007) 10-item BFI-10 scale to be a reliable measure of extraversion, agreeableness, openness, conscientiousness, and neuroticism in individuals (Balgiu, 2018;Guido et al., 2015. Other studies based on the WVS sixth wave (from 2010 to 2014), however, reveal the shortcomings of the BFI-10 scale's psychometric qualities when measuring the Big Five personality traits in cross-national scenarios (Ludeke & Larsen, 2017;Simha & Parboteeah, 2017). The conclusions regarding the challenges of the BFI-10 instrument were substantiated by findings from Chapman and Elliot's (2019) study based on General Social Survey data, which revealed odd results that did not replicate those of original big five instruments. Given the foregoing, the B5P model's universal application becomes problematic. Furthermore, the 10item measure's persistent use in cross-cultural research without a consensus on its generalisability seems contentious. As a result, the literature supports the necessity for additional research into the structural validity of the WVS's 10-item measure.

Method
In this section the design, sampling, research instrument used, procedure, statistical analyses and ethical concerns are discussed.

Design
This study is based on the analysis of cross-sectional data on the B5P model collected during the World Values Survey (sixth wave) in Germany, the Netherlands, Rwanda, and South Africa. The data was quantitative in nature and collected by means of Rammstedt and John's (2007) 10-item Big Five Personality Inventory (BFI-10). The purpose of the analysis was either to confirm or refute the structural validity of the BFI-10 instrument using data from selected WEIRD and non-WEIRD nations. We acknowledge that the study's concentration on a small number of nations and a single dataset limits the universal applicability of its findings.

Sampling
The WVS website provides a detailed description of how individuals were sampled across countries (Inglehart et al., 2014), with every effort being made to sample individuals randomly per country.
The data from four countries, namely Germany, the Netherlands, Rwanda, and South Africa, was used in the study. The selection of countries for analysis was by no means arbitrary. As the focus of the study was on the B5P construct, and as this measure was only used in the sixth wave, the countries were selected from those which completed the questionnaires during that period. We chose only two Western European countries to represent WEIRD since their WVS data was found by Ludeke and Larsen (2017) to reflect the Big Five Model's five-factor structure. The inclusion of Rwanda and South Africa in the non-WEIRD category was motivated by the desire to include contrasting comparisons. The demographics of Rwandan respondents were substantially skewed toward non-WEIRD, but those of South African respondents were mixed, allowing for a good comparison of countries with diverse characteristics.

Measurement instrument
The BFI-10 designed by Rammstedt and John (2007) was used in the WVS to collect data on the theorised B5P constructs, namely extraversion, agreeableness, openness, conscientiousness, and neuroticism. The BFI-10 items are presented as five affirming statements (e.g., "I see myself as someone who is outgoing, sociable") and five disaffirm statements (e.g., "I see myself as someone who is reserved"), two statements per personality dimension. Respondents were required to rate the statements on a Likert scale with five response categories ranging from 1 = strongly disagree to 5 = strongly agree. The table below summarises the questionnaire items and how they were classified in the survey.
It can be observed from Table 1 that two items each, one positively worded and the other negatively worded, represent each of the five traits of the B5P-model. Literature on the reliability and validity of the BFI-10 paints a mixed picture. Studies by Rammstedt and John (2007), Carciofo et al. (2016), Rammstedt and Krebs (2007), and Erdle and Rushton (2011) yielded respectable reliability coefficients for each of the five constructs in the fivefactor model. Also, studies by Rammstedt and John (2007), Balgiu (2018), and Guido et al. (2015) yielded data which supported a five-factor model as was theorised in the B5F-model. Notwithstanding the preceding findings, later research by Ludeke and Larsen (2017) disconfirmed positive reliability findings and the validity of the five-factor hypothesis.

Procedure
The primary aim of the study was to test the structural validity of the B5P-model across countries, some of which were WEIRD, and others non-WEIRD. As stated, data from the WVS was used, specifically the SPSS file available on the WVS website (https://www.worldvaluessurvey.org/wvs.jsp).
As the aim of the study was to assess the fit of the B5P-model with data from four countries, exploratory factor analyses (EFA) were first performed, to visually assess whether the data fitted the B5P-model in each country. First, following Kaiser's rule regarding eigenvalues, the natural fit of the data was established. Then, using the SPSS software (IBM Corp, 2020), the data was "forced" into a five-factor solution, again one country at a time. As a final preparation before entering into multi-group confirmatory factor analyses (MGCFA), confirmatory factor analyses (CFA) were performed per country. Given the aforementioned results, multi-group  confirmatory factor analyses (MGCFA) were envisaged as a means to test for the levels of measurement invariance.

Statistical analyses
It was planned that exploratory factor analyses (EFA), confirmatory factor analyses (CFA), and multigroup confirmatory factor analyses (MGCFA) were to be used successively to assess the levels of measurement invariance. With CFA, the number of factors and their loading patterns (Bialosiewicz et al., 2013) were considered, while for CFA and MGCFA the guidelines below were used. Table 2 presents the criteria for evaluating the CFA results.

Ethical considerations
The use of the WVS data was open to all interested parties, subject to referencing the database in the reference list (see, Inglehart et al., 2014). No data specific to this research was collected.

Results
Data for the sixth wave of the WVS was collected in Germany (2013), the Netherlands (2012), Rwanda (2012), and South Africa (2013). In all, 8724 responses were collected: 2010 from Germany, 1739 from the Netherlands, 1527 from Rwanda, and 3448 from South Africa. V248 was used to create the variable "Educated". Presented here is the percentage of respondents who indicated that they had completed a university degree. f V225 was used to create the "Industrialised" variable. The percentage of respondents who indicated that they use their personal computer frequently, rather than occasionally or seldom, is presented here. g V141 was used to create the "Rich" variable. The percentage of respondents who provided answers to 8, 9 and 10 is relevant in this case. h V237 was used to create the "Democratic" variable. Presented here is the percentage of respondents who stated that they had sufficient savings.

Demographic variables and WEIRD classification
In the table below, demographic data are presented per country. Apart from the customary reported data on age and sex, the report also includes WEIRD data per country. This was done because frequent criticism is levelled against psychometric instruments developed from the perspective of WEIRD societies (Doğruyol et al., 2019;Laajaj et al., 2019).
From Table 3 it can be observed that men are marginally underrepresented, but this applies across all the groups. Noteworthy are the differences in average age between the European and the African sample, namely 15 years, which is about the same as the standard deviation in the groups. This equates to a large effect size and a practically significant difference on age.
The data captured in Table 3 affirms that the European countries can be classified as WEIRD and the African countries as non-WEIRD. Stated differently, the European countries scored higher on all the WVS's proxies of WEIRD than the African countries did.

Statistics testing for measurement invariance
First the data from Germany was analysed using EFA. The Kaiser-Meyer-Olkin (KMO) test of sampling adequacy was .617 for the German sample, and the Bartlett's Test of Sphericity (BTS) approximated chi-square was 1939.332 (df = 45), which was statistically significant (p < .001) (N = 2046). Kaiser's criterion of retaining factors with eigenvalues greater than one was used first, and resulted in the retention of four factors, accounting for 59.414 per cent of the variance. When the data was "forced" into a five-factor solution, the declared variance was 68.875 per cent. These findings are summarised in Table 4. Table 4 is that the a priori model yielded the same number of factors and patterns of loadings as the theorised model. The loadings also fitted the positive-negative alternation, as was expected given the reverse wording of every other item. However, for component five (A(R)), the loading weight for that scale item was lower than what was theoretically expected. The model based on eigenvalues exposed four factors, where the first factor was more complex, but with the remaining three following the B5P conceptualisation.

What can be observed in
The data for the Netherlands was analysed next. The KMO measure of sampling adequacy was .526 for the Netherlands sample, and the BTS approximated chi-square was 1510.445 (df = 45), which was statistically significant (p < .001) (N = 1902). When Kaiser's criterion of retaining factors with eigenvalues greater than one was applied, five factors were retained, accounting for 66.859 per cent of variance. These findings are summarised in Table 5. Table 5 shows that, as was the case with the German sample, the a priori model had five factors with patterns of loadings reflecting those theorised in the B5P model. However, for component five (A(R)), the loading weight of the scale items was lower than expected. This is the same item for which loading was not satisfactory in the German sample. Worse in the case of the Netherlands was that the loading of the items was not signed contrary to that of the other marker variable. When the model was based on eigenvalues, five factors were found, which implies that the "forced" and the eigenvalue models were identical, both providing support to the B5P conceptualisation.  The data for Rwanda was analysed next. The KMO measure of sampling adequacy was .611 for the Rwanda sample, and the BTS estimated chi-square was 1693.435 (df = 45), which was statistically significant (p < .001) (N = 1527). When Kaiser's criterion for maintaining variables with eigenvalues greater than one was applied, four factors were maintained, accounting for 61.395 per cent of the variance. The total variance explained was 82.646 per cent when "forcing" the data into the five-factor solution. Table 6 summarises these findings.
What can be observed is that the number of factors and patterns of loadings on the factor components based on the eigenvalues greater than one do not match those theorised in the B5P model. The indicator items loaded haphazardly across the factors such that no definite personality construct in line with the B5P model could be identified. Also, when the data was "forced" into the five-factor solution, the patterns so pronounced in the WEIRD-countries were absent.
Lastly, data for South Africa was analysed. The KMO measure of sampling adequacy for the South African sample was .872, and the BTS approximate chi-square was 1057.903 (df = 45), which was statistically significant (p < .001) (N = 3531). When Kaiser's criterion of retaining factors with eigenvalues greater than one was applied, two factors were retained, accounting for 55.644 per cent of the variance. When the data was "forced" into the five-factor solution, the declared variance was 76.606 per cent. These findings are shown in Table 7.
Only two components were derived based on the strategy of selecting eigenvalues greater than one. As shown in Table 7, the patterns of the item loadings do not reflect any particular personality construct as proposed in the five-factor model. Also, when considering the forced five-factor solution, none of the patterns typical of B5P, as observed in the samples from Germany and the Netherlands, were observed.
The results presented in Tables 4 to 8 show compelling evidence that the BFI-10 functions at different levels of effectiveness: in the WEIRD countries it follows the B5P conceptualisation, while this does not occur in the non-WEIRD countries.
The aforementioned conclusion is based on exploratory factor analyses (EFA), and the interpretation involved some subjectivity. To quantify these tentative conclusions, confirmatory factor analysis (CFA) was used to evaluate for structural validity in a thorough and comprehensive manner, again per country. Five factors were postulated, with two items loading on each factor, as explained in Table 1. Hu and Bentler's (1999) cut-off criteria for fit indices, as presented in Table 3, were applied. The test results for each of the four countries are presented in Table 8.
According to the cut-off criteria, the findings revealed a poor fit across all counties, including those which had comparatively better patterns of loadings in EFA. These results were surprising, to say the least, and will be discussed in the discussion below.

Discussion
The overall aim of the study was to establish the structural validity and measurement of variance of the BFI-10 instrument in WEIRD and non-WEIRD countries. The novelty of the study lies in its attempt to compare and contrast the psychometric properties of the BFI-10 instrument in the culturally different environments.
The literature review focused on the B5P conceptualisation, as well as the concept of measurement invariance and the need to test for it. It was revealed that although the B5P conceptualisation is well accepted, it is not without critique, particularly in terms of its use as a universal theory of personality. With regard to measurement invariance, the concept, as well as how it should be assessed, was considered.
The data revealed that the countries included in the study were clearly differentiable based on the WEIRD concept, with the two Western-European counties being classified as WEIRD, and the two from sub-Saharan Africa as non-WEIRD. Another characteristic which differentiated the countries was mean age, which was substantially higher in the WEIRD countries. It may well be asked whether the acronym WEIRDO, in which O stands for old, might perhaps be applicable. The implications of including older respondents from the WEIRD group in the analyses of personality can only be speculated on. Lang et al. (2011) hypothesised that the mental strain associated with personality studies could preclude elderly respondents from providing valid self-report responses, thereby decreasing the likelihood of deriving a compact five-factor model. They did note, however, that more educated elderly people are likely to cope better with mental strain than less educated elderly people, and thus provide more consistent item responses during surveys. Given that B5P conceptualisation is based largely on trait theory, the fact that data was extracted from a more mature sample may be irrelevant.
The result of the EFA suggests a partially valid model in the WEIRD contexts and an invalid one in the non-WEIRD contexts (see, Tables 4 to 7). The results reveal that, at a configural level of measurement invariance, WEIRD countries met the criteria (to a large degree), but that the non-WEIRD data did not support the proposed theoretical structure at all. These results underscore the psychometric difficulties relating to comparison of personality score levels prevalent in cross-cultural studies. This corroborates Hofstede and McCrae's (2004) affirmation that perception of personality dimensions is not divorced from cultural context. Hence, researchers should not assume that personality instruments are equally valid in settings other than those in which they were developed.
Even though the outcome of the EFA for the WEIRD countries suggested a factor structure almost equivalent to the theorised model, it still fell short of the ideal. When a more comprehensive statistic is used to test for configural fit, the model fit indicators (CFI and RMSEA) reflected a poor model fit for all four countries. These results were surprising, as the EFA results were quite satisfactory for Germany and the Netherlands. Given the CFA results, it should be stated that at the most basic form of measurement invariance, that is, configural invariance, the BFI-10 failed.
All further tests of measurement invariance were abandoned, as testing for measurement invariance is a sequential process, where testing for higher levels of invariance is done only once certain milestones have been achieved (Berry et al., 2011). Thus, no tests were performed on configural, metric, scalar, and strict invariance.
Other studies, notably those of Ludeke and Larsen (2017)have since shown the measurement limitations of using the BFI-10 instrument. However, Ludeke and Larsen (2017) found it to be a reliable and valid tool in Germany and the Netherlands (the WEIRD countries). Previous studies within the WVS domain (Balgiu, 2018;Rammstedt & John, 2007) found satisfactory levels of structural validity and measurement invariance. Several studies outside the WVS domain have found the BFI-10 to be a reliable, valid, convenient and useful tool for gauging self-reported personality traits (Erdle & Rushton, 2011;Guido et al., 2015;Rammstedt & Krebs, 2007).
These results affirm the problematic nature of the BFI-10 tool, and this may be the reason why its use in the WVS has since been discontinued, and it does not appear in the WVS seventh wave questionnaire. Notwithstanding its limitations identified in the WVS sixth wave data, the BFI-10 remains a useful instrument for researchers, perhaps particularly so in WEIRD environments. Thus, it is recommended that the psychometric properties of even well-established personality rating tools need to be examined, particularly if they are used in environments foreign to where they were developed. There is also evidence that disparities in schooling affect structural validity even within a country (Lang et al., 2001;Rammstedt et al., 2010). As a result, the findings may be attributed less to WEIRD vs. non-Weird and more to educational differences. Future studies should probe further the possible influence of the variations in respondents' levels of education and other demographic variables on the structural validity of the BFI-10 instrument.

Conclusion
The article discusses the importance of assessing the psychometric properties of well-established personality rating instruments when they are used across diverse cultural groups. Preliminary results (using EFA) indicate that the BFI-10 instrument has configural structural validity and measurement invariance in WEIRD countries, but not in non-WEIRD countries. Further investigations (using CFA) revealed that not even in WEIRD countries was the BF-10 structurally valid. If nothing else, this indicates the importance of using multiple statistical techniques to gain information on a specific question. The results suggest that practitioners and researchers should adopt a cautious approach when applying ostensibly globally accepted tools in contexts for which they were not designed, and that, particularly as research findings concerning the BFI-10 are so contradictory, further research on this instrument be conducted.
Lastly, it seems that the WVS data supported the WEIRD categorisation of countries, even though the results did not fit the theorised factor structure perfectly. It was noted that