Abbreviated Three-Item Versions of the Satisfaction with Life Scale and the Harmony in Life Scale Yield as Strong Psychometric Properties as the Original Scales

The cognitive components of subjective well-being can be measured with the Satisfaction with life scale (SWLS) and the Harmony in life scale (HILS), which both comprise five items each. The aim of this article is to abbreviate these scales and examine their psychometric properties and validity. Three datasets including test-retest data are used (N1⁄4 787; N1⁄4 860; N1⁄4 343). The two first datasets were already collected, whereas the third dataset included delivering the three-item scales (SWLS-3; HILS-3) together (in random order) with one shared instruction. The last study was pre-registered, including open data and code. The SWLS-3 and the HILS-3 demonstrate good psychometric properties, including very high internal consistency and item total correlations, strong test-retest reliability, where two-factor models of cognitive well-being tend to yield very good fit indices. Further, the scales demonstrate measurement invariance across time and gender. In fact, the three-item scales demonstrate as strong psychometric properties as compared with the five-item scales. Additionally, the scales demonstrate similar validity by yielding similar correlations to assessments of well-being, mental health problems and social desirability. Thus, the SWLS-3 and the HILS-3 can efficiently be used together with one shared instruction, without compromising (and in most aspects even yielding small improvements) the psychometric soundness of the scales. ARTICLE HISTORY Received 10 September 2019 Accepted 26 January 2020


Introduction
The subjective well-being (SWB) approach assesses wellbeing as a cognitive component and an affective component (Diener, 1984). The cognitive component focuses on life evaluations: how individuals think about their lives. To assess the cognitive component, the Satisfaction with life scale (SLWS; Diener, Emmons, Larsen, & Griffin, 1985) is most often used; whilst recent research show that it can meaningfully be complemented with the Harmony in life scale (HILS; Kjell, Daukantait_ e, Hefferon, & Sikstr€ om, 2016). The two scales have in common that they do not impose a lot of criteria or aspects that respondents are forced to evaluate: They allow subjective evaluations, where respondents decide for themselves what they consider meaningful and important in relation to satisfaction with life (SWL) and harmony in life (HIL).
The main aim of this article is to abbreviate these scales, whilst not compromising their psychometric soundness. It is not argued that the original SLWS and HILS are poor, but rather that they more efficiently can be delivered in abbreviated versions, without compromising the psychometric properties. It is valuable to provide evidence supporting the use of abbreviated versions of the SWLS and the HILS. This is because answering long scales with many items may put unnecessary demands on participants and may result in poor data quality in long surveys; whilst efficiently and accurately assessing both SWL and HIL result in a more comprehensive and detailed understanding of well-being (e.g., see Kjell, 2011Kjell, , 2018. The examination of the psychometric properties in this article includes examining their internal consistency, item total correlation, test-retest reliability, factor structure using confirmatory factor analyses as well as examining measurement invariance across time and gender in three different datasets. In addition, validity is investigated by examining the scales' correlation to other measures of well-being, mental health problems and social desirability.

Advantages of shorter scales
Short scales have been found advantageous in several contexts. Sandy, Gosling, Schwartz, and Koelkebeck (2017) describe several such contexts, including large online studies where, respondents might not have the patience for long questionnaires; longitudinal designs, where respondents are tracked at numerous occasions over a long time; and prescreenings, where the aim is to quickly identify a number of traits or states before allowing entry to a full study. Further, "the demand for short scales is currently expanding at an accelerating speed. One reason for the increasing need for short scales could be a changing way to approach psychological research in general. With research questions becoming more and more complex, involving more and more constructs … " (Ziegler, Kemper, & Kruyen, 2014, p. 185). Examples of short measures with satisfactory psychometric properties include the 5 and 10 items scales for the Big-Five personality domains (Gosling, Rentfrow, & Swann, 2003), the Ten Item Values Inventory (Sandy et al., 2017) and the Single-Item Self-Esteem Scale (Robins, Hendin, & Trzesniewski, 2001).

Satisfaction with life and Harmony in life
Research demonstrate that SWL and HIL complement each other in providing a comprehensive understanding of subjective well-being (e.g., see Kjell, 2011Kjell, , 2018. SWL involves "a global assessment of a person's quality of life according to his [or her] chosen criteria" (Shin & Johnson, 1978, as cited in Diener et al., 1985). In contrast, HIL "is by its very nature relational. It is through mutual support and mutual dependence that things flourish" (Li, 2008, p. 427). That is, "harmony encourages a holistic world view that incorporates a balanced and flexible approach to personal well-being that takes into account social and environmental contexts" (Kjell et al., 2016, p. 894). In accordance to these definitions, individuals describe their SWL with words such as happy, content, fulfilled, pleased and gratified; and their HIL with words such as peaceful, balanced, calm, unity and agreement (Kjell, Kjell, Garcia, & Sikstr€ om, 2018). Further, in a large cross-cultural investigation where individuals where allowed to freely describe what happiness is for them, the responses concerned both harmony and psychological balance (25% of the responses) as well as satisfaction (7% of the responses; Delle Fave, Brdar, Freire, Vella-Brodrick, & Wissing, 2011, see also similar results in Delle Fave et al., 2016. Hence, together SWL and HIL capture central and complementary aspects of well-being.

Items of the SWLS-3 and the HILS-3
The original SWLS and HILS comprise five items each (SWLS-5; HILS-5); and, here it is suggested that the first three items of each scale (SWLS-3; HILS-3) are most apt to form abbreviated versions. From a psychometric property perspective, the three first items of the SWLS yielded the strongest factor loadings and item-total correlations (Diener et al., 1985); and research has identified the last item showing less convergence with the other items (Pavot & Diener, 2009; see also Vittersø, Biswas-Diener, & Diener, 2005). Similarly, the first three items of the HILS also yielded the strongest item-total correlations (Kjell et al., 2016). Further, in a two-factor solution of the SWLS-5 and the HILS-5, the first three items of the scales yielded the strongest factor loadings at two separate measurement occasions (with the same participants).
The first three items of each scale also make most sense to select from a theoretical perspective as they arguably are most directly tapping into the targeted constructs. The first three items in the SWLS-5 concern being satisfied, having an ideal life or excellent conditions; whereas the last two items tap into evaluating one's past (as in So far I have gotten the important things I want in life), and have gotten important things (as in If I could live my life over, I would change almost nothing). In terms of the HILS-5, the first three items focus on the most central aspects of HIL where the items include the words harmony or balance as opposed to the last two items that focus on accept (as in I accept the various conditions of my life) and fitting in (as in I fit in well with my surroundings).

Psychometric properties
Previous research indicates that the five-item scales of SWL and HIL demonstrate good psychometric properties in terms of internal consistency, test-retest, item-total correlations as well as test-retest reliability (e.g., see Diener et al., 1985;Kjell et al., 2016). Confirmatory factor analyses have further demonstrated that the SWLS-5 and the HILS-5 form a twofactor model with good fit (Kjell et al., 2016). Although, to our knowledge, measurement invariance has not been examined for the HILS-5. Whereas, for the SWLS-5, research has found that factor loadings, unique variances and factor variance are invariant across sexes (Shevlin, Brunsden, & Miles, 1998; for a review see Emerson, Guhn, & Gadermann, 2017) and time (using a spanish version in an adolescent sample; Esnaola, Benito, Antonio-Agirre, Axpe, & Lorenzo, 2019). This article focuses on examining the psychometric properties of the three-item scales in regard to internal consistency, item total correlation, test-retest reliability, factor structure using confirmatory factor analyses as well as measurement invariance across gender and time. Further, as comparison, the article presents these aspects of psychometric properties for the five-item scales as well, which are based on two of the three datasets (i.e., the last dataset only comprises the three-items scales).

Hypotheses
The following hypotheses were pre-registered after having analyzed the first two datasets (already collected from Kjell et al., 2016; but before collecting the third dataset. The pre-registered hypotheses include: H 1 . The SWLS-3 and the HILS-3 yield good internal consistency and strong or very strong item total correlations. H 2 . The SWLS-3 and the HILS-3 yield a good fit in a two-factor solution using confirmatory factor analyses. H 3 . The SWLS-3 and the HILS-3 yield strong longitudinal measurement invariance (strong measurement invariance is further described in the Statistical methods section).
H 4 . The SWLS-3 and the HILS-3 yield strong or very strong testretest correlations after two weeks follow up.
In addition to these hypotheses, the validity for the threeitem scales is investigated by examining their correlation to constructs relating to well-being (i.e., happiness, and psychological well-being), mental health problems (i.e., depression, anxiety and stress) and social desirability; where the focus is to compare the correlations with the five-item scales. We did not pre-register specific hypotheses for these analyses, but generally anticipated the correlations of the three-and five-item scales to be similar.

Participants
Participants in Dataset 1 were taking part in Kjell et al. (2016) second study including a range of other well-being related instruments. Participants in Dataset 2 were taking part in Kjell et al.'s (2018) seventh study, which also included several other well-being related instruments. Dataset 3 was specifically collected for the purpose of this article, where the material is described under Instruments below.
Dataset 1 were collected using Mechanical Turk (Mturk) and included a test-retest procedure (M ¼ 57.2; SD ¼ 5.6 days between Time 1 [T1] and Time 2 [T2]). At T1, 787 participants completed the survey and control questions correctly (360 Females and 427 Males, with a mean age of 30.8 [SD ¼ 9.8] years, 141 failed the control questions); at T2, 535 participants completed the survey and the control questions correctly (252 females and 283 males, with a mean age of 31.2 [SD ¼ 9.8] years, 60 failed the control questions). Most participants came from India, followed by the USA and other countries; for more detailed information see Kjell et al. (2016).
Dataset 2 were also collected on Mturk including testretest (M ¼ 30.8; SD ¼ 2.0 days between T1 and T2). At T1, 860 1 participants completed the survey and control questions correctly (439 Females and 421 Males, with a mean age of 32.8 [SD ¼10.1] years, 42 failed the control questions); at T2, there were 477 participants (261 Females and 216 Males, with a mean age of 34.1 [SD ¼ 10.4] years, 42 failed the control questions). More than 90% of the participants reported coming from the USA, followed by other countries; see Kjell et al. (2018) for more details.
Dataset 3 were collected for this article. Participants were recruited from Prolific, using the following pre-screeners: Fluency in English, nationality from the UK and the minimum age of 18 years. Participants were paid £0.3 to partake at T1, and the study took 1.02 (SD ¼ 1.5) mins to complete. Three-hundred-fifty participants completed the study, but 7 answered the control question incorrectly and were removed from the analyses. The final sample comprises 343 participants (236 Females, 106 Males, and 1 Other, with a mean age of 34.4 [SD ¼ 11.9] years).
After two weeks, the 343 participants were invited to partake again for £0.3. As pre-registered, those who had not answered were sent a reminder two days later; the survey was closed one week after the first invitation for T2. Threehundred participants answered but one was removed for not answering the control item correctly. The final T2 sample comprised 299 participants (87.2% of the T1 sample; 214 Females, 84 Males, and 1 Other, with a mean age of 35.0 [SD ¼ 12.1] years). The study took on average 1.03 (SD ¼ 1.60) minutes to complete, and there were on average 14.8 (SD ¼ 1.40) days between T1 and T2.

Instruments
For Dataset 1 and 2 we only present the instruments that are employed in the analyses of this article; whereas for Dataset 3 we describe all measures that were included in the data collection.

Dataset 1
The Satisfaction with Life Scale (SWLS; Diener et al., 1985) assesses life satisfaction with five items (e.g., In most ways my life is close to my ideal) answered using a 7-point rating scale ranging from 1 ¼ Strongly Disagree to 7 ¼ Strongly Agree. See the Results section for psychometric information.
The Harmony in life Scale (HILS; Kjell et al., 2016) measures harmony in life with five items (e.g., My lifestyle allows me to be in harmony). The closed-ended items are answered on the same scale as the SWLS and the psychometric properties are presented in the Results section.
The Subjective Happiness Scale (SHS; Lyubomirsky & Lepper, 1999) measures happiness as a cognitive construct; i.e., how the respondent thinks about their life in terms of happiness. The measure comprises four items answered on closed-ended Likert-type scales that range from 1 to 7; with different scales that are specific to each item (e.g., the item: In general, I consider myself; is coupled with the following scale: 1 ¼ Not a very happy person to 7 ¼ A very happy person). The McDonald's omega was .87 and Cronbach's alpha was .82.
The Scales for Psychological Well-Being (SPWB; Ryff, 1989;Ryff & Keyes, 1995) abbreviated version comprises 18 items, which cover six subscales/dimensions involving (McDonald's omega/Cronbach's alpha are presented after each example item): Autonomy (e.g., I judge myself by what I think is important, not by the values of what others think is important; .48/.42), Environmental mastery (e.g., In general, I feel I am in charge of the situation in which I live; .67/.61), Personal growth (e.g., For me, life has been a continuous process of learning, changing, and growth; .53/.40), Positive relations with others (e.g., People would describe me as a giving person, willing to share my time with others; .61/.58), Purpose in life (e.g., Some people wander aimlessly through life, but I am not one of them; .52/.18), and Self-acceptance (e.g., I like most aspects of my personality; .71/.69). There are three items per dimension/subscale, and items are answered on a Likert-type scale that ranges from 1 ¼ Strongly disagree to 6 ¼ Strongly agree. 1 Kjell et al., (2018) report 854 participants since 6 participants had not written any words in other questions and were thus removed; but are used here since they completed the SWLS and the HILS.
The Marlowe-Crowne Social Desirability Scale (Crowne & Marlowe, 1960) the shorter version Form A (Reynolds, 1982) comprises 11 items that capture social desirability (e.g., I am always courteous, even to people who are disagreeable).
Respondents are required to answer whether the statement is personally True or False for them. McDonald's omega was .70 and the Cronbach's alpha was .65.

Dataset 2
The HILS and the SWLS as previously described for Dataset 1.
The Patient Health Questionnaire-9 (Kroenke & Spitzer, 2002) measures Depression with nine items (e.g., Feeling down, depressed or hopeless), coupled with rating scales ranging from 0 ¼ Not at all to 3 ¼ Nearly every day. Both McDonald's omega and Cronbach's alpha were .93.
The Generalized Anxiety Disorder Scale-7 (Spitzer, Kroenke, Williams, & L€ owe, 2006) assesses Anxiety with seven items (e.g., Worrying too much about different things) answered on the same rating scale as the PHQ-9. Both McDonald's omega and Cronbach's alpha were .94.
The Marlowe-Crowne Social Desirability Scale (Crowne & Marlowe, 1960) the shorter version Form A (Reynolds, 1982) as previously described, was also included in this dataset, where both McDonald's omega and Cronbach's alpha were .73.

Dataset 3
The Abbreviated Version of the Satisfaction with Life Scale (SWLS-3) comprises the three first items from the original SWLS developed by Diener et al. (1985). The items (e.g., I am satisfied with my life) are answered on the same scale as the SWLS. Internal consistency statistics for the scale in the current study are presented in the Results section.
The Abbreviated Version of the Harmony in Life Scale (HILS-3) includes the three first items from the full version of the HILS as developed by Kjell et al. (2016). The items (e.g., I am in harmony) are answered using the same rating scale format as described for the SWLS. For internal consistency statistics see the Results section.
The Control Question included the following attention check question: Please answer the alternative '4 neither agree nor disagree' below. Participants that failed to answer it correctly were removed from the analyses as pre-registered. This kind of attention checks has been shown to increase the statistical power and quality of data sets (Oppenheimer, Meyvis, & Davidenko, 2009).

Procedure
All three datasets were collected online, and participants were first informed about the study, that participation is voluntary, anonymous, and that they can withdraw at any time without giving a reason. For more detailed procedural information about the collection of Dataset 1 and 2 see Kjell et al. (2016Kjell et al. ( , 2018, respectively. In the collection of Dataset 3, participants were also informed that they will be asked to partake in two weeks' time, and that their data would be open upon publication of the article. Then participants were asked to enter their Prolific ID and to fill out the SWLS-3 and the HILS-3 in randomized order (i.e., the scales were presented together, on the same webpage, only showing the instructions once). Lastly, participants answered the demographic questions and were debriefed. After two weeks, participants were contacted again and asked to complete the same survey again as specified above; after two days, those participants that had not answered the survey were reminded, and after a week the survey was closed.

Statistical methods
The data was analyzed using frequentist (Neyman-Pearson) statistics, Confirmatory Factor Analyses (CFA), and measurement invariance analyses. The following criteria were used for the two first datasets and pre-registered for the third dataset. The alpha level was set to .05. Cronbach's alpha and McDonald's omega above .70 were considered to indicate good internal consistency. Pearson correlations of .2 À .39 were interpreted as weak, .40-.59 as moderate, .6 À .79 as strong, and above .8 as very strong.
To identify the CFA models, the factor loading of the first item of the latent variables were set to 1 (which is the default in lavaan; Rosseel, 2012). The robust Maximum likelihood estimator (MLM) was used as individual items were not normally distributed in the datasets (and was thus preregistered for Dataset 3). P-values above .05 are considered to indicate good fit; however, since p-values are biased by sample size the following criteria were also used to indicate good fit: the comparative fit index (CFI) above .95 and the root mean square error of approximation (RMSEA) below .05 (and below .08 for acceptable fit; Schreiber, Nora, Stage, Barlow, & King, 2006). To examine whether the two threeitem scales perform similarly across time and gender, measurement invariance analyses were carried out. The following five models using increasingly restrictive parameter specification across time or gender were carried out: Model 1 (baseline, configural) constrains the factors to be invariant across time or gender, whereas there are no equality constrains on the parameters.
Model 2 (metric, referred to as weak invariance) includes constrains for the factor loadings so that they are invariant across time/gender. Model 3 (scalar, strong invariance) adds constrains on the intercepts of the items so that they are invariant across time/gender. Model 4 (strict invariance) further adds constrains to the residual variances so that they are invariant across groups. Model 5 further constrains the means of the factors so that they are invariant across time/gender.

Dataset 2
The intercorrelations between the SWLS-3 and the SWLS-5 as well as between the HILS-3 and the HILS-5 are also very strong in Dataset 2 (r ¼ .97 and r ¼ .98, respectively). Further, the correlation between the three-item scales is similar to the correlation between the five-item scales (i.e., r ¼ .85 and r ¼ .84, respectively).

Dataset 3
When the SWLS-3 and the HILS-3 are delivered on the same page the correlation between them (r ¼ .74) falls between the correlations of Dataset 1 and 2, where the items were presented on different pages (c.f. r ¼ .73 in Dataset 1, and r ¼ .85 in Dataset 2).
Good fit for a two-factor, rather than a onefactor, model CFA were used to examine whether the six items of the SWLS-3 and the HILS-3 were best captured in a two-factor as compared with a one-factor model. A two-factor model yielded a considerably better fit than a one-factor model across the datasets; and the fit tended to be better for the three-item scales as opposed to the five-item scales (see Table 3 for fit indices and Figure 1 for factor loadings).

Dataset 1
For the three-item scales, a one-factor model yields a poor fit for all fit-criteria; whereas a two-factor model of cognitive SWB including SWL and HIL yields a better fit including a good fit for CFI, but above the acceptable cutoff for RMSEA. This can be compared with the five-item scales, where the two-factor model is acceptable (see RMSEA) to good (see CFI) fit.

Dataset 2
A two-factor model yields a good fit for the three-item scales, where the p-value is just above .05; and the fit indices indicate good fit. This fit is better than the one-factor model; and it is also worth noting that it is a better fit than for the five-item scales where the fit is only acceptable to good, and the p-value is below the .05 threshold.

Dataset 3
Presenting all items together with one shared instruction did not disturb the two-factor fit in Dataset 3; where a twofactor model yields a good fit for the three-item scales, which is better than for the one-factor model. It is notable that the p-value is above .05, the CFI indicates good fit and the RMSEA acceptable fit.

Longitudinal measurement invariance
To assess psychometric equivalence of the SWLS-3 and the HILS-3 across time, analyses of longitudinal measurement invariance were carried out. Overall, both scales demonstrated strict invariance across time (see Table 4 for the SWLS, and Table 5 for the HILS) for all three datasets.

Dataset 1
The configural model showed an acceptable (see RMSEA) to good (see CFI) fit for the SWLS-3; and a good fit for the HILS-3. All test of Dv 2 between models were not significant, and according to the DCFI cutoff of .01, both the SWLS-3 and the HILS-3 demonstrated strict measurement invariance. It is also noteworthy that both five-item scales also yielded strict invariance; although the HILS-3 showed better fit indices than the HILS-5, which only demonstrated an acceptable configural fit.

Dataset 2
The configural models for both the SWLS-3 and the HILS-3 showed good fit; and again, the scales showed strict measurement invariance based on non-significant Dv 2 and DCFI below the threshold. Further, the five-item scales demonstrated strict invariance as well; although (again) the configural model of the HILS-5 were only acceptable.

Dataset 3
The configural models for both the SWLS-3 and the HILS-3 yielded good fit; where the scales, yet again, demonstrated strict measurement invariance based on DCFI below the threshold and non-significant Dv 2 (except for the SWLS-3, which based on Dv 2 demonstrated strong invariance as the model for residual variances were significant, p <.010).

Strong test-retest reliability
Overall, the test-retest of the scales were strong to very strong; although the three-item scales tend to demonstrate slightly (although probably not statistically) lower test-retest correlations than the five-item scales.

Dataset 3
Importantly, the SWLS-3 yields very strong test-retest reliability, and the HILS-3 yields strong test-retest reliability when being answered with the fourth and fifth items removed.

Measurement invariance across gender
Measurement invariance analyses were used to assess psychometric equivalence of the SWLS and the HILS across gender (i.e., between females and males). Both three-item scales demonstrated strict invariance across gender (see Table 7 for SWLS, and Table 8 for HILS) for all three datasets.

Dataset 1
The configural fit for the SWLS-3 and the HILS-3 were just identified. Both scales demonstrated strict measurement invariance based on both non-significant Dv 2 and DCFI below the threshold. It is noteworthy that the SWLS-5 yielded an acceptable to good configural fit, and the HILS-5 yielded an unacceptable to good configural fit. The SWLS-5 demonstrated non-significant Dv 2 and DCFI less than .01, whilst the HILS-5 demonstrated non-significant Dv 2 for all models but model 3 (and model 5, which is not the focus here) and DCFI less than .01 for all model comparisons.

Dataset 2
Again, the configural model was just identified, and the SWLS-3 and the HILS-3 showed strict measurement invariance in regard to both non-significant Dv 2 and DCFI below the threshold for all model comparisons. In this dataset the measurement invariance of SWLS-5 can be considered strict based on DCFI less than .01. However, it might be concerning that the Dv 2 for both model 2 and 3 are significant (p ¼ .005, and p < .001, respectively). The configural model for the HILS-5 is not acceptable (see RMSEA) to good (see CFI), with strict invariance based on non-significant Dv 2 and DCFI less than .01 for all model comparisons.

Dataset 3
Replicating the results from both Dataset 1 and 2, both the SWLS-3 and the HILS-3 had a configural model that was just identified and demonstrated strict measurement invariance. This was in regard to both non-significant Dv 2 and DCFI below the threshold for all model comparisons.
Comparing the validity between the three-and fiveitem scales The three-and five-item scales, yield similar correlation coefficients with other related psychological constructs of mental health (i.e., subjective happiness and psychological well-being), psychological problems (i.e., depression, anxiety and stress) and social desirability. This is demonstrated in Dataset 1 (see Table 9) and 2 (see Table 10). .84

Discussion
Overall, the three-item scales of SWL and HIL yield strong psychometric properties. Often the three-item, as compared with the five-item scales, produced psychometric improvements, although these were small and they probably have little practical importance. In addition, the three-item scales form a two-factor solution of cognitive well-being with good fit, that tend to include better fit indices than the five-item scales. First, the SWLS-3 and the HILS-3 yield high internal consistency as well as very strong item total correlations in accordance to H 1 . In fact, the Cronbach's alphas were slightly higher (.01 À .02) for the three-item scales in comparison to the five-item scales; whilst McDonald's omega total and item total correlations were similar.
Second, the SWLS-3 and the HILS-3 demonstrate a good fit in a two-factor solution, which is in accordance to H 2 (although, in Dataset 1 the fit was just about unacceptable based on the RMSEA criteria, but good as indicated by the CFI criteria). It is further important to note that the twofactor fit is better than a one-factor fit throughout all three datasets. In addition, the fit indices tend to be better for the three-item scales than the five-item scales (the only exception is in Dataset 1 where the RMSEA is somewhat better for the five-item scales, but this is not true for CFI).
Third, in accordance to H 3 , the HILS-3 and the SWLS-3 yield strong longitudinal measurement invariance in all three datasets; in fact, both scales consistently demonstrated strict measurement invariance based on the CFI difference threshold in all three datasets. Hence, the results support invariant factor structure (i.e., see Model 1), invariant factor loadings (i.e., i.e., see Model 2), invariant item intercepts (i.e., see Model 3), and invariant residual variance (i.e., see Model 4). This demonstrates that the meaning of the constructs as measured by the SWLS-3 and the HILS-3 are similar across the repeated assessment occasions. So, the results support meaningful comparisons of the means across different measurement times. It is also of interest that the HILS-5 only demonstrated an acceptable [based on RMSEA] rather than a good configural fit; hence, from a longitudinal measurement invariance perspective it might, in fact, be more appropriate to use the HILS-3 rather than the HILS-5.
Fourth, the SWLS-3 yields strong to very strong testretest reliability, and the HILS-3 demonstrates strong testretest reliability, which is in agreement with H 4 . Although, the three-item scales show smaller test-retest correlations when compared to the five-item scales, this difference can be considered small (i.e., rs are .01 to .05 units smaller). However, the removed items thus appear to be somewhat more stable over time than the included items. For example, this might be because one of the removed items in the SWLS concerns one's past (item 5: If I could live my life over, I would change almost nothing), and the perception of one's past might not change as quickly as one's perception of SWB level. The reasons for the lower test-retest Table 9. Pearson's r correlation comparisons between three-and five-item scales in dataset 1.  correlation of the HILS may be because the removed items tap in to fitting in and accepting various conditions, might have stronger stability than the core items that more directly tap into harmony and balance. Hence, the slightly lower test-retest correlations of the three-item scales might actually reflect more true changes in the targeted constructs; although these conjectures require more research. Fifth, the HILS-3 and the SWLS-3 yield strong (and even strict) measurement invariance across gender in all three datasets, which is in accordance to H 5 . Thus, there are support for measurement invariance on all four levels. Importantly, this enables the comparison of means between females and males. Further, it is interesting to note that the three-item scales did not demonstrate potential problems that is indicated by the five-item scales. For example, the SWLS-5 yields significant differences for model 2 and 3 in Dataset 2; and the HILS-5 yields significant difference between model 3 in Dataset 1. It may also be concerning that the configural model of the HILS-5 demonstrates unacceptable fit based on the RMSEA (although the CFI indicates good fit). So, from a measurement invariance perspective it might be a better choice to use the abbreviated three-item scales rather than the longer five-item versions.
In addition to the pre-registered hypotheses it is also noteworthy that the correlations between the abbreviated and original scales are very strong (r ¼ .95r ¼ .98); and the correlations between the three-item scales and the fiveitem scales are very similar (r difference of .01). In terms of validity, the three-and five-item scales yield very similar correlation coefficients to other well-being measures (including subjective happiness and the dimensions/subscales of psychological well-being), assessments of mental health problems (including depression, anxiety and stress) as well as social desirability.
Furthermore, it is important to note that presenting the SWLS-3 and the HILS-3 together on the same page with just one shared instruction does not increase the correlation as compared to when they are delivered on separate pages with their own instruction. This is important as presenting the items together could have made respondents more likely to interpret them similarly.

Limitations and future research
Although the samples are diverse including participants from the US, the UK, India and some other countries; all samples are collected online. Hence, future research could benefit from examining measurement invariance across nations and in samples not only collected online. Further, the SWLS-3 and the HILS-3 are only tested on their own in a very short survey in this study, where future studies will be able to show how they more specifically relate to other constructs. However, considering the very strong correlations between respective three-and five-item scales, there is very little room for differences between the scales and their correlations to other constructs. In addition, in Dataset 3, at T1 participants that did not complete the T2 survey significantly differed from those who did complete it in terms of age, gender and HILS-3 score.
However, notable the effect sizes were small, and the overall response rate at T2 were very high; that is, 87% of the participants that partook at T1 completed the survey at T2.
Moreover, the internal reliability of the Scales of Psychological Well-Being demonstrated low internal consistency as measured with Cronbach's alpha and McDonald's omega. Hence, these analyses should be interpreted with caution; however, the correlations of the three-and fiveitems scales did not differ considerably for these subscales. Lastly, a potential limitation of these scales might be the absence of reversed scored items. There are, however, an ongoing debate about the potential benefits of reversed items (e.g., see Su arez-Alvarez et al., 2018;Van Sonderen, Sanderman, & Coyne, 2013;Weijters, Baumgartner, & Schillewaert, 2013); especially for short scales where boredom and inattention is less likely than for longer scales.

Conclusions
To summarize, results from three different datasets show that the three-item scales of SWL and HIL demonstrate (very) high internal consistency and very strong item total correlations, where a two-factor model of cognitive wellbeing yield a good fit (which is better than a one-factor model). The three-item scales also demonstrate strong to very strong test-retest reliability. In addition, the scales yield strict longitudinal measurement invariance as well as strict measurement invariance across gender, which importantly enables meaningful comparisons between means across both time and gender. Lastly, the three-and five-item scales demonstrate comparable validity by yielding very similar correlation coefficients to constructs of well-being, mental health problems and social desirability.
In conclusion, the SWLS-3 and the HILS-3 demonstrate good psychometric properties and can efficiently be presented together with shared instructions. In fact, although the fiveitem scales demonstrated good psychometric properties, the three-item scales appear to overall yield better or competitive properties, particularly in forming a two-factor model with good fit, yielding longitudinal measurement invariance as well as measurement invariance across gender. Hence, using the SWLS-3 and the HILS-3 might be particularly useful in situations where it is important to shorten the surveys and limit the demands put on respondents. Furthermore, there are no loss of strong psychometrics, and perhaps even some small improvements in the three-item scales as compared with the five-item scales. Considering the successful abbreviation of both these scales, future research may consider shortening other commonly used scales to reap similar benefits.