The value of social media language for the assessment of wellbeing: A systematic review and meta-analysis

ABSTRACT Wellbeing is predominantly measured through self-reports, which is time-consuming and costly. It can also be measured by automatically analysing language expressed on social media platforms, through social media text mining (SMTM). We present a systematic review based on 45 studies, and a meta-analysis of 32 convergent validities from 18 studies reporting correlations between SMTM and survey-based wellbeing. We find that (1) studies were mostly limited to the English language, (2) Twitter was predominantly used for data collection, (3) word-level and data-driven methods were similarly prominent, and (4) life satisfaction was the most common outcome studied. We found that SMTM-based estimates of wellbeing correlated with survey-reported scores across studies at a meta-analytic average of r = .33(95% CI [.25, .40]) for individual-level assessments of wellbeing, and at r = .54(95% CI [.37, .67]) for regional measures of well-being. We provide recommendations for future SMTM wellbeing studies.

There is a growing interest in the concept of wellbeing, given its association with a wide range of positive outcomes.Higher levels of wellbeing are associated with better financial habits and social relations, more altruistic behaviours, higher school grades, and better workplace functioning (Chapman & Guven, 2016;James et al., 2019;Kim et al., 2019;Maccagnan et al., 2019;Okabe-Miyamoto & Lyubomirsky, in press;Oswald et al., 2015;Steptoe, 2019;Walsh et al., 2018).Higher levels of wellbeing, supported by governmental policies, may also boost the socio-economic development of nations (Lambert et al., 2020;Santini et al., 2021).
Most of the research on wellbeing relies on self-report questionnaires, which seems justified since by definition wellbeing is centred on the subjective evaluation of one's functioning in life.However, collecting self-report data is time-consuming, expensive, and it can suffer from several biases, such as social desirability (Edwards, 1957) or recollection bias (Shiffman et al., 1997).Furthermore, wellbeing questionnaires are generally static and may therefore not be well suited to capture variance over time.A relatively novel alternative for self-report questionnaires is the automatic analysis of individuals' social-media language (social media text mining; SMTM).Below, we first discuss how wellbeing is defined in the field and explain in detail how SMTM is conducted and how SMTM estimates are usually evaluated.Next, we review what two existing reviews found for SMTM efficacy in assessing wellbeing.Finally, we present our systematic review, metaanalysis, and conclusions.

Definitions of wellbeing
A wide range of wellbeing definitions, models, and measures exist, but the most common conceptual distinction is made between subjective (or hedonic) and psychological (or eudaimonic) wellbeing (Deci & Ryan, 2008;Ryff, 1989).Subjective wellbeing (SWB) is defined as the cognitive and affective evaluation of one's life, whereby the cognitive component is often captured by life satisfaction, while the affective component is measured by (the presence of) positive affect and (the absence of) negative effect (Diener et al., 1985).Psychological wellbeing (PWB; Ryff, 1989) is defined as positive functioning in life, consisting of positive relations, autonomy, environmental mastery, personal growth, purpose in life, and self-acceptance.Overall, measures of wellbeing correlate moderately to strongly with each other, suggesting an underlying common, broad wellbeing factor (Bartels & Boomsma, 2009;Baselmans & Bartels, 2018;Disabato et al., 2016;Longo et al., 2016).

Social Media Text-Mining (SMTM)
The idea of analysing textual data to infer psychological phenomena can be traced back to the beginning of the 1900s.Freud suggested that mistakes in language use can inform about people's hidden intentions (see for an overview, Tausczik & Pennebaker, 2010).Later on, methods focused on individuals' responses to a predetermined set of stimuli (e.g., ambiguous inkblots or drawings) as indicators of emotions, thoughts, or motivations (e.g., Holtzman, 1950;McClelland, 1979;Rorschach, 1921).Around the 1950s the use of less stimulus-dependent approaches started to emerge.For instance, Gottschalk et al. (1958) developed a contentanalysis protocol to identify Freudian themes in transcriptions of 5-min recordings of patients talking about their thoughts (e.g., Gottschalk et al., 1958Gottschalk et al., , 1969)).The first computerized automatic text analysis program, the General Inquirer program (Rosenberg & Tucker, 1979;Stone et al., 1966), appeared in the second half of the 1960s.Today, the most prominent method of analysing text in the social sciences is the Linguistic Inquiry and Word Count software (LIWC; Boyd et al., 2022;Pennebaker et al., 2015).
Although these methods are widely available, collecting human-generated responses to open-ended questions can still be as costly as collecting survey responses.The vast availability of social media data has, however, changed this.It is estimated that more than 3.6 billion people use social media platforms worldwide (Tankovska, 2020) leading to the generation of unprecedented amounts of self-reported textual data every day.These include text-based recordings of thoughts, emotions, and behaviours without the individuals' primary motivation for providing data for research.Social media text data from thousands of subjects can be collected and analysed automatically while offering a less biased, unobtrusive, and more ecologically valid assessment of wellbeing.Collectively, we refer to methods that apply automatic text analyses on data from social media as social media text mining (SMTM; Tay et al., 2020).
Overall, conducting SMTM involves two steps.In the first step, the unstructured language data from individuals' social media accounts is automatically analysed to create language variables or 'features' (e.g., which words are used or the number of times a word is used relatively to the user's total word count).The methods to build language features can be categorized as closed and open-vocabulary methods (see for an overview, Schwartz & Ungar, 2015).
Closed-vocabulary methods involve dictionaries based on existing psycho-social theories or created through annotations performed by annotators.For example, a dictionary written by experts is the 2022 version of Linguistic Inquiry and Word Count (LIWC-22;Boyd et al., 2022), which includes 337 positive emotion words (e.g., 'love', 'nice', 'sweet') and 612 negative emotion words (e.g., 'hurt', 'ugly', 'nasty').Examples of annotation-based dictionaries are the Affective Norms for English Words dictionary (ANEW; Bradley & Lang, 1999; also see, Warriner et al., 2013) and the Language Assessment by Mechanical Turk (LabMT; Dodds et al., 2011; also see, Kloumann et al., 2012) providing the average valence scores (between happy and unhappy) for approximately over 10,000 unique words.The relative frequencies of words from the dictionaries can be counted to estimate the positive and negative content in the text.
In open vocabulary methods (Schwartz & Ungar, 2015) language features are 'learned' from the data itself.These methods count on algorithms or decision rules.For example, Latent Dirichlet Allocation (LDA; Blei et al., 2003) groups words in a text that naturally occur together to generate language features called 'topics'..A topic is usually comprised of words that are semantically coherent and meaningful, such as the words 'tonight', 'excited', 'super', and 'stoked' occurring together (Eichstaedt et al., 2021).Similarly, the Pointwise Mutual Information criterion (PMI; Abdi & Williams, 2010) is used to detect two-and three-word sequences that occur at rates that are above chance (e.g., 'have a good day', 'thanks a lot').
In the second step of SMTM, the features can be used for estimating simple correlations (word-level methods; see Jaidka et al., 2020) or building supervised MLprediction models ('data-driven').Prediction models may use both open and closed vocabulary features in combination with demographics.

Evaluating the success of an SMTM approach
The success of SMTM can be established through the level of convergence of SMTM estimates and the 'gold standard' or 'ground-truth scores' based on selfreports.Most associations observed between SMTM and survey scores have an upper limit of r = 0.30-0.40,which is similar in magnitude to correlations found between for example self-report survey scores and informant-reports of personality (e.g., Park et al., 2015) and wellbeing (e.g., Schneider & Schimmack, 2009).As alternative validity approaches, researchers compare the words and phrases (or topics) associated with high (and low) survey scores (Kern et al., 2016) or compare the temporal variations (peaks and dips) in SMTM scores (Cao et al., 2018;Dodds et al., 2011;Durahim & Coşkun, 2015;Kramer, 2010;Kristoufek, 2018;Qi et al., 2015).

SMTM to assess wellbeing
The unique SMTM data properties can be leveraged to study complex traits like wellbeing in large samples.SMTM can be applied to measure wellbeing both at the individual and regional levels: akin to survey-based wellbeing assessments, inferences can be made at regional levels by aggregating both the location-stamped language data (e.g., geo-located tweets) and the survey responses within each region (Jaidka et al., 2020).
A review (Luhmann, 2017) and a meta-analysis (Settanni et al., 2018) have been published on the use of SMTM to assess wellbeing.Settanni et al. (2018) found that wellbeing could be estimated accurately through individuals' digital traces (e.g., user demographics, user activity statistics, language), indicated by a meta-analytic correlation of 0.37 (95% CI [0.28-0.45]) between SMTM and survey-based wellbeing scores.The positive association between digital traces and wellbeing was stronger for public social media platforms (Twitter/Sina Weibo, Reddit, and Instagram) than private ones (i.e., Facebook).Luhmann (2017) reported a moderate converging validity (between rs = .20 and .40) of SMTM for wellbeing with the Depression, Anxiety, and Stress Scales (DASS-21; Henry & Crawford, 2005).A weaker convergent validity was observed when the Satisfaction with Life Scale (SWLS; Diener et al., 1985) was used (overall less than r = .20),based on which the authors concluded that the validity of SMTM for life satisfaction was limited.

The present study
The existing literature review and meta-analysis provide useful first insights into the potential of applying SMTM to assess wellbeing.However, automatic text analysis methods are rapidly improving, suggesting that an upto-date systematic review and meta-analysis is needed.
Further, both earlier studies considered self-reportbased stress, anxiety, and depression as primary indicators of wellbeing in addition to life satisfaction scores.This approach might have lowered the validity of the results since wellbeing is not equivalent to the absence of psychopathology (Keyes, 2002).To address these points, in this present study, we conduct a systematic review and a meta-analysis with a focus on wellbeing measures specifically.We structure our evaluation across four sections: (1) sample characteristics, (2) design characteristics, (3) validity of the results (using convergent and face validity) based on a qualitative synthesis, and (4) convergent validity assessed via meta-analysis.

Information source and search strategy
On November 5, 2021, a search was conducted in the bibliographic databases PubMed and Web of Science.The results from both databases were merged.Reference lists of the selected articles were further scrutinized for relevant articles.As our search strategy we used combinations of search terms related to (1) wellbeing (e.g., 'Wellbeing', 'Well-being', 'Life satisfaction'), (2) social media platforms (e.g., 'Social-media', 'Facebook', 'Twitter'), (3) language message type ('Language', 'Post', 'Updates', 'Status'), and (4) language analysis methods (e.g., LDA -"Latent Dirichlet Allocation, 'LIWC -Linguistic Inquiry and Word Count') (see Table 1 for detailed information), and the Boolean search operators 'AND' and 'OR'.Initially, we conducted a comprehensive search that involved all possible combinations of the four search term categories.Subsequently, we narrowed down our search to include only the combinations of three search term categories The database search was done by using combinations of the terms above.The Boolean search operators AND (horizontal) and OR (vertical) were used to combine the 4 columns, after that, first 3, then first 2.
(social media platforms, language message types, and language analysis methods).This was followed by using only two search term categories (language message types and language analysis methods).

Study selection, eligibility criteria, and data extraction
Following the Preferred Reporting Items for Systematic Review and Meta-Analysis (PRISMA) guidelines (Moher et al., 2009), a flow diagram of our study selection process is presented in Figure 1.The titles and abstracts of all identified articles were screened after exact duplicates were removed.The screening was performed by the first author.Uncertain cases were resolved through discussions among the authors.The title and abstract of articles were screened according to the following eligibility criteria: (1) A study must utilize 'social media language' to investigate 'wellbeing', (2) must use a 'quantitative approach', (3) must be published in a peer-reviewed journal, (4) is not a meta-analysis or a review paper (5) and is written in English.We additionally included articles by scanning the references of the two aforementioned review studies (Luhmann, 2017;Settanni et al., 2018).

Sample characteristics
In the current review, we investigated the type of language (e.g., English, Chinese), platform (e.g., Facebook, Twitter), and sample size used in each study.

Design characteristics
We reported whether a ground-truth measure (i.e., selfreport) was included, and if so, which wellbeing measures were used, whether the main focus was on individual, subnational, or national wellbeing levels, whether closed and/or open vocabulary methods were used for textual data, and whether data-driven or word-level methods were used.We use the term 'subnational' to refer to assessments made for location, but not countrylevel studies, e.g., states, counties, or neighbourhoods.

Validity of results
Convergent validity.To assess SMTM's convergent validity for wellbeing, we report, meta-analyse, and evaluate the correlations between SMTM and ground-truth scores.
Face validity.To assess SMTM's face validity for wellbeing, we compare the evidence from the SMTM and the self-report-based wellbeing literature.If similar conclusions can be drawn, this can provide evidence for the face validity of SMTM in the wellbeing context.To facilitate a meaningful comparison, four recurring topics from the wellbeing literature were chosen: (1) the characteristics of happy individuals, (2) temporal trends in wellbeing, (3) the relation between positive affect (PA) and negative affect (NA), and ( 4) demographics (age and sex differences).

Quantitative synthesis: meta-analysis and publication bias
A meta-analysis was conducted using the Metafor package in R (R Core Team, 2021;Viechtbauer, 2010) to obtain a meta-analytic estimate for converging validity across studies.We identified a total of 32 effect sizes (correlation coefficients) from 18 studies.Two studies were excluded to ensure homogeneity.The excluded studies involved a considerably large time gap between the SMTM and survey-based wellbeing measurements or did not use self-reports but face-toface interviews.
For sake of comparability with previous metaanalyses (e.g., Settanni et al., 2018) the effect sizes (i.e., correlations) used from each study were based on the best-performing language machine learning prediction models or language features (while being able to generalize to newer datasets -i.e.not overfitting) Correlation coefficients were converted to standardized z-values using Fisher transformation.After conducting the meta-analysis on the transformed effect sizes, the meta-analytic estimate was converted back to a correlation coefficient to allow interpretation in the original metric.
We applied a random effects model (Rubio-Aparicio et al., 2020) with robust variance estimation (Hedges et al., 2010) using a restricted maximum likelihood estimator (REML; Kenward & Roger, 1997).Cochran's Q-test (Hedges & Olkin, 2014) was applied to assess whether the null hypothesis that the true heterogeneity (τ 2 ) between the effect sizes is equal to 0. Further, the extent of how much of the heterogeneity was attributable to true heterogeneity can be assessed through the I 2 statistic, with values around 25, 50, and 75% categorized as 'low', 'medium', and 'high' (Higgins et al., 2003).Higher levels of I 2 values can be considered legitimate grounds for including potential moderators.We included two possible categorical moderator variables.The first moderator variable indicated if an effect size was estimated at individual level or location level (reference category), and the second moderator indicated whether the methods for estimating wellbeing were through data-driven or word-level methods (reference category).A third interaction term between our two moderators was not included because the number of effect sizes for the location-level word-level method group was far lower than for the other groups, 2 compared to k = 8, 9, and 13 (individual/word, individual/data, location/data, respectively), increasing the risk for type-2 errors.
To assess the risk of publication bias, we visually created and inspected a funnel plot, and applied Egger's test to statistically assess its asymmetry (Egger et al., 1997).We also estimated the number of (potentially) missing studies on the left side of the funnel through the trim and fill method (Duval & Tweedie, 2000).Observing a symmetrical funnel plot, obtaining a statistically non-significant Egger's test, and finding no studies missing based on the trim-and-fill method would provide no evidence for publication bias.
Lastly, some of the effect sizes in the present metaanalysis were non-independent: some studies provided multiple effect sizes based on different wellbeing measures (e.g., life satisfaction, PA/NA or eudaimonic wellbeing dimensions) for the same samples.To solve the non-independence issue, we applied robust variance estimation (RVE; Hedges et al., 2010;Moeyaert et al., 2017) to our meta-analytic estimate.RVE ensures that the studies with more effect sizes are assigned a smaller weight to obtain unbiased standard errors (Hedges et al., 2010;Moeyaert et al., 2017).

Results
Our initial search resulted in 28,857 papers.After removing duplicates 21,197 articles remained.By screening titles and abstracts, 332 articles were found potentially relevant.Based on full-text readings, 38 articles were included in the present study.Seven articles were added from external references.The final systematic review included 45 articles (See Figure 1).
In Table 2, we list all studies referred to in the sample/ design characteristics section of the results.In Table 3, we provide results acquired from studies to judge the face validity of SMTM for wellbeing.In Table 4, correlations reported between SMTM-based wellbeing and ground-truth scores are provided.In all tables, studies are referred to with an identification number, which is indicated in the reference list as well.

Type of language
In the majority of the studies, analyses were based on single language data (k = 42) with English datasets being most common (k = 30), followed by Chinese (k = 6), Italian (k = 4), Russian (k = 1), and Turkish (k = 1).Only a few studies used data from multiple languages (k = 3).

Platform
Mostly, the data were collected through Twitter (k = 28) and its Chinese equivalent Sina Weibo (k = 6).Facebook was used in less than one-fourth of all studies (k = 10).A single study used datasets from both Twitter and Facebook.

Inclusion of ground-truth measures
Less than half of the studies employed a wellbeing related ground-truth measure (k = 19).

Individual vs. regional focus
The majority of the studies investigated either individual-level wellbeing (k = 23) or subnational (e.g., county or state) wellbeing (k = 18), while a few studies investigated nation-level wellbeing (k = 3).A single study investigated both individual and subnational level wellbeing.

Word-level vs. data-driven methods
The majority used data-driven methods (k = 23), and the remaining studies used word-level methods (k = 20) except for two studies (k = 2) using both.

Inclusion of ground-truth measures
In the 19 studies that included a ground truth measure, most studies used measures for satisfaction with life (k = 12).The remaining studies used measures for affect (k = 1), general wellbeing (k = 1), quality of life (k = 2), affect and satisfaction with life (k = 1), affect and psychological wellbeing (k = 1).Only a single study used a eudaimonic wellbeing measure.

Convergent validity of results
Overall, studies indicated convergent validity for SMTM (k = 18) with correlation coefficients on average r = 0.39, SD = 0.19, ranging between r = 0.08 and 0.85.Only a few studies found unexpected results (k = 3); One study found life (dis)satisfaction was not associated with wellbeing at location level, while another study found no association between SMTM wellbeing and ground-truth measures across 81 provinces in Turkey.The last study found that negative word use was associated with wellbeing but in a positive direction.A summary estimate for the convergent validities across the studies will be provided in the meta-analysis section of the present study.

Language of high and low wellbeing
The language of individuals who score higher on surveybased wellbeing (k = 6) included words related to topics, such as leisure time, achievements, exercise, altruism, amazement, religion, and first person plural nouns (e.g., 'We', 'us').The language of individuals who scored lower on survey-based wellbeing included words related to swearing, disengagement, lack of meaning, problems with relationships, first-person singular nouns (e.g., 'I, me') and impersonal predicates (e.g., 'must').

Temporal trends in wellbeing
Fifteen studies have investigated the temporal trends in SMTM-based wellbeing.Six studies have examined the changes in SMTM-based wellbeing as a response to an emotionally valent worldwide and nation-scale event (e.g., festivals, disasters, economic crisis) (k = 6).Despite the overall results, one study has suggested wellbeing fluctuations cannot be found in social media text data (k = 1).Nine studies reported changes in SMTM-based wellbeing in reoccurring/cyclical fashion (hours for working and resting during the day, weekends, and seasons with good weather).These changes have occurred in a range of time resolutions such as hours (k = 3), days (k = 5), and months/seasons (k = 1).Some studies have also reported changes in SMTMbased wellbeing related to changes in temperature or weather conditions (k = 3).
The two studies that investigated hourly changes in SMTM wellbeing reported different results depending on whether both PA and NA were included in the SMTM wellbeing measure.The study with a composite measure of PA and NA (k = 1) found that SMTM wellbeing peaked early in the morning (e.g., when waking up, commuting, and starting to work) and decreased drastically until midday, which was followed by a less strong decay until midnight (resting and eventually going to sleep).The other study that focused independently on SMTM-based PA and NA found that PA peaked twice a day (early morning and near midnight) while NA was lowest in the mornings and had a stable increase until its single peak at night-time.
The studies focusing on the daily fluctuations of SMTM wellbeing have revealed that wellbeing was consistently highest at the weekends (k = 6).Nonetheless, it was less clear on which weekday lowest wellbeing levels are found.In two studies, Wednesday was suggested (k = 2), while Tuesday was also considered as the day with the highest wellbeing (k = 1) but also the lowest wellbeing (k = 1).
SMTM wellbeing is found to decrease with lower temperatures, less light, and rougher weather conditions (e.g., wind, rain) (k = 3).For instance, SMTM wellbeing was the highest in July and lowest in February in Italy which is located in the Northern Hemisphere (k = 1), and it rose in parallel with increasing temperatures during winter and spring but was limited until 30 degrees Celsius in summer (k = 1).Shorter day length -which changes in accordance with seasons -was also associated with decreases in PA (but not associated with an increase in NA) in both Northern and Southern Hemisphere countries (e.g., the United States and Canada, India, and Australia) (k = 1).In addition, SMTM wellbeing decreased if it rained (but not if it snowed) (k = 1) and the effect of air pollution on SMTM wellbeing increased when air conditions were rough (i.e., if there is too much rain, wind and too many clouds) (k = 1).

Relation between positive and negative affect
Only a few studies assessed and compared SMTM-based PA and NA, and the results from these studies on the differences between the two constructs are in line with evidence-based on self-reports.PA and NA showed independent within-person level trajectories over time (k = 1), and PA and NA were not significantly associated with each other at province level (k = 1).PA was more prevalent than NA in the language of individuals on social media (k = 6) and was also characterized by different 'affect dynamics' (k = 2).For instance, whenever users on Twitter state that they are feeling either 'good' or 'bad' (i.e., express their current emotions), the emotional content of their following tweets first peaks, and eventually returns to baseline levels.SMTM-based PA tends to peak quicker, while NA builds up more slowly and eventually returns to baseline levels even faster than PA (k = 1).

Demographics
Sex differences were found in SMTM-based wellbeing (k = 4).Females used more positive and negative language use than males, yet the findings for the negative word use were not consistent across studies (k = 2).Female users' trajectories show that positive word use frequencies increased faster but dissipated slower than in trajectories for males (k = 1).The negative effects of air pollution were more visible in the SMTM wellbeing levels of females than males (k = 1).
Concerning age, at the individual level older people used more positive language (k = 2).In line with individuallevel results, at the regional level, happier tweets were observed in neighbourhoods with older populations (k = 1).

Meta-analysis and publication bias
In total, we retrieved 32 effect sizes from 18 studies and the number of outcomes per study ranged from 1 to 8 (mean = 1.78, median = 1).To assess the risk of publication bias, we visually inspected our funnel plot and concluded that it was mostly symmetrical (see Figure 2).The Egger's test results showed no statistical evidence for a funnel plot asymmetry (z = 1.02, p = .31).The trim and fill algorithm estimated the number of missing/nonreported effect sizes/studies on the left side of the mean effect of our funnel plot as 0 (SE = 3.42).Based on an initial random-effects meta-analysis model without any moderator variables, we found that the estimated true heterogeneity (τ 2 ) between the effect sizes was significant, as indicated by Cochran's Q-test (Hedges & Olkin, 2014), Q (df = 31) = 6204.30,p < .0001.Nonetheless, the estimated true heterogeneity was small (τ 2 = .04,SE = .01).Most of the variability between the effect sizes was attributable to true heterogeneity (I 2 = 98.81%).Thus, we proceeded with including our categorical moderator variables (individual vs location level and data-driven versus word level) in our meta-analytic model.

Discussion
Following the PRISMA guidelines, we presented a systematic review based on 45 studies that use SMTM to assess wellbeing and conducted a (1) qualitative synthesis and (2) a meta-analysis based on 32 effect sizes from a subset of eighteen studies reporting correlations between SMTM and survey-based wellbeing.

Qualitative synthesis
The systematic review and qualitative synthesis resulted in the following overall observations.Across the 45 studies, 70% of all studies were based on English-speaking samples and Twitter was the most popular platform (60% of the studies).In general, large sample sizes were used (between 10,000 and a million individuals), though with a wide range (the smallest samples had less than 1000 individuals while the largest samples included even more than a million individuals).Half of the studies focused on individual-level wellbeing and the other half focused on regional wellbeing.Half of the studies used closed vocabulary methods (such as LIWC dictionaries), and the remaining studies either used open vocabulary (such as LDA topic models) or combined closed and open vocabulary methods.Word-level and data-driven methods were used equally.Satisfaction with life was the most used ground-truth measure (25% of the studies with ground-truth measures), while eudaimonic wellbeing (such as autonomy, personal growth, and environmental mastery) was only assessed in a single study.
Our results showed a clear majority (70% of all studies using English-speaking samples.This limits the applicability of the current evidence to non-English speaking populations.Existing studies have already suggested differences in the expression and conceptualization of happiness/wellbeing in different cultures.For instance, a study has found that European Americans (EA) and Asian Americans (AA) valued high-arousal positive affect (reflecting 'excitement') more than the Hong Kong Chinese people (CH), whereas AA and CH participants valued low-arousal positive affect (reflecting 'calmness') more than EA participants (Tsai et al., 2006).Therefore, a particular set of language features (e.g.,, words, topics) or a model used in SMTM may assess wellbeing in one culture well but can fail in another.Currently, SMTM might thus only be a reliable option to assess wellbeing in English, thereby excluding large parts of the world population.Nonetheless, the recent development of open-access language models based on massive multilingual data such as Multilingual BERT (M-BERT; Pires et al., 2019), XLM (Cross-lingual Language Model; Lample & Conneau, 2019), and XLM-R (XLM-RoBERTa; Conneau et al., 2020) may help to alleviate these representativeness problems.Such models, once trained, can be applied in a wide variety of natural language processing tasks in a wide range of languages not limited to English, allowing for the detection of wellbeing of individuals from different populations.
The qualitative synthesis, furthermore, indicated that most studies use large sample sizes, highlighting one of the advantages of social media data use.The combination of large-scale social media data and computerized text analysis methods allow for assessing wellbeing in a complementary, and perhaps an alternative way, to traditional survey-based self-reports.The unique characteristics of large language data used in SMTM (e.g., the longitudinal prospective structure, the ecologically valid setting, large reach, anonymous large-scale data collection) provide an efficient way to assess wellbeing at different units of analysis (e.g., individual or location).Wellbeing assessed through SMTM at both the individual and location level can aid in developing improved personalized interventions to increase happiness while it can also inform better policies in neighbourhoods, cities, and countries.Both practitioners and policy-makers can use SMTM and aim to increase the already known positive outcomes related to higher levels of wellbeing such as better financial habits, social relations, more altruistic behaviours, higher school grades, and better workplace functioning (Chapman & Guven, 2016;James et al., 2019;Kim et al., 2019;Maccagnan et al., 2019;Okabe-Miyamoto & Lyubomirsky, in press;Oswald et al., 2015;Steptoe, 2019;Walsh et al., 2018), as well as increased socio-economic development of regions (Lambert et al., 2020;Santini et al., 2021).
Among the reviewed studies, there was an equal preference for word-level and data-driven methods.In addition, open-vocabulary methods were applied in 20% of the studies, less than the closed-vocabulary methods.However, data-driven and open vocabulary methods may better leverage larger datasets.Used with large social media datasets, these methods allow for finding previously unknown associations and help generate new hypotheses.Given the vast availability of language data on social media will likely expand further, the increasing preference for these computational methods is understandable.Nonetheless, open-vocabulary methods also have potential shortcomings, such as that study variables are generally not comparable across studies, while closed-vocabulary methods (dictionaries), for example, remain constant across studies.Overall, open vocabulary methods require more expertise to implement, need larger datasets, and are less easy to use than closed-vocabulary methods (for a full discussion, see Eichstaedt et al., 2021).
Half of the studies included at least one survey-based ground-truth measure ensuring a valuable source to test for the validity of social media data.With recent developments of data-driven methods, though, ground-truth measures are becoming less essential as these are only necessary when new models for SMTM are developed or when models are adapted for new populations.The novel contextual word embeddings are pre-trained on high-quality large samples (e.g., Devlin et al., 2018;Sanh et al., 2020;Z. Yang et al., 2019), which may increasingly liberate researchers from the burden of collecting large amounts of ground-truth measures.In principle, pretrained models can be directly used to assess wellbeing in independent text data.Researchers should, however, ensure there are no large differences between development and target samples.
There was a considerable amount of variety of wellbeing ground-truth measures to validate SMTM wellbeing scores.Most studies (around 60%) used selfreported life satisfaction scores as the ground truth.The remaining studies used other wellbeing measures capturing subjective wellbeing, mental wellbeing, or quality of life.These inconsistencies between measures make it difficult to readily compare results across studies.For instance, some of these measures included items for objective wellbeing (e.g., income, access to basic services), while others used items for the affective or cognitive components of wellbeing.At the same time, studies have shown that different wellbeing measures correlate with each other at moderate to high levels (e.g., Bartels & Boomsma, 2009;Busseri, 2018), implying a general wellbeing factor (Longo et al., 2016).Based on this, results informed by different wellbeing measures may still be comparable.Overall, both clarity on conceptualization of wellbeing and caution when inferring results from different wellbeing measures are needed.In the future, it is also commendable to include eudaimonic wellbeing measures (reported only in one study) to obtain a complete picture of SMTM wellbeing.
Our review indicated that SMTM-based wellbeing had mostly similar qualities as survey-based wellbeing.For instance, language expressed by individuals scoring high and low on wellbeing were largely in line with self-report-based results (e.g., Diener et al., 2018).People with higher wellbeing levels talked about social gatherings, leisure time, engagement, and enjoyment, whereas the language of people with lower wellbeing included words related to low levels of motivation, lack of meaning, impersonal predicates, and swearing.This aligns with previous survey studies showing that individuals who were more engaged and socially connected reported higher wellbeing (Keyes, 2010;Ryff, 1989;Seligman, 2018).In addition, our review showed that SMTMbased wellbeing both increase and decrease in response to positive and negative events similar to result acquired through survey-based wellbeing studies as well (Luhmann et al., 2012).Both one-time impactful (e.g., earthquakes or economic crisis) and cyclically occurring events (e.g., work vs. leisure hours) resulted in changes in SMTM wellbeing, hence providing additional evidence for the face validity of using SMTM capture wellbeing and its fluctuations.
A minority of studies have reported unexpected results concerning convergence between SMTM and survey-based wellbeing.One study found SMTM wellbeing was not associated with life (dis)satisfaction at the location level (LabMT applied to 232 zip codes in Utah, Nguyen, Kath, et al., 2016).The reason for the null associations between life (dis)satisfaction and wellbeing at location level may be due to the time gap between the assessment of the ground-truth measure (between 2009 and 2010) and when the language data was collected (between 2009 and 2014).Another study reported no association between SMTM wellbeing and ground-truth wellbeing measures across 81 provinces in Turkey (Durahim & Coşkun, 2015).The second study's null results may be explained by the fact that the ground-truth measure for wellbeing consisted of scores given by government officials, which differed from the other studies' methods, where social desirability and anonymity concerns may have played a role.Finally, the last study found negative word use on social media to be positively associated with survey-based wellbeing (N.Wang et al., 2014).The unexpected results from the study finding negative word use being positively related to wellbeing may be due to the use of LIWC negative words in the reverse context or sarcasm (e.g., I feel 'terribly' happy).Such problems with word-level methods (like LIWC) might have been particularly alleviated in this study where the language data from individuals were aggregated at different time windows like days, weeks, and months.Despite these few unexpected results, we have observed higher convergent validities for regional studies compared to individual-level were observed.This may be due to the higher prevalence of data-driven methods and larger sample sizes observed in the former group of studies.
Overall, the qualitative synthesis has shown the value of large-scale data collection and wellbeing assessment via social media language and provided a detailed picture of the current state of the field.Large sample sizes, frequent use of ground-truth measures, as well as widely used data-driven methods ensures the quality of the studies to improve further in the future as well.The multilingual versions of newer data-driven methods may allow for better assessment of wellbeing in non-English speaking populations.

Quantitative synthesis: meta-analysis and publication bias
A meta-analysis was conducted to obtain a metaanalytic estimate for converging validities across studies.We meta-analysed 32 effect sizes acquired from 18 studies that reported convergent validities of SMTM wellbeing and seemed, based on Egger's test (Egger et al., 1997) and the trim and fill method (Duval & Tweedie, 2000), unaffected by publication bias.The results based on a meta-analysis including effect sizes of optimal models (thus signifying the upper limit) indicated moderate convergence between SMTM and survey-based wellbeing (meta-analytic r = 0.40, 95% ).This correlation is largely similar to the results of an earlier meta-analysis (meta-analytic r = 0.37, CI 95% [0.28-0.45])(Settanni et al., 2018) and are similar to convergent validity coefficients achieved by other methods (e.g., peer reports with self-report surveys) for personality and wellbeing which typically range between r = 0.30-0.40(e.g., Park et al., 2015;Schneider & Schimmack, 2009).The correlation is, however, higher than the values in previous literature reviews which reported correlations being between 0.20 and 0.40 for affect, and smaller than 0.20 for life satisfaction (e.g., Bellet & Frijters, 2019;Luhmann, 2017).These differences may be explained by these studies being literature reviews and not systematic reviews, which may have resulted in not including all of the available evidence in the field.
The results showed higher convergent validities for location level studies (between 0.37 and 0.67) compared to individual-level studies (between 0.25 and 0.40), as previously reported in the World Happiness Report in 2019 (Bellet & Frijters, 2019).At the individual level, word-level methods performed better than datadriven methods (average r = 0.38 and 0.26 respectively; SD = 0.22 and 0.13).At the location level, however, data-driven methods performed better than wordlevel methods (average r = 0.52 and 0.38, respectively; SD = 0.24, SD for the second mean cannot be calculated as it is based on a single score).Nevertheless, it should be noted that the latter result was based on a single study, limiting the interpretability of this result.Overall, this pattern of findings may be explained by the fact that word-level methods (e.g., LIWC) were developed with a particular focus on interpreting individuals' psychological states and traits rather than regions.On the contrary, data-driven methods are thought to capture the nuances and differences in the language of different geographical regions and achieve higher accuracies, but require bigger datasets, as observed for regional studies (Jaidka et al., 2020).

Limitations of the study and social media data use in general
Our review and meta-analysis should be interpreted in light of the following limitations.In the present review, pre-prints were not included which may have caused missing very recent developments in the field, but it guarantees that the included studies were peerreviewed.In the meta-analysis section, the models achieving the highest convergence in each study were included for the calculation of the effect sizes.Thus, the meta-analytic convergence between SMTM and surveybased wellbeing reflects the upper limit.Among the search terms used, we did not include eudaimonic wellbeing explicitly.However, the other search terms we used for general wellbeing (e.g., wellbeing, well being, or well-being) probably would have captured eudaimonic wellbeing studies if they existed.Clearly, eudaimonic wellbeing is researched less than hedonic wellbeing (e.g., life satisfaction, positive or negative effects), perhaps due to the length and scope of the self-report measures dedicated to this construct (Ryff, 1989).Therefore, we estimated that we covered most of the studies available in the field -if not all of them.
More in general, although SMTM appears to be a valid method for assessing wellbeing, it is important to acknowledge the limitations inherent to social media language data.Social media language data are often 'noisier' than survey data.There is temporal variation in terms of the amount of text data produced between individuals (some write more, and more frequently), and within individuals (text production per time unit varies).In addition, social media data are not representative of the population concerning age, sex, income levels, educational levels, and ethnicities (e.g., Blank & Lutz, 2017;Hargittai & Dobransky, 2017;Hargittai, 2020;Mislove et al., 2011).For example, a recent report mentioned fewer global female Facebook users compared to males, although female users were more active (e.g., frequency of post likes, comments, or the number of clicks on advertisements) (Kemp, 2021).In addition, language features (e.g., words) show a Zipfian distribution, i.e., most words only occur a few times (Eichstaedt et al., 2021), leading to sparse data for most words.People use a very large number of different words or topics -with low base rates and uneven distributions across the population -each with small effects on the phenotype of interest (e.g., wellbeing).In order to find these small effects, large sample sizes are needed.At the same time, while a single word or collection of words may be an imperfect measure of wellbeing, given the sample sizes that can be achieved, language may provide a valid measure of wellbeing when aggregating all the small effects across all words.Overall, researchers must acknowledge the potential limitations of social media language datasets and correct for those, if possible.

Recommendations and future studies
Based on the results of our systematic review and metaanalysis, we have the following recommendations for the field of SMTM:

Expand to other languages and populations
The results of this review indicated that most studies focused on English language datasets.However, as we foresee an increase in the use of SMTM, it may become more important to reassure that all the voices on social media are being heard, especially if policymakers and researchers aim to infer the wellbeing levels of a particular region to aid (social) policies.A substantial number of people in a specific region may use a different language to express their happiness and worries instead of a dominantly spoken language/dialect of a country.Similar issues have been observed in other research areas, such as in genetics, where inferences are predominantly based on European ancestry samples, worsening the existing health disparities between over and underrepresented groups (Martin et al., 2019).In line with large-scale worldwide initiatives in the field of genetics, for instance, creating methods that are compatible with other ancestry populations-datasets (e.g., Multi ancestry Meta-Analysis; Turley et al., 2021), or compiling datasets representing diverse populations (e.g., 23andMe, All of Us, China Kadoorie Biobank), applications of multilingual SMTM and collecting multilingual datasets may become important to make SMTM more inclusive (Hsu et al., 2021).

Combine data from different social media platforms
We observed that most datasets were acquired from Twitter, and only a single study has combined data from two platforms (Facebook and Twitter) (Jaidka et al., 2020).Making general population-level inferences based on single platform data may result in misleading conclusions, given the presence of potential platformspecific sample selection mechanisms (e.g., females and younger individuals use Facebook more than males and older individuals; Blank & Lutz, 2017).The issue of nonrepresentativeness can be relieved by collecting data from multiple social media platforms (to acquire a more comprehensive picture), and applying existing de-biasing techniques (e.g., Giorgi et al., 2019Giorgi et al., , 2021;;Z. Wang et al., 2019) if population-level conclusions are being made.

Collect social media data in existing cohorts
To further control for biases in social media data, researchers can request access to social media data from individuals who already participate in large-scale population-based studies such as the CDC's Behavioral Risk Factor Surveillance System (BRFSS; Johnson et al., 2014;Nelson et al., 2001), the UK Biobank (UKB; Sudlow et al., 2015), the Midlife in the United States national survey (MIDUS; Brim et al., 2004), the Health and Retirement Study (HRS; Juster & Suzman, 1995), and the National Longitudinal Study of Adolescent Health (Add Health;Harris, 2013).It is typical for these studies to already collect a wide array of (demographic) information from their participants (for instance, yearly).These population-based samples may be more representative of the general population, and by collecting social media data from such samples, potential sampling biases can be reduced.
Such large-scale datasets can be used for other purposes as well.For instance, the existing survey scores collected in the past can be easily augmented with the social media language of the same individuals from the same time point in the past.By doing so, the convergent validity of SMTM for various traits including but not limited to wellbeing (e.g., depression, personality) can be investigated for multiple time points.In addition to that, combining multiple types of data from the same individuals in a continuous fashion (e.g., survey, SMTM) makes a real-time assessment of wellbeing (and other traits) possible.Social media language features and survey scores can also be combined in the same model to increase prediction model performances.
It should be noted, however, linking these data, as well as collecting the text-based social media data of individuals requires caution for privacy concerns.Researchers must respect the individuals' rights to privacy and reassure that all ethical requirements are sufficiently met.

Use open vocabulary and data-driven approaches
Most of the papers we examined use closedvocabulary methods to extract language features.We recommend that researchers also use open vocabulary methods and apply data-driven approaches which can offer improved predictions compared to when only closed vocabulary methods or word-level approaches are used.

Conclusion
The present qualitative synthesis and meta-analysis supported the value of SMTM to cost-efficiently assess wellbeing both at the individual and regional levels.SMTM can be used to assess past and present wellbeing.Application of SMTM to assess wellbeing -in real-time -can eventually help develop personalized interventions to increase wellbeing, or aid policymakers to adjust their decisions to maximize the wellbeing of (inhabitants of) neighbourhoods, countries, and cities.The use of SMTM for assessing wellbeing may also provide new opportunities for researchers.For instance, individuals' wellbeing levels and their variation over time can be analysed in combination with other existing datasets including surveys, physiological, and laboratory measures.

Figure 1 .
Figure 1.PRISMA Flow Diagram of the included studies

Figure 2 .
Figure 2. Funnel plot based on the 32 included effect sizes in the meta-analysis

Figure 3 .
Figure 3. Meta-analytic estimates of the correlations observed between Social Media Text Mining (SMTM) and survey-based wellbeing assessments at both individual and location levels.

Table 2 .
Sample and design characteristics of the 45 reviewed studies.

Table 3 .
Results from the 45 reviewed studies.

of high levels of wellbeing
Enjoying weekend, being happy on Sunday, romance, friends and family gatherings, birthdays, parties at night, life and living, WB is highest in July and lowest in February 28 SMTM-PA decreases with shorter day length but not SMTM-NA 7 SMTM-WB increases as the temperature increases in winter and spring, no association after 30 degrees Celsius in summer 28 SMTM-WB decreases with rain, but not with snowfall 28 Decreases in SMTM-WB related to air pollution increases with bad weather 26 SMTM-WB, SMTM-PA, SMTM-NA = overall wellbeing, positive affect, and negative affect based on social media text mining.

Table 4 .
Overview of the highest achieved correlations reported for convergent validity.
SWLS = Satisfaction with Life Scale, PWBS = Ryff's Psychological Well-being scales, PANAS = Positive and Negative Affect Schedule, WHO-5 = World Health Organization Well-being Index, URRSAQ-LS = Urban and Rural Residents Social Attitudes Questionnaire -Life satisfaction, DASS-21 = The Depression, Anxiety, and Stress Scales, BRFSS-LS = The Behavioral Risk Factor Surveillance System -Life satisfaction, ISTAT = Italian National Institute of Statistics.QoL = Quality of Life (not a previously validated scale).