Computational Analysis of Superfood Representations in News Media

ABSTRACT What do berries, avocado, quinoa, and ginger have in common? These food items are often regarded as superfoods, a marketing term that overstates the importance of single food items for one’s health and wellbeing. In the present paper, we set out to investigate how purported superfoods are represented in the discourse of online news. We use computational language models to extract the unique topics and terms used to discuss superfoods. Our results show that news coverage is dominated by many specific claims about the healing properties of superfoods. The structural topic model further demonstrates that articles mentioning superfoods are more likely to include topics about a) nutrients, physical appearance, and health in the same context, b) retail strategies, and c) scientific research about the health benefits of superfoods. These results illustrate complex representations of superfoods in news media.

Even though many people strive to eat more healthily, there is little consensus about the what, when, and how of a healthy diet.Whilst plenty of guidance and recommendations from government and experts exists (Julia et al., 2021), discourse in online media is now dominated by diverse opinions and advice concerning what people "should" be eating to improve their health and wellbeing.Given that people's perceptions and behavior are shaped by the media representation of food healthfulness (Nagler, 2014;Oakes & Slotterback, 2001), it is essential to better understand what messages about food healthfulness online media perpetuate.
It is known that food marketing influences consumers' perceptions of food healthfulness (Chandon & Wansink, 2012;Plasek et al., 2021;World Health Organization, 2021).In fact, food and beverage companies invest heavily in healthy food marketing in response to consumer demand (Samoggia et al., 2020).The most recognized and researched types of food marketing strategies are those found on food packaging, such as health claims and symbols, package design, and branding (Plasek et al., 2020;Silchenko et al., 2020).However, large food and drink companies are also positioning themselves as nutritional educators (Garcia & Proffitt, 2021), often marketing specific "healthy" product categories and ingredients (Chandon & Wansink, 2012;Mintel, 2016).As a result, the online media discourse surrounding foods also contains marketing concepts disguised as healthy eating advice (MacGregor et al., 2021;Samoggia et al., 2020).
These online media messages have led to the marketing term "superfood" being popularized in everyday discourse (Delicato et al., 2019;Roth & Zawadzki, 2018), which is the focus of our paper.Broadly speaking, superfoods refer to foods naturally rich in macro-nutrients and various vitamins and minerals (Jagdale et al., 2021).This is despite no clear evidence linking a given food item to any health outcomes (Cloutier et al., 2013;Siipi, 2013;Thurecht et al., 2018).Moreover, the superfood hedonistic concepts in its research (Kāle & Agbozo, 2020).However, the question remains about whether this finding is replicable when analyzing the discourse in mainstream news media, and if it would generalize to other known superfoods.
In this paper, we leverage the latest methods from natural language processing (NLP) to analyze online news articles about known superfoods in different contexts.Our chosen corpus of online news articles (News on the Web -NOW) captures the representation of superfoods in the US and UK media over a 10-year period (2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017)(2018)(2019)(2020).To establish unique features of superfood-specific language, we focus on news coverage that contain mentions of foods typically associated with superfoods.We then compare language use between those articles where the word superfood occurs, with those where it is absent.In summary, we conduct three analyses to extract unique feature of the superfood-specific language.First, we explore commonalities in language between superfood articles and descriptions of superfoods provided by a sample of participants.Next, we use two computational techniques to establish predictive features of articles written about superfoods.The first of these computational techniques is text classification.For this analysis, we formally compare articles written in a superfood context with closely matched articles written in a non-superfood context.Therefore, our classifier is forced to rely on subtle differences in language when making predictions as to whether an article is about superfoods or not.An advantage of this approach is that we can uncover the most important words, used as predictors by the trained classifier, to distinguish the representation of a food item in a superfood-specific context.Next, to gain further insight into the interaction between various words and concepts related to superfoods in online news media, we used structural topic modeling (STM).Topic modeling allows us to formally identify latent themes and topics in our sample of news articles about various foods.We use structural topic modeling specifically as it allows us to compare the likelihood of each topic appearing in superfood or non-superfood articles.Finally, as an additional comparison, we replicated both of the above-mentioned computational analysis techniques using articles mentioning "organic" in place of "superfood."The purpose of this is to ascertain whether we can successfully capture the unique representation of superfoods, rather than a general concept of healthiness.
In short, the following paper makes three novel contributions.First, it presents a unique computational analysis of news media, offering a detailed insight into how the concept of superfood is portrayed.This is important given the lack of consensus about the definition of this term on the one hand, and its prominence in everyday discourse and marketing on the other hand.Second, this paper employs classification and structural topic modeling to directly compare superfood-specific language with the content pertaining to organic foods.This analysis allows us to explore the unique meaning ascribed to the term "superfood" in the broader space of categories and labels used in foodrelated discourse.Lastly, our study demonstrates the value of using language models to study latent representations of psychologically relevant constructs and topics (cf.Demszky et al., 2023;Gandhi et al., 2022;Wulff & Mata, 2023).

Corpus selection
The corpus of online news articles used in this paper was taken from the NOW Corpus (http://corpus.byu.edu/now/), which is a collection of online newspaper and magazine articles, maintained by Mark Davies at Brigham Young University.This is the only English-speaking corpus that is larger than a billion words (Davies, 2017), and so was most suitable for exploring a niche topic like superfoods.The metadata for this corpus includes "article ID," "word count," "date," "country," "news outlet," "URL" and "title."Specifically, for our analysis, we used a static local copy of the NOW Corpus, accessed in May 2020, which covers the period between January 2010 to February 2020.Only articles published in the United States of America and Great Britain were analyzed.

Identifying food names
We compiled a list of food names by downloading data from the U.S. Department of Agriculture (2019) Food Composition Data, and McCance and Widdowson's Composition of Foods Integrated Dataset (Public Health England, 2021), as both are the official sources of information about commonly consumed foods in the USA and UK respectively.This list was used in an initial pre-processing step to filter articles only containing food names.

Participant survey data
We conducted a short online survey on Prolific Academic, in which we asked participants a series of questions about their perceptions of superfoods.Details of this study are provided in the Appendix, but here we note that we used some of our participants' responses to identify food names that are most associated with superfoods.More specifically, we asked each participant in our study (but only those who indicated to be familiar with the term superfood) to name at least five superfoods.We then selected 25 most frequent responses to select online news articles for our computational analyses.

Transparency and openness
All data and code relating to the participant survey are available at https://tinyurl.com/39p8uwff.This study's design and its analysis were not pre-registered.The NOW Corpus can be purchased from the http://corpus.bye.edu/now/

Data pre-processing and article selection
All steps to extract and clean the relevant sample of online news articles were performed using R, version 4.1.0(R Core Team, 2021).The R package "spacyr" (Benoit & Matsuo, 2020), was used for data pre-processing.We started with a sample size of 13,871,016 news articles published in the United States of America and Great Britain during the period specified.First, we removed HTML tags, URL links, and non-alphabetical characters (e.g., special characters, numbers, and all punctuation except for hyphens), then standardized all text and titles of the news articles to lowercase.Next, we selected only news articles that contained the word "food" or "diet" either in the text or title.At this stage, there were 779,919 news articles.We then removed all articles that either had duplicated article IDs or that had identical article titles from the same news outlet.Following this, we filtered articles that mentioned at least one of the food names in our food name list (see Appendix for detail), resulting in 547,568 food-related news articles.Our next data pre-processing steps involved tokenization (splitting article texts into single word units, for example, breaking down the sentence "Kale is a superfood" into ["Kale," "is," "a," "superfood"]), removing stop words (frequent words that provide little information e.g., "and," "in" and "the"), and part of speech tagging (to identify a word's function within a sentence, e.g., "green" as an adjective).Part of speech tagging allowed us to select the most meaningful parts of speech (nouns, adjectives, and verbs) from the articles.We also performed lemmatization, turning words representing the same concepts into their base form (e.g., "energizing" and "energized" both became "energize").Next, we replaced all possible spellings of the word superfood (e.g., "superfood," "super-food," "super foods") with the former spelling for consistency.This then allowed us to identify the 1,169 articles that mentioned the word "superfood" at least once in the article text or title, and the 216,769 that did not.Consequently, we further subset the data to only include articles that specifically mentioned the 25 superfood names given by participants in the online survey mentioned above.This resulted in a final sample of 57,853 news articles, with 872 articles that mentioned the word superfood at least once (henceforth referred to as superfood articles).In the last step, all articles were tagged with a dummy variable to denote their status as either a superfood or non-superfood article.

Word frequency: a bag-of-words model
As an initial exploration of our corpus, we used a word counting technique known as a bag-of-words model to provide a simplified representation of our superfood articles sample (see Kowsari et al., 2019).The bag-of-words approach was chosen over other known frequency statistics (such as tf_idf and weighted logs odd ratio) because these alternative methods would give priority to unique and obscure words even when found in only a small number of articles.Instead, we were interested in finding the most common words across all articles in our chosen context.
For this analysis, we included additional data pre-processing steps.The same steps were replicated when pre-processing participants' survey responses to two questions asking them to describe or define a superfood, for comparative purposes.Further details are provided in the Appendix.

Text classification
Text classification is a supervised machine learning technique that allows us to predict whether an article is written in a superfood context or not.An offshoot of this approach is that we can subsequently identify the terms that underlie the classifier's predictions.Consequently, text classification can enable us to make inferences about concepts that are most likely to be associated with superfoods in a sample of articles discussing the same group of foods.
For text classification, it is optimal to have a balanced distribution of articles in each group.However, our dataset was highly unbalanced, with the superfood articles making up only 1% of the sample.In addition, the much larger 'non-superfood' class likely included articles that only scarcely referenced the food items of interest.Therefore, to down-sample the non-superfood articles and balance the corpus, we used propensity score matching (Ho et al., 2011;Rubin & Rosenbaum, 1985).We obtained the propensity scores for each sample using a generalized linear model with a logit link function.We regressed the article class (i.e., superfood mentioning or not) on the counts of the top 25 superfoods selected by survey participants.The premise was that the logit model could estimate the probability of each article being a superfood or non-superfood article, and the predicted probabilities (propensity scores) would reveal how likely each non-superfood article could serve as a viable counterfactual (or replacement) for that superfood article.All superfood articles were paired with their nearest neighbor; the non-superfood article with the closest propensity score.Note that each match was independent, and thus the same non-superfood article could be matched to several superfood articles (greedy matching).All unmatched non-superfood articles were then discarded from the sample, allowing for maximum homogeneity between the two comparison groups and a reduced sample size.In total there were 872 pairs of articles in our text classification sample, 10% of which were always held-out for model evaluation during cross-validation (as explained below).
Next, we trained a text classification model to discriminate between the superfood and nonsuperfood articles in our balanced sample.Specifically, we used the logistic regression classifier (also known as the maximum entropy classifier); a linear model often used for text classification tasks due to its interpretability.We chose to encode the data with unigrams, bigrams, and trigrams.Each type of an n-gram is an expression of N consecutive words combined.Previous studies have shown that increasing the n-gram range leads to better performance in a variety of text classification tasks (Bharadwaj & Shao, 2019;Shah et al., 2018).For the purpose of our research question, more complex n-grams may also be better suited for capturing unique contextual information that defines discourse surrounding superfoods.For example, the meaning of the word "superfood" in "healthy superfood" and "expensive superfood" is the same if only unigrams are used.However, the two bigrams can have a different meaning, despite both mentioning superfoods.We represented the data using a documentfeature matrix, where each column represented an n-gram, each row represented a news article, and the observations for each of the documents corresponded with the count of the word's occurrence.Take a single hypothetical sentence "Superfoods are very good."This sentence would be represented by 10 columns in the document-feature matrix: three corresponding to the unigrams ("superfoods," "are," "very," and "good"), three corresponding to the bigrams ("superfoods are," "are very," and "very good"), and two for the trigrams ("superfoods are very," "are very good").
A common issue in machine learning models is overfitting.To mitigate against this, we applied an L1 penalty, which shrinks majority of irrelevant coefficients to 0. 1 When training the regularized logistic regression, it was important to choose the optimal strength of the penalty imposed on the L1 norm of the model's coefficients.We tuned the regularization strength parameter by running a grid search over a range of values and evaluating the average F1 score on a 10-fold cross-validation split for each of them.The F1 score is the harmonic average of the model's precision and recall, further discussed in the results section.The entire fitting procedure was implemented in the "glmnet" package in R (Simon et al., 2011).
As an additional analysis, we repeated all the aforementioned steps replacing "superfood" with the word "organic."The reason was to ascertain whether the representation of superfoods is unique or reflects an overarching perception of food healthiness.Specifically, using the sample of 25 superfoods given by participants, we marked news articles that mentioned the word "organic" vs. those that did not.

Topic modeling
Finally, to examine whether the differences identified by the classifier generalized to more broad, latent themes found in the entire subset of the news articles corpus, we used the Structural Topic Model (STM) (Roberts et al., 2014(Roberts et al., , 2016)).Topic modeling refers to a family of unsupervised statistical learning techniques that identify underlying latent semantic structures characterized by the frequent occurrence of a vocabulary subset in a corpus of natural texts.In the past, topic modeling has been applied to analyze corpora from a variety of areas, such as social media discourse (Zamani et al., 2020), financial news (Bybee et al., 2020), and historical texts (Barron et al., 2018).
Topic modeling relies on several assumptions, which enabled us to extract topics from our newspaper corpus.One such assumption is that each document is composed of a mixture of topics, and each topic is formed using a probability distribution of multiple words.In the same manner as the bag of words model, it also assumes that there is no order to the words in a document and that documents are independent.The distinguishing feature of STM is that, while building on the basic idea of probabilistic topic modeling, it allowed us to incorporate document-level covariates (or metadata) into the model's structure (Roberts et al., 2019).STM was therefore most appropriate for our goal of quantifying the effect of our dummy variable (superfood article or not) on the topical structure of the news articles.As such, STM allowed us to identify topics present across all articles that were more prominent when the term "superfood" was used.
To facilitate model estimation, we further subset our data to only include articles mentioning any of the superfood items specified by the survey participants more than twice, resulting in a sample of 18,219 non-superfood articles and 577 superfood articles.Since the number of the latent topics to be estimated must be specified a priori, we conducted a grid search over a range of values and chose the highest number (K = 12) that offered a notable improvement in the exclusivity score calculated over the top 10 words in each of the topics.A word is said to be exclusive to a given topic if it has a high probability of appearing in the topic and a low probability of appearing in the other topics estimated by the model (Roberts et al., 2014).The exclusivity of a model is the aggregated exclusivity of the top N words for each of the topics.The details of the parameter search can be found in Figure A1 of the Appendix.
1 Formally, logistic regression solves the following optimization problem:) where λβ 1 is the regularization penalty imposed on the model's coefficients.

Superfood names
In total, there were 115 unique superfood names listed by our survey participants (for the full list, see the OSF repository associated with this project: https://tinyurl.com/39p8uwff),with 51 given by at least two participants.As can be seen from Figure 1, these 51 foods identified as superfoods belong to a wide range of food categories from Vegetables and Vegetable Products (e.g., kale and spinach) to Spices and Herbs (e.g., ginger and turmeric), and Sweets (e.g., dark chocolate).Unsurprisingly, most of these foods were from the "Vegetables and Vegetable Products" category and the "Fruit and Fruit Juices" category with "blueberry," "avocado," and "kale" being the most frequently mentioned (53, 46 and 42 mentions, respectively).The top 25 food names mentioned by participants, and the food names subsequently used for the computational analysis, are highlighted in bold on the X-axis of Figure 1.These foods are listed in descending order of word frequency and include blueberry, avocado, kale, goji, quinoa, spinach, chia seed, nut, broccoli, acai, ginger, egg, berry, fish, sweet potato, beet, green tea, spirulina, almond, pomegranate, salmon, turmeric, wheatgrass, yogurt, and oats.

Word frequency
Our first analysis concerns the distribution of the most frequent words in superfood articles.These are presented in Figure 2 alongside data from our survey participants, separately for adjectives, nouns, verbs, and bigrams.Even at a glance, the concept of health is most prominent, but we can also see several references to the sensory properties of foods, naturalness, weight control, and scientific research.
To uncover the specific words that underlie predictions about an article being about superfoods, we used text classification.Note that the use of propensity score matching means each of the superfood articles was compared against a non-superfood article that was highly similar in context (determined by similar counts of the top 25 foods selected by survey participants).For robustness purposes, we replicated this analysis by comparing articles that mentioned the word "organic" (organic articles) vs those that did not (non-organic articles).Table 1 shows the performance of the classifier model on the out-of-sample dataset.We summarize the results in terms of three commonly used classification metrics -accuracy, precision, and recall.Accuracy is simply the number of cases predicted to be in their true respective class.Precision refers to the number of examples correctly predicted as belonging to a given class, as a proportion of all examples belonging to that respective class.Recall (also known as sensitivity) represents the number of observations correctly predicted as a given class divided by the total number of observations truly belonging to that class.As seen in the Table, our classifier has an accuracy rate of 67% for superfood articles and 72% for organic articles, meaning that it can classify these online news articles better than chance.As a result, we can conclude that our classifier model can sufficiently pick up linguistic differences between articles written about foods in a superfood context or not, as well as in an organic context or not.

Text classification
A benefit of our approach is that we can use the classifier model to find the n-grams with the highest probability of being in a news article classified as a superfood article or an organic food article.Figure 3 presents the top 100 n-grams, scaled to be proportional to the log-odds of the corresponding coefficients for each group.
Looking first at the superfood classifier n-grams in Figure 3, the first thing to note is an overlap between the n-grams produced using this approach and the words obtained from the simpler word-counting technique in Figure 2. The relationship between superfoods and health is yet again prevalent.For example, many n-grams reference the nutritional composition of foods such as "source (of) potassium," "high protein," "nutrient-rich," "mineral (and) vitamin," "vitamin c (and) e," "monounsaturated fat," "low sugar," "anti-oxidant," and "good/bad cholesterol."The classifier also picks up on the discourse surrounding superfoods and illnesses, with n-grams predictive of superfood articles including "heart disease percent," "cancer percent/fight," "boost immune system," "carcinogen" and "reduce stress." Additionally, the unigram "premature" is highly predictive but it includes two uses of the term in two separate contexts: premature death (for 16 of the superfood articles) and premature aging (for 13 of the superfood articles).Other n-grams that show a relationship with beauty include "protect skin," "facial," "hydrating," "regime," all of which were not seen in the word frequency (bag-of-words) analysis.In terms of sensory attributes, "tasty" is the only n-gram predictive of a superfood article, and the only reference to natural content is "unpasteurized" or "grass feed."We also detect nuances in language such as the paradoxical nature of foods marketed as superfoods, with predictive n-grams indicating a local origin ("locally sourced") but also their exotic nature ("ancient," "Peruvian" and "South America").
In comparison, there is little crossover between n-grams predictive of superfood articles and organic foods, even though both terms are typically perceived as cues to healthiness.Instead, articles written in the context of organic foods are predominately centered around food production methods, the concept of naturalness, and foods' environmental impact.Nonetheless, the most predictive n-gram was "certified", suggesting that articles about organic foods may highlight the necessary government standards and regulations that must be met for a food to be labeled organic.This is followed by "natural food", which along with "wholesome", "pure", "whole food" as well as "decay" and "rot", suggest a strong link between organic foods and naturalness.Similarly, there is an emphasis on the local aspect of foods with n-grams like "locally source(d)", "local food/farm", "small farm", and "farmer"(s) market" also being predictive of news articles about organic foods.Convenience or ease of access is also implied, as "delivery service", "open restaurant" and "metropolitan" may suggest.The environmental and ethical impact is another theme that stands out, with direct references to "eco/ environmentally friendly", "sustainable food", "ethical", "good quality" and "cooperative".Moreover, while the health benefits of foods are alluded to with descriptions like "healthful" and "medicinal", the focus is on the presence or absence of chemicals (e.g., "fertilizer", "pollutant", "contaminant").From this, we can conclude that our model successfully picks up on representations of healthiness that are unique to superfoods, rather than all health-related terms like organic.

Topic modelling
To confirm and extend the findings from the text classification, we used STM to extract topics.Essentially, STM enables automated discovery of differences in latent themes and topics between articles written about the same 25 foods from different contexts.Figure 4 summarizes the estimated effect of the presence of the "superfood" or "organic" term on the difference in proportion of a latent topic appearing in the articles.A positive value on the X-axis indicates a larger prevalence of a given topic in superfood articles (Panel A) or organic articles (Panel B).Our two reported topic models were fit with the same 12 topics as determined by the grid search (see Methods for detail).Topic labels were assigned by using the top 10 keywords most associated with the given topic.
In articles written about the same 25 foods, what group of words (constituting a topic) is most likely to occur when the term "superfood" is mentioned?As shown in Panel A of Figure 4, the topic relating to diet and weight stands out, with a 25.72% (95% CI [23.20%, 28.15%]) higher likelihood of occurring in superfood articles.The words most indicative of this topic were "diet, fat, healthy, help, health, protein, vitamin, weight, body, and sugar."Notably, this topic includes both terms related to food nutrients ("vitamin," "protein," and arguably "fat") and appearance ("weight," "shape").The fact that these terms appear alongside "health" as one of the most representative words, suggests that the model detected a relationship in discourse between diet, appearance, and health.In comparison, Panel B shows that this topic was the third most prevalent in organic articles relative to non-organic articles.Moreover, it was only 2.90% (95% CI [1.78%, 4.03%]) more likely to occur in an organic context.Given that this topic is considerably more prominent in the discussion around superfoods, one could infer that this association is pivotal in the representation of superfoods in the media.
Perhaps less surprising was the higher likelihood of retailing concepts appearing in superfood articles relative to non-superfood articles about the same foods.However, although this topic was the second most prevalent in superfood articles, the mean difference was much smaller compared with the first (diet, appearance, and weight) topic at 0.02 (95% CI [0.00, 0.04]).The 10 words that formed our interpretation of this topic consisted of "product, company, market, store, sell, uk, business, price, buy, and consumer."This retailing topic was also the second most likely to appear in organic food articles in proportion to non-organic food articles, but was slightly more likely to occur in organic articles than superfood articles (mean difference of 0.04, 95% CI [0.03, 0.05]).
The third topic more likely (by 1.94%, 95% CI [0.04, 3.84%]) to appear in superfood articles vs nonsuperfood articles was one associated with scientific research.Keywords constituting this topic such as "study, "disease," "cancer," "health," and "research" had also been present in the findings of our previously mentioned computational analysis techniques.Again, having the word "health" within the top 10 words of this topic suggests a discourse where scientific evidence is given to suggest a relationship with these foods and health or disease.Interestingly, despite more debate and scientific research conducted to assess the relationship between organic foods and health, this topic of research was less likely (by −0.97%, 95% CI [−1.92%, 0.03%]) to occur in organic news articles than non-organic articles.Moreover, this contrast in discourse demonstrates another distinction between representations of superfoods and organic foods in the media.
Our topic modeling approach also reveals that the concept of cooking, "cook, add, minute, recipe, salt, heat, serve, water, bowl, pan," was slightly more likely to be found in articles written about superfoods, by 0.75% (95% CI [−1.14%, 2.71%]).This was the topic with the largest difference in ranking between superfood and organic contexts, shown from a visual comparison between the superfood (Panel A) and organic (Panel B) plots.We can also see that this cooking topic was ranked higher than the topic on eating out "restaurant, menu, dish, chef, serve, wine, bar, open, taste and meal," a topic found to be less likely to appear in superfood articles than non-superfood articles (by −3.26%, 95% CI [−5.36%, −1.05%).Thus, these findings suggest superfoods are more likely to be promoted as ingredients for home cooking, rather than as a treat when dining out.Conversely, the opposite is true for organic foods, where organic is more likely (by 0.84%, 95% CI [−0.37%, 2.03%) to be discussed in the same context as eating out, but less likely (by −2.28%, 95% CI [−3.38%, −1.23%]) to be mentioned in references to recipes.
Contrary to expectations, a topic relating to naturalness was only 0.34% (95% CI [−1.03%, 1.90%]) more likely to occur in superfood articles relative to non-superfood contexts.The words used to define this topic were "farm, grow, plant, water, farmer, produce, crop, animal, feed, production."On the other hand, this topic was ranked first for likelihood (at 8.30%, 95% CI [7.41%, 9.17%]) in organic vs non-organic articles, supporting the assumption that this finding is due to representation differences between these two contexts.It is also worth noting that the keywords relating to naturalness all appear to relate specifically to food production methods, with terms relating to the rawness or purity not detected by the topic model.
One may also wonder about the topics least likely to occur about the 25 foods in a superfood context, and how this differs from topics least likely to occur in an organic context.Figure 4 shows us that the majority of the topics (7 out of the 12) were found in articles where the term superfood was not mentioned, with the same being true for the organic comparison.Of these topics, the one referring to the fishing industry was least likely (mean difference of −0.05, 95% CI [−0.07, −0.04]) to be mentioned in a superfood context, defined by the topic label "water, river, lake, sea, fishing, catch, boat, fisherman, whale, ocean."This was also true of the topic least likely to occur in articles about organic foods relative to non-organic foods, but with a mean difference of −0.03 (95% CI [−0.04%, −0.02%]).The second least likely topic in superfoods articles (mean difference of −0.05, 95% CI [−0.06, −0.04]) consisted of words such as "state, photo, family, city, school, community, country," which appears to reflect a discussion of social factors surrounding the consumption of our sample of foods.The same topic was ranked relatively similarly (third-least likely) in the organic comparison (mean difference of −0.02, 95% CI [−0.03, −0.02]).Interestingly, the third least likely topic (mean difference of −0.05, 95% CI [−0.06, −0.04]) to occur in a superfood context demonstrates a focus on the unsustainability of fishing practices and environmental consequences (consisting of unique words such as "climate, change, fishery, population").Along similar lines was a topic referencing humanwildlife coexistence (mean difference of −0.04, 95% CI [−0.06, −0.03]) with the words most representative of the topic being "animal, bird, specie(s), human, bear, female, live, wildlife, male, insect."These two topics were also less likely to occur in an organic context but had a slightly higher mean difference (−0.02, 95% CI [−0.03, −0.01] and −0.02, 95% CI [−0.03, −0.01] respectively) in comparison to superfoods vs non-superfood articles.The next topic slightly more likely to be mentioned outside of a superfood context (by −4.02%, 95% CI [−5.82%, −2.32%]), seems to reflect a discourse relating to food tourism, with words including "island, hotel, local, city, place, beach, town, old, around, visit."For organic articles, this food tourism topic was equally likely to occur relative to non-organic articles (mean difference of 0.00, 95% CI [−0.01, 0.01]).Last was a more abstract topic of words denoting the writer's personal perspective ("think, want, really, life, back, feel, love, never"), which had a small mean difference of −0.04 (95% CI [−0.05%, −0.03%]).However, by comparison, this same topic was the second least likely to appear in an organic context (mean difference of −0.03, 95% CI −0.04, −0.02]).

Discussion
Our computational approach makes several contributions to our understanding of superfood representation in online news media.Through a series of comparisons (with participant data, and with articles about the same foods in either a non-superfood context or organic context) we extracted the words, concepts, and topics most strongly associated with the term superfood.First, we found a unique emphasis on the relationship between individual foods and health benefits in superfood articles.Second, we observe a distinct use of medical terminology (such as "cancer," "immune system," "heart disease," "risk") in a superfood context, which is notably absent in the representation of organic foods.Third, against our expectations, terms stressing the naturalness and environmental impact of these foods were infrequent in the characterization of superfoods.As a whole, considering superfood has no official definition, our findings offer a deeper understanding into the concept .Given the role media plays in shaping people's beliefs and attitudes, our results also contribute to our understanding of the origins of misconceptions concerning the health and well-being benefits of superfoods.
Although a link between superfoods and health benefits is expected, our result provides support for previous research findings in a data-driven manner (Franco Lucas et al., 2021;Loyer, 2016;Rojas-Rivas et al., 2019).All three of our bottom-up approaches found "health," "healthy" and specific nutrients such as "protein," "sugar," and "vitamin" to have the largest association with the term "superfood."Such consistency of findings between our three bottom-up techniques demonstrates the robustness of computational approaches in uncovering superfood representations.Indeed, drawing attention to isolated compounds present in foods is not new to superfood advertising (Scrinis, 2013); in fact, it was a highly successful marketing strategy on food packaging until unfounded nutrient claims were banned in the 1990s (Goldberg & Sliwa, 2011;Silchenko et al., 2020).Nonetheless, a strength of our topic modeling approach is that we can now see the extent of the association between this nutrient-focused conceptualization of health and superfoods.As such, we observe how the representation of superfood with health considerably exceeds the relationship of organic and health, and even organic and naturalness in media discourse.While we cannot use present results to claim that this superfood representation in the media directly influences consumer perceptions, similar language in participant responses does suggest individuals are at least aware of the same association.Furthermore, we found that mentions of health, various nutrients, weight, and appearance, emerge within the same topic (taken from our topic modeling analysis).This implies a discourse where superfoods are touted for weight loss as part of health messaging (Rodney, 2018;Sikka, 2019), despite scientific evidence establishing that weight is a poor indicator of health (Frederick et al., 2020;Saguy & Almeling, 2008).Considering that nutrient-focused marketing detracts from the recommended "total diet approach" to healthy eating (Freeland-Graves & Nitzke, 2013), plus the implications of a suggested linear relationship between weight and health for disordered eating behaviors (Frederick et al., 2020;Pilař et al., 2021), our findings highlight a need to further extend health and nutrition claim regulation to the online media marketing of food items.
Our results also draw attention to a medicinal representation of foods, centered around disease prevention, which is more likely to occur in a superfood context.Cancer is the most frequently associated disease in our article sample about superfoods, with a slight emphasis on breast cancer.However, in the words of Cancer Research (2020) "there is no good evidence that any one food prevents cancer, including superfoods."Most of the research conducted on individual foods that are reported in the media are either from animal studies (Jagdale et al., 2021), in vitro (outside of a living organism) studies (Šamec et al., 2019), or single studies that should be interpreted with caution (Ladher, 2016).This is also true for the other diseases mentioned in our superfood articles (e.g., heart disease).Thus, the ability of our untrained model to identify a representation of superfoods based on weak evidence (Inoue-Choi et al., 2013), reinforces the role of the media in creating confusion about healthy eating (Hackman & Moe, 1999;Nagler, 2014;Weitkamp & Eidsvaag, 2014).Again, as evident from the language reflected in participant responses, the superfood discourse in the media may therefore help explain the discrepancy between people's inaccurate beliefs about food healthiness and official dietary guidance.
Contrary to the entwined relationship between health and naturalness found in previous research (Gandhi et al., 2022;Loyer, 2016;Michel et al., 2021;Perkovic et al., 2021;Roman et al., 2017;Siipi, 2013), naturalness was not a concept stressed in the online news article representation of superfoods.This is more surprising because naturalness representation was detected in participant responses, as well as in the comparative analysis of organic food articles.One possible explanation is that the relationship between chosen superfoods and naturalness can be assumed by design, whereas the same food can be sold as organic or not, and thus naturalness associations would need highlighting in organic discourse.Another factor to consider is that superfoods are often sold in supplement form, involving high levels of processing, and so claims about their naturalness may appear contradictory.
The general lack of coverage concerning the social and environmental consequences of superfoods in online news articles may explain why some participants perceived superfoods as "eco-friendly."However, the existing scientific literature on superfoods reports a detrimental social and environmental impact due to increased demand worldwide (Bedoya-Perales et al., 2018;Loyer, 2016;Magrach et al., 2020).It is perhaps unsurprising that superfoods are spun in a positive light within a media marketing discourse, even if this unbalanced representation further enhances the halo effect of superfoods.However, given the relatively broad range of news articles from a variety of news outlets in this study, one might expect a higher prevalence of this topic in superfood articles than revealed in our topic modeling analysis.As a result, it would be interesting to explore whether differences occur between different news outlets and if this finding is also true of representations in social media.Moreover, as consumers demonstrate a preference for environmentally friendly foods (Franco Lucas et al., 2021), a recommendation for future research is to assess how increased awareness of environmental consequences from global-scale production of superfoods (e.g., water depletion, soil degradation, reduction in biodiversity, and carbon footprint) might influence perceptions, preferences, and purchase behavior for superfoods.
Our relatively small number of superfood articles, both initially (1,169) and after selecting only articles mentioning 25 known superfoods (872), is unlikely to capture the entirety of online superfood news articles written between January 2010 and February 2020.We chose to prioritize minimizing researcher influence, selecting articles from our corpus using arbitrary means (count of the word "superfood") rather than adding further news articles from specific news outlets (e.g., The New York Times, or The Guardian).It is also worth noting the existence of related marketing terms that have spawned from the superfood discourse (e.g., "superfruit," "supergrain," and "super berry") (Liu et al., 2021;Loyer, 2016).Unfortunately, too few articles were available in the NOW corpus to extract meaningful themes and representations.Nonetheless, despite some limitations, our approach captures meaningful patterns that are consistent with discourse findings about known superfoods using other corpora (Kāle & Agbozo, 2020).
One limitation of the NOW corpus is that it only offers data for the period of 10 years, which prevents us from drawing any conclusions on how superfood related discourse might have changed over time.We note that there are other data sources (e.g., American Stories database) and pre-trained time-variant language models (e.g., histwords project), that could be used for this purpose.Combined with the insights of the present study, future work could explore how the definitions of superfoods and their relation to organic foods changed over the years (and even across different places).
Overall, we believe that the strength of our approach is that we can uncover and quantify the unique representations of superfoods in the news media.While the term superfood is banned on food packaging, here we demonstrate how this term is prevalent outside of the supermarket environment.More importantly, we demonstrate a number of unique dimensions that make up the representation of superfoods in the media.The next stage for researchers is to ascertain the extent to which these representations influence food perceptions, and ultimately food choices.For now, we recommend advertising regulatory bodies pay close attention to the loopholes being used to produce these misleading and potentially harmful associations.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Topic Modelling Grid Search
As referenced in the Topic Modelling section of the Computational Methods Section, Figure A1 shows the measures we used to find the optimal number of latent topics.We searched K-values between 4 and 40, shown on the X-axis of Figure A1 From left to right, the plots refer to exclusivity, semantic coherence, held out likelihood, residual, bound, lower bound, and finally "em.its" refers to the total number of EM iterations used in fitting the model (Roberts et al., 2019).Here, our model was set to run at a maximum of 100 iterations.We optimized the model on exclusivity, which compares word distributions between topics to determine the likelihood of the top words of one topic being top words in the other topics (Roberts et al., 2019).Another important measure is semantic coherence, introduced by Mimno et al. (2011), which refers to the probability that the top words of one topic co-occur within our corpora of superfood articles (Pandur et al., 2020).We also consider held-out likelihood estimation, similar to cross-validation, which checks the model's predictive performance by estimating the probability of words occurring within a document after they have been removed (Pandur et al., 2020).Measuring residuals is useful for determining how much variance remains at a given topic number, and whether more topics would be needed to account for any overdispersion.The bound is a measure of convergence, with the model considered converged when there is a small enough change between iterations (Pandur et al., 2020).The lower bound simply applies a correction to the bound so that the bounds are directly comparable (Roberts et al., 2019).

Figure 2 .
Figure 2. A comparison of the top 25 adjectives, nouns, verbs, and bigrams taken from online news articles written in a superfood context with the adjectives, nouns, verbs, and bigrams mentioned by more than one participant to describe or define a superfood.

Figure 3 .
Figure 3. Word cloud of the 100 n-grams most predictive of the terms "superfood" and "organic" occurring in the sample of online news articles.

Figure 4 .
Figure 4.Estimated differences (and 95% confidence interval) in topic probabilities between superfood and non-superfood articles (left panel -A), and organic food and non-organic food articles (right panel -B) Each plot is ordered by the topics most prevalent in the superfood or organic articles.

Figure A1 .
Figure A1.Grid search evaluation results.The model was optimized based on the exclusivity score.

Table 1 .
Text classification validation performance.