Linkage Analysis Revised – Linking Digital Traces and Survey Data

ABSTRACT Linkage analysis, i.e. linking media exposure, content, and surveys, has been a powerful tool to assess media effects. However, the development of online communication and the advent of social media brings about many challenges for traditional linkage designs. In this paper, we explain the three steps of linkage designs for online communication effects and the usage of computational approaches to capture communication exposure and content. We then review recent designs and studies that use different forms of digital trace data to link digital communication exposure, content, and surveys: Tracking data, data donations, and screenshots/screen recordings. We describe (practical) challenges and opportunities when linking digital communication traces with self-reports and show how these data could be analyzed to establish different media effects.

effects 1 (e.g., attitudes about climate change).Within this paper, we will describe this threestep linkage process -measuring exposure, content, outcome variables -for online communication effects and the combination of digital traces, online communication content, and survey data.We will show how computational methods and the collection of digital traces have changed traditional linkage designs and which strengths and pitfalls come with these new methods and designs.This linkage process includes methods and steps described in more detail in other papers in this special issue.Collecting digital traces is described in-depth by Ohme et al. (2023) in this issue.Creating meaningful variables from text (Kroon et al., 2023) and from visual data (Peng et al., 2023) is also described in the issue, and repeating this detailed information on trace data collection and content analysis would, of course, go well beyond the scope of this paper.We will focus on linking the different sources, practical challenges, and design decisions, as well as analytical opportunities and complexities of linked datasets of digital trace data and surveys.
We will first outline the key challenges for linking social media communication to self-reports and establish media effects with linkage analysis.We focus on surveys, the main data collection method in communication and social sciences.However, they are not the only data sources to be combined with digital traces of social media usage -examples include non-self-reported outcomes such as physiological data or data provided by peers or parents.At the heart of this paper is a review of innovative designs that bring about new opportunities for researchers to combine online communication content, exposure, and survey-based outcomes.Based on the review of innovative linkage designs, we will discuss new opportunities and advantages that come with linking social media data and surveys, but also practical, ethical, and analytical challenges when working with social media data linkage.
outcome variable (De Vreese et al., 2017).Linkage analysis is considered a "gold standard" for media effects research with high levels of external validity (Scharkow & Bachl, 2017).

Challenges of traditional linkage analysis
Developments over the last decades, such as the incorporation of the internet and social media in all aspects of our lives, have led to challenges to the initial application of linkage analysis, for example by an increasing number of information sources, fragmentation, and personalization, making reliance on self-reported media exposure even more problematic.At the same time, digital advancement brought about new opportunities for scholars to adjust, re-invent, and apply linkage designs.Likewise, our thinking about media effects has advanced, for example, by distinguishing between the immediate and more long-term impact of media exposure, a focus on intrapersonal media effects (Valkenburg et al., 2021), and the existence of reinforcing spirals between media exposure and attitudes (Slater, 2015).To meet the requirements of changed media consumption and effects, digital tracking data ideally encompasses a fine-grained, precise, and frequent (and maybe even continuous) measure of information people encounter, the attention they devote to this information, and a frequent and valid measurement of outcome variables such as attitudes and behavioral intentions.In an ideal type of situation, research relies on large and representative samples.As outlined above, traditional linkage analysis designs combine three data sources: media usage, media content, and survey measures.The measurement of all three data sources faces serious challenges when applying the design to social media effects.
1 For linkage designs, media effects are primarily defined as (changes in) a particular outcome variable (e.g., attitudes, behavioral intentions, emotions, knowledge that correlate with media content a participant was exposed to (Schuck et al., 2016).When using longitudinal designs to measure media content and the outcome variable, researchers tend to state that exposure to the media content caused the (changes in) the

Measuring exposure to social media communication
It has almost become a truism of communication research that self-reports of media usage are imprecise, and recipients have a hard time reporting their usage in retrospective surveys (Araujo et al., 2017;De Vreese & Neijens, 2016;Parry et al., 2021;Scharkow, 2016Scharkow, , 2019)).This is especially true for online communication and social media usage because the specific nature of social media makes it even harder for recipients to recall the quantity of media usage correctly ("How much time do you spend online/with social media?"), but also sources of information and communication ("Where do you get information on current affairs?") (Niederdeppe, 2016;Ohme, 2020;Scharkow, 2016).These characteristics include the always-online, always-connected nature of social media (Vorderer et al., 2016), i.e., there is no clear beginning and no clear ending to a communication episode, resulting in primarily short, frequent, and potentially overlapping communication episodes.The impact of those characteristics is further amplified by the fact that most social media communication is happening via mobile devices (Naab et al., 2018).However, as we will show in this paper, while it has been harder to assess exposure via self-report, the characteristics of digital communication make it possible to measure exposure more accurately than ever.For many data collection designs, we will present simultaneous measures of exposure and media content.
The problems of self-report of media exposure and usage can be circumvented since recent linkage designs use passive measures of exposure.In other words, when being exposed to social media communication, recipients leave digital traces on their devices; these digital traces can be used to measure exposure to media content accurately and continuously (Menchen-Trevino, 2013;Stier;Breuer et al., 2020).In theory, capturing these digital traces allows us to measure exposure and precisely which content was received by a participant in our linkage study -something that was impossible with traditional linkage designs relying on self-reports.However, trace data collection suffers from many challenges that impede the data quality, leading to less-than-perfect measurements.

Measuring social media content
In the era of mass communication, analyzing the content of a limited number of leading media outlets and combining those data with the reported use of those outlets was largely sufficient.The "end of mass communication" (Chaffee & Metzger, 2001) also challenged the principle of leading media outlets and thus one of the foundations of linkage analysis (Geiß, 2019).Media usage on social media platforms is fragmented and characterized by many factors, such as personalization, incidentalness, non-exclusivity, granularity, and sociality (Kümpel, 2022).Simply put, social media content is different for everybody, and even when a respondent can report their social media use adequately, it is almost impossible to collect the media content based on those reports.However, since there are not only digital traces of media exposure but also the actual media content, one might even get more detailed information on actually received media content (Christner et al., 2022;Otto et al., 2022).

Measuring outcome variables
At first glance, one might think this data source is not heavily affected by the digital communication environment.After all, most traditional linkage studies used (panel) surveys to capture the outcome variable such as attitudes, beliefs, emotions, cognition, or behavior(al intentions).However, the characteristics of social media communication do also challenge how we think about media effects.Within traditional linkage analysis, there was a rather simplistic linear and media effects paradigm, summing up media characteristics in a specific time frame.The idea is, for example, that a more negative portrayal of an issue, a person, or an institution in the media leads to more negative attitudes toward that object (Brosius et al., 2020;Vliegenthart et al., 2008).In a dynamic, fragmented social media environment, different media effects paradigms might be more valid, including cumulative effects, immediate effects, or reinforcing effects, as we will explain in this paper (Thomas, 2022;Valkenburg et al., 2021).This implies that simply asking outcome variables in panel surveys with a few waves or even cross-sectional might not be enough to capture social media effects.
We argue that more frequent, fine-grained, and dynamic measures are necessary.Luckily, the technical development has provided researchers with other ways to measure the outcome when investigating social media effects, as we will show later in this paper: Smartphones and mobile intensive-longitudinal surveys are a good match when trying to measure the effects of short, frequent communication episodes (Naab et al., 2018).

Overcoming the challenges of traditional linkage analysis
While the challenges described above make it harder, or sometimes even impossible, to apply a traditional linkage design, computational methods, and modern designs might tackle some of these challenges and advance media effects research in general.In other words, while it might have become harder to access and analyze a ubiquitous, personalized, fragmented, and fast-lived social media environment, the digital nature of social media might bring solutions to long-standing problems in media effects research: Since social media users leave digital traces, which can be captured, all three data sources can potentially be measured more nuanced, fine-grained and more accurate than previously done.The passive nature of many social media measures and many computational approaches takes a cognitive burden from participants and reduces memory errors -especially when it comes to exposure (Guess et al., 2019;Scharkow, 2016;Verbeij et al., 2021), but also when it comes to the accurate analysis of media content (Scharkow, 2017).Furthermore, digital trace data and measures of exposure could be more elaborate.It is evident that for many media effects to occur, not only exposure but rather attention to or elaboration of media content is crucial (Eveland et al., 2009).Computational communication exposure or usage measures, be it for tracking or screen recordings, often also capture meta-data that could be a valuable operationalization of attention or engagement with online communication content.These data include the length of stay on a certain page, mouse activities, clicking, and other activities like sharing, commenting, and liking.While this metadata has the potential to tell us more about attentional processes during media reception and, thus, could be crucial for media effects research, it must also be acknowledged that the validation and methods research on these measures is still in its infancy.
It should be stressed that digital trace data, in turn, comes with a broad array of issues, some of which are difficult to assess or address.Salganik (2019) points out ten characteristics of big data, some of which stress the positive sides (always on, nonreactive).At the same time, most discussed are flaws concerning sampling (inaccessible, non-representative), data quality (drifting, algorithmically confounded, dirty), and data interpretation/handling (incomplete, sensitive).While the current (and future) media environment makes self-report data often poor reflections of actual exposure to and interaction with platforms and content, digital traces must also be carefully treated.
We will briefly review and discuss different approaches to conducting linkage analysis for social media communication.All approaches apply computational methods to capture one of the three data sources of linkage analysis -exposure, content, and outcomes.While, in contrast to traditional linkage analysis, the concept of exposure also changed for online communication, especially for social media usage, many recent linkage studies still investigate rather traditional modes of communication, e.g., selecting and reading news websites, comments, passively receiving audiovisual content (Cardenal et al., 2019).We will use the term usage, where appropriate, to refer to more active forms of communication engagement, e.g., writing comments, replying, sharing, clicking, and liking.

A brief review of current linkage designs
When trying to systematize and review current approaches to linking communication content and selfreport data, it becomes apparent that the degrees of freedom have increased considerably when comparing newer approaches to traditional linkage analysis.At the same time, different decisions for linkage researchers had to be taken for traditional (offline) linkage designs, e.g., which (news) outlets to analyze, weighing procedures to apply, or how the panel design should look like (De Vreese et al., 2017), the main procedure for quantitative linkage analysis is relatively fixed: Combining (self-report) of media exposure with content analyses of relevant communication outlets and link this content to survey scores of the variable of interest.As methods to measure communication exposure and content have changed dramatically in the social media era (Ohme et al., 2023), the designs for linkage analysis have also changed and diversified.In this section, we will provide an overview of different linkage designs used in the last years to combine social media content data and other digital traces with (survey) data to capture social media effects.
We mostly focus on studies that link social media content and other variables on an individual level as linkage analysis is conducted to establish claims about media effects or selection, which are within-person changes due to media use (Valkenburg & Peter, 2013).More holistic views of platforms cannot account for the fragmentation of social media and personalization of timelines, meaning that an analysis of whole platforms through surveys is arbitrary and, at best, artificially measures social media effects (Kümpel, 2022).We, lastly, only consider self-reported, individual-level data as the "outcome" data source.We see the particular strength of linkage analysis in combining data from different data sources, which can give insights into both behaviors (digital traces) and opinions and perceptions (survey).Inferences about the latter solely based on behavioral data -e.g., treating a retweet as agreement -can lead to oversimplification (Groot Kormelink & Costera Meijer, 2018;Stier;Breuer et al., 2020) and misinterpretations.
We will now explain different designs linking social media communication to the outcome variables.They are organized alongside the three main techniques used to capture exposure and content (i.e., social media data and other digital traces): Tracking, data donations, and screenshots/screen recordings.We chose this way of organizing the section as (1) the data types collected through these methods are different (e.g., links, pictures, HTML content), ( 2) the opportunities for linking with other data sources vary (e.g., panel, experience sampling) and (3) the practical challenges differ substantially for the three approaches (see also Ohme et al. (2023) in this issue for more information on digital traces data collection).Data collection, linking, pre-processing, and analysis differ substantively depending on the exact procedure and which "source" of digital traces is chosen.We will try to illustrate all the steps and cover some exemplary studies.In each section, we discuss the main idea, technical considerations, example studies, and projects using these designs and describe the main advantages and caveats of these new linkage designs.

Tracking
Using tracking approaches to capture online communication exposure and content can be seen as the most established technique used across many different linkage and media effects studies in various designs (Puschmann & Pentzold, 2021).In tracking designs, online communication content is captured live via an extension or application; thus, exposure and contact might be measured simultaneously or content is crawled after data collection (Adam et al., 2019).This can be done in different grades of granularity (e.g., domain level vs. URL level) and include content or meta information.This meta-information can be seen as a further measure of exposure.Since researchers know how long a specific website has been "open" and how active a participant has been in terms of mouse activity, web tracking does not only offer content measures, but also measures of exposure or even attention.As mentioned, this brings many advantages and might help scholars measure exposure, attention, and engagement with online communication content.Regarding measuring communication content, two approaches can be distinguished when describing tracking tools.Some track the HTML directly when activating the browser extension (Makhortykh et al., 2022); others rely on URLs that are scraped after the data collection (Guess et al., 2021).The latter strategy has implications for dynamic websites.Researchers cannot be certain that participants have seen precisely the content they scrape afterward.Extensions have, for example, been developed to capture YouTube recommendations (Sanna et al., 2021) or Facebook usage (Beraldo et al., 2021;Breuer et al., 2022;Haenschen, 2020;Haim & Nienierza, 2019).Often, browsing histories in general (to some extent including social media data) are collected via specific tools (for an overview see Christner et al., 2022) or by relying on commercial data collection companies (Jürgens et al., 2020;Stier;Breuer et al., 2020;Stier et al., 2020).In all these designs, participants must install some form of extension or application on one or more devices at the beginning of the research process.This is followed by a tracking period during which the data is collected.Some designs only include a survey at the end of the tracking period; others include two or more, aiming at investigating changes over time.In some cases, so-called "post-linking approaches" (Stier, Breuer et al., 2020) are being used -meaning that participants provide information that can later be used to retrieve digital traces, e.g., via an API.This procedure has, for example, been used for collecting Twitter or Facebook data (Al Baghal et al., 2020;Guess et al., 2019).
The studies vary with regard to the kind of data that is linked to survey responses -some investigate the amount of or content of news that is accessed on social media (Haenschen, 2020;Haim & Nienierza, 2019), while others investigate personalization (Sanna et al., 2021) or whether selective exposure to political content and populist attitudes are related (Stier et al., 2020).Scholars have also linked tracking data to self-reports to assess how far they overlap (usually assuming that trace data are the more accurate data source) (Guess et al., 2019;Jürgens et al., 2020).

Advantages and strengths
The advantage of tracking studies is that detailed data specifically needed for a particular research project can be collected in a set period, for example, during the final weeks of an election campaign.While designing the tools (extensions, applications) to capture content, a focus can be put on what exactly is needed, making this a flexible approach that can be adapted for each research project.This can reduce the chances of collecting data without a clear research purpose and make the linking process between survey and trace data easier.

Practical challenges
On the downside, tracking approaches require substantial technical skills and are prone to errors and changing technical settings at the back end of the social media -which, if not caught early, can lead to not collecting all the data or only parts of the expected data.Updates of platforms, operating systems, applications, and browsers can threaten the tool's functionality.Take, as an example, changes to Android phones that make it difficult to track specific data with applications that were not installed from the official Google Play Store -protecting users but also requiring additional adaptations or even new tools in case applications do not get approved by official stores . 2 This also applies to more minor changes in website interfaces or privacy changes in operating systems (e.g., iOS by default not allowing cross-app tracking).While these challenges affect all study designs using digital traces, it especially impacts linkage studies looking for longer-term trends.If the availability and format of data sources frequently change over time, analyses become difficult.A further practical but conceptual challenge is that mobile tracking of apps is, to date, hardly possible (Christner et al., 2022) and also mostly relies on browser-tracking (Vogler et al., 2023), thus missing large parts of mobile online communication.The chances that a browser extension or an application for tracking mobile data still works a few years later are incredibly slim, meaning that every project developing a tool should keep in mind that additional resources for maintenance and adaptation are needed to make the cost investment sustainable.To some extent, this can be solved through open-sourcing application code, allowing other researchers to assist in maintaining the code.Apart from these technical challenges, being tracked can lead to issues with social desirability and privacy awareness, disrupting normal participant behavior and thus hampering the validity of the research findings.When it comes to analyzing online communication content, all approaches introduced here carry the strength of providing digital text that is available for automatic content analysis, making it possible to scale these analyses tremendously (Kroon et al., 2023).The combination of capturing digital traces and automatic content analysis is a prerequisite for measuring personal media consumption and not relying on leading media outlets like the traditional approach.If one aims at content analyzing text from web tracking tools, these text-corpora must be preprocessed.Imagine somebody is interested in the effects of online news usage.When scraping or tracking any online news article, one needs to parse the relevant text and filter out meta-data, information on ads, and additional so-called boilerplate.There is a considerable amount of work done at the moment to establish preprocessing pipelines to extract relevant content from different online sources.
Standardizing and harmonizing these preprocessing steps is still in its infancy.However, transparency about preprocessing is essential since preprocessing decisions might affect the outcome of content analysis and, thus, of the whole linkage study (Pipal et al., 2023).

Data donations
Data donations present a newer approach to getting social media data on the individual level.The main difference compared to tracking approaches is that data is collected post-hoc (using already tracked data) while tracking usually occurs ante-hoc (e.g., by installing a tool that tracks the usage of a social media platform).For data donations, existing data sources are being located and reused for research purposes.As the name implies: Respondents "donate" their own data for the purpose of research.This approach especially gained in popularity due to changes in European data protection laws, which allow users to request copies of their own digital traces from companies such as social media platforms (the so-called General Data Protection Regulation, GDPR).However, even earlier, some research designs were implemented using data donation techniques, such as WebHistorian (Menchen-Trevino, 2016), which allows for a full overview of browsing histories (including visits to social media websites).It has been, for example, used to research the null effects of news exposure on outcomes such as political knowledge and participation (Wojcieszak et al., 2022) Several newer designs follow the same approach, building tools that can theoretically be used for the donation of any online traces (Araujo et al., 2022;Boeschoten et al., 2022).In addition to this, some studies focused on specific platforms only, such as Instagram (Van Driel et al., 2022) or Facebook (Breuer et al., 2022;Haim & Nienierza, 2019;Marino et al., 2017;Thorson et al., 2021).
The exact content of data donations and whether exposure or usage can be quantified through it depends on the data source.Data donations almost always include (active) usage information (e.g., likes, follows).However, some data sources also include measures of exposure -TikTok data downloads include all videos shown on the feed, while Twitter takeouts include all ad impressions of a user.Additionally, some data sources also include inferred data, such as ad profiling categories derived from platforms about a user (included in the Facebook takeout).Overall, it can be said that using data donations can result in very diverse sets of data from exposure to inferred data, though they are primarily being used to capture active usage of platforms.
The obtained digital traces are usually combined with online surveys: Participants are recruited online, fill in a survey, and in the final step, are asked to donate their digital traces (Marino et al., 2017;Thorson et al., 2021).In some cases, this was done on a smaller scale in an offline setting to aid the data donation process (Breuer et al., 2022).The data sources are thus collected separately and combined using individual identifiers (Breuer et al., 2022;Thorson et al., 2021).However, in some newer designs, participants are prompted to annotate their own data (e.g., whether the data matches their usual usage patterns or whether a YouTube channel includes news) (Welbers et al., 2023).This approach offers more control over the quality of the data, but could potentially also include effects measures such as emotions invoked by certain content.In this case, content and responses are collected together.

Advantages and strengths
An advantage of data donation studies and tools is the strong focus on active user involvement: Users can visually inspect their own behavioral traces and filter them before allowing the researchers access.Additionally, donated data can sometimes go back months to years, allowing for a more longitudinal overview that is less impacted by social desirability issues compared to tracking (Breuer et al., 2022).
Furthermore, from a technical point of view, no extensions or applications need to be installed, reducing the amount of technical support needed and enhancing cross-platform and cross-device compatibility -so that e.g., exposure to and interaction with political content on several platforms (Twitter, Facebook) on different devices (mobile, desktop) can be taken into account, giving a more accurate measure than only tracking one platform and device.Adding new sources or adapting existing ones does require comparably little effort, making it more flexible to the ever-changing ecosystem of platforms.

Practical challenges
Data donation techniques also come with particular challenges: Participants need to be able to locate and upload data -depending on the study design, even from multiple different providers.This often requires substantially more time investment than passive tracking, and participants need to have a certain level of digital literacy to follow the instructions (Van Driel et al., 2022).This leads to high dropouts and, in turn, to issues with representativeness and sample sizes (Breuer et al., 2022), limiting the applicability of linkage studies for research goals that require large sample sizes (e.g., for investigating heterogeneous media effects across specific groups) while also leading to comparatively high costs.Furthermore, the process depends on the social media platforms: While they are legally required to provide the data, the process can take up to weeks and be tedious for the user requesting their data.Additionally, there are no set requirements on what data needs to be included -often the data that would be of interest for research purposes (e.g., timeline exposure) is not part of data takeouts, limiting the research designs that can profit from data donations for social media research (Araujo et al., 2022;Breuer et al., 2022).The data acquired via data donations usually come in a very structured, machine-readable format such as CSV files or JSON files, making it easy to process.However, there are several other practical challenges specific to making data donations usable for linkage analysis: Often, the data does not include the content of e.g.liked posts directly but rather the link to the content (e.g., a link to a Tweet that was liked).Therefore, an additional step is needed to retrieve this content, usually via automated scraping methods that might require specific computational skills for data retrieval, storage, and linking (see tracking designs above).Furthermore, most sources for data donations are not or very poorly documented -for example, for social media archives or browsing histories, no "manual" is given, meaning that researchers often have to apply detective work to understand the different files available.Sometimes it might be that variables are present in the data without any explanation of how they have been created -one example would be that Chrome browsing histories include the "frequency" of a visited website which is calculated through some mix of frequency and recency, however with no explanation of how the data exactly was derived.Since the exact composition of the data archives being provided for data donations frequently changes, researchers will always have to invest time in understanding the content and structure of the data source while accepting that some variables will not be usable without further information on how they were created.

Screenshots/Screen recordings
When talking about the effects of social media, it should be noted that more than 90% of communication is done on mobile devices via social media apps and not via desktop computers and browsers (Statista, 2021).This is a potential problem for the designs discussed above, as data from those apps are difficult to capture . 3The last technique, screenshots and screen recordings are mostly conducted on mobile devices and can be seen as a hybrid between tracking and data donation.The capturing or recording of the screen can either be done with automated tools or manually by the user.The former is 3 Data donation approaches are able to capture mobile digital traces, so in principle this notion only holds for tracking.However, data donations are also limited to information from a few platforms, apps, and services that a user logs into and can, in turn, download their data.For a significant amount of activities on mobile devices, data donation approaches are not available.
often used in the context of mobile usage, taking screenshots of mobile phones (Brinberg et al., 2021;Reeves et al., 2021) or recording the screen while certain applications are being opened (Krieter & Breiter, 2018).In this case, linkage works similarly to tracking approaches: data are being recorded passively while social media are being used and are often combined with online surveys.
Screenshots are rather used in experience sampling approaches, prompting users to upload screenshots of their (mobile) social media exposure and content.Since screenshots are, by definition, static and do not contain meta-data, it is harder to capture active usage.
In that sense, it is more similar to data donation techniques, allowing users to inspect the data before submitting it and being able to augment it on the spot (e.g., by adding the emotion felt).The combination of screenshot data donation and mobile experience sampling, sometimes also referred to as mobile intensive longitudinal linkage analysis (MILLA) (Otto & Kruikemeier, 2023;Otto et al., 2022), is especially suitable to capture immediate reactions to social media content such as emotions (Otto et al., 2020) or credibility judgments of specific social media items (Otto et al., 2018).Still, it has also been used to investigate application usage on mobile phones and assess the accuracy of selfreports (Ohme et al., 2021).

Advantages and strengths
These immediate reactions to social media content cannot be measured with any other approach mentioned here or with traditional linkage analysis since surveys can only measure "delayed" and more long-term media effects such as attitudes, opinions, and behavioral intentions.A further advantage of screenshot/screen recording approaches is that any kind of content across different platforms and devices can be captured, explicitly addressing issues with collecting mobile data: Since social media communication on mobile devices primarily takes place in apps instead of the browser, especially tracking approaches fall short (Christner et al., 2022).Furthermore, it can capture (better than tracking or donations) the actual exposure by showing what was displayed to the participant on the screen.Especially for studying advertisements and personalization, this offers clear advantages (Beraldo et al., 2021).Thus, screen recordings, screenshots, and experience sampling approaches might capture exposure, content, and the outcome variable almost simultaneously (or "in-situ" (Schnauber-Stockmann & Karnowski, 2020)).

Practical challenges
The content obtained through those techniques does come in a very unstructured format -visual information.It needs to be stored (requiring more capacities than textual content) and parsed before analysis.Reliably extracting features from visual content remains a challenge, especially if researchers would like to conduct an automated content analysis on the material (Peng et al., 2023).Furthermore, especially with regard to automated approaches, highly private data is included in the data collection (e.g.opening a private messenger or a banking application).Taking care of adequate privacy measures and data storage poses important challenges.
This method furthermore relies on a continuous engagement of the respondent: for the screen recording approaches, they need to follow the instructions closely and need to make sure their mobile device is linked and working correctly (Reeves et al., 2020).For the screenshots approach, there is a high demand for resources on the side of the participant: they have to remember and detect relevant situations to take a screenshot.Like any other intensive longitudinal design, high compliance and motivation of participants is necessary (Napa Scollon et al., 2009;Otto & Kruikemeier, 2023;Otto et al., 2022).Finally, these studies rarely yield sample sizes as big as they are potentially possible with other designs since they are also very costly and incentives must usually be (very) high (Bolger & Laurenceau, 2013).Moreover, screenshots or screen recordings are highly unstructured data; thus, some preprocessing steps need to be taken before one is able to perform any analysis.However, as mentioned before, these steps are dependent on the exact design.For screenshot approaches in comparison with manual content analysis, there is almost no preprocessing necessary since coders can directly use the (readable) screenshots to code the variables of interest (Otto et al., 2022).If one aims to analyze (text from) screenshots automatically, these need to be preprocessed.Text recognition (optical character recognition) software usually works very well with screenshots from mobile devices so that the text can be further processed and analyzed (Chiatti et al., 2017).Since screenshots and OCR software usually do not capture visual content, but only the relevant text, similar processes like tracking data (see parsing above) are mostly not necessary.
Screen recordings may be the most processing-intensive data source for linkage analysis.Screen recordings capture a significant amount of irrelevant content since researchers are mostly only interested in specific content, e.g., social media information or social interactions.Therefore, the corpus of recordings ("raw data") has to be searched for relevant words and content.Depending on the approach, this could be done after the recordings are transmitted to the researcher by searching the text corpus for relevant words based on dictionaries (Sun et al., 2022).A second way to identify and filter relevant content is the so-called 'key logging' approach.Here, relevant content is determined before data collection, and only recordings with relevant words are submitted to the researcher from the phone.The recordings are processed on the mobile device itself (Krieter, 2019).

General discussion and challenges of social media linkage designs
After introducing different recent linkage designs with their specific advantages and practicalities, some challenges and opportunities of social media linkage analysis need to be discussed comparatively.We will now explicate how researchers should select one of the designs based on the kind of media effects and variables they are interested in.Thereafter, we will discuss some general analytical challenges.Finally, we will show a way forward to combine traditional linkage designs and recent approaches to yield a sound judgment of online communication usage, content, and effects.

Analytical opportunities and complexities
It would go well beyond the scope of this paper to introduce all analytical approaches that could be used to model linkage data.Still, four broader challenges, opportunities, and research strands should not remain unmentioned.They are all more or less related to the high resolution of communication exposure measures and the intensive-longitudinal, dynamic nature of computational approaches and online communication: (1) Duration and timing of media effects (2) Within-person processes and person-specific effects, (3) reciprocal processes, (4) aggregation of data.
(1) If scholars want to apply one of the designs discussed above, they might want to elaborate on the nature, duration, and timing of media effects they are interested in.The different data sources on the side of the exposure and content (tracking, data donations, screen recordings, screenshots), as well as different survey designs, have implications for the assumed media effects model, and clearly "the design must fit the phenomenon under study" (Slater, 2007, p. 286).If one is, for example, interested in changes in social norms and social media consumption, it could be worthwhile to take into account months or even years of social media communication.This is, potentially, only possible with data donations.Data donation approaches allow linking one's social media history to the variable of interest.They might therefore be especially interesting for more stable variables such as social norms, (social) identity or values, and other long-term processes such as cultivation effects.On the downside, the approach is mostly "retrospective" in nature since the outcome is mostly asked days, weeks, or even months and years after media exposure.That being said, combining data donations with multi-wave surveys and asking for donations multiple times is possible.However, this requires substantial effort from participants at every wave -which is why, in these cases, tracking might be a more attractive approach.For tracking studies, more mid-to short-term processes and transient variables are at the center of interest.Since researchers mostly ask to install tracking tools for a particular field period, this design is mostly suitable to track participants' social media exposure and content over some days or weeks, e.g., during a campaign period (Beraldo et al., 2021).Tracking designs linked with panel surveys are suitable for media effects that unfold over a few days or weeks.Screenshots are (only) able to capture short-term processes, especially when combined with experience sampling designs (e.g., emotions, attentional processes) for specific media content.Since this design is very demanding for participants, it is hard to conduct studies of more than a few days or weeks and impossible to capture a wide range of social media content with screenshots (e.g., capture all social interactions vs. doing a screenshot when coming across hate speech on social media) (Otto & Kruikemeier, 2023;Otto et al., 2022).For screen recordings, however, longer time frames are possible (Reeves et al., 2021).But immediate media effects are not easy to capture since the material, i.e., the uploaded screen recordings, cannot be directly integrated with experience sampling approaches (yet).Thus, as in traditional linkage analysis, for tracking, data donations, and screen recordings, media exposure and content needs to be aggregated to link it to the survey scores.While for traditional linkage analysis, the design almost "pre-defined" the nature of media effects researchers could capture and model, the diversity of approaches discussed above makes it inevitable for communication scholars to think about the kind of processes they are interested in and the appearance, duration, linearity, and dynamic of media effects they attempt to cover (Baden & Lecheler, 2012;Thomas, 2022;Valkenburg et al., 2021).Now, scholars are able to investigate and model media effects dynamics better than with any other traditional research design.Whether scholars are interested in immediate effects, mid-or long-term effects, cumulative media effects, sleeper effects, or other effect duration and dynamics, social media data, other digital behavioral data, and intensive longitudinal surveys have the potential to model these dynamics (Thomas, 2022).
(2) Depending on the exact research design, linkage analysis with social media data often yields data with a higher resolution.In other words, exposure, content, and responses are measured more frequently than in traditional linkage studies, yielding more data points per individual.Traditional designs did not always enable scholars to study theories and models that describe intraindividual processes (Beyens et al., 2021;Thomas, 2022).The designs presented here are, in principle, all able to capture within-person processes.After all, they measure online communication exposure (or usage) and content in a longitudinal or even continuous way.Of course, this shift to interindividual processes also calls for adjustments of theories (Slater, 2015) and analysis (Hamaker et al., 2018;Thomas et al., 2021).On the analytical level, these different dynamic approaches require different characteristics from the data: Firstly, of course, longitudinal data on the individual level.
Secondly, these dynamic processes require different amounts of measurement points.Within-person processes need at least three measurement points.Person-specific dynamics call for at least 50 measurement points per person and, depending on the model, multiple measures per day (Hamaker et al., 2018;Keijsers & van Roekel, 2018).A third important component is the issue of unequal periods between observations.This characteristic holds for almost all designs we have discussed above and poses challenges to analytical approaches that work with lagged effects.Recently, continuous time models have been discussed to model the dynamic relationship between the variables over time (De Haan-Rietdijk et al., 2017).These requirements should be considered when planning a study and analyzing longitudinal survey data and digital traces.
(3) The linkage designs and type of data described here allows for investigating reciprocal relationships.That means, not only the effects of content on outcome variables, but also how these outcome variables might alter social media use and the content people encounter.If researchers apply any panel design, they would be able to capture media effects and media selection at the same time.After all, exposure to online communication content at time point 1 does not only affect attitudes, behaviors, and beliefs but could also be affected by these variables at time point 0. Modern linkage designs would then not only allow for testing online media effects as explained in this special issue (Ohme et al., 2023), they would also allow communication scholars to overcome the traditional selection-or effects paradigm (Geers et al., 2018;Slater, 2015).Linkage designs, thus, also yield an advantage over traditional experimental media effects designs that are not able to capture selection and effects at the same time (Feldman et al., 2013).(4) We have outlined two advantages of the high granularity of exposure/usage and content measures.It is, however, impossible to conduct surveys on the same resolution . 4Even very intensive experience sampling designs cannot overburden participants with more than a few measurement points per day.Thus, most modern linkage designs -like their offline counterpart -need to rely on some form of aggregation of communication content.
However, research on how online communication recipients and users come to a summary evaluation or judgment that they can express in self-reports is scarce.It remains unclear whether participants in linkage studies have in mind mean scores (De Vreese et al., 2017), remember peaks, or recent communication episodes (Alaybek et al., 2022).These possible relationships and aggregations need to be investigated for different phenomena, variables, and designs.In principle, this notion holds for traditional offline linkage designs.However, these linkage studies mostly came with the implicit assumption of cumulative media effects and weighted content based on exposure measures.This procedure could, for example, be transferred to measures of attention or engagement that we have discussed earlier in this paper.
In sum, we have shown that on a processing and analytical level, computational methods and dynamic data call for an update of statistical approaches for communication scientists to capture causal media effects relationships and dynamics or even return to statistical approaches that have been used on different types of data, such as specific types of time series analyses (Vliegenthart, 2014).However, if scholars do not focus on the details of dynamics and the longitudinal, sometimes even continuous data collection, they can rely on well-known statistical approaches.Consequentially, to date, most linkage studies mentioned here use (multilevel) regression approaches, structural equation models, or descriptive statistics to describe the data and test their hypothesis.

Ethical challenges
Finally, ethical and privacy challenges make exploiting digital behavioral data for linkage purposes more complex and sometimes even impossible because privacy regulations might be violated (Menchen-Trevino, 2013).Explicit and informed consent is paramount for linking survey and digital trace data on the individual level -while consent rates are notoriously low for digital trace studies, impacting study designs and sampling (Stier, Breuer et al., 2020).This process often needs to go further than just providing a short information sheet at the beginning of the study since participants are often not aware of the detail of the data being collected, what it looks like, and what could potentially be inferred from it.Froomkin (2019) called big data a "destroyer of informed consent" as both users and researchers cannot predict the number of unexpected findings the data might reveal.Therefore, approaches that allow users to (visually) inspect their own data before submission to researchers should be implemented more widely (Araujo et al., 2022).
Apart from the data collection process itself, questions on how data can be made available for reproducibility purposes without potentially violating the privacy of participants remain crucial (Bishop & Gray, 2017).Protecting privacy when using digital trace data often goes beyond deleting usernames, e-mails or IP addresses -the combination of website visits or search histories, the parameters that are part of URLs and many other variables often make it possible to identify 4 It would be possible to measure the outcome at the same resolution, e.g., when applying physiological measures or real-timeresponse designs (Maier et al., 2016) individuals (Breuer et al., 2020).This makes sharing data sets a difficult endeavor, as the 2006 AOL search log release scandal famously showed (Barbaro & Zeller, 2006).A data set of anonymized searches was made public and due to the unique combination of entered search terms, users could be identified easily.One approach to tackle this is to not collect "raw" data but rather the variables of interest derived from it -such as, for example, amount of news organizations followed on a social media website (but not which news organizations were followed).Especially, data donation frameworks allow for this process (Boeschoten et al., 2022).However, the downside is the inability to assess the validity of these steps from raw to transformed data, making the overall research process less transparent and reproducible.Still, in line with the GDPR policy of data minimization, it is crucial only to collect and store data needed for the particular research purpose instead of a catch-all approach that aims at finding patterns in data without clear expectations.

Discussion
Linkage analysis has been one of the main communication science methods to investigate the impact of media content on all kinds of attitudinal and behavioral outcomes.On the one hand, digitalization and the rise of social media have complicated linkage processes.The fragmentation of media use has diversified the amount of content people are exposed to, which leads to serious challenges to self-reports and content analytical approaches, making it almost impossible to apply the traditional linkage design that we have described above.Since newer linkage designs measure exposure (and content) passively and/or closer to the reception situation, these challenges of communication quantity as well as problems of self-report can partly be solved.A second severe challenge to the logic of traditional linkage is personalized agendas and communication content.If social media outlets, search results, and mobile app content is curated through self-selection and/or algorithmic selection, traditional approaches do not capture this level of personalization.We have, therefore, outlined the importance of linkage on the individual level.The "end of mass communication" (Chaffee & Metzger, 2001) has made it impossible to simply correlate online communication content and agendas with survey data on the aggregate level (Breuer et al., 2022;Stier;Breuer et al., 2020).Finally, we have shown the importance of computational approaches to linkage analysis when capturing the dynamic and fast-lived nature of online communication.
New approaches to linkage analysis, tailored to the changed and digitalized media environment, offer ample opportunities.This paper outlines three common approaches tracking, data donations, and screenshots/screen recordings.All carry strengths and weaknesses.On the one hand, these methods offer better opportunities to establish the information individuals are exposed to compared to the traditional approach based on self-reports of media use.On the other hand, these new methods produce complex datasets that must be adequately managed, archived, and made available for (re-)usage on a practical data management level.Moreover, there is the everlasting challenge of data access when using digital behavioral data, especially on closed platforms and devices (Christner et al., 2022;Otto et al., 2022).Additionally, complicating factors exist, such as tracking users across devices, browsers, and services (impacting data quality and interpretability) and technical complexity and maintenance of research tools.Most tracking and donations of browser histories do not work across platforms and devices.Data donations through screenshots/recordings can be applied more widely, but often require sustained efforts from participants and suffer from potentially high drop-out levels.In the case of screenshots, collected material will only cover the self-selection of communication and information people have encountered.Without wanting to downplay these issues -imperfect data are nothing new.Still, substantial efforts are needed to, on the one hand, further develop the tools that underlie the data collection process and, on the other hand, demonstrate the usefulness of the data that are being collected.
Many studies that perform a new type of linkage analysis suffer from low response rates and limited samples that are often not representative of social media users, let alone broader populations.Individuals with strong privacy concerns, for example, might be reluctant to participate (Struminskaya et al., 2020).This is not problematic per se: if we want to understand dynamics and within-person effects, a limited sample offers ample opportunities.The approach, however, does have to fit the research question and the "nature" of media effects a scholar is interested in, as we have discussed above.Generally speaking, this type of data might be particularly fitting to investigate the mechanisms by which attitudes and behavior are affected by social media content (and vice versa), but less to understand how media content, for example, affects political preferences throughout an election campaign.Here, traditional linkage analysis is still a viable research strategy (Vermeer et al., 2022).In any case, we argue that using designs that combine traditional and new approaches would be substantially and methodologically interesting.Firstly, as a validation strategy: do these different methods yield similar types of content exposure and effects?Secondly, as complementary means to assess short-term dynamics and their long-term consequences.Thirdly, triangulating traditional and recent linkage designs might be a way forward to compare the impact of many different media environments -be it online, offline, or interpersonal communication, which has been impossible to capture through linkage designs, but might now be possible by using screen recording/screenshot, data donation, as well as tracking approaches, e.g. of instant messengers (Koch et al., 2022).Lastly, though discussed only briefly in this paper, the importance of alternative data sources to link with digital trace data (e.g., physiological measures) should be further pursued in social science settings.Up until now, survey data was mostly linked to administrative records, such as voting district, address, or household information (Beuthner et al., 2023;Silber et al., n.d.) giving information on the context of individual-level data.Second, GPS or location data are often used when researchers are interested in the spatial context or mobility (Doherty et al., 2014).Besides these rather contextual measures, researchers could think of replacing the self-report with, for instance, implicit measures using reaction times (Brouwer et al., 2021).When applying these implicit measures, media effects that are not as open to reason and conscious processes could be explored (Hefner et al., 2011).
To sum up, traditional linkage was relatively simple and powerful, but recent designs are needed to capture the vastly changing and increasingly complex social media environment.Their development goes hand in hand with more advanced, detailed thinking about media effects.While this type of data collection will remain tedious and resource-intensive, technological advancements might increase its user-friendliness for both researchers and participants.Because of their high external validity, new linking strategies can be expected to develop into central methods to investigate the effects of social media content in the near future.Linking digital traces of communication exposure with self-reports might bring us forward in detecting complex, non-linear, dynamic media effects, assessing communication content that we were not able to capture before, and modeling the complex communication environment of humans in a more fine-grained, accurate, and realistic way than we could in a mass media, offline era.The idea of linkage could, thus, become even more powerful than before.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Notes on contributors
Lukas P. Otto (PhD, U of Koblenz-Landau) is a senior researcher at GESIS -Leibniz Institute for the Social Sciences, Computational Social Science Department.His research interests include dynamics of communication effects, political communication and emotion, and mobile approaches to communication research Felicia Loecherbach (Ph.D., Vrije Universiteit Amsterdam) is a postdoctoral researcher at the Center for Social Media and Politics, New York University.Her research focuses on news diversity and how it is impacted by selection and personalization as well as the collection and modeling of digital trace data Rens Vliegenthart is a professor in Strategic Communication, Wageningen University and Research.His research focuses on (social) media content and effects on citizens as well as on political decision making processes