Challenges of and approaches to data collection across platforms and time: Conspiracy-related digital traces as examples of political contention

ABSTRACT Taking the example of conspiracy-related communication online as one form of contentious politics, this study examines the data collection challenges for multidimensional comparative research across platforms, time, and cultural embeddings. It compares the architectures and features relevant to data collection, access regimes, and use cultures for a set of digital platforms and communication venues. Differentiating between actor- and content-based strategies, this study discusses the potentials and limitations of these approaches, considering differences in platforms, temporal dynamics, and cultural embeddings as well as several layers of equivalence. The discussion highlights crucial insights into designing data collection strategies in multidimensional comparative studies.


Introduction
Nowadays, a significant part of political contention and mobilization is performed through digital communication and distributed in a hybrid and networked digital information ecology (Häussler, 2021).This ecology of communication and information circulation is constituted by a range of social networking platforms, such as Facebook or Twitter; messenger and microblogging sites, such as Telegram; image and discussion boards, such as 4chan and Reddit; and alternative and legacy media sites online.These venues all come with specific platform architectures, features, and afforded utilities for specific actor groups (Bossetta, 2019;Evans, Pearce, Vitak, & Treem, 2017), governance structures, and access regimes that fundamentally influence the data collection possibilities and limitations in such sites.Although decisions concerning these multifaceted platform characteristics can influence empirical analyses, they are rarely discussed at length (Mahl, von Nordheim, & Guenther, 2022).
In addition to the question of the ways in which platform peculiarities pose challenges to valid data collection, the matter becomes even more complex if we acknowledge the nature of digital information ecologies.Digital communication seldom remains contained within one specific platform, as platforms and communication venues are mutually interrelated.First, technological features enable content to be easily spread within and across several platforms.This affordance of networked communication can contribute to the diffusion of topics and narratives from platforms providing spaces for fringe use cultures and dark participation (Quandt, 2018) to broader audiences, which might therefore influence societal discourses at large.Second, acts of political contention are not only intertwined across platforms through linking, sharing, and forwarding features, representing just one meaning of the "cross" in "cross-platform."Political and challenger actors who contend with existing rules and procedures also often maintain accounts on several platforms.They strategically leverage platform-specific features to adapt their messages to distinct audiences and platformspecific use cultures (Ekman, 2022).Even without direct digital references between platforms, we can expect that users observe discourses across platforms and that debates on contentious issues are marked by mutual interference -again contributing to the whole of societal discussion.If we acknowledge this inherent and double crossplatform nature of digital communication, then the question is how equivalent data collection from several platforms and communication venues can be facilitated to enable cross-platform and platform-comparative studies, as well as whether viable approaches to deal with vast platform differences are available.Comparative studies on these acts of contentious claims-making and mobilization across multiple platforms and communication venues have only recently become more frequent (e.g.Frischlich, Schatto-Eckrodt, & Völker, 2022;Yarchi, Baden, & Kligler-Vilenchik, 2020), which is not surprising given the task complexity involved.
The same applies to research on the spread of and mobilization through conspiracy theories.Studies researching conspiracy-related content online have often focused on single platforms (Gallagher, Davey, & Hart, 2020;Tuters & Hagen, 2020), context-specific events, or particular actor groups (Bevensee & Ross, 2018;Knight, 2008;Wilson, 2017).Conspiracy theories are central parts of contentious practices.As in any other type of digitally mediated political contention, understanding whether and how conspiracy theories appear on and potentially transcend the boundaries of specific platforms is crucial for our understanding of the formation of contentious public debate.Conspiracy theories are also deeply rooted in a particular time and culture (Barkun, 2013), adding these dimensions to comparative designs and thus increasing their complexity.Against this background, our study uses the phenomenon of conspiracy theories as an exemplary case for our discussion of the challenges of and approaches to data collection in multidimensional comparative studies on political contention.We ask the following questions:

What are the methodological and practical challenges of different platform architectures, governance and access regimes, and use cultures for data collection across platforms and time?
What approaches could facilitate equivalent data collection from multiple platforms while also considering temporal dynamics and cultural embeddings?

What are the implications of (partially unresolvable) limitations for valid data collection?
To address these questions, we briefly introduce the example of conspiracy-related communication as one form of contentious politics.We then compare the architectures and features relevant to data collection, access regimes, and aspects that stand out for particular use cultures in the context of conspiracy theories for a set of platforms and digital communication venues.We highlight the challenges that researchers face when collecting data in a cross-platform and cross-time comparative design and juxtapose the data collection possibilities organized by the differentiation between actorand content-based strategies.We discuss the potentials and limitations of these approaches, considering differences in platforms, temporal dynamics, and cultural embeddings, as well as several layers of equivalence.The discussion highlights crucial insights into designing data collection strategies in multidimensional comparative studies that extend beyond our example of conspiracyrelated content to a wider range of digital political communication.

Comparative data collection: The example of conspiracy-related content
When social movements and political entrepreneurs strive to articulate their claims and mobilize political contention, social media platforms and a vast array of digital communication technologies provide a central infrastructure for the organization of contentious politics.One form of content that can contribute to political mobilization is the narration and distribution of conspiracy theories.Conspiracy theories have been defined as the "proposed explanation of some historical event (or events) in terms of the significant causal agency of a relatively small group of persons -the conspirators -acting in secret" (Keeley, 1999, p. 116).
The key characteristics of conspiracy theories are an intentionalism suspected behind events and actions, a dualism between a small group of conspirators and those who are affected, and the secrecy in which connected actions and processes occur (Baden & Sharon, 2021;Barkun, 2013;Butter & Knight, 2020;Mahl, Schäfer, & Zeng, 2023).Not every form of public conspiracism or communication linked to or labeled as conspiracy theory exhibits all defining characteristics.Particularly in public debate, ostracizing an actor group as a conspirator or an explanation as a conspiracy can serve many political functions.Hyzen and den Bulck (2021) argue that such strategies are used to denigrate opposing actors and undermine prevailing institutions, values, and beliefs.Harris (2023, p. 6) sees their political utility in their "counterofficial nature."In this respect, conspiracy theories are central parts of contentious practices and are closely linked to populist communication (Bergmann, 2018).With the concept of conspiracyrelated content, we refer to the full public communication on and about (alleged) conspiracy theories in various communication venues, including their narrations, counter-narrations, and debunking, as well as neutral observation forms.
As is the case with political contention studies in general, research on conspiracy-related content online has often focused on single established social media platforms, such as Twitter (Graham, Bruns, Zhu, & Campbell, 2020;Mahl, Zeng, & Schäfer, 2021) and Facebook (Bruns, Harrington, & Hurcombe, 2020).Others examine platforms facilitating secure spaces for the coordination of contentious actions (Herasimenka, 2019) and the development of conspiracy narrations within specific communities, such as on 4chan, 8kun (Tuters & Hagen, 2020;Tuters, Jokubauskaitė, & Bach, 2018), Reddit, Gab, and Telegram (Busbridge, Moffitt, & Thorburn, 2020;Garry, Walther, Mohamed, & Mohammed, 2021;Zeng & Schäfer, 2021).In addition, studies often concentrate on singular events in a specific national context or particular actor groups (Bevensee & Ross, 2018;Knight, 2008;Wilson, 2017).This is no surprise, as conspiracy-related content is particularly difficult to detect, and the more so when automated text collection and analysis are involved.In principle, this is due to the inherent blurriness of the concept and the difficulty of distinguishing talk about possible conspiracies from, as Baden and Sharon (2021, p. 90) call it, "conspiracy theories proper."Even when specific, ex ante-defined conspiracy theories are studied, their appearances can often be ambiguous, and the explicitness of related talk can vary depending on the discourse context, such as the platform's communication styles and community norms (Baden & Sharon, 2021).Subcultural milieus, such as those found on 4chan and 8kun, demonstrate their belonging to a community by consciously using insider abbreviations, floating signifiers, and slang that are difficult to detect and understand from the outside (Frischlich, Schatto-Eckrodt, & Völker, 2022;Nissenbaum & Shifman, 2017;Tuters & Hagen, 2020).Outside observations of conspiracist interpretations e.g. in traditional news media will likely rely on different denominators, depending on news outlets' characteristics and journalistic styles (Bruns, Hurcombe, & Harrington, 2022).Expressions might also change over time as conspiracy narratives are adapted to latch on to and integrate current developments and crises.They might vary in terms and explicitness across countries, as conspirational thinking and narratives are differently embedded and accepted across time and different cultures (Barkun, 2013).What is more, the actors spreading conspiracy theories are likely to be less institutionalized, thus making it more difficult to detect and classify them and to collect relevant information across platforms and time.
Many of these characteristics apply to the actors in and the content of digitally mediated political contention in general.The discussion that follows thus addresses the challenges of and approaches to data collection in a more general sense.However, the case of conspiracy theories offers a prime example, as it combines general and specific challenges and helps illustrate solution strategies.
To systematize our discussion of data collection strategies, we broadly differentiate between two approaches (Heft & Buehling, 2022).The first is an actor-based approach in which scholars use known actors or accounts identified a priori as access points to gather their communication.In the case of conspiracy-related communication, these are often actors or sites that have been linked to conspirational or otherwise problematic content on blacklists, fact-checking sites, or prior research (e.g., on alternative media; Rooke, 2021), or ideological entrepreneurs, such as Jordan Peterson or Alex Jones (Hyzen & den Bulck, 2021).The second type is the content-based approach.Studies following this approach often focus on one or more a priori known conspiracy theories and use casespecific key terms and hashtags (e.g., #5GCoronavirus, #Pizzagate) (Graham, Bruns, Zhu, & Campbell, 2020;Leal, 2020) or more encompassing dictionary-based procedures, such as the computational dictionary for the study of right-wing populist conspiracy discourse (RPC) by Puschmann, Karakurt, Amlinger, Gess, and Nachtwey (2022), to construct the data corpus of a study.The extent to which one of these strategies or a combination of both is viable for comparative data collection from the vast array of platforms and communication venues online depends not only on a study's aim but also on several platform characteristics, which are discussed in the following section.

Platform characteristics and their consequences for data collection
For comparative studies across platforms and communication venues, thorough insights into platforms' general architectures (Bossetta, 2019) and their ways of structuring content and enabling access through various features are paramount (Pearce et al., 2020), as these fundamentally shape data collection possibilities and limitations.The platform architecture defines the form of communication infrastructure that is established.The content-and actor-related characteristics of platforms and online media determine the units of analysis that are possible, how content can be organized and found, and how individual pieces of information are accessible (Table 1).While outlining platform specifics and how they enable or restrict data collection in general, we acknowledge that each platform is distinct and that architectures and access regimes can change considerably over time.Furthermore, we can only highlight some relevant aspects for data collection across the board, while specific sampling strategies (e.g., based on engagement measures) or possible levels of comparative data analysis (Rogers, 2019, Chapter 10) are beyond the scope of our discussion.
Discussion boards enable a content-based search within and across specific boards and subreddits, such as the subreddit r/conspiracy analyzed by Samory and Mitra (2018).When it comes to content-related characteristics, the full context of a particular post within these boards can usually be evaluated only if the preceding discussion is known.Therefore, one off-topic post could mislabel an entire thread and introduce significant noise to the dataset.Researchers must decide whether the whole board, specific threads, the full thread, or certain thread elements constitute the relevant unit of analysis.In terms of actor-related characteristics, 4chan users can start and contribute to discussions without registering, and upload text and images anonymously.On Reddit, no user account is required to access most content, which can be found through a general search or the selection of specific subreddits.Reddit also offers a high anonymity level for users, as registration requires no personal data and thus allows multiple and invalidated personas by the same individual (Prakasam & Huxtable-Thomas, 2021).Thus, while platform features allow open access to various content forms, linking this content to identifiable actors is prevented by the platforms' policies and features that afford high anonymity.Accordingly, actorbased approaches to data collection are not feasible within discussion boards.Content-based approaches, however, are viable if the content is accessed live or if an archive provides external access.

Networked social media
Social networking platforms, such as Facebook, and micro-blogging platforms, such as Twitter and Gab, offer a network-based interaction architecture in which communication is organized in an interconnected way.These platforms provide infrastructures through which users can upload and disseminate content via a recognizable identity (profile).Overall, the communication infrastructure is built on persistent identities, as usage always requires some form of registration and selfpresentation via usernames or descriptions, resulting in appearances as identifiable personas (Frischlich, Schatto-Eckrodt, & Völker, 2022;Jasser, McSwiney, Pertwee, & Zannettou, 2023).At the content level, the main analysis units comprise original posts (Facebook, Gab) or tweets (Twitter), which are self-contained in distributing their meanings.Several forms of comments inscribed in forward, reply, or retweet functions are also possible.These communication forms can but must not be interconnected, and they do not necessarily follow a sequential logic but can also take place simultaneously.Depending on the requirements of persistent personalization, which rather enhances pseudonymity in some instances (Frischlich, Schatto-Eckrodt, & Völker, 2022), several features enable data collection based on the actors and their content as the analysis units.For example, Bruns, Harrington, and Hurcombe (2020, p. 15) use the search query "(covid,corona,virus,epidemi,pandemi) AND (5 g)" to collect Facebook posts related to the dissemination of COVID-19/5 G conspiracy theories while classifying actors spreading this content as part of their data analyses.However, these platforms also afford users ways to limit the findability and accessibility of content and user information through choices in privacy settings (Frischlich, Schatto-Eckrodt, & Völker, 2022;Jasser, McSwiney, Pertwee, & Zannettou, 2023).As a result of this, for example the study by Bruns, Harrington, and Hurcombe (2020) was limited to Facebook public spaces while closed groups or private profiles can't be collected.
Overall, data collection on Facebook and Twitter primarily relies on application programming interfaces (APIs), which are open to researchers upon request.While Twitter at the time of writing supports a full-archive search, the Facebook API "CrowdTangle" limits data access to public pages and groups.As for Gab, scholars either scraped the platform (Fair & Wesslen, 2019) or used APIs (Jasser, McSwiney, Pertwee, & Zannettou, 2023;Zannettou et al., 2018) to enable large data collections.

Publishing-oriented platforms and online media
Format-oriented publishing platforms, such as YouTube, or the vast array of online alternative and legacy news media offer a broadcast-style interaction architecture in which articles (online media) or videos (YouTube) stand on their own, not requiring direct relations to prior content.
At the content level, the main units of analysis can consequently be described as self-contained.However, users can refer to articles or videos through comment sections.On YouTube, content can generally be found by searching for specific terms or actors who can be identified by their registered user accounts.For example, Allington and Joshi (2020) use an actor-based approach to data collection, starting from the account of David Icke, an actor frequently described as a conspiracy theorist, and collecting his videos and related comments from his account.As for online media websites, while individual articles published on a website may or may not be linked to a specific author, research regularly assumes that the output of media websites (articles) represents the medium as a whole.Some format-oriented publishing platforms, such as YouTube, allow for both contentand actor-based approaches to data collection, while in the case of online media websites, a hybrid approach is necessary, as the actor (the analysis unit, i.e., the particular medium) must be defined in advance to create or assess a searchable corpus of website content.
Overall, format-oriented publishing platforms offer content that is regularly openly accessible, although commercial online media may sometimes restrict access through paywalls or membershipbased access regimes.The feasibility of data collection is then highly dependent on the availability of archived content.For instance, while YouTube offers a search function similar to that of the front end, querying a variety of alternative and legacy media requires the availability of comprehensive archives, each with its own data quality and reach limitations (Blatchford, 2020).Approaching this challenge with a hybrid strategy of combining actor-and content-based data collection reveals the limitations of each website in terms of the availability of permanently archived content, native search functions, or APIs (Freelon, 2018).

Hybrid platforms
Telegram's communication infrastructure, composed of channels and chat groups, renders it a hybrid in our framework.Chat groups within Telegram afford room-based threaded discussions between uniquely identifiable users.Channels, on the other hand, show characteristics of broadcastoriented platforms in which posts can be attributed to the channel's administrators and sometimes also provide a discussion option for readers.
Content in public groups and channels is generally openly accessible but can only be found if the name of the channel or group is known or an access link is available.Telegram user profiles neither entail information about the user's other posts in channels or chat groups nor require information about other chat groups in which the user is active (except if one's own account is a member of the same group as the focal user is).
An actor-based collection strategy for Telegram would require abstraction from individual users and call for the identification of whole chat groups and channels as actors.Schulze et al. (2022) have identified three exemplary channels related to QAnon that were used to collect all publicly available posts from these channels.In the case of chat groups, the resulting disregard for a multitude of speakers would need to be justified.Lacking an adequate search function, content-based approaches require a similar hybrid strategy as do media platforms.That is, a full sample of previously discovered actors (channels or chat groups) needs to be collected, which can then be queried for their content in a second step.

Time-related challenges
Assessing temporal dynamics is crucial for obtaining reliable data and valid results.Previous research has shown that access to digital trace data diminishes over time (Buehling, 2023;Schatto-Eckrodt, 2022;Walker, 2017), reducing data quality and possibly impairing subsequent results.The general evolution of issues impedes the contentbased detection of specific content over time.This effect is amplified in research on contentious political communication, such as the propagation of conspiracy theories, which are altered and recomposed throughout their life cycles.Furthermore, the actors involved in such communication might leave the public arena for a variety of reasons (Sillaber, Chimiak-Opoka, & Breu, 2013).In the following, the temporal complications requiring consideration are grouped at the platform, individual user, and issue context levels, which are mutually intertwined.These challenges might occur either because of temporal changes at the time of content creation or the time lag between content creation and data collection.

Platform and medium level
At the platform level, data content changes and deterioration are determined by factors rooted in platform governance and architecture.Every social media platform differs in content quality standards, moderation practices, and enforcement capabilities, which are themselves context and time specific.A sophisticated framework of acceptable and unwanted user behaviors has been implemented on most platforms over time (Gorwa, 2019), although differences may arise from various platform-internal and -external factors.Platforms' codes of conduct can be enforced through content de-amplification, content deletion, and temporary/ permanent user bans (Gorwa, 2019).
While content moderation is constantly evolving, most changes are enforced retroactively.Therefore, the time of data collection significantly influences subsequent results.Previously salient content and influential actors become invisible after content moderation, unless they have been previously archived.Consequently, in crossplatform studies, platforms might appear to differ in their historical contents when the only real difference is the retroactive enforcement of varying codes of conduct.
Researcher access to social media data is also limited by the platforms' terms of service, materialized in the API access granted.The access options and information granularity enabled are platform specific and time dependent themselves (van der Vlist, Helmond, Burkhardt, & Seitz, 2022).Changes in API governance can subsequently impair intertemporal within-platform data comparability (Ho, 2020) and complicate crossplatform comparability.Similarly, researcher access to online news content is limited by access to and the completeness of databases.Because of the challenging endeavor of archiving online news, existing commercial databases such as Factiva yield inconsistent or incomplete records (Blatchford, 2020).Collecting online news data is challenging, requiring human intervention and the use of database combinations to obtain a more complete representation of the content studied (Blatchford, 2020).
Platforms also differ in inscribing content persistence in their technical architectures.While content persistence is afforded on most social media platforms, content creation and data collection on 4chan are shaped by ephemerality.The platforms' automated deletion of threads that decline in activity or become too old implies that posts can only be streamed with a native API.Studies relying on retrospective data collection, for example, to trace the genealogy of conspiracy theories back to 4chan (De Zeeuw, Hagen, Peeters, & Jokubauskaite, 2020), are only possible via third-party archives.Researchers reluctant to use unverifiable thirdparty archives are impelled to set up a custom archival pipeline and forego historical data (Tuters, Jokubauskaitė, & Bach, 2018).

Issue context
When studied over time, the words identifying a topic under research, or the predefined composition of communicators, can change partly independent of platform and individual user actions.While this is true for most issues in online communication, this dynamic becomes particularly clear in the collection of conspiracy-related content.Discourse about a specific conspiracy might change its focus, evolve, and absorb adjacent topics in the course of the legitimization strategies brought forward by its proponents.For example, the QAnon conspiracy theory relies on historic anti-Semitic narratives and more recent conspiracy theories, such as Pizzagate, while constantly being updated by so-called Q Drops (Garry, Walther, Mohamed, & Mohammed, 2021).
Scholars applying an actor-based data collection strategy might encounter dynamics in actor compositions, caused by either a natural dynamic of visibility in the discourse (e.g., Rogers, 2020) or by the actors' choices to leave the public arena.If not considered, this could lead to data undersampling in periods when predefined actors' communication was not prevalent, although there was a discourse about the topic.A naturally evolving turnover of dominant speakers becomes even more drastic when actors delete their profiles on a platform after withdrawing from a debate or retiring from their careers, as this might imply a retroactive deletion of all of their past communication (Bachl, 2018).

Individual user level
At the individual user level, platform and time dependence also manifests in social media posts' content and ephemerality.Previous studies have shown that voluntary post deletion is highly prevalent, especially if the post involves a topic deemed as socially undesirable, such as bullying, profane language, and intoxicant use (Almuhimedi, Wilson, Liu, Sadeh, & Acquisti, 2013).User-driven content moderation in the realm of a personal social media page (Gagrčin, 2022) can also result in content ephemerality.Walker (2017) not only shows that data quality is inversely correlated to the time lag between social media content creation and data collection but that ephemerality in political communication also depends on the contentiousness of the topics discussed.As an explanation for this, Bastos (2021) proposes the possible regret of posting subprimequality content.Neubaum and Weeks (2023) recognize message ephemerality as an affordance, allowing individuals to voice political opinions they perceive as possibly harmful to themselves if archived forever.In this respect, ephemerality can be used in a political strategy of deliberate provocation and polarization through contentious, manipulated, or illegal content (Münch, 2021).Buehling (2023) shows that message ephemerality resulting from post deletions in conspirational chats potentially biases computational content and social network analysis results.Individual user behavior on specific platforms also changes as a result of platform governance, as the use of certain words might trigger content deletion or account bans.The same topic might undergo a change in characteristic terms as users adapt to and evade content moderation efforts by using deliberate misspellings or dog whistles (Moran, Grasso, & Koltai, 2022).
Users might further adapt their posting behaviors to the structural content ephemerality built into platforms' architectures.On 4chan, for example, users circumvent automated message deletion by self-archival via specific post structures, such as general posts (Tuters, Jokubauskaitė, & Bach, 2018), which need to be considered in data collection.

Equivalent data collection across platforms and time
Cross-platform studies not only better align with the double interrelated nature of digital communication; they also enable the assessment of platform architecture influences on communication and mobilization patterns itself (Matassi & Boczkowski, 2023;Pearce et al., 2020).However, comparability between and across platforms is a severe issue, as the same objects might not be available or might have different meanings across platforms (Rogers, 2017a(Rogers, , 2019)).In addition, the time-related and cultural characteristics described above lead to a multidimensional comparative setting (see Figure 1), which considerably impedes data collection.
Comparative research more generally approaches this challenge with the concept of functional equivalence (Kolb, 2002;Wirth & Kolb, 2004) -the objects or units do not have to be equal across several system contexts, but "the functionality of the research objects within the different system contexts must be equivalent" (Wirth & Kolb, 2004, p. 88) that is, provide a "common basis of the comparison" (Kolb, 2002, p. 4).Scholars distinguish between construct, item, and method equivalence.Construct equivalence refers to the theoretical construct of interest and whether it can be considered equivalent across several systems, such as platforms or cultures.Item equivalence considers whether single items, such as data collection search terms, lead to equivalent content across contexts and how this can be ensured.Method equivalence refers to the entire research process, namely, to an equivalent selection of analysis units (sample, e.g., actor and content units), application of the research instruments (e.g., codebooks and dictionaries), and procedures at the administrative level (Kolb, 2002;Wirth & Kolb, 2004).Using the example of conspiracy-related research and the framework of actor-and contentbased approaches (see Section 2), in the following, we discuss data collection strategies to enhance equivalent data collection for cross-platform and platform-comparative studies, as well as for timeand culture-sensitive studies.

Actor-based strategies
Whether an actor-based approach relying on a priori defined actors as units and starting points of data collection across platforms is possible depends fundamentally on platform features and user choice.It requires platforms providing persistent identification mechanisms, user choices enabling open access, and communication units being clearly attributable to an identifiable actor.This content attributability to individual actors can also differ in preciseness.In platforms such as YouTube and Facebook, individual author information is available per communication unit (e.g., a video or post).In online media, units such as articles are regularly attributed to the medium, even if each article can be written by different authors.
Comparative research across platforms with an actor-based strategy is thus only viable for nonanonymous platforms at the individual or aggregate level (e.g., a full medium and a full chat).In terms of data access, the platforms would need to enable a search per actor, and for online media, the sites would have to provide archived content, or other archives would have to be available for search and data collection.Regarding the example of conspiracy-related communication, an actor-based strategy for data collection would mean that groupbased platforms, such as 4chan, with their limited actor identifiability, are excluded; this is difficult to justify, though, given these platforms' relevance for this type of content.
Regarding construct equivalence, the challenge then is to select actors who -across platforms, time, and cultural contexts -function to represent comparable conspiracy-related actors.For the individual-level analysis in distinct cultural contexts, one strategy can be to select the same single actors who are active with accounts on several platforms in the same time span, also facilitating comparability at the method level.This selection will generally be most viable for actors with a higher degree of institutionalization, such as party actors or politicians; this is because they deliberately enact their public voices through several but identifiable accounts, often also directly referencing their various platform-specific online appearances in their communication and cross-posting their content to maximize reach and impact (Bossetta & Schmøkel, 2023).In the field of conspiracy-related communication, however, ideological entrepreneurs (Hyzen & den Bulck, 2021) also tend to self-brand in a recognizable way using their clear names or brand pseudonyms, and they cross-mention their appearances on several platforms or directly link to their various accounts across platforms.
However, in the context of political contention, we must consider that actors in this field are likely less institutionalized, more heterogeneous, and more difficult to detect across platforms, time, and cultures.A second general option is to resort to actor types and pursue a design in which the actors chosen represent the same characteristics and systemic functions across platforms, time, or cultures.This could be, for example, actors who share a comparable functional role (e.g., as hyperpartisan media actors or online influencers), a comparable position in a shared field (e.g., based on activity or engagement metrics, such as mentions; or based on network metrics), and other meta-data-based similarities (McNerney et al., 2022).
With respect to item and method equivalence, functionally comparable data collection would profit from actor content that is persistent and accessible historically, enabling comparisons across actors and time.Our overview has shown that content persistence is a considerable challenge that can go as far as full actor accounts vanish as a result of platform governance or individual user decisions.However, the actorbased strategy could be particularly fruitful in carving out different communication styles and forms of conspiracy-related content at different communication venues.

Content-based strategies
For content-based strategies, we focus on approaches that use case-specific keywords, denominators (e.g.hashtags), or comprehensive dictionaries for data collection.Keywords are words with purposive meaning that act "as the key to a cipher or code" (Rogers, 2017b, p. 82, following the New Oxford American Dictionary).Dictionaries consist of keyword sets that can be accompanied by a set of rules (van Atteveldt, Welbers, & van der Velden, 2019).As Rogers (2017b, p. 83) highlights, keywords can be "parts of programmes, anti-programmes or efforts at neutrality," which should be considered when designing queries and dictionaries.For our example of conspiracy-related content, this means that keywords need to equally capture content contributing to the narration of conspiracy theories (programs), content that challenges these narrations or contributes to counter-narration and debunking (antiprograms), and content that neutrally relates to both.
These content-based strategies must always be adjusted to account for platforms' architectures and use cultures.At the conceptual level, the question is which terms are key to, for example, a specific conspiracy-related discourse and the extent to which these keywords differ across platforms and time and require adjustments for equivalent data collection.While keyword and dictionary construction is fundamental, the influences of keyword selection and validation are far less acknowledged and lack standardization (Mahl, von Nordheim, & Guenther, 2022).Studies on conspiracy theories often use a small or event-specific set of keywords, often without explicit validation (Bruns, Hurcombe, & Harrington, 2022;Starbird, 2017;Zeng & Schäfer, 2021).
Aiming for a broader collection of content across platforms and time, studies need to acknowledge the different communication styles and use cultures on various platforms and the potential changes in relevant terms across time and language areas.This is particularly relevant in the context of conspiracy-related content, which has been shown to evolve across time, depending on recurring events and the platform-specific and cultural embeddings of contentious narratives.
Recent approaches put more emphasis on dictionary creation, expansion, and validation, such as the dictionary coherence, augmentation, validation, and analysis (CAVA) approach developed by van Atteveldt and Chan (2022); this approach allows researchers to construct data-driven dictionaries by determining words semantically similar to preselected keywords as measured through word vector representations (Bojanowski, Grave, Joulin, & Mikolov, 2017).The CAVA approach also offers means of validating a dictionary (construct equivalence).An example of a validated dictionary for conspiracy-related research is the RPC-Lex (Puschmann, Karakurt, Amlinger, Gess, & Nachtwey, 2022), consisting of 10,829 unique keywords for the study of right-wing populist conspiracy (RPC) in German-language texts.
To adapt dictionary development for crossplatform research, we propose that dictionary validation and expansion should be based on platform-specific corpora.This accounts for platform-specific use cultures and potential differences in keywords representing the same construct, albeit with likely differences in the share of programs, anti-programs, and neutrality and differences because of platform-specific styles.This can be achieved through a workflow starting with a theoretically defined seed dictionary, which is computationally expanded on the basis of relevant keywords extracted from platform-specific text samples.The equivalence of cross-platform data collection at the construct level can then be pursued by combining the platform-based expanded and validated dictionaries into a single dictionary to be used across all data corpora.
If the research aim involves understanding variations across multiple dimensions (e.g., in the narration of conspiracy theories that have been highlighted for their time dependency and cultural differences), the approach can also account for semantic changes in concepts over time and national or cultural contexts.This can be ensured by conducting dictionary development and expansion based on keywords derived from time-, language-, and platform-specific text samples.
Validating single keywords and their translations in different cultural contexts is crucial to ensure item equivalence (Lind, Eberl, Heidenreich, & Boomgaarden, 2019).Rather than relying solely on keyword translation, the datadriven keyword expansion approach proposed here can capture new and comparatively equivalent keywords that additionally reveal language patterns and nuances in different cultural contexts.
Finally, method equivalence entails decisions to be made on the best possible equivalence of analysis units, starting from the question of the extent to which a submission, post, video, or article is functionally equivalent.While a process of dictionary development as described above should foster equivalence in the research instrument, whether it can be applied in the same way depends on platform-specific data collection possibilities, such as differences in search functions per platform (e.g., enabling an unlimited dictionary or limiting the number of keywords), and the database accessible for data collection.Depending on successful context-specific adaptation, contentbased strategies could be particularly viable for multidimensional comparative studies across platforms, time, and cultural contexts.

Conclusion
Taking the example of conspiracy-related communication online as one form of contentious politics, this study examined the methodological and practical challenges of equivalent data collection for multidimensional comparative studies across different platforms, time, and cultural embeddings.Interest in cross-platform research has grown recently for its capacity to enrich the theoretical understanding of how platform-specific characteristics and their appropriation influence political communication (Bossetta, 2019;Matassi & Boczkowski, 2023) and its more encompassing approach to a networked digital information ecology (Zannettou et al., 2018).Theoretically, the concepts of digital architecture (Bossetta, 2019) and affordances (Evans, Pearce, Vitak, & Treem, 2017) provide frameworks to facilitate comparative endeavors and help ascertain the platform features and functionalities that can be considered functionally equivalent.
Comparatively less attention has focused on the practical problems of data access and data collection online from a cross-platform perspective as, for example, inscribed in platforms' access regimes and dependent on the availability and structures of archived data (but see Burgess & Matamoros-Fernández, 2016;Pearce et al., 2020;Rogers, 2017a).
In addition, ensuring equivalent data collection in comparative studies becomes more challenging when multiple dimensions (e.g., platform, time, and language) are considered.In analogy to other computational problems arising with highdimensional data (Hastie, Tibshirani, & Friedman, 2009), this phenomenon could be called the curse of dimensionality of comparative data collection.From our insights into the distinct architectures and use cultures of several platforms/communication venues and time-related data collection challenges, we derived a discussion of actor-and content-based strategies -exemplified with the case of collecting conspiracy-related content online -and how they can be adjusted to platform peculiarities, cultural embeddings, and temporal dynamics.To tackle the curse of dimensionality, our discussion highlights the following crucial points for designing comparative data collection in studies on political communication and contention: Whether a study is interested in what Rogers (2017a) calls medium research, that is, the influence of platform architectures on practices and content, in the social phenomenon (e.g., the distribution of conspiracy-related content), or in both is the guiding step for the theoretical and methodological design of a multidimensional comparative study.Then, the theoretical construct to be measured must be defined in consideration of the potential differences between platforms and communication venues.We have exemplified this with the construct of conspiracyrelated content, which is based on definitions of conspiracy theories but considers programs, antiprograms, and neutral content.
Sensitivity toward distinct use cultures on different platforms at different times and in specific contexts (e.g., language areas) and what these factors might mean for the study object (Rogers, 2017a) needs to then be translated into the study design and data collection.This includes a discussion of (a) the important platforms for the research question at hand, (b) the unit of analysis that is most relevant to the research question, e.g.rather an actor-or content-based unit or a combination; and (c) the analysis units that are actually available across all dimensions.
Based on this, a data collection and query strategy can be designed.We have highlighted some potentials of actor-and content-based strategies, especially ways to adjust dictionary-based strategies to platforms, use cultures, and time.Approaches relying on a small set of keywords or specific hashtags need to consider that they can be deliberately used when referring to a certain topic but can also be deliberately avoided (Massanari, 2017).Computationally expanded dictionaries must also ensure that they can capture the concept fully and equivalently across platforms, time, and context, as suggested by the platform-, time-, and language-dependent expansion procedure in our example.
However, functional equivalence does not mean neutralizing all platform differences; it means finding the analysis units that best represent a certain function (see also Pearce et al., 2020).This includes the fact that different platforms might necessitate different sampling strategies, both with respect to the units of analysis (actors, tweets, etc.) and the overall sampling strategy (from the full archive, sub-domains, etc.).Overall, research needs to reflect on how limitations and differences might influence the collected data and impede comparisons or can be accounted for through the analytical strategy.This extends to the challenges posed by digital media data's ephemerality and differences in data accessibility for platforms and websites, demanding individual researcher decisions and trade-offs concerning costs, time expenditure, and completeness.Furthermore, large-scale social media data collection is always restricted to publicly available data, leaving vast amounts of discussion and mobilization occurring in closed communication channels in the dark (Burgess & Matamoros-Fernández, 2016).These limitations can be partly circumvented through emerging data collection designs relying on data donations or data sharing by researchers, both of which bear their own legal and ethical pitfalls (Assenmacher et al., 2022;van Driel et al., 2022).
All these aspects call for careful pre-studies and validations of the data collection process and a thorough discussion of its limitations.The increasing multidimensionality and complexity of collecting digital communication data remain challenges, particularly in the area of political contention.However, we consider current efforts to establish social media archives, enable data donations and sharing, and develop computational methods, such as the proposed content-based approach for the computational expansion of dictionaries derived from platform-, time-, and language-specific corpora, as promising avenues to further facilitate comparative studies in a constantly changing and fluid field.Xixuan Zhang is a research associate and doctoral candidate at the Freie Universität Berlin.She studied media and political communication at Freie Universität Berlin and media informatics at Technische Universität Berlin.Her research interests are in the fields of digital activism, online discourse, and the networked public sphere.She also researches the application of computational methods, ranging from text mining to network analysis and machine learning.Dominik Schindler is a PhD candidate in applied mathematics at Imperial College London.Drawing from network science and machine learning, he has developed methods for the computational social sciences and analyzed conspiracyrelated online communication as a research fellow at the Weizenbaum Institute Berlin.He obtained his MSc in applied mathematics from Imperial College and his MA degree in digital media from Goldsmiths, University of London.

Notes on contributors
Miriam Milzner is a research associate and doctoral candidate at Freie Universität Berlin and the Weizenbaum Institute for the Networked Society.Prior to her PhD, she received her MA degree in journalism and communication studies from Freie Universität Berlin.Her research interests include the logics of digital information infrastructures, the hybrid media sphere, and the dynamics of (coordinated) information manipulation.
context of the Weizenbaum Institute]).Dominik Schindler acknowledges support from the EPSRC (PhD studentship through the Department of Mathematics at Imperial College London) and from the Weizenbaum Institute (Research Fellowship).

Annett
Heft heads the research group Dynamics of Digital Mobilization at the Weizenbaum Institute for the Networked Society, Berlin, and is a senior researcher at the Institute for Media and Communication Studies, Freie Universität Berlin.Her main research fields are the comparative study of political communication in Europe, with an emphasis on digital public spheres and right-wing communication infrastructures, transnational communication, as well as quantitative research methods and computational social science.Kilian Buehling is a research associate at the Freie Universität Berlin and the Weizenbaum Institute for the Networked Society.He studied economics at Technische Universität Dresden, and his research interests are information diffusion, transnational communication processes, and network dynamics of anti-democratic and conspiracy theory groups.