The Trouble with Sharing Your Privates: Pursuing Ethical Open Science and Collaborative Research across National Jurisdictions Using Sensitive Data

ABSTRACT Open science and effective collaboration both require the sharing of data between researchers. This is especially true for computational methods, as the technical complexity and heterogeneous data sources often require collaboration between researchers in different institutions and jurisdictions. Many data sources, however, cannot be shared openly because of copyright law and contracts such as terms of service. These regulations can be complex, sometimes untested in case law, and vary between countries and over time. This paper details our experiences in conducting international comparative research on very large collections of news items from multiple countries. We set out the main problems we have encountered and some short- term approaches we have used to mitigate some of these problems. We end with listing some additional long-term actions that will advance our research community’s ability to collaborate on computational research using sensitive data.

This paper aims to share some of the experiences and lessons learned over the past years of conducting large scale multilingual text analysis involving collaborators in different physical locations.The authors are senior researchers from the US and two different European countries who have been working together for several years in a longitudinal and comparative automatic text analysis project using a variety of nonpublic data sources. Moreover, the authors have conducted various other collaborative text analysis projects using sensitive data and worked toward creating text analysis tools and setting up large scale databases of political texts. We offer what follows in the hope that others can benefit from our several combined years of frustrating experiences, frequent mistakes, and small victories in the realm of cross-national collaborative text-analytic work using sensitive data.

Open Science Requires Open Data, But Sensitive Data Isn't Easily Shared
Collaboration on computational research, even more than on traditional research, requires the sharing of both data sets and the tools and scripts used to process these data. In larger research consortia, the technical, theoretical, and local expertise to conduct specific analyses are often distributed among different teams. Tools developed in one team need to be adapted and validated for different contexts, often requiring both linguistic and substantive expertise from other teams. In particular, cross-national comparative studies generally require close collaboration between teams that may be located in different national jurisdictions. In these cases, being able to freely share materials is crucial for efficient collaboration and for ensuring the validity of measurements.
Sharing research materials is also a crucial part of open science (Klein et al., 2018;Miguel et al., 2014;Nosek et al., 2015). Data transparency is a key part of the move toward transparent and open science (Nosek et al., 2015), which improve the reproducibility and robustness of scientific findings by allowing other scholars to inspect and verify published results (Klein et al., 2018). Moreover, sharing data can improve the efficiency of science by allowing greater re-use and more collaboration and specialization (Van Atteveldt et al., 2019).
In many cases, however, researchers are not free to share sensitive data, which for the context of this discussion we will define as data that originates from third parties and that cannot be openly shared due to legal, proprietary, or regulatory barriers. There are many types of sensitive data that might be of interest to political communication researchers, such as social media data and survey or experimental data identifying individuals, which can be covered by privacy regulations such as the EU General Data Protection Regulation (GDPR). However, given the scope of this contribution we will focus here on sharing politically-relevant media content, such as entire newspaper articles or complete transcripts of television news broadcasts.
There is an important caveat that must be underscored for what follows: none of the authors has legal expertise, and our understanding of the relevant legal landscape may be partial or flawed. We offer no legal advice here, but merely convey our imperfect understanding of legal barriers that define boundaries we have been working to uphold while still advancing research projects using sensitive data.

Barriers to Sharing Political Media Content
There are at least three factors hindering the sharing of full text political media content: copyright laws; contract laws/terms of service; and how these regulations vary between countries and over time.
Copyright law is a temporary monopoly on the distribution of text and other creative works intended to allow authors to make money from their creations. The copyright on most media content is owned by the company that owns the media outlet. Contract laws and terms of service come into play when media content holdings are obtained from library sources and from commercial content database providers such as LexisNexis or Factiva, which offer media content under general campus licenses or other contractual agreements. Contract laws and terms of service also come into play when researchers scrape media content holdings from Internet sites or when researchers enter into formal agreements with media content owners. It is also important to emphasize that even when researchers located within a particular country operate within that country's established copyright laws, contract laws and terms of service might still limit their ability to share (or even use) the news content that they have access to. When there is a conflict between the usage terms imposed by contract and by copyright law, in many cases it is not clear which set of laws should prevail. For example, a researcher in the United States might use LexisNexis news data within "fair use" exemptions in copyright law, but still be in violation of the campus contract that allowed the researcher access to LexisNexis in the first place. In addition, sensitive material that has been acquired by one project team cannot in most cases be physically transferred across campus or national boundaries for use by other teams in a collaborative project. Commercial content providers might require identical licenses held by the collaborating institutions for material to be used in more than one place, and in many cases researchers will have no access to or understanding of the terms of the license that their campus is bound by.
Finally, legal boundaries differ across jurisdictions, are constantly evolving, and are often poorly understood by campus authorities. Copyright law and contract law differs not only between the US and EU, but also between EU member states. For example, while the United States has a "fair use" exception to copyright law that is generally favorable to researchers, the concept of "fair use" has no direct parallel in the European Union even though some research uses may be exempted from copyright restrictions. Moreover, legal barriers that govern sharing of sensitive data are constantly evolving (e.g., the Digital Millennium Copyright Act and the new EU Copyright Directive). While US copyright law has broader fair use exemptions, it also has harsher (statutory) damages; and while the new EU Copyright Directive has specific exemptions for academic use, these provisions are untested and need to be written into national law before taking effect, potentially introducing more variation and uncertainty. On terms of service, there are differences between jurisdictions for example, in whether "click through" agreements or terms of service simply posted on a website constitute a valid contract. This issue was at the heart of the prosecution (and subsequent suicide) of Aaron Schwartz under the US Computer Fraud and Abuse Act for violation the JSTOR terms of service by automatically downloading large amounts of scientific articles from their archive.

The Need for Finding Solutions within Legal Boundaries
The complexities mentioned above can result in two extreme reactions. Many individual researchers and research groups (especially in computer science) simply ignore legal barriers to scrape and use the data they want. However, if researchers make a mistake or get caught in a copyright violation, usually the consequences will fall heavier on their institutions than on themselves. Institutional risk managers therefore often take the other approach and minimize risk by disallowing any sharing of sensitive data altogether.
Neither approach is satisfactory from a data transparency perspective, however, as even researchers that gather data without permission cannot share these in an open way. Thus, it is important to develop practices that foster research transparency within the legal and ethical bounds set by relevant regulations. Part of the solution consists of things can be done right now by individual research groups, while a fuller solution will depend on longterm advocacy and education efforts by the research community as a whole.
Some compliant solutions to consider for the short term can include:

Publishing or Sharing Small Validation Sets
Depending on the exact data source and terms of service, it might be allowed to publish a small sample of sensitive material to allow for analyses to be checked by others. Although this can be used to check rule-based analyses such as dictionaries, it is less useful for validating or improving corpus analysis and supervised or unsupervised analyses such as scaling or topic modeling because these methods' results vary strongly with the size of the dataset.

Publishing Metadata
In some cases, such as online news or data from Twitter or LexisNexis, other researchers might be able to retrieve the same data used by an originating research team given the identifying metadata such as URL, status ID, or article headline and date. This can be cumbersome and costly, however, if large amounts of data are needed to duplicate or validate analyses. Moreover, the persistence of the remote data can often not be guaranteed, jeopardizing the future reproducibility of research. Archiving an encrypted version of the sensitive data could solve that problem, but will presumably run into the same regulatory hurdles.

Meeting Face to Face
If data cannot cross institutional barriers, the easiest way to collaborate on data can be to physically come together. For example, many campus licenses governing sensitive news data have exceptions for visiting scholars who are physically on campus premises. This is not a solution, however, for sharing data with external parties, if data from multiple institutions need to be analyzed jointly, or if financial or agenda constraints prevent meeting long enough to do substantial analytical work on the data. Growing concerns about the environmental impact of travel within the academic community might also hinder face-to-face meetings when long-distance flights would be required to bring collaborators together.

Remote Access to Computer Systems
Essentially a virtual form of meeting face-to-face, it might be possible to give collaborators remote access to the relevant computer system on which sensitive data are physically stored. Depending on exact agreements governing how particular forms of sensitive data might be used, this might overcome some legal problems of sharing data within trusted collaborations. It is generally difficult to prevent remote users from downloading the data, however, so this might pose risks to the institution and does not solve the problem of sharing the data outside trusted collaborations.

Non-consumptive Research
One solution pioneered by the Hathi Trust Research Center (https://www.hathitrust.org/htrc) is a set of non-consumptive research practices that strive to offer remote access to sensitive data without allowing users to abuse this access. One possibility is to allows users to run limited analyses and queries to extract features like sentiment scores from the data via an API or web interface, but only returns the sentiment scores or enough textual context to validate the scores without allowing access to the full text. Another possibility, called a Data Capsule (Zeng et al., 2014), allows a user to develop an analysis with limited access to the raw text, and then send the developed algorithm to a secure system where it can be run over the full corpus of data without giving the researcher any direct access to the data. In such a system, only limited and nonsensitive data can be returned to the researcher. Although these solutions can solve the problems of data access, they can be difficult to implement and cumbersome to use as they force the user to develop and validate analyses in an unfamiliar environment and possibly with different tools than they normally use.

Longer-Term Solutions Will Require Advocacy and Education
As surveyed above, research teams can already take a number of steps to mitigate the problems of data sharing between teams and within the community in general. However, none of these options is without problems. Better ways to share sensitive data will depend on concerted and long-term actions by the field as a whole. We think action should be taken in at least two directions.

Work Toward Better Data Agreements
The sharing of data between researchers does not pose a direct threat to the business models of news producers, and content archives like LexisNexis have a direct interest in ensuring that it remains possible to conduct and publish valid research with their materials. Thus, there might be scope to collaborate with these parties to work toward data agreements that allow for sharing raw data within open science principles. For this to happen, it is important that standard language be adopted to give these parties the confidence that we as scientists will not abuse these data or distribute them in ways that might hurt their profitability. This could include standard embargo periods (the value of news depreciates quickly) or a standardized procedure where local scientific archives such as DataVerse, the Inter-University Consortium for Political and Social Research (ICPSR), or the GESIS -Leibniz-Institute for the Social Sciences in Germany could give access to data on signing an appropriate agreement. The governments who ultimately fund most of our research could be convinced to pressure or regulate these data owners to allow scientific research by distributed teams, for example, as part of press subsidies or privileges or as a way to regulate the role of social media in the news ecosystem.

Work Toward Collaborative Open Data Sets
Progress in fields such as computational linguistics has profited tremendously from "shared tasks" where different research groups work on the same data set, such as the GigaWord dataset maintained by the Linguistic Data Consortium (

Conclusion
Given the complexities and uncertainties surrounding data sharing, there may be no single authority at your local institution who actually knows what the rules are, and many institutions lack a clear authority for making decisions about sharing of sensitive data. In our own experience, university lawyers tend to be very restrictive in their interpretations of the legal situation in order to minimize legal risks for the institution. Continued access to important library collections might be jeopardized, future ability for other researchers to use the same collections might be restricted, and lawsuits might be filed with substantial legal costs and large financial awards if cases go to trial. For these and other reasons, the easiest-and from a risk management standpoint, the best-decision is for the institution to simply say "No" and forbid the sharing of data altogether. A crucial factor in the success of cross-national collaborations using sensitive data might therefore be differences in "risk culture" among the collaborating institutions and their willingness to support researchers where legal boundaries are unclear and constantly evolving. From our experience, therefore, it would be beneficial for universities and research institutions to create and empower "data ombudspersons" to whom researchers could turn in cases of doubt, andcruciallywho could talk to each other directly across institutional boundaries. Data ombudspersons could shift the frame from risk prevention to research promotion and would relieve researchers from acquiring half-baked legal knowledge themselves. As a research community, we need to find a sustainable and ethical solution for the problem of sharing our privates. We can't get away with ignoring the problems-as some currently seem to prefer, who are risking the entire research community's long-term access opportunities for individual short-term publication gain-but we also can't just give up open science. This will require increased awareness for the need to ethically share sensitive data among researchers, but also a concerted effort by the field to develop the relevant practices and standards required to do so, and to convince funding agencies, data owners, and regulators of the need to change agreements and regulations in ways that allow for open science practices to flourish.

Disclosure Statement
No potential conflict of interest was reported by the authors.