Towards shared infrastructure for public benefit access to platform data

Worth, Sophia; Stein, Jake; Simperl, Elena; Riley, Chris

doi:10.61557/LSGE6110

1.2 Grids+Face-Orange-ArticleHeroBanner-1110x452-ODI-Research

Wed May 20, 2026

With the accumulation of vast amounts of data by online platforms, there is a need for a reliable, sustainable infrastructure that supports necessary data access for public benefit initiatives. This access can support research into the role of platforms in society and their potential harms, and the development of mechanisms to intervene, as well as services and interventions that target social challenges.

In this write up, we summarise four key lessons from a March 2026 workshop of experts on data access for public benefit initiatives, and from these outline three recommendations for future collaboration.

The four key lessons that we discuss are: that existing and emerging legislation is not sufficiently supporting the main methods public benefit actors are using for data access; that there is a need to showcase real-world use cases on data access in the public benefit to be learned from and coalesced; the importance of balancing between priorities in scalability of data access solutions, with careful human oversight and tailoring of approaches; and the need for sustainable and robust pipelines due to an unpredictable access landscape.

Introduction

Platforms own and control data of enormous public value, yet the terms of access have repeatedly been set, and reset, largely according to commercial conditions rather than emphasising access for public benefit. This data helps companies develop platform functionalities, understand what content users are exposed to, learn about their health, shopping behaviour, and much more. Much of this data also offers huge value addressing real world inequities and wider public benefit. This includes studying and holding platforms accountable for their real world impacts, and for supporting the development of a wider range of socially impactful services.

Recent regulatory developments have sought to advance these public benefit goals of data access. This includes legislation to support data portability, allowing people to selectively transfer their data from an organisation to themselves or to another organisation, under legislation such as the General Data Protection Regulation (GDPR) and the Digital Markets Act in the EU. Further examples include the EU’s Data Services Act, which directly mandates access to platform and search engine datasets for vetted researchers, and individual access laws in California and Colorado, and similar proposals across the world in countries such as Switzerland and Brazil. However, this nascent legal infrastructure has serious limitations in implementation, including unresolved substantive gaps and tensions in between different frameworks, and it does not cover all important domains and geographies where data access is needed. This means that much of existing access to platform data relies on permission by platform owners, and dependence on platform infrastructures.

Building durable data access infrastructure requires confronting this access deficit cooperatively among researchers, public policy advocates, and industry practitioners. The systems for data access must be resilient enough to absorb shifts in platform priorities, legislative mandates, and the evolving technical footprint of the platforms, search engines, and AI systems.

In March 2026, the Data Transfer Initiative and the Open Data Institute, as part of the CoCoDa project, convened a group of experts focused on the mechanisms for data access for public benefit at the Royal Society in London. Across three plenary sessions, researchers, regulators, advocates, and industry practitioners examined the practical scope of legal mandates for public benefit data access, took stock of the technical tools currently in use, and identified the gaps that remain to be addressed.

The workshop reached consensus on four primary approaches to platform data access that have been put to use: scraping and sockpuppet approaches, data donation uploads, data portability APIs, and, in the context of the EU, DSA Article 40 access requests.

No single approach reliably delivers the quality or volume of data that research requires, and that is likely to continue. Though newer legislation, notably including the EU’s DSA and DMA, represents genuine progress, researchers reported that in practice these rights fall short of what rigorous research demands. The same is true of the technical layer: there is a suite of solutions such as privacy-preserving multiparty computation, secure research environments, semantic data standards, and normalised data donation frameworks. These are each reasonably well-developed in isolation, but remain fragmented in practice, limiting their collective effectiveness.

What is needed most urgently across these contexts for public benefit data access is not better individual tools, but the assembly of these technologies into end-to-end data pipelines. Further, the community needs deliberate pairing of those pipelines with the relevant legal mandates and practical use cases that motivate the development of this infrastructure, and offer insights into technical and governance best practices.

Key Lessons from our Workshop

Here we outline our four main takeaways from the workshop session, and the core debates among participants.

Lesson 1: Existing and emerging legislation is not (yet) sufficiently supporting the main methods public benefit actors are using for data access.

Existing legal mandates, currently strongest in the EU, create a variety of rights for researchers and data subjects to access data from platforms, but legislation stops short of defining how requirements should be technically implemented, and regulation tends to be slow and patchy when access requirements are clear. These recent experiences can provide both insights for future legislative agendas internationally, and for understanding the needs of public benefit actors seeking data access from platforms.

At the workshop, researchers presented tools to manage and automate the data donation process, highlighting the difficulty of managing individual data subject preferences when sharing data for research, and the lack of direct means to access sufficiently anonymised data from all platforms.

Participant presentations highlighted contrasting approaches to accessing platform data. One participant introduced software designed to manage individual data donations, enabling participants to download their data from platforms and selectively remove or obscure sensitive elements before sharing. This approach yields relatively comprehensive datasets, but requires substantial effort in recruitment, coordination, and anonymisation. It also points toward the potential value of more direct platform access mechanisms that incorporate built-in privacy safeguards. This was seconded by a participant presentation highlighting the duplication that platforms must themselves perform when designing processes to deliver data through different channels for privacy access, portability, or research access.

Research that relies on data obtained through exercising data subject access rights under the DMA and GDPR are further complicated by the trajectory of the EU GDPR Omnibus, which threatens to curtail approaches that rely on individual data donation, creating an environment of uncertainty.

Another presentation examined a method of accessing platform data via scraping interfaces directly with data listed in DSA Article 40 data access catalogues. While this route offers greater speed and removes the need for participant recruitment and data donation, the researcher demonstrated that the data provided was incomplete, omitting information visible in the platform interface. Taken together, the two approaches reveal a trade-off: donation-based methods offer depth but are slower and less anonymous, whereas API-based access is more efficient and privacy-preserving but limited in scope.

The workshop also highlighted the benefit of clear communication about the use and remit of relevant laws to platforms. It was noted that platforms may inherently respond to data requests as a cost or liability. In the case of data portability requests, some data providers may seek to lock users into their own platforms through limiting data access. Similarly, DSA requests may lead to fears of exposing customer data or competitive advantage, leading to policies that avoid data disclosure wherever possible and requiring a high degree of justification on the part of researchers.

Ultimately, this diversion of incentives may appear irreconcilable, but it draws attention to a resource gap regulators may focus on. Data access and portability rights under new EU platform regulation are auspicious, but data subjects and researchers do not have the resources necessary to assuage platforms’ defensive stance vis-a-vis customer data. This leads to two logical solutions.

First, this could inspire greater allocation of government resources to administration of data access infrastructure. The EU has already stepped historically further into the active administration of technical research infrastructure than ever before with its data access portal – this could be a sign to go further, providing secure research environments, or certifying specific research infrastructures (akin to health data system certifications).

Second, in the absence of greater government commitments, researchers may be forced to develop greater cooperation to pool technical resources or data. This could take the form of greater coalitions of researchers supporting a single request for larger research projects, or committing to single common data standards, and shared technical infrastructures.

Lesson 2: We can collectively do more to answer the question, “Data access for what?” - showcasing real-world use cases and documenting shared objectives and experiences for data access in the public benefit.

To build momentum and clarity, we need to collate success stories, to highlight the areas where data access can offer benefit, and details of the technical mechanisms, and real-world challenges faced in the process of achieving them. During the event, we heard about a number of cases, for example:

The use of social media data to study the extent of polarisation happening on digital platforms, or the challenges faced by young people in navigating filter bubbles;
The use of supermarket shopping data by health researchers to flag diagnostic markers for serious disease;
The use of Uber Drivers’ data to develop aggregate analysis and pay auditing;
The use of banking data to enable research into the risks of problem gambling;
The use of mobile device data to facilitate researchers to gain a deeper understanding of the impact that digital devices have on young people;
The collection and linking of data from multiple online services with survey responses to understand people’s digital lives and how these relate to wider outcomes.

‘Data access’ can sound potentially worrying in an era of rightful concern about data privacy and corporate cooptation of personal data and peoples’ attention, and the systems that we live in, using their data as a tool. The situation only grows worse if ever, in any way, AI is brought into the mix. These concerns must be held in tension with genuine benefit opportunities, and must inform the development of mechanisms that are designed with public benefit, and protection from harms, at the centre.

With examples can come opportunities to join the dots in terms of common needs, clearer shared objectives, red lines, debates, and momentum for shared efforts, including examples for ‘bottom up’ or grassroots project initiation or governing approaches. During the workshop participants discussed how those who donate data need to understand the real world benefits of their participation in order to motivate their engagement - and acknowledged that policy initiatives require substantiated real world impact. Developing clearer examples of successes, and indeed failures, can help to achieve these objectives.

We are also interested in identifying whether alternative terminologies can help delineate forms of data access for public benefit reasons, and would be enthusiastic to identify different terms that are being used by organisations and researchers.

Lesson 3: Scalability of data access solutions is a priority to promote widespread researcher access given the breadth of work to be done, but this needs to be balanced with tailoring to specific data types and uses.

During the session, we discussed a multitude of factors that necessitate careful human oversight and tailoring of data access solutions to specific contexts. This includes divergences in data formats provided by companies that require alignment with the systems used for analysis; and differences in granularity of oversight needed, for example where consent mechanisms need to be carefully tailored due to the sensitivity of data sharing; among others. Naturally, data access mechanisms require primary prioritisation of care and attention to ensure data subjects are well-protected and the downstream impacts projects are carefully considered.

At the same time, there are major benefits to draw from scalability and pooling resources to enable for effective and efficient solutions. For example, we heard a lengthy discussion among participants on the inequalities associated with the ability to gain access to private data for public benefit, given that smaller institutions may lack the financial resources, technical knowhow and established mechanisms to engage with technology companies through multiple round, potentially adversarial data access processes. This presents a major limitation on platform accountability. We heard proposals for shared institutions for public benefit data access mechanisms that can offer technical capability, provide expertise and support on ethical approval, and support the legal risks that can be associated with navigating platforms’ data access mechanisms in the public interest.

Other scalable solutions include standardised data formats and delivery pipelines for data access mechanisms, development of computational tools and computer mediated data access mechanisms, and mapping of data sets that may be available for access in order to avoid these becoming pools or graveyards of data that are underutilised. Further, mechanisms that help to connect researchers with significant pools of data, or intermediaries who can facilitate data access, and support in establishing suitable consent mechanisms, would be of significant help in supporting more efficient pathways to usable data.

Lesson 4: To be robust to a changeable landscape of access mechanisms from technology companies and regulated access routes, we need to think about sustainability of our technical pipelines for access.

Recent years have demonstrated that the landscape of data access mechanisms for protection of citizens can be choppy. Factors influencing this include the ‘deregulatory zeitgeist’ which undermines reliance on legal mechanisms, rapid developments in AI technologies which change the platforms and the data that they hold, defensive perspectives of companies towards data access mechanisms due to concern about loss of customers and bad news stories, and frequent changes to tools they provide for data access. This means that technical pipelines need to remain aware of these barriers, and sustainable in the face of them. This creates greater need for shared practices.

To ensure sustainability, such pipelines should also look to draw benefit from emerging tools such as generative AI that can help with historically inefficient data processing and management tasks; examining such questions requires continuous investment. Still, such approaches need to be embedded with care, due to valid questions over involvement of emerging technologies and the need to preserve social trust and informed consent.

Core recommendations

[Lessons 1, 2, 4] Build a repository of case studies on data access for the public benefit, that can be learned from and coalesced, and summarise the areas where this is most needed (which can be open to debate and scrutiny) to guide stakeholders including researchers, advocates, and policymakers.
[Lessons 3, 4] Build an overview and mechanisms to connect researchers and organisations to data access opportunities (datasets, data intermediaries, etc.) for public benefit
[Lessons 2, 4] Continue to develop coalitions of stakeholders that are developing the technical and governance pipelines for data access in the public benefit, in order to seek opportunities for shared infrastructure and best practices. These coalitions should focus on building sustainable and robust approaches in the long term, in the face of numerous barriers and changes in the landscape of access.

Conclusion

We are grateful for the participation of all of the experts who attended our workshop, and helped us to develop a clearer agenda for data access infrastructure for the public benefit. We will continue to develop these perspectives, and to address the core lessons and recommendations in our work going forwards.

Please be in touch if you would like to attend future events, or have ideas for further collaboration, at [email protected] (for the Data Transfer Initiative) or [email protected] (for the Open Data Institute and CoCoDa project. You can follow the CoCoDa project on LinkedIn for further updates).

About us

Our five year plan

What we do

Solid

Membership