Exploring CommonsDB’s role in AI training data

Participants at a recent CommonsDB workshop explored whether the registry can support lawful, traceable data for AI training

Oct 9, 2025 — Doug McCarthy

Exploring CommonsDB’s role in AI training data

On 25 September, CommonsDB hosted a half-day technical and legal workshop with a small group of external stakeholders to examine one question: could a registry of Public Domain (PD) and openly licensed works help AI developers access lawful, trustworthy data for AI training?

With participants joining from Creative Commons, Pleias, Common Crawl Foundation and the Royal Library of the Netherlands, the session focused on testing assumptions, gathering feedback and understanding how CommonsDB might intersect with real-world AI workflows.

Why AI and why now?

A recurring challenge in AI development is sourcing data that is both legally sound and traceable. CommonsDB is being built to offer machine-readable declarations about PD and openly licensed works, backed by verifiable credentials. The workshop explored whether this infrastructure could also surface AI preference-related information – not as a new direction for the project, but as a potential extension of its core purpose.

Setting the scene

Paul Keller (Open Future) opened the workshop by outlining the foundations of CommonsDB. He stressed that it is a registry rather than a repository, and that beginning with Public Domain and openly licensed works has created a safe space to experiment without the complications of disputed rights. The project is exploring how registry technology can strengthen the connection between digital assets, rights information, and the actors making declarations.

Sebastian Posth (Liccium) then introduced the technical layer, focusing on ISCC identifiers and resilience against metadata loss as content circulates. Wikimedia Sverige and Europeana shared early experiences preparing data for registration. With 200,000 declarations already in the registry, CommonsDB is working towards five million declarations by July 2026.

A pre-release version of the CommonsDB Explorer – the forthcoming user interface for the registry – was demonstrated using multiple variants of Vermeer’s View of Delft, prompting discussion about matching, perceptual similarity, and interpretation.

Questions of scope and trust

The first discussion quickly widened. Participants asked whether the registry could handle text, video, and audio alongside images, and what would happen when rights information changes or conflicts occur. Paul Keller clarified that CommonsDB assumes “good faith” declarations and focuses on low-risk content, not adversarial disputes. João Quintais (Institute for Information Law) noted that litigation over PD/OL content remains highly unlikely.

The group also explored different trust models, from decentralized platforms such as Wikipedia to institutional statements and verifiable credentials. CommonsDB was consistently framed as a working prototype to understand what registries can – and cannot – do.

Turning to AI training

The discussion then moved to AI. Paul Keller introduced the idea of embedding AI preference signals within CommonsDB declarations. The registry’s metadata schema already contains a placeholder, but no standard for expressing such preferences exists yet.

Participants compared location-based approaches (like robots.txt) to expressing AI preferences with asset-level signals. (For a discussion of the difference between location and asset-based approaches, see this Open Future policy brief.) Anastasia Stasenko (Pleias) noted that training datasets shared on platforms such as Hugging Face often lack rights information at the document level, making provenance and permissions difficult to trace. There was consensus that CommonsDB’s asset-level approach could add value, and that provenance and lawful sourcing are gaining importance in Europe and the US.

Registry models under discussion

Two potential approaches emerged for handling AI preference information within the registry framework:

Expose AI preference information directly through CommonsDB, alongside rights declarations; or
Route AI preferences through a separate registry, maintained by another body.

The conversation focused less on choosing a model and more on its implications: sustainability, governance, interoperability, and how machine users would consume the data. There was general agreement that CommonsDB’s long-term utility will come through its APIs rather than its human-facing Explorer.

For a fuller treatment of the assumptions, roles, and trade-offs between these options, see the CommonsDB discussion paper on AI training preferences and opt-outs.

Core value and next steps

Participants recognized CommonsDB’s main contribution as providing trustworthy, asset-level rights information for PD/OL works. AI training was viewed as adjacent, not central: any future support for AI preference signals should build on the existing registry model rather than redefine it.

The CommonsDB team will continue technical testing with institutional data and refine both Explorer and API access. The workshop reinforced that while AI presents relevant use cases, the foundation remains clear, verifiable rights information – and that is where CommonsDB is focused now.