Jump to content

MediaWiki Product Insights/Responsible Reuse

From mediawiki.org

The Wikimedia projects are the largest collection of open knowledge in the world. This has made our knowledge infrastructure an invaluable destination not just for humans, but also for search and AI companies that access our content automatically as a core input to their products. With the rise of AI, the demand for human created content has grown exponentially, which in return has led to an unsustainable increase in automated traffic to our sites.

This high volume of automated traffic is placing an increasingly unsustainable load on our infrastructure. We need to act now to re-establish a healthy balance, so we can dedicate our engineering resources to supporting and prioritizing the Wikimedia projects, our contributors and human access to knowledge.

Current situation

[edit]

Since 2024 we’ve observed a significant rise in demand for Wikimedia’s content, via automated mechanisms that include scraping, APIs and bulk downloads. Most of the increase in traffic to our sites is coming from scraping bots collecting training data for large language models (LLMs), which in return enables products such as chat-based search engines and virtual assistants.

This expansion has caused a high load on our infrastructure, which is taking time and resources away that we need to support the Wikimedia projects, contributors and readers. At the same time, this reuse of content is happening largely without sufficient attribution, which is key to drive readers back to our projects and attract new users to participate in the movement.

We want to explore the underlying needs represented by this increase in traffic and look at approaches to establish sustainable pathways for developers and reusers to access knowledge content.

Key challenges

[edit]

The broader situation created by a rise in automated traffic presents us with a number of challenges, which this work is intended to help address.

Disproportionate impact on our infrastructure

[edit]

Whilst bots account for around 35% of pageviews, they are responsible for at least 65% of our most expensive traffic. This is because automated traffic disproportionately targets uncached pages, which must be served directly from our core data centres rather than edge caches. Our content is free, our infrastructure is not.

Constant disruption and high operational burden

[edit]

This rapid increase in automated traffic results in constant disruption and an unsustainable workload for our infrastructure teams. Indiscriminate bots are aggressively crawling not wiki content but any URL they can find within our infrastructure, including beta environments and developer platforms, whilst often trying to mask their behaviour.

Difficulty distinguishing bots from legitimate traffic

[edit]

It is increasingly difficult to distinguish legitimate, mission-aligned automated traffic from bots that actively try to circumvent rate limits and spoof their identity. Scrapers are increasingly using residential proxies in an attempt to appear as human users, meaning that actions to reduce the load on our infrastructure can impact legitimate users and abusers alike.

Widespread reuse without sufficient attribution

[edit]

Search engines have long indexed and displayed wiki content in search, which in return has brought new readers to our sites and reinforced brand awareness. In contrast, organisations running aggressive scrapers are not always following the requirement for clear attribution. Our mission relies on people to find us and join as readers, contributors and donors.

Target outcomes

[edit]

As we work to address these challenges, we are aiming for the following outcomes:

  1. Ensure preferential access for human and mission-oriented traffic
  2. Remain a discoverable and trusted source of human knowledge
  3. Provide an API ecosystem that meets current and future user needs
  4. Enable a base level of governance across automated access points

Ensure preferential access for human and mission-oriented traffic

[edit]

We will ensure preferential access for humans, bots that are operated and relied on by the community, and other mission-oriented traffic like research. This includes minimizing impact on the community, whilst ensuring adherence to our policies and guidelines, as we work together to enable sharing of knowledge in ways that are sustainable for generations to come.

Remain a discoverable and trusted source of human knowledge

[edit]

Our mission relies on people finding us to join as readers, contributors and donors. Attribution is essential to ensure credibility of information and to recognise the work of volunteers. We need a cohesive approach to attribution, so that companies can help bring users back to support the resources they depend on, no matter whether the access method is scraping, dumps or APIs.

Provide an API ecosystem that meets current and future user needs

[edit]

Providing clear, well-documented APIs that are fit for modern patterns of use is core to meeting developer needs while protecting our infrastructure. We want to encourage the use of supported pathways to allow us to improve developer experience, ensure appropriate attribution, avoid accidental APIs and offer only what we can, and want to, sustainably support.

Enable a base level of governance across automated access points

[edit]

Improving our ability to distinguish one automated user from another is a prerequisite for ensuring fair use of our infrastructure. By enabling a base level of governance across access points, we gain the ability to enforce policies systemically, ensure licensing requirements are followed and, where appropriate, direct commercial users towards Wikimedia Enterprise.

Focus areas for 2025-26

[edit]

WE5: Developers and reusers access knowledge content in curated pathways, ensuring the sustainability of our infrastructure and responsible content reuse.

  • WE5.1: Developer authentication and authorization
  • WE5.2: Evolve API offering
  • WE5.3: Attribution framework
  • WE5.4: Combat scraping