Debian Protects CI Data from LLM Scraping

Debian's continuous integration (CI) infrastructure has become a target for bots used to scrape data for training large language models (LLMs). This has led to an excessive load on Debian's web servers, forcing the project to restrict public access to CI data.

The decision was made to protect server resources and ensure that the CI infrastructure remains available to Debian developers. The abuse of the open web by LLM scrapers is a growing problem affecting various organizations and open source projects.

For those evaluating on-premise deployments, there are trade-offs between the availability of public data and the need to protect their infrastructure from unwanted access. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these implications.