1aibotsFeaturedfiltersfirewallsopen web

Tackling The AI Bots That Threaten To Overwhelm The Open Web

from the overrunning-the-commons dept

It is a measure of how fast the field of AI has developed in the three years since Walled Culture the book (free digital versions available) was published that the issue of using copyright material for training AI systems, briefly mentioned in the book, has become one of the hottest topics in the copyright world, as numerous posts on this blog attest.

The current situation sees the copyright industry pitted against the generative AI companies. The former wants to limit how copyright material can be used, while the latter want a free for all. But that crude characterization does not mean that the AI companies can be regarded as on the side of the angels when it comes to broadening access to online material. They may want unfettered access for themselves, but it is becoming increasingly clear that as more companies rush to harvest key online resources for AI training purposes, they risk hobbling access for everyone else, and even threaten the very nature of the open Web.

The problem is particularly acute for non-commercial sites offering access to material for free, because they tend to be run on a shoestring, and are thus unable to cope easily with the extra demand placed on their servers by AI companies downloading holdings en masse. Even huge sites like the Wikimedia Projects, which describes itself as “the largest collection of open knowledge in the world”, are struggling with the rise of AI bots:

We are observing a significant increase in request volume, with most of this traffic being driven by scraping bots collecting training data for large language models (LLMs) and other use cases. Automated requests for our content have grown exponentially, alongside the broader technology economy, via mechanisms including scraping, APIs, and bulk downloads. This expansion happened largely without sufficient attribution, which is key to drive new users to participate in the movement, and is causing a significant load on the underlying infrastructure that keeps our sites available for everyone.

Specifically:

Since January 2024, we have seen the bandwidth used for downloading multimedia content grow by 50%. This increase is not coming from human readers, but largely from automated programs that scrape the Wikimedia Commons image catalog of openly licensed images to feed images to AI models. Our infrastructure is built to sustain sudden traffic spikes from humans during high-interest events, but the amount of traffic generated by scraper bots is unprecedented and presents growing risks and costs.

A valuable new report from the GLAM-E Lab explores how widespread this problem is in the world of GLAMs – galleries, libraries, archives, and museums. Here’s the main result:

Bots are widespread, although not universal. Of 43 respondents, 39 had experienced a recent increase in traffic. Twenty-seven of the 39 respondents experiencing an increase in traffic attributed it to AI training data bots, with an additional seven believing that bots could be contributing to the traffic.

Although the sites that responded to the survey were generally keen for their holdings to be accessed, there comes a point where AI bots are degrading the service to human visitors. The question then becomes: what can be done about it?

There is already a tried and tested way to block bots, using robots.txt, a tool that “allows websites to signal to bots which parts of the site the bots should not visit. Its most widely adopted use is to indicate which parts of sites should not be indexed by search engines,” as the report explains. However, there is no mechanism for enforcing the robot.txt rules, which often leads to problems:

Respondents reported that robots.txt is being ignored by many (although not necessarily all) AI scraping bots. This was widely viewed as breaking the norms of the internet, and not playing fair online.

Reports of these types of bots ignoring robots.txt are widespread, even beyond respondents. So widespread, in fact, that there are currently a number of efforts to develop new or updated robots.txt-style protocols to specifically govern AI-related bot behavior online.

One solution is to use a firewall to block traffic according to certain rules. For example, to block by IP addresses, by geography, or by particular domains. Another is to offload the task of blocking to a third party. The most popular among survey respondents is Cloudflare:

One [respondent] noted that, although they can still see the bot traffic spikes in their Cloudflare dashboard, since implementing protections, none of those spikes had managed to negatively impact the system. Others appreciated the effectiveness of Cloudflare but worried that an environment of persistent bot traffic would mean they would have to rely on Cloudflare in perpetuity.

And that means paying Cloudflare in perpetuity, which for many non-profit sites is a challenge, as is simply increasing server capability or moving to a cloud-based system – other ways of coping with surges in demand. A radically different approach to tackling AI bots is to move collections behind a login. But for many in the GLAM world, there is a big problem with this kind of shift:

the larger objection to moving works behind a login screen was philosophical. Respondents expressed concern that moving work behind a login screen, even if creating an account was free, ran counter to their collection’s mission to make their collections broadly available online. Their goal was to create an accessible collection, and adding barriers made that collection less available.

More generally, this would be a terrible move for the open Web, which has at its heart the frictionless access to knowledge. Locking things down simply to keep out the AI bots would go against that core philosophy completely. It would also bolster arguments frequently made by the copyright industry that access to everything online should by default require permission.

It seems unfair that groups working for the common good are forced by the onslaught of AI bots to carry out extra work constantly re-configuring firewalls, to pay for extra services, or to undermine the openness that lies at the heart of their missions. An article on the University of North Carolina Web site discussing how the university’s library tackled this problem of AI bots describes an interesting alternative approach that could offer a general solution. Faced with a changing pattern of access by huge numbers of AI bots, the library brought in local tech experts:

[Associate University Librarian for Digital Strategies & Information Technology] Shearer turned to the University’s Information Technology Services, which serves the entire campus. They had never encountered an attack quite like this either, and they readily brought their security and networking teams to the table. By mid-January a powerful AI-based firewall was in place, blocking the bots while permitting legitimate searches.

Stopping just the AI bots requires spotting patterns in access traffic that distinguishes them from human visitors in order to allow the latter to continue with their visits unimpeded. Finding patterns quickly in large quantities of data is something that modern AI is good at, so using it to filter out the constantly shifting patterns of AI bot access by tweaking the site’s firewall rules in real time is an effective solution. It’s also an apt one: it means that the problems that AI is creating can be solved by AI itself.

Such an AI-driven firewall management system needs to be created and updated to keep ahead of the rapidly-evolving AI bot landscape. It would make a great open source project that coders and non-profits around the world could work on together, since the latter face a common problem, and many have too few resources to do it on their own. Open source applications of the latest AI technologies are rather thin on the ground, even if most generative AI systems are based on open source code. An AI-driven firewall management system optimized for the GLAM sector would be a great place for the free software world to start remedying that.

Follow me @glynmoody on Mastodon and on Bluesky. Republished from Walled Culture.

Filed Under: , , , ,

Source link

Related Posts

1 of 11