aibotscloudflarecommon crawlcrawlingFeaturedintermediariesinternet archiveopen internetredditscraping

We’re Walling Off The Open Internet To Stop AI—And It May End Up Breaking Everything Else

from the how-open-is-it? dept

A longtime open internet activist recently asked me whether I’d reversed my position on internet openness and copyright because of AI. The question caught me off guard—until I realized what he was seeing. Across the tech policy world, people who spent decades fighting for an open, accessible internet are now cheering as that same internet gets locked down, walled off, and restricted. Their reasoning? If it hurts AI companies, it must be good.

This is a profound mistake that threatens the very principles these advocates once championed.

There are plenty of reasons to be concerned about LLM/AI tools these days, in terms of how they can be overhyped, how they can be misused, and certainly over who has power and control over the systems. But it’s deeply concerning to me how many people who supported an open internet and the fundamental principles that underlie that have now given up on those principles because they see that some AI companies might benefit from an open internet.

The problem isn’t just ideological—it’s practical. We’re watching the construction of a fundamentally different internet, one where access is controlled by gatekeepers and paywalls rather than governed by open protocols and user choice. And we’re doing it in the name of stopping AI companies, even though the real result will be to concentrate even more power in the hands of those same large tech companies while making the internet less useful for everyone else.

The shift toward a closed internet shifted into high gear, to some extent, with Cloudflare launching its pay-per-crawl feature. I will admit that when I first saw this announcement, it intrigued me. It would sure be nice for Techdirt if we suddenly started getting random checks from AI companies for crawling the more than 80k articles we’ve written that are then fueling their LLMs.

But, also, I recognize that even having 80k high-quality (if I say so myself) articles is probably worth… not very much. LLMs are based on feeding billions of pieces of content—articles, websites, comments, pdfs, videos, books, etc—into a transformer tool to make the LLMs work. Any individual piece of content (or even 80k pieces of content) is actually not worth that much. So, even if Cloudflare’s system got anyone to pay, the net effect for almost everyone online would be… tiny.

Of course, history has also shown that those setting up the tollbooths to be aggregators of such payments often do quite well. So I’m sure Cloudflare might do quite well out of this deal (and, honestly, I would trust Cloudflare to do a better job of this than many other companies, given its history). But the tollbooth/aggregators quite often become corrupt. Research on the history of these kinds of “collective licensing” intermediaries shows a long trail of corruption and other problems.

More concerning than the economic model, though, was what came next. None of this is to suggest Cloudflare will definitely go down the road of corruption, but the temptations will be there. And indeed, a secondary announcement from Cloudflare revealed a fundamental confusion about what kinds of internet access should be restricted. Last month, it accused AI company Perplexity of “using stealth, undeclared crawlers to evade website no-crawl directives.”

Plenty of people reacted angrily to the story, arguing it was proof of bad behavior on Perplexity’s part, but the details suggest that Cloudflare was conflating very different activities. It’s one thing to block scraper bots that are building up an index of content for training an LLM. That’s an area where it seems reasonable for some to choose to block those bots.

But what Cloudflare described was something different entirely:

We created multiple brand-new domains, similar to testexample.com and secretexample.com_. These domains were newly purchased and had not yet been indexed by any search engine nor made publicly accessible in any discoverable way. We implemented a_ robots.txt file with directives to stop any respectful bots from accessing any part of a website….

We conducted an experiment by querying Perplexity AI with questions about these domains, and discovered Perplexity was still providing detailed information regarding the exact content hosted on each of these restricted domains. This response was unexpected, as we had taken all necessary precautions to prevent this data from being retrievable by their crawlers.

This is where the anti-AI sentiment becomes genuinely dangerous to internet openness. It’s one thing to say “no general scraping bots” but what Cloudflare is describing here is something much more fundamental: they want robots.txt files to control not just automated crawling, but individual user queries. That’s not protecting against bulk AI training—that’s breaking how the web works.

Let me give an example that hopefully clarifies why I find this problematic. A year and a half ago, I wrote about how I use LLM tools at Techdirt to help me with editing. A lot has changed in the 17 months since that was written, but I still use the same tool, Lex, to help me with editing what I write. And one thing I’ve found to be super useful in my final edit is that I give the tool a list of all sources I used in writing the article so that it can fact check (it will also search other sources for me, which is quite useful as it will—with surprising frequency—find useful sources to add more relevant information to an article).

But, increasingly, I’m finding that for certain news sites, it refuses to read them, and I’m guessing it’s because of various lawsuits some publishers have filed. So, for example, I find that the tool I use refuses to read NY Times or NBC News stories. But, I’m not trying to train an AI on those articles. I’m just asking it to read over the article, read over what I’ve written, and give me a sense of whether or not it believes I’m writing a fair assessment based on those articles.

When the AI is able to read that content, I find it incredibly useful in making sure that my reporting is accurate and clear. But there are times I’m unable to, because these publishers have taken such an extreme view of these tools that they seek to block any and all access.

This illustrates the core problem: we’re not just blocking bulk AI training anymore. We’re blocking legitimate individual use of AI tools to access and analyze web content. That’s not protecting creator rights—that’s breaking the fundamental promise of the web that if you publish something publicly, people should be able to access and use it.

Consider the broader implications: if we normalize blocking AI tools from accessing web content, where does it end? We’ve talked in the past about how many visually impaired users rely on technological tools to “read” websites for them. If we establish that all technological intermediary tools can be blocked without payment, we’re not just hurting AI companies—we’re potentially breaking accessibility tools that people depend on.

There’s a world of difference between “scrape this site to add it to a massive corpus of data” and “hey, can you just look at this one site to see what it says?” One is a big scraping job and one is simply a user-directed prompt.

Cloudflare’s complaint against Perplexity seems to conflate the two and pretend they’re the same. And I wasn’t the only one who noticed how odd this is, especially if you believe in an open web. On an open web, if I point a browsing tool at an open website, the tool should be able to read that website.

The collateral damage from this conflation is already spreading beyond AI companies.

Take, for example, Reddit telling the Internet Archive that it was going to start blocking its crawler from archiving Reddit feeds, because it was worried that AI companies were simply getting access to its content (that Reddit now is looking to license) by going to the Wayback Machine instead.

Here we see the real economic driver behind much of this: Reddit has discovered that user-generated content can be a revenue stream through AI licensing deals. But rather than finding ways to capture that value while preserving archival access, they’re choosing to break historical preservation entirely. We’re losing decades of human discourse and cultural history because Reddit wants to ensure AI companies pay for access to fresh content.

All of this suggests we’re moving very far away from an open internet, and towards one where it’s not just “pay to crawl” but it’s “pay to click” to get access to anything online.

Common Crawl, a non-profit at the center of some of these fights, is finding itself in a tough spot as well. It’s spent many years creating incredibly important and useful archives of the web. Those archives have been essential for many important research projects. But the Common Crawl archives have also been quite useful to LLM companies, and Common Crawl has been trying to navigate all of this. Unlike some others, its scanning bot is quite clear about who it is and seeks to be as “friendly” as a scraping bot can be. It’s not trying to sneak around, yet it’s suddenly facing challenges where it can’t accurately archive large parts of the web any more.

The Common Crawl situation perfectly illustrates how anti-AI sentiment is destroying valuable public resources. Common Crawl has been crucial for academic research, journalism, and public interest projects for over a decade. Researchers have used its archives to study everything from the spread of misinformation to the evolution of web technologies. But because AI companies also found the archives useful, Common Crawl is now being shut out of large parts of the web.

This is the definition of cutting off your nose to spite your face. We’re destroying a public good that benefits researchers, journalists, and civil society because we’re afraid that AI companies might also benefit from it.

And all that means that the web isn’t that open anymore. And that’s sad to think about.

Common Crawl is now suggesting that more forward-thinking companies will start thinking of enabling open crawling of their websites as an updated form of “search engine optimization,” or, in this case, AI optimization, and at least some companies seem to be agreeing, as managers want information about them or linking them to appear in AI searches as more searches go to LLMs instead of traditional search queries:

A significant number of websites currently block CCBot (Common Crawl’s web crawler), often without realizing its role in the ML and research ecosystems. Common Crawl publishes monthly web datasets which serve as foundational training data for major AI models and research initiatives.

As one SEO Ash Nallawalla (Author of The Accidental SEO Manager) wrote:

“A manager asked me why our leading brand was not mentioned by an AI platform, which mentioned obscure competitors instead. I found that we had been blocking ccBot for some years, because some sites were scraping our content indirectly. After some discussion, we felt that allowing LLM crawlers was more beneficial than the risk of being scraped, so we revised our exclusion list.”

If CCBot can’t crawl your site, your content is absent from one of the key datasets on which AI models are trained, potentially making your brand less visible in AI-powered search results.

This quote reveals the fundamental tension in the current approach. Companies are discovering that blocking AI access doesn’t just prevent training—it makes them invisible in an increasingly AI-mediated web. As Judge Mehta just noted in the Google antitrust remedies ruling, AI is beginning to encroach on the historical search market. As more people use AI tools for search and research, being blocked from AI training datasets means being blocked from discoverability.

We’re creating a two-tier internet: sites that can be found and accessed through modern tools, and sites that can’t. Guess which tier will thrive?

In other words, there is a lot going on across the board here. You have some companies who want to appear in AI results. You have some (including us at Techdirt!) who don’t mind it when AI scanners crawl and learn from our content, so long as they don’t take down our servers.

But, increasingly, we’re seeing people have such a negative, knee-jerk, anti-AI stance that they may be shutting off access to the web in a manner that could lead to the death of an open web, and could lead much more towards a pay-to-access model on the web, which I think is a result that most of us would regret.

And this is what I fear we’re going to end up with: an internet where large platforms control access through licensing deals and technical restrictions, where public archives are neutered to prevent AI companies from accessing them, and where individual users can’t use modern tools to access and analyze web content. It’s a world where Google, Microsoft, and Meta get special access through billion-dollar licensing deals while everyone else—researchers, journalists, small businesses, individual users—gets locked out.

The power and excitement of an open web was that it was open and accessible to all. The web’s core principle wasn’t “open to everyone except the technologies we don’t like.” It was “open, period.” Once we start making exceptions based on who might benefit or what technology might be used to access content, we’ve abandoned that principle entirely.

We’re not protecting creators or preserving the open internet—we’re helping to destroy it. The real winners in this new world won’t be individual writers or small publishers. They’ll be the same large tech companies that can afford licensing deals and that have the resources to navigate an increasingly complex web of access restrictions. The losers will be everyone else: users, researchers, archivists, and the long tail of creators who benefit from an open, discoverable web.

None of this means we should ignore legitimate concerns about AI training or creator compensation. But we should address those concerns through mechanisms that preserve internet openness rather than destroy it. That might mean new business models, better attribution systems, or novel approaches to creator compensation. What it shouldn’t mean is abandoning the fundamental architecture of the web.

And that would be unfortunate for all of us.

Filed Under: , , , , ,

Companies: cloudflare, common crawl, internet archive, reddit

Source link

Related Posts

1 of 3