Butlerian Jihad
This page collects my blog posts on the topic of fighting off spam bots, search engine spiders and other non-humans wasting the precious resources we have on Earth.
This page collects my blog posts on the topic of fighting off spam bots, search engine spiders and other non-humans wasting the precious resources we have on Earth.
The Wikimedia Foundation, stewards of the finest projects on the web, have written about the hammering their servers are taking from the scraping bots that feed large language models.
Our infrastructure is built to sustain sudden traffic spikes from humans during high-interest events, but the amount of traffic generated by scraper bots is unprecedented and presents growing risks and costs.
Drew DeVault puts it more bluntly, saying Please stop externalizing your costs directly into my face:
Over the past few months, instead of working on our priorities at SourceHut, I have spent anywhere from 20-100% of my time in any given week mitigating hyper-aggressive LLM crawlers at scale.
And no, a robots.txt file doesn’t help.
If you think these crawlers respect robots.txt then you are several assumptions of good faith removed from reality. These bots crawl everything they can find, robots.txt be damned.
Free and open source projects are particularly vulnerable. FOSS infrastructure is under attack by AI companies:
LLM scrapers are taking down FOSS projects’ infrastructure, and it’s getting worse.
You try to do the right thing by making knowledge and tools freely available. This is how you get repaid. AI bots are destroying Open Access:
There’s a war going on on the Internet. AI companies with billions to burn are hard at work destroying the websites of libraries, archives, non-profit organizations, and scholarly publishers, anyone who is working to make quality information universally available on the internet.
My own experience with The Session bears this out.
Ars Technica has a piece on this: Open source devs say AI crawlers dominate traffic, forcing blocks on entire countries .
So does MIT Technology Review: AI crawler wars threaten to make the web more closed for everyone.
When we talk about the unfair practices and harm done by training large language models, we usually talk about it in the past tense: how they were trained on other people’s creative work without permission. But this is an ongoing problem that’s just getting worse.
The worst of the internet is continuously attacking the best of the internet. This is a distributed denial of service attack on the good parts of the World Wide Web.
If you’re using the products powered by these attacks, you’re part of the problem. Don’t pretend it’s cute to ask ChatGPT for something. Don’t pretend it’s somehow being technologically open-minded to continuously search for nails to hit with the latest “AI” hammers.
If you’re going to use generative tools powered by large language models, don’t pretend you don’t know how your sausage is made.
Heydon is employing a different tactic to what I’m doing to sabotage large language model crawlers. These bots don’t respect the nofollow rel value …so now they pay the price.
Raising my own middle finger to LLM manufacturers will achieve little on its own. If doing this even works at all. But if lots of writers put something similar in place, I wonder what the effect would be. Maybe we would start seeing more—and more obvious—gibberish emerging in generative AI output. Perhaps LLM owners would start to think twice about disrespecting the
nofollowprotocol.
As it currently stands, both the rapid growth of AI-generated content overwhelming online spaces and aggressive web-crawling practices by AI firms threaten the sustainability of essential online resources. The current approach taken by some large AI companies—extracting vast amounts of data from open-source projects without clear consent or compensation—risks severely damaging the very digital ecosystem on which these AI models depend.
AI companies with billions to burn are hard at work destroying the websites of libraries, archives, non-profit organizations, and scholarly publishers, anyone who is working to make quality information universally available on the internet.
More on how large language bots are DDOSing the web:
LLM scrapers are taking down FOSS projects’ infrastructure, and it’s getting worse.
Over the past few months, instead of working on our priorities at SourceHut, I have spent anywhere from 20-100% of my time in any given week mitigating hyper-aggressive LLM crawlers at scale.
This matches my experience with The Session. In fact, while I had this article open in a tab, I had to go deal with a tsunami of large language model bots. It’s really fucking depressing.
Please stop legitimizing LLMs or AI image generators or GitHub Copilot or any of this garbage. I am begging you to stop using them, stop talking about them, stop making new ones, just stop. If blasting CO2 into the air and ruining all of our freshwater and traumatizing cheap laborers and making every sysadmin you know miserable and ripping off code and books and art at scale and ruining our fucking democracy isn’t enough for you to leave this shit alone, what is?
Anyone at an AI company who stops to think for half a second should be able to recognize they have a vampiric relationship with the commons. While they rely on these repositories for their sustenance, their adversarial and disrespectful relationships with creators reduce the incentives for anyone to make their work publicly available going forward (freely licensed or otherwise). They drain resources from maintainers of those common repositories often without any compensation.
Even if AI companies don’t care about the benefit to the common good, it shouldn’t be hard for them to understand that by bleeding these projects dry, they are destroying their own food supply.
And yet many AI companies seem to give very little thought to this, seemingly looking only at the months in front of them rather than operating on years-long timescales. (Though perhaps anyone who has observed AI companies’ activities more generally will be unsurprised to see that they do not act as though they believe their businesses will be sustainable on the order of years.)
It would be very wise for these companies to immediately begin prioritizing the ongoing health of the commons, so that they do not wind up strangling their golden goose. It would also be very wise for the rest of us to not rely on AI companies to suddenly, miraculously come to their senses or develop a conscience en masse.
Instead, we must ensure that mechanisms are in place to force AI companies to engage with these repositories on their creators’ terms.
Oh, this is a very handy service from Paul—given the URL of an RSS feed that only has summaries, it will attempt to get the full post content from the HTML.
This is a great move from Cloudflare. I may start using their service.
The Dark Forest idea comes from the Remembrance of Earth’s Past books by Liu Cixin. It’s an elegant but dispiriting solution to the Fermi paradox. Maggie sums it up:
Dark forest theory suggests that the universe is like a dark forest at night - a place that appears quiet and lifeless because if you make noise, the predators will come eat you.
This theory proposes that all other intelligent civilizations were either killed or learned to shut up. We don’t yet know which category we fall into.
Maggie has described The Expanding Dark Forest and Generative AI:
The dark forest theory of the web points to the increasingly life-like but life-less state of being online. Most open and publicly available spaces on the web are overrun with bots, advertisers, trolls, data scrapers, clickbait, keyword-stuffing “content creators,” and algorithmically manipulated junk.
It’s like a dark forest that seems eerily devoid of human life – all the living creatures are hidden beneath the ground or up in trees. If they reveal themselves, they risk being attacked by automated predators.
Those of us in the cozy web try to keep our heads down, attempting to block the bots plundering our work.
I advocate for taking this further. We should fight back. Let’s exploit the security hole of prompt injections. Here are some people taking action:
I’ve taken steps here on my site. I’d like to tell you exactly what I’ve done. But if I do that, I’m also telling the makers of these bots how to circumvent my attempts at prompt injection.
This feels like another concept from Liu Cixin’s books. Wallfacers:
The sophons can overhear any conversation and intercept any written or digital communication but cannot read human thoughts, so the UN devises a countermeasure by initiating the “Wallfacer” Program. Four individuals are granted vast resources and tasked with generating and fulfilling strategies that must never leave their own heads.
So while I’d normally share my code, I feel like in this case I need to exercise some discretion. But let me give you the broad brushstrokes:
You can view source to see some examples.
I plan to keep updating my pool of potential prompt injections. I’ll add to it whenever I hear of a phrase that might potentially throw a spanner in the works of a scraping bot.
By the way, I should add that I’m doing this as well as using a robots.txt file. So any bot that injests a prompt injection deserves it.
I could not disagree with Manton more when he says:
I get the distrust of AI bots but I think discussions to sabotage crawled data go too far, potentially making a mess of the open web. There has never been a system like AI before, and old assumptions about what is fair use don’t really fit.
Bollocks. This is exactly the kind of techno-determinism that boils my blood:
AI companies are not going to go away, but we need to push them in the right directions.
“It’s inevitable!” they cry as though this was a force of nature, not something created by people.
There is nothing inevitable about any technology. The actions we take today are what determine our future. So let’s take steps now to prevent our web being turned into a dark, dark forest.
AI is steeped in marketing drivel, built upon theft, and intent on replacing our creative output with a depressingly shallow imitation.
A handy resource for keeping your blocklist up to date in your robots.txt file.
Though the name of the website is unfortunate with its racism-via-laziness nomenclature.
I realized why I hadn’t yet added any rules to my
robots.txt: I have zero faith in it.
I endorse this statement.
Readability is back, but now it’s called Mercury.
This tool for building ScrAPIs is an interesting development—the current trend for not providing a simple API (or even a simple RSS feed) is being interpreted as damage and routed around.
A handy step-by-step guide to scraping HTML to get data out. Useful for services (—cough—Twitter—cough—) that keep changing the rules of their API use.
A new feature on Matthew Somerville's brilliant train timetable site. Just put /fares at the end of any URL to get the cheapest available fare.
While I had Matthew in my clutches, I made him show me around the API for They Work For You. Who knew that so much could fun be derived from data about MPs?
First off, there’s Matthew’s game of MP Top Trumps, ‘though he had to call it MP Fab Farts to avoid getting a cease and desist letter.
Then there’s a text adventure built on the API. This is so good! Enter your postcode and you find yourself playing the part of your parliamentary representative with zero experience points and one hundred hit points. You must work your way across the country, doing battle with rival MPs, as you make your way towards Sedgefield, the lair of Blair.
You can play a Web version but for some real old-school fun, try the telnet version. This reminded me of how much I used to love text adventures back in the days of 8-bit computers. I even remember trying to write my own in BASIC.
For what it’s worth, Celia Barlow, MP for Hove, has excellent pesteredness points. I made it all the way up to Sedgefield and defeated Tony Blair in battle. My prize was the source code of the adventure game in Python.
Ah, what larks!
There’s another project that Matthew works on that I find extremely useful. He has created accessible UK train timetables using the data from the National Rail site, a scrAPI if you will. This is where I go whenever I need to plan a train journey.
The latest feature is something that warms the cockles of my heart: beautiful, hackable URLs. If I want a list of trains going from Brighton to London, I can just type:
http://traintimes.org.uk/brighton/london
It handles spaces (or pluses or underscores) too:
http://traintimes.org.uk/brighton/london victoria
The URL can also be extended with a departure time:
http://traintimes.org.uk/brighton/london victoria/14:00
My address bar is my command line. This is the kind of design that makes URL fetishists like Tom very happy.