Tim Berners-Lee Invented the World Wide Web. Now He Wants to Save It | The New Yorker
A profile of Tim and the World World Web.
A profile of Tim and the World World Web.
I don’t use large language models. My objection to using them is ethical. I know how the sausage is made.
I wanted to clarify that. I’m not rejecting large language models because they’re useless. They can absolutely be useful. I just don’t think the usefulness outweighs the ethical issues in how they’re trained.
Molly White came to the same conclusion:
The benefits, though extant, seem to pale in comparison to the costs.
What I do know is that I find LLMs useful on occasion, but every time I use one I die a little inside.
I genuinely look forward to being able to use a large language model with a clear conscience. Such a model would need to be trained ethically. When we get a free-range organic large language model I’ll be the first in line to use it. Until then, I’ll abstain. Remember:
You don’t get companies to change their behaviour by rewarding them for it. If you really want better behaviour from the purveyors of generative tools, you should be boycotting the current offerings.
Still, in anticipation of an ethical large language model someday becoming reality, I think it’s good for me to have an understanding of which tasks these tools are good at.
Prototyping seems like a good use case. My general attitude to prototyping is the exact opposite to my attitude to production code; use absolutely any tool you want and prioritise speed over quality.
When it comes to coding in general, I think Laurie is really onto something when he says:
Is what you’re doing taking a large amount of text and asking the LLM to convert it into a smaller amount of text? Then it’s probably going to be great at it. If you’re asking it to convert into a roughly equal amount of text it will be so-so. If you’re asking it to create more text than you gave it, forget about it.
In other words, despite what the hype says, these tools are far better at transforming than they are at generating.
Iris Meredith goes deeper into this distinction between transformative and compositional work:
Compositionality relies (among other things) on two core values or functions: choice and precision, both of which are antithetical to LLM functioning.
My own take on this is that transformative work is often the drudge work—take this data dump and convert it to some other format; take this mock-up and make a disposable prototype. I want my tools to help me with that.
But compositional work that relies on judgement, taste, and choice? Not only would I not use a large language model for that, it’s exactly the kind of work that I don’t want to automate away.
Transformative work is done with broad brushstrokes. Compositional work is done with a scalpel.
Large language models are big messy brushes, not scalpels.
Looking at LLM usage and promotion as a cultural phenomenon, it has all of the markings of a status game. The material gains from the LLM (which are usually quite marginal) really aren’t why people are doing it: they’re doing it because in many spaces, using ChatGPT and being very optimistic about AI being the “future” raises their social status. It’s important not only to be using it, but to be seen using it and be seen supporting it and telling people who don’t use it that they’re stupid luddites who’ll inevitably be left behind by technology.
It’s just over one month until UX London. You should grab a ticket if you haven’t already!
The format of UX London is quite special. While the focus of each day is different—discovery, design, and delivery—each day unfolds like this…
There are four talks in the morning. You get your brain filled with ideas and learn from fantastic speakers. It’s a single track—everyone’s getting the same shared experience.
Then after a lunch, you choose from one of four workshops. Whatever you choose, it’s going to be hands-on. You can leave your laptop at home.
A day of listening to talks could get exhausting. A workshop that lasts all day could be even more exhausting. But somehow by splitting the day between both activities, the energy level is just right!
That said, we don’t want the day to end with everyone spread across four different workshop rooms. That’s why there’s one final talk at the end of each day.
These closing talks are a bit different to the morning talks. Whereas the focus of the morning talks is on practical skills that you can apply straight away, the closing talks are an opportunity to sit back and have your mind expanded. They’ll be fun and thought-provoking.
Paula Zuccotti is closing out day one with a talk about two of her projects: Every Thing We Touch and Future Archeology:
This talk invites audiences to reconsider the meaning of the objects they encounter every day and reflect on what their possessions might reveal about who we are and what we value, both now and in the years to come.
Sarah Hyndman will wrap up day two with a fun interactive talk about your senses:
Join a live expedition into our inner world to explore why we see, feel and remember.
Finally, Rachel Coldicutt is going to finish UX London with a rallying cry:
Introducing the Society of Hopeful Technologists and discussing how, in modern technology development, your practice is probably more political than you realise.
I can’t wait! Get yourself a ticket for a day or for all three days.
And as a little thank you for tolerating my excitement, use the discount code JOINJEREMY to get 20% off your ticket.
The Wikimedia Foundation, stewards of the finest projects on the web, have written about the hammering their servers are taking from the scraping bots that feed large language models.
Our infrastructure is built to sustain sudden traffic spikes from humans during high-interest events, but the amount of traffic generated by scraper bots is unprecedented and presents growing risks and costs.
Drew DeVault puts it more bluntly, saying Please stop externalizing your costs directly into my face:
Over the past few months, instead of working on our priorities at SourceHut, I have spent anywhere from 20-100% of my time in any given week mitigating hyper-aggressive LLM crawlers at scale.
And no, a robots.txt
file doesn’t help.
If you think these crawlers respect robots.txt then you are several assumptions of good faith removed from reality. These bots crawl everything they can find, robots.txt be damned.
Free and open source projects are particularly vulnerable. FOSS infrastructure is under attack by AI companies:
LLM scrapers are taking down FOSS projects’ infrastructure, and it’s getting worse.
You try to do the right thing by making knowledge and tools freely available. This is how you get repaid. AI bots are destroying Open Access:
There’s a war going on on the Internet. AI companies with billions to burn are hard at work destroying the websites of libraries, archives, non-profit organizations, and scholarly publishers, anyone who is working to make quality information universally available on the internet.
My own experience with The Session bears this out.
Ars Technica has a piece on this: Open source devs say AI crawlers dominate traffic, forcing blocks on entire countries .
So does MIT Technology Review: AI crawler wars threaten to make the web more closed for everyone.
When we talk about the unfair practices and harm done by training large language models, we usually talk about it in the past tense: how they were trained on other people’s creative work without permission. But this is an ongoing problem that’s just getting worse.
The worst of the internet is continuously attacking the best of the internet. This is a distributed denial of service attack on the good parts of the World Wide Web.
If you’re using the products powered by these attacks, you’re part of the problem. Don’t pretend it’s cute to ask ChatGPT for something. Don’t pretend it’s somehow being technologically open-minded to continuously search for nails to hit with the latest “AI” hammers.
If you’re going to use generative tools powered by large language models, don’t pretend you don’t know how your sausage is made.
As it currently stands, both the rapid growth of AI-generated content overwhelming online spaces and aggressive web-crawling practices by AI firms threaten the sustainability of essential online resources. The current approach taken by some large AI companies—extracting vast amounts of data from open-source projects without clear consent or compensation—risks severely damaging the very digital ecosystem on which these AI models depend.
AI companies with billions to burn are hard at work destroying the websites of libraries, archives, non-profit organizations, and scholarly publishers, anyone who is working to make quality information universally available on the internet.
Make yourself a nice cup of tea and settle in with Julian Gough’s magnum opus:
How early, sustained, supermassive black hole jets carved out cosmic voids, shaped filaments, and generated magnetic fields
Google Fonts only lets you download .ttf files meaning that if you want to self-host your fonts (and you should), you have to first convert them to .woff2 files.
Luckily this tool has been online for over a decade, doing what Google Fonts should be doing by default.
AI has the same problem that I saw ten year ago at IBM. And remember that IBM has been at this AI game for a very long time. Much longer than OpenAI or any of the new kids on the block. All of the shit we’re seeing today? Anyone who worked on or near Watson saw or experienced the same problems long ago.
LLMs are good at transforming text into less text
Laurie is really onto something with this:
This is the biggest and most fundamental thing about LLMs, and a great rule of thumb for what’s going to be an effective LLM application. Is what you’re doing taking a large amount of text and asking the LLM to convert it into a smaller amount of text? Then it’s probably going to be great at it. If you’re asking it to convert into a roughly equal amount of text it will be so-so. If you’re asking it to create more text than you gave it, forget about it.
Depending how much of the hype around AI you’ve taken on board, the idea that they “take text and turn it into less text” might seem gigantic back-pedal away from previous claims of what AI can do. But taking text and turning it into less text is still an enormous field of endeavour, and a huge market. It’s still very exciting, all the more exciting because it’s got clear boundaries and isn’t hype-driven over-reaching, or dependent on LLMs overnight becoming way better than they currently are.
Interesting—this is exactly the same framing I used to talk about design systems a few years ago.
Oh, this is a very handy service from Paul—given the URL of an RSS feed that only has summaries, it will attempt to get the full post content from the HTML.
This magnificent piece by Maxwell Neely-Cohen—with some tasteful art-direction—is right up my alley!
This piece looks at a single question. If you, right now, had the goal of digitally storing something for 100 years, how should you even begin to think about making that happen? How should the bits in your stewardship be stored with such a target in mind? How do our methods and platforms look when considered under the harsh unknowns of a century? There are plenty of worthy related subjects and discourses that this piece does not touch at all. This is not a piece about the sheer volume of data we are creating each day, and how we might store all of it. Nor is it a piece about the extremely tough curatorial process of deciding what is and isn’t worth preserving and storing. It is about longevity, about the potential methods of preserving what we make for future generations, about how we make bits endure. If you had to store something for 100 years, how would you do it? That’s it.
If someone uses an LLM as a replacement for search, and the output they get is correct, this is just by chance. Furthermore, a system that is right 95% of the time is arguably more dangerous tthan one that is right 50% of the time. People will be more likely to trust the output, and likely less able to fact check the 5%.
Hypertext links are an information-density multiplier.
The way I’ve long thought about it is that traditional writing — like for print — feels two-dimensional. Writing for the web adds a third dimension. It’s not an equal dimension, though. It doesn’t turn writing from a flat plane into a full three-dimensional cube. It’s still primarily about the same two dimensions as old-fashioned writing. What hypertext links provide is an extra layer of depth. Just the fact that the links are there — even if you, the reader, don’t follow them — makes a sentence read slightly differently. It adds meaning in a way that is unique to the web as a medium for prose.
Speaking of serendipity, not long after I wrote about making a static archive of The Session for people to download and share, I came across a piece by Alex Chan about using static websites for tiny archives.
The use-case is slightly different—this is about personal archives, like paperwork, screenshots, and bookmarks. But we both came up with the same process:
I’m deliberately going low-scale, low-tech. There’s no web server, no build system, no dependencies, and no JavaScript frameworks.
And we share the same hope:
Because this system has no moving parts, and it’s just files on a disk, I hope it will last a long time.
You should read the whole thing, where Alex describes all the other approaches they took before settling on plain ol’ HTML files in a folder:
HTML is low maintenance, it’s flexible, and it’s not going anywhere. It’s the foundation of the entire web, and pretty much every modern computer has a web browser that can render HTML pages. These files will be usable for a very long time – probably decades, if not more.
I’m enjoying this approach, so I’m going to keep using it. What I particularly like is that the maintenance burden has been essentially zero – once I set up the initial site structure, I haven’t had to do anything to keep it working.
They also talk about digital preservation:
I’d love to see static websites get more use as a preservation tool.
I concur! And it’s particularly interesting for Alex to be making this observation in the context of working with the Flickr foundation. That’s where they’re experimenting with the concept of a data lifeboat
What should we do when a digital service sinks?
This is something that George spoke about at the final dConstruct in 2022. You can listen to the talk on the dConstruct archive.
- People only understand things relative to things they already understand
- People only understand things in context
- People rely on patterns and consistency
- People seek to minimize cognitive load
- People have varying levels of expertise and familiarity
- People are goal-oriented
- People often don’t know what they’re looking for
- Information is more useful when it’s actionable
For many archivists, alarm bells are ringing. Across the world, they are scraping up defunct websites or at-risk data collections to save as much of our digital lives as possible. Others are working on ways to store that data in formats that will last hundreds, perhaps even thousands, of years.