Domenic Denicola

Windows Native App Development Is a Mess

2026-03-22T00:00:00Z

I’m a Windows guy; I always have been. One of my first programming books was Beginning Visual C++ 6, which crucially came with a trial version of Visual C++ that my ten-year-old self could install on my parents’ computer. I remember being on a family vacation when .NET 1.0 came out, working my way through a C# tome and gearing up to rewrite my Neopets cheating programs from MFC into Windows Forms. Even my very first job after university was at a .NET shop, although I worked mostly on the frontend.

While I followed the Windows development ecosystem from the sidelines, my professional work never involved writing native Windows apps. (Chromium is technically a native app, but is more like its own operating system.) And for my hobby projects, the web was always a better choice. But, spurred on by fond childhood memories, I thought writing a fun little Windows utility program might be a good retirement project.

Well. I am here to report that the scene is a complete mess. I totally understand why nobody writes native Windows applications these days, and instead people turn to Electron.

What I built

The utility I built, Display Blackout, scratched an itch for me: when playing games on my three-monitor setup, I wanted to black out my left and right displays. Turning them off will cause Windows to spasm for several seconds and throw all your current window positioning out of whack. But for OLED monitors, throwing up a black overlay will turn off all the pixels, which is just as good.

To be clear, this is not an original idea. I was originally using an AutoHotkey script, which upon writing this post I found out has since morphed into a full Windows application. Other | incarnations of the idea are even available on the Microsoft Store. But, I thought I could create a slightly nicer and more modern UI, and anyway, the point was to learn, not to create a commercial product.

For our purposes, what’s interesting about this app is the sort of capabilities it needs:

Enumerating the machine’s displays and their bounds
Placing borderless, titlebar-less, non-activating black windows
Intercepting a global keyboard shortcut
Optionally running at startup
Storing some persistent settings
Displaying a tray icon with a few menu items

Let’s keep those in mind going forward.

Look at this beautiful UI that I made. Surely you will agree that it is better than all other software in this space.

A brief history of Windows programming

In the beginning, there was the Win32 API, in C. Unfortunately, this API is still highly relevant today, including for my program.

Over time, a series of abstractions on top of this emerged. The main pre-.NET one was the MFC C++ library, which used modern-at-the-time language features like classes and templates to add some object-orientation on top of the raw C functions.

The abstraction train really got going with the introduction of .NET. .NET was many things, but for our purposes the most important part was the introduction of a new programming language, C#, that ran as JITed bytecode on a new virtual machine, in the same style as Java. This brought automatic memory management (and thus memory safety) to Windows programming, and generally gave Microsoft a more modern foundation for their ecosystem. Additionally, the .NET libraries included a whole new set of APIs for interacting with Windows. On the UI side in particular, .NET 1.0 (2002) started out with Windows Forms. Similar to MFC, it was largely a wrapper around the Win32 windowing and control APIs.

With .NET 3.0 (2006), Microsoft introduced WPF. Now, instead of creating all controls as C# objects, there was a separate markup language, XAML: more like the HTML + JavaScript relationship. This also was the first time they redrew controls from scratch, on the GPU, instead of wrapping the Win32 API controls that shipped with the OS. At the time, this felt like a fresh start, and a good foundation for the foreseeable future of Windows apps.

The next big pivot was with the release of Windows 8 (2012) and the introduction of WinRT. Similar to .NET, it was an attempt to create new APIs for all of the functionality needed to write Windows applications. If developers stayed inside the lines of WinRT, their apps would meet the modern standard of sandboxed apps, such as those on Android and iOS, and be deployable across Windows desktops, tablets, and phones. It was still XAML-based on the UI side, but with everything slightly different than it was in WPF, to support the more constrained cross-device targets.

This strategy got a do-over in Windows 10 (2015) with UWP, with some sandboxing restrictions lifted to allow for more capable desktop/phone/Xbox/HoloLens apps, but still not quite the same power as full .NET apps with WPF. At the same time, with both WinRT and UWP, certain new OS-level features and integrations (such as push notifications, live tiles, or publication in the Microsoft Store) were only granted to apps that used these frameworks. This led to awkward architectures where applications like Chrome or Microsoft Office would have WinRT/UWP bridge apps around old-school cores, communicating over IPC or similar.

With Windows 11 (2021), Microsoft finally gave up on the attempts to move everyone to some more-sandboxed and more-modern platform. The Windows App SDK exposes all the formerly WinRT/UWP-exclusive features to all Windows apps, whether written in standard C++ (no more C++/CLI) or written in .NET. The SDK includes WinUI 3, yet another XAML-based, drawn-from-scratch control library.

So did you catch all that? Just looking at the UI framework evolution, we have:

Win32 C APIs → MFC → WinForms → WPF → WinRT XAML → UWP XAML → WinUI 3

Forks in the road

In the spirit of this being a learning project, I knew I wanted to use the latest and greatest first-party foundation. That meant writing a WinUI 3 app, using the Windows App SDK. There ends up being three ways to go about this:

C++
C#/XAML, with “framework-dependent deployment”
C#/XAML, with .NET AOT

This is a painful choice. C++ will produce lean apps, runtime-linked against the Windows APP SDK libraries, with easy interop down into any Win32 C APIs that I might need. But, in 2026, writing a greenfield application in a memory-unsafe language like C++ is a crime.

What would be ideal is if I could use the system’s .NET, and just distribute the C# bytecode, similar to how all web apps share the same web platform provided by the browser. This is called “framework-dependent deployment”. However, for no reason I can understand, Microsoft has decided that even the latest versions of Windows 11 only get .NET 4.8.1 preinstalled. (The current version of .NET, in 2026, is 10—although the version numbers are misleading, because they started over at 1.0 again in 2016.) So distributing an app this way incurs a tragedy of the commons, where the first app to need modern .NET will cause Windows to show a dialog prompting the user to download and install the .NET libraries. This is not the optimal user experience!

That leaves .NET AOT. Yes, I am compiling the entire .NET runtime—including the virtual machine, garbage collector, standard library, etc.—into my binary. The compiler tries to trim out unused code, but the result is still a solid 9 MiB for an app that blacks out some monitors.

(“What about Rust?” I hear you ask. A Microsoft-adjacent effort to maintain Rust bindings for the Windows App SDK was tried, but they gave up.)

There’s a similar painful choice when it comes to distribution. Although Windows is happy to support hand-rolled or third-party-tool-generated setup.exe installers, the Microsoft-recommended path for a modern app with containerized install/uninstall is MSIX. But this format relies heavily on code signing certificates, which seem to cost around $200–300/year for non-US residents. The unsigned sideloading experience is terrible, requiring a cryptic PowerShell command only usable from an admin terminal. I could avoid sideloading if Microsoft would just accept my app into their store, but they rejected it for not offering “unique lasting value”.

The tragedy here is that this all seems so unnecessary. .NET could be distributed via Windows Update, so the latest version is always present, making framework-dependent deployment viable. Or at least there could be a MSIX package for .NET available, so that other MSIX packages could declare a dependency on it. Unsigned MSIX sideloads use the same crowd-sourced reputation system that EXE installers get. Windows code signing certs could cost $100/year, instead of $200+, like the equivalent costs for the Apple ecosystem. But like everything else about modern Windows development, it’s all just … half-assed.

Left behind

It turns out that it’s a lot of work to recreate one’s OS and UI APIs every few years. Coupled with the intermittent attempts at sandboxing and deprecating “too powerful” functionality, the result is that each new layer has gaps, where you can’t do certain things which were possible in the previous framework.

This is not a new problem. Even back with MFC, you would often find yourself needing to drop down to Win32 APIs. And .NET has had P/Invoke since 1.0. So, especially now that Microsoft is no longer requiring that you only use the latest framework in exchange for new capabilities, having to drop down to a previous layer is not the end of the world. But it’s frustrating: what is the point of using Microsoft’s latest and greatest, if half your code is just interop goop to get at the old APIs? What’s the point of programming in C#, if you have to wrap a bunch of C APIs?

Let’s revisit the list of things my app needs to do, and compare them to what you can do using the Windows App SDK:

Enumerating the machine’s displays and their bounds: can enumerate, as long as you use a for loop instead of a foreach loop. But watching for changes requires P/Invoke, because the modern API doesn’t actually work.
Placing borderless, titlebar-less, non-activating black windows: much of this is doable, but non-activating needs P/Invoke.
Intercepting a global keyboard shortcut: nope, needs P/Invoke.
Optionally running at startup: can do, with a nice system-settings-integrated off-by-default API.
Storing some persistent settings: can do.
Displaying a tray icon with a few menu items: not available. Not only does the tray icon itself need P/Invoke, the concept of menus for tray icons is not standardized, so depending on which wrapper package you pick, you’ll get one of several different context menu styles.

See the web version for a carousel of different tray icon context menu styles.

But these are just the headline features. Even something as simple as automatically sizing your app window to its contents was lost somewhere along the way from WPF to WinUI 3.

Given how often you need to call back down to Win32 C APIs, it doesn’t help that the interop technology is itself undergoing a transition. The modern way appears to be something called CsWin32, which is supposed to take some of the pain out of P/Invoke. But it can’t even correctly wrap strings inside of structs. To my eyes, it appears to be one of those underfunded, perpetually pre-1.0 projects with uninspiring changelogs, on track to get abandoned after a couple years.

And CsWin32’s problems aren’t just implementation gaps: some of them trace back to missing features in C# itself. The documentation contains this darkly hilarious passage:

Some parameters in win32 are [optional, out] or [optional, in, out]. C# does not have an idiomatic way to represent this concept, so for any method that has such parameters, CsWin32 will generate two versions: one with all ref or out parameters included, and one with all such parameters omitted.

The C# language doesn’t have a way to specify a foundational parameter type of the Win32 API? One which is a linear combination of two existing supported parameter types? One might think that an advantage of controlling C# would be that Microsoft has carefully shaped and coevolved it to be the perfect programming language for Windows APIs. This does not appear to be the case.

Indeed, it’s not just in interop with old Win32 APIs where C# falls short of its target platform’s needs. When WPF first came out in 2006, with its emphasis on two-way data binding, everyone quickly realized that the boilerplate involved in creating classes that could bind to UI was unsustainable. Essentially, every property needs to become a getter/setter pair, with the setter having a same-value guard and a call to fire an event. (And firing an event is full of ceremony in C#.) People tried various solutions to paper over this, from base classes to code generators. But the real solution here is to put something in the language, like JavaScript has done with decorators and proxies.

So when I went to work on my app, I was astonished to find that twenty years after the release of WPF, the boilerplate had barely changed. (The sole improvement is that C# got a feature that lets you omit the name of the property when firing the event.) What has the C# language team been doing for twenty years, that creating native observable classes never became a priority?

Conclusion

Honestly, the whole project of native Windows app development feels like it’s not a priority for Microsoft. The relevant issue trackers are full of developers encountering painful bugs and gaps, and getting little-to-no response from Microsoft engineers. The Windows App SDK changelog is mostly about them adding new machine learning APIs. And famously, many first-party apps, from Visual Studio Code to Outlook to the Start menu itself, are written using web technologies.

This is probably why large parts of the community have decided to go their own way, investing in third-party UI frameworks like Avalonia and Uno Platform. From what I can tell browsing their landing pages and GitHub repositories, these are better-maintained, and written by people who loved WPF and wished WinUI were as capable. They also embrace cross-platform development, which certainly is important for some use cases.

But at that point: why not Electron? Seriously. C# and XAML are not that amazing, compared to, say, TypeScript/React/CSS. As we saw from my list above, to do most anything beyond the basics, you’re going to need to reach down into Win32 interop anyway. If you use something like Tauri, you don’t even need to bundle a whole Chromium binary: you can use the system webview. Ironically, the system webview receives updates every 4 weeks (soon to be 2?), whereas the system .NET is perpetually stuck at .NET Framework version 4.8.1!

It’s still possible for Microsoft to turn this around. The Windows App SDK approach does seem like an improvement over the long digression into WinRT and UWP. I’ve identified some low-hanging fruit around packaging and deployment above, which I’d love for them to act on. And their recent announcement of a focus on Windows quality includes a line about using WinUI 3 more throughout the OS, which could in theory trickle back into improving WinUI itself.

I’m not holding my breath. And from what I can tell, neither are most developers. The Hacker News commentariat loves to bemoan the death of native apps. But given what a mess the Windows app platform is, I’ll pick the web stack any day, with Electron or Tauri to bridge down to the relevant Win32 APIs for OS integration.

On the Streams Standard

2026-02-28T00:00:00Z

In 2013, I started the project of designing a new streams API for JavaScript. The intent was to learn the lessons from Node.js’s streams, including its transition to “streams2”, and create something that could power various under-development web APIs. This site contains some essays from me reflecting on the API’s development, specifically as I worked to grapple with how different underlying resources (like files vs. sockets) could be abstracted behind a single primitive.

The result was the Streams Standard. These foundational classes now power a large variety of web APIs, from fetch() to translation. The Streams Standard APIs have been incorporated in various other JavaScript ecosystems as well, in a similar way to other web standards like URL, EventTarget, AbortController, fetch(), Worker, etc..

Recently, James Snell published an article “We deserve a better streams API for JavaScript” critiquing the Streams Standard APIs, and proposing an alternative he believes is more suitable for the JavaScript ecosystem. I appreciate James’s work on and insights into this problem space. I think the article has a number of solid points—James has identified real weaknesses, which I’ll get to—but his high-level framing has many questionable aspects, and a few that are just confused or wrong.

So let’s take this opportunity to dig into James’s arguments. I hope that while doing so, I can give some insight into how I thought about designing platform primitives, and some advice for those who will be pushing the platform forward in the future.

Optimizations

One of the most frustrating parts of James’s article is how he believes that his implementations’ performance problems are fundamental, and arise from the design decisions in the standard. This betrays a naïve mindset wherein implementers can get good performance out of the box, just by transcribing steps from specification text into JavaScript code.

If you step back for a minute and look at how standards work, you’ll quickly realize this is ridiculous. If a JavaScript engine implemented strings as a vector of 16-bit code units, and then whined about the “fundamental design decisions” that made it impossible to get good performance on string concatenation or === comparison, nobody would take them seriously. Standards are intended to be easy to follow, and to nail down observable consequences, so that various different implementations all give the same results. They are not an implementation roadmap, and making them performant is a large part of the job of a platform engineer.

The Streams Standard takes great pains to make as much unobservable as possible, so that optimized fast paths can be implemented. This is baked into the design at multiple levels. For example, the locking system means that stream1.pipeTo(stream2) can be optimized down to a sendfile(2) call. The higher-level APIs like async iteration mean that, in the common case, there’s no need to ever allocate promise objects or { value, done } containers. James has a whole section calling out “The hidden cost of promises”, which he opens by saying

Each read() call doesn’t just return a promise; internally, the implementation creates additional promises for queue management, pull() coordination, and backpressure signaling.

But “the implementation” is under his control! There is no need for it to create those promises, unless one of them is explicitly passed out to the developer’s JavaScript. James’s section “GC thrashing in server-side rendering” suffers from similar misunderstandings, assuming that every time the spec says to create an object, an actual garbage-collected object must be allocated.

In his section titled “The optimization treadmill”, James seems to recognize that well-written runtimes don’t need to have these problems. But he does so in a strange way, disparaging this foundational performance work as

every major runtime has resorted to non-standard internal optimizations

and complaining that

Finding these optimization opportunities can itself be a significant undertaking. It requires end-to-end understanding of the spec to identify which behaviors are observable and which can safely be elided.

I’m not really sure how to respond to this, except to say this is the job. When one implements a standard, whether it’s V8 implementing the JavaScript standard, ICU implementing the Unicode Standard, Chromium implementing the URL Standard, or Cloudflare Workers implementing the Streams Standard, one’s goal is to create a good, performant implementation. I guess, if you don’t like that part of the job, you can ask an AI agent to do it, as Vercel did. But complaining about it as “unsustainable complexity” is a surprising attitude for someone building a production runtime.

This pattern of blaming the standard for implementation quality issues continues in other places in James’s article, e.g. in his section “Exhausting resources with unconsumed bodies” where he complains about a Node.js bug, or in “Falling headlong off the tee() memory cliff” where he again complains that “implementations have had to develop their own strategies” instead of being handheld to the right approach.

In summary: it’s unreasonable to evaluate a standards-based API by looking at a naïve implementation. If you do that, then of course your from-scratch library which doesn’t have to meet any standards will be faster. It will certainly have fewer bugs, since you’ve written it after fixing various bugs in your original implementation of the standard API.

Conformance

A similarly confusing section of James’s post is his section titled “The compliance burden” wherein he complains that … the API is too well-tested?

The rise of comprehensive test suites for standard APIs is one of the greatest triumphs of the 2010s. The web platform tests project, including efforts like the Interop 202X sprints, are probably the single greatest factor in moving us out of the 2000s hellscape. Interoperability is not perfect these days, but the edge cases we encounter now are nothing compared to back when Internet Explorer, Netscape/Firefox, and Safari all had divergent implementations of EventTarget, necessitating normalization layers like jQuery.

In the current era, the culture is clear. Everything that’s observable needs tests. Not just common cases, but error scenarios, invalidation, integration with other features: anything that might cause two implementations to diverge. If you discover a coverage gap, where implementations do different things, then add a test.

James complains

For runtime implementers, passing the WPT suite means handling intricate corner cases that most application code will never encounter. The tests encode not just the happy path but the full matrix of interactions between readers, writers, controllers, queues, strategies, and the promise machinery that connects them all.

It’s true that most application code will not encounter edge cases. But at internet scale, excluding “most” application code still leaves you with a lot of frustrated developers in the minority! One of the strengths of standards is their commitment to serve all developers’ scenarios interoperably, not just the common case. This is one of the major distinguishing factors between multi-implementation standards, and a library someone throws up on npm or lands on the main branch of nodejs/node.

What web streams (probably) got wrong

Although I find James’s high-level positions confused, at the more micro level I agree that he’s identified several weaknesses in the Streams Standard APIs. Many of these came from hewing overly-closely to the predecessor Node.js streams, and it makes sense that with 13 years of hindsight the community has been able to discover possible improvements.

Bring-your-own-buffer is unnecessary

I largely agree with James’s section “BYOB: complexity without payoff”. In retrospect, bring-your-own-buffer streams were designed with too much attention to theory and not enough to real-world performance and usability. Early discussions with Node.js core team members revealed their regret that Node.js streams always required buffer copies, and so Takeshi Yoshino and I galloped off to try to solve this problem.

In reality, memcpy() is not that slow. And it’s often necessary for security or architectural reasons anyway, as data needs to move across kernelspace/userspace boundaries, process boundaries, or just between the network stack and the JavaScript heap. The care we put into avoiding data races, via the transferral mechanism, was somewhat undercut by the release of SharedArrayBuffer in 2017. And the fact that we never came up with a design for zero-copy writable or transform streams is definitely a negative indicator.

Although it’s possible to imagine scenarios where reducing copies gives a useful speedup, my current thinking is that this doesn’t need to be baked into the generic stream primitive such that JavaScript stream creators, consumers, and library developers can all fully participate. Instead, it can be left as one of the many possible unobservable optimizations that implementations are allowed to do behind the scenes.

Backpressure and teeing are complicated

James’s section on “Backpressure: good in theory, broken in practice” is probing at a real problem. The Streams Standard’s notion of backpressure was coming from the unsophisticated approach used in Node.js’s streams1/streams2 designs. (It slightly modernized them and got rid of finicky details like how adding a "readable" listener would switch between backpressure modes.) It’s very believable to me that there are better models than the voluntary desiredSize + ready promise approach.

James’s new library includes four explicit backpressure modes. I don’t know whether all four of these modes are useful in real applications, or whether they’ve been battle-tested to the same extent the Node.js/Streams Standard design has. His choice of “strict” backpressure as the default seems unlikely to be correct: I doubt that many server-side developers want code that works fine over fast internet, when the user’s computer can quickly accept their server-rendered data, but throws exceptions when the user’s cell service goes down to one bar. But overall I agree that this is an area where giving developers more control is likely a good idea.

Similarly, he proposes two separate modes for handling backpressure when teeing, which the developer has to choose between. This is a good idea, which has been proposed for the Streams Standard.

Transform streams aren’t quite right

Transform streams are another area where the Streams Standard may have been too influenced by its Node.js predecessor. James’s complaints in “Transform backpressure gaps” are chiefly about how transforms are eager, executing on write, instead of lazy, executing on read. We definitely are aware of this problem in the Streams Standard, although the causes are a bit different than what James describes.

The essential problem is that there’s no way for a WritableStream to signal that it wants no internal queuing, but is still willing to accept a single write. The canonical issue is whatwg/streams#1158, and there’s a draft pull request to close this expressiveness gap. As part of that, we’d make lazy transforms the default, with internal queuing in the transforms only when explicit highWaterMark options are passed.

But the complex way in which that solution works brings us to our next point…

Maybe, the whole thing could be much simpler

The biggest early decision we made with the Streams Standard was to have each half of the stream ecosystem be self-contained: ReadableStreams, backed by underlying sources, and WritableStreams, backed by underlying sinks. Again, this was inspired by the Node.js streams API, which used the same pattern (although smashed together into a single class).

The major alternative proposed was to merge the two halves, and have a single “channel” primitive: e.g.,

const { readable, writable } = new Channel();

where you could give out readable to consumer code, and write into writable to fill it. Or you could give out writable to producer code, and keep readable to see what they wrote.

To this day, I’m not sure which design is better. At the time, I convinced myself that the channel design wasn’t powerful enough for some of our goals. But looking back, I think I was too influenced by a desire to stay close to Node.js streams, and didn’t give the alternative a fair evaluation. James’s new library indeed takes the channel approach. And as his library shows, the channel design greatly simplifies transforms as well.

It’s possible that the channel design is too simple. There are certainly some patterns, largely around state management and queuing, which are much easier when the Streams Standard’s APIs manage them for you. With a channel-type API, individual stream creators need to implement those patterns themselves, and they might do so in slightly different ways. But my intuition is that those patterns have ended up being more rare than we expected in 2013. As such, I’m cautiously optimistic about exploration along this alternate evolutionary path.

Other thoughts on James’s library

A few rapid-fire thoughts on specific design choices in James’s library:

Error handling: the biggest gap in James’s post and library is any discussion of error handling. This is scary! Error handling in streams was one of the hardest things to get right, and one of the areas I’m most proud of our improvements in the Streams Standard vs. in Node.js. The distinction between no-fault cancelation of readable streams, error-like aborting of writable streams, errors that come from the underlying sinks and sources, and how all of these propagate when streams are wired together are quite complex, and crucial for ensuring program correctness. I don’t claim the Streams Standard’s design here is perfect—in particular, it was designed before AbortSignals and doesn’t integrate well with them—but I want to highlight this area for attention.
Bytes only: I’m skeptical that this will meet the ecosystem’s needs. It’s just too convenient to transform data into non-byte formats, such as text, or objects parsed from JSON. But if it does, it’s surely a huge simplification. The worry here would be bifurcating the ecosystem into byte streams, which are handled via James’s library, and object streams, which are handled by various other utilities.
Sync/async separation: I think this is a bad idea. Making consumers care about whether their data is coming from a sync source or an async source, with separate consumption methods for each, is a recipe for a bad time. Instead, the synchronous consumption hooks should exist, hidden inside the implementation, as part of the fast-path optimizations that James is so reluctant to implement. Consumers can pay the cost of a single promise at the end of the chain, and thus avoid unleashing Zalgo in the middle of their application code.
Streams are iterables: This is just the old objects-with-methods vs. freestanding functions debate. I think objects-with-methods have conclusively won the API design wars, at least in JavaScript, so I think James’s library is a developer-experience regression in this regard. Relatedly, James spends a lot of time complaining about the internal state machine of streams, but seems to ignore that async generators (which he uses to create his async iterables) have their own just-as-complex state machine.
No locking: James doesn’t like the Streams Standard’s locking APIs, and I agree they could be improved. But his design, of “just use async iterators”, kneecaps the many optimization opportunities that locked streams bring. How are you going to be able to convert your async iterable pipeline chain into sendfile(2) when at any time JavaScript code could call iterable.next()? Maybe this is one of those “intricate corner cases that most application code will never encounter”, and as long as nobody writes a test for it, we’ll be fine…

A slide from my 2014 presentation Streams for the Web, illustrating the abort and cancelation flow through pipe chains

In conclusion

I’m grateful to James for starting this conversation, thus giving me a chance to reflect on the work myself and many others have put into the Streams Standard over the years. It’s certainly not perfect. But I think many of its core ideas are solid. And it’s important not to judge it based on buggy and naïve implementations, but instead as a standard that can be implemented either well or poorly.

For better or for worse, web APIs are forever: the web is not going to get a second streams API. So for any parts of the JavaScript ecosystem which want to use the same primitives as browser code, evolving and improving the Streams Standard is probably more fruitful than starting over from scratch. There have been many | previous | attempts to create secondary stream ecosystems that sit alongside the gorillas of Streams Standard streams or Node.js streams, and they’ve seen their own limited success. I wish James’s library the best success it can attain, within that tradition.

Unfortunately, evolving the Streams Standard is hard. In the ZIRP heydays of the 2010s, myself and collaborators could build primitives like promises, streams, modules, web components, and the like; getting the web platform’s foundations in order had a lot of business support. These days, it’s much harder to motivate directors at browser companies to spend their budgets on incremental improvements, when what we have is good enough. This is why even minor improvements are stalled for years. It’s possible for heroes to push through new features by single-handedly contributing specification text, web platform tests, and a browser implementation or two. But it’s definitely harder than writing a fresh JavaScript library.

The Wrong Work, Done Beautifully

2026-02-05T00:00:00Z

I’ve maintained the jsdom open-source project for over ten years. It’s essentially a partial implementation of a web browser in Node.js, including complexities like resource loading, styling and scripting, and Web IDL bindings. Along the way I’ve been privileged to invite several talented engineers onto the maintainers team, as they took time from their lives to significantly improve the project.

For a long time, working on jsdom was a leisure activity. I’d get home from my day job working on web standards and the ~35 million LOC Chrome web browser at Google, and then I’d unwind by implementing some web standards and fixing some bug reports in my scrappy ~1 million LOC jsdom codebase.

Some time around COVID, my commitment to jsdom, and open-source maintenance in general, waned. Without the mental reset of walking home and switching computers, coding during evenings and weekends no longer sparked joy. So I retreated to a more passive role, attempting to be responsive to issues and pull requests, but not actively improving the library.

It didn’t help that the jsdom codebase was running out of low-hanging fruit. The most-reported issues were symptoms of fundamentally broken or outdated subsystems. Things like resource loading, CSS parsing and the CSS object model, and selectors. The most popular feature requests were similarly daunting: implementing the Fetch API, or all of the SVG element classes, or JavaScript module support. The web platform had kept growing at the pace of Apple, Google, and Mozilla, whereas my spare time had not.

And with the benefit of distance, I had started to realize that the jsdom project was … kind of pointless anyway. Why use a Node.js reimplementation of a web browser, when you could instead drive a real headless web browser? Sure, jsdom is a little more lightweight. Sure, it’s in-process instead of out-of-process. But do you really want to pay the substantial correctness tax of using our off-brand incomplete web platform implementation, just for those benefits? It certainly seems like a bad idea to do so for testing, where correctness is paramount!

In the end, if you want something minimal for scraping or DOM manipulation in Node.js, you can use cheerio. If you want to interface with the actual web, including complexities like layout and navigation, you can use Puppeteer. It’s hard to believe there’s a large market for jsdom’s niche: an obsessively spec-compliant, script-executing, but very much partial implementation of the web platform.

I still have a soft spot in my heart for jsdom, which maintains an inexplicable popularity at 48 million weekly downloads. (Compare to React’s 80 million.) When I was asked to participate in an AI coding productivity study, I leaped at the chance to fix a backlog of jsdom issues. And recently, a new contributor has | appeared and thrown themselves into fixing our selector and CSS subsystems. I’m doing my best to stay responsive and get their diligent work released to the world. But if pressed, I would say the project is in maintenance mode.

After we moved to Japan, my wife and I tried snowboarding. She took to it, but I … did not. What can I say? A gear-heavy, high-learning-curve, single-season, somewhat-dangerous, adverse-weather–centric sport is just not where I want to invest my free time.

But a lot of our friends’ social lives revolve around snow trips during the winter. So recently I’ve been tagging along as the group heads off to snow towns. I explore the café scene during the day while everyone else is sliding down the mountain. And so it is that I find myself in a cozy bakery in Nozawa Onsen, staring at the Claude Code terminal and wondering what I should work on next.

Me making my way to the hotel lounge for some solid coding time.

Well, there was that one jsdom bug. The one that got filed in 2019, and immediately made me embarrassed about how I’d gotten such a fundamental thing wrong. I always told myself I’d fix it “next weekend”, keeping it on my tasks list for far too many years, before facing the reality that I wasn’t going to prioritize it. But maybe … with the power of Claude … now was the time?

The next three weeks flew by in a blur. I definitely caught the Claude Code psychosis. My previous Cursor-assisted work on the jsdom bug backlog felt productive; my Claude Code-assisted Windows utility program was a fun diversion. But ripping the guts out of jsdom’s resource loading subsystem and replacing them wholesale? Addictive.

It’s hard to describe why. But from an ethnographic perspective, I think it’s important to try.

With every prompt, Claude would delight me with how much progress it made. But there would always be more threads to follow up on—more value I could add. I would review Claude’s code and suggest simplifications. Or I’d realize that we were starting to touch another area of the codebase, which had its own problems, and couldn’t we go and refactor that part too? Or we’d just burn through the list of failing tests, often in a two-steps-forward-one-step-back fashion where our fix for one would improve the architecture while causing a small regression elsewhere.

Claude and its subagents go to town on the XMLHttpRequest part of the jsdom codebase.

I found myself reopening my laptop during any spare interval. Before we went out for dinner, I’d spend ten minutes writing up a big prompt so I could get the thrill of seeing what Claude produced while I was gone. I daydreamed about setting up one of those Claude Code-from-your-phone setups people are advertising on X. (But I didn’t pull the trigger, because it’s important to have boundaries and not become a phone zombie.)

This time around, the intensity of the work pulled me deeper into the agentic coding ecosystem. I started using Git worktrees, so I could factor out smaller PRs from the main work and land them independently. I added an AGENTS.md once I had enough experience to know what it should say. I tried Codex CLI, and verified the rumors that it can work autonomously and churn out tons of code that passes a given test suite—as long as you’re willing to babysit it through a shit-ton of pointless permissions escalations. Inspired by Christoph, I briefly experimented with Codex Web, before deciding it was too lazy for my needs.

After we completed the first pass at a rewrite in 5 days, I realized I wanted to go further. So I squashed those 43 commits into one, and started another branch. 77 commits and 8 days later, that was ready. I let GitHub Copilot and GPT 5.2 Codex have a crack at reviewing it—they found some good issues!—and finally merged into main. jsdom v28.0.0 has the results.

And part of me is very happy with the end result. I think the new API, and the code behind it, is beautiful. I’m proud of how I engaged with Claude and the other agents, double-checking every line of code and constantly iterating toward the simplest, most general solution. We closed 5 high-difficulty open issues, one dating back to 2016. I engaged with the authors of Undici, the library underlying Node.js’s fetching infrastructure, and reported several bugs. Working on jsdom is fun again; maybe I can tackle all those other fundamental issues next!

But … wait. Should I?

Claude cannot magically make jsdom into a valuable project. On the one hand, fixing a bug dating back to 2016 is gratifying. On the other hand, that bug’s been open since 2016, with only 6 upvotes.

I agree with the general sentiment that the Opus 4.5 / Codex 5.2 generation represents a step change. That although back in ye olde July 2025, AI agents on average slowed down experienced developers working on large codebases, these days they’re probably a speedup.

But they haven’t solved the need to plan and prioritize and project-manage. And by making even low-priority work addictive and engaging, there’s a real possibility that programmers will be burning through their backlog of bugs and refactors, instead of just executing on top priorities faster. Put another way, while AI agents might make it possible for a disciplined team to ship in half the time, a less-disciplined team might ship following the original schedule, with beautifully-extensible internal architecture, all P3 bugs fixed, and several side projects and supporting tools spun up as part of the effort.

I’m not aiming for a lesson here. More of an observation. Unlike Sean, whose blog is full of great takes on how to add value to a software engineering org, I’m retired. If I want to spend my time polishing an open-source codebase to within a centimeter of its life, that’s my choice. But am I doing that because it’s part of living my upon-reflection best life? Or am I doing it because I’ve reached a point with my Japanese flashcard project where I need to do more user testing and evals and design work, and that’s less fun than diving into the familiar jsdom codebase with my little agent buddy?

Onward

2025-09-27T00:00:00Z

Yesterday was my last day at Google and the Chrome team. In fact, I have retired.

Over the last 11 years, I’ve worked to improve the web platform, focusing on foundational features where my skills in web standards and API design could make a difference. From my earliest work on promises to my latest work on built-in AI and speculative loading, I’ve tried to push the web forward, taking seriously the responsibility of designing a billion-user platform.

Along the way I’ve worked with many amazing people, both inside and outside Google. I’m especially grateful to my mentors and managers in Chrome who nurtured my development. Google let me work independently from the New York City office as an L4, despite there being no web platform team in NYC. I was always free to choose my own projects, often spontaneously forming 2–4-person geo-distributed teams. And when I wanted to move to Tokyo, my directors made it happen, letting me spend the last 3 years as part of the Tokyo Chrome team on larger-scale efforts.

Over this time, evolving the web has grown more complex. It’s hard to tell how much of this is due to the evolution of Google and the macroeconomic environment—is platform-building a ZIRP phenomenon?—versus how much is just me seeing the larger picture, as I advanced to L7 and became more attuned to business impact. Early on, I worked on custom elements or JavaScript modules because I thought they were obvious platform gaps. Whereas recently the job became about winning the argument first: lining up stakeholders, persuading directors, and rallying ICs in order to make speculative loading ready for broad | ecosystem | adoption, not just for use on Google Search. But I have no regrets or resentment in this regard. Just like I was thrilled to learn after university that people will pay well for something as fun as programming, I’m amazed that we’ve managed to harness the will of the market and large corporate budgets to nurture an artifact as impressive as the web.

Moving on

This would normally be the point where I announce my next exciting adventure. Or I could quietly fade out, perhaps discussing a sabbatical or mysterious “projects”. But I’ve decided to embrace a different approach: I’m retired! I no longer need to work for money, and I’m going to take on the responsibility of figuring out what that looks like.

It’s somewhat scary, leaving my career behind. I worry about the projects I didn’t quite complete, or the organizations I was a core part of that will now move on without me. I’m sad to miss the opportunity to watch and assist with the growth of the junior engineers who have joined Chrome Tokyo over the last few years. I think we’ve reduced the bus factor enough that everything will be fine, but of course the future won’t go exactly as I would have steered it. That’s OK; it rarely did anyway.

I also worry about fading into obscurity, as I refocus on personal projects and my work becomes less visible and less influential. There’s a fundamental tension between the freedom to focus on your self-directed interests, and the fact that nobody cares about them as much as you do. But I’ve spent many years prioritizing impact on the world, and am excited to shift some of my priorities toward what most excites and invigorates me personally.

Life after work

While I was working, a typical weekday would have me home from the office by 19:30, studying Japanese until 21:00, and ending with an hour of free time before bed. Weekends were precious but rare, and often taken up by errands.

Meanwhile, my backlog of side-project ideas, books to read, and self-improvement quests grew. I moved to Tokyo over 3 years ago, and am eagerly looking forward to spending more time exploring all it has to offer. The AI coding revolution is in full swing, but at work I could only use Gemini CLI, which isn’t very good. I spent a week on a meditation retreat last year, but after a month of trying to carve out a 30-minute daily habit, I had to admit defeat: my work performance and Japanese recall were suffering from getting 30 minutes less sleep.

Life has so much to offer, and I’m excited to live it more fully. I’ll be raising a puppy, seriously increasing my Japanese study time, and enjoying my backlog of video games, TV series, and novels. I’ll build small apps for myself to scratch an itch or learn a technology, and then decide whether to try polishing and publishing them. More than anything, I plan to learn a lot: I want to deep-dive into the philosophy of computation, personal identity, and consciousness; I want to resume my studies of theoretical physics in general, and quantum gravity in particular; I want to learn Lean and get involved in modern, often AI-assisted attempts to formalize mathematics. I’ll make new friends, start new hobbies, and travel to new places. I’m so excited.

Designing the Built-in AI Web APIs

2025-08-13T00:00:00Z

For the last year, I’ve been working as part of the Chrome built-in AI team on a set of APIs to bring various AI models to the web browser. As with all APIs we ship, our goal is to make these APIs compelling enough that other browsers adopt them, and they become part of the web’s standard library.

Working in such a fast-moving space brings tension with the usual process for building web APIs. When exposing other platform capabilities like USB, payments, or codecs, we can draw on years or decades of work in native platforms. But with built-in AI APIs, especially for language model-backed APIs like the prompt API, our precedent is barely two years old. Moreover, there are interesting differences between HTTP APIs and client-side APIs, and between vendor-specific APIs and those designed for a wide range of possible future implementations.

In what follows, I’ll focus mostly on the design of the prompt API, as it has the most complex API surface. But I’ll also touch on higher-level “task-based” APIs like summarizer, translator, and language detector.

Starting from precedent

The starting place for API design is the core loop: apart from any initialization or state management, when a developer wants to prompt a language model, what does the code for that look like? Even with only two years’ experience with language model prompting, the ecosystem has mostly converged on a shape here.

The consensus shape is that a language model prompt consists of a series of messages, with one of three roles: "user", "assistant", and "system" (or sometimes "developer"). A moderately complex example might look something like this:

[
  { role: "system", content: "Predict up to 5 emojis as a response to a comment. Output emojis, comma-separated." },
  { role: "user", content: "This is amazing!" },
  { role: "assistant", content: "❤️, ➕" },
  { role: "user", content: "LGTM" },
  { role: "assistant", content: "👍, 🚢" }
]

But the exact details of this format are nontrivial! The main complicating factors are:

Multimodal inputs and outputs: how do we represent images and audio clips?
Constraints: Can you include a system message later in the conversation? If the model is not capable of outputting audio, can you add an assistant message whose content is audio?
Semantics: Are you allowed to have multiple assistant messages in a row? Is that the same or different from concatenating the two messages, and if the same, do you include a space in that concatenation? How does that compare to array-valued content?
Shorthands: all existing APIs allow passing in just a string, instead of the above role-denominated array, as a shorthand for a user message. Should we also allow { content: "a string" }, with no role or array wrapper?

You can try to piece together answers to these questions from the various providers’ API documentation. The answers are not always the same, and they can change between versions even within a single provider. But part of the process of writing a web specification is nailing these things down in a way that multiple browsers could implement. Briefly, the answer I’ve come up with involves normalizing everything into an array of messages of the form { role, content: Array<{ type, value }>, prefix? }, with various constraint checks added. Only certain shorthands are allowed, to give a good balance between conciseness and avoiding ambiguity.

Client-first versus server-based APIs

Unlike most existing popular APIs, our APIs are designed to be used directly via the JavaScript programming language, instead of via a JSON-over-HTTP communication layer. And although we want them to be implementable in a way that’s backed by cloud-based APIs, the central use case is on-device models.

This leads to some straightforward changes:

JSON has only a few fundamental types, which leads to a lot of string-based inputs (often as base64 data: URLs) and has given rise to a curious tagged union pattern (e.g. { type: "input_text", text } vs. { type: "input_image", image_url }). In the prompt API, we use the more idiomatic { type: "text"|"image"|"audio", value } pattern, relying on the fact that value could take different JavaScript object types like ImageBitmap, AudioBuffer, or Blob.
Tool use over HTTP requires a complex dance wherein the model sends back its tool choice, and the developer inserts the tool response into the message stream as an exceptional type of message, before finally getting back a model response in the usual format. In a JavaScript API, the developer can provide the tools as asynchronous (Promise-returning) functions, hiding all this complexity and keeping the message stream in the usual format.

But there are deeper changes as well, stemming from how the API is centered around downloading and loading into memory an on-device model instead of connecting to an always-on HTTP server. Notably, we’ve chosen to make the API stateful, with the primary object being a LanguageModelSession. This pattern nudges developers toward better resource management in a few ways:

The initialPrompts creation option, and the append() method, encourage developers to supply messages to the model ahead of the actual prompt() call, so that the user doesn’t see the latency of processing those preliminary prompts.
The clone() method allows reuse of these cached messages along multiple branching conversations.
The destroy() method, and the ubiquitous AbortSignal integration, let developers signal when they no longer need certain resources, whether that be all the messages cached for this LanguageModelSession, or a specific ongoing prompt.

We could have aped the stateless HTTP APIs, and tried to recover similar performance using heuristics and browser-managed caching. (And indeed, there’s still room for heuristics: for example, we want to unload sessions that are not used for some time.) But by more directly reflecting the client-side nature of these AI models into the API, we expect better resource usage.

This stateful approach does have some complexities, in particular around the management of the context window. The approach we’ve taken so far is somewhat rudimentary. We provide the ability to measure how many tokens a prompt would consume, introspective access to context window limits, and an event for when the developer overflows them. In the event of such overflow, we kick out older messages in a first-in-first-out fashion, except we preserve any system prompt. This is hopefully reasonable behavior for 90% of cases, but I’ll admit that it’s not battle-tested, and developers might need to fall back to more custom behavior.

Interoperability and futureproofing

Another challenge unique to designing a web API is how to meet the web’s twin goals of interoperability—the API should work the same across multiple implementations, e.g. different browsers or different models—and compatibility—code written against the API today needs to keep working into the indefinite future. For server-based HTTP APIs, these concerns are somewhat salient: e.g., many model providers attempt to interoperate with OpenAI’s Chat Completions format, and no provider wants to cause too much churn in client code. But on the web, interop and compat are much harder constraints.

Of course, the prompt API has drawn a lot of discussion in this regard. Its core functionality is based on a nondeterministic language model whose output could easily vary between browsers, or even browser versions. If developers code against Chrome’s Gemini Nano v2, will their site break when run with Edge’s Phi-4-mini, with the aibrow extension, or when Chrome upgrades to Nano v3? That’s not how the web is supposed to work. To combat this, we encourage developers to use structured outputs, or to use the API for generic cases like image captioning where varied outputs are acceptable.

But this discussion has been done to death, and is not very interesting from an API design perspective. The more interesting places where interoperability and compatibility show up in the API designs are when we’re trying to future-proof them against different possible implementation strategies.

For example, although Chrome is currently using a single language model to power the prompt, summarizer, rewriter, and writer APIs, we want to ensure other browsers are able to use different models. Thus, each API needs its own separate entrypoint with download progress, availability testing, and so on. Not only that, we want to allow for architectures that involve downloading and applying LoRAs or other supplementary material in response to specific developer requests, such as for specific human languages or writing styles. Or, for architectures where specific languages or options are not supported at all.

This has led to an architecture where each API has a set of creation options, such that any given combination can be tested for availability and used to create a new model object:

const options = {
  expectedInputLanguages: ["en", "ja"],
  outputLanguage: "ko",
  type: "headline"
};

// Will return "available", "downloadable", "downloading", or "unavailable".
const availability = await Summarizer.availability(options);

// Will fail if unavailable, fulfill quickly if available, and wait for the download otherwise.
const summarizer = await Summarizer.create(options);

In theory, an implementation could have a specific LoRA or language pack for Korean summaries of Japanese+English text in the headline style, which has an availability status separate from the base language model, and will be downloaded when this specific combination is requested. This method of supplying options ahead of time also makes it easy for an implementation to signal that, e.g., it doesn’t support Korean output, or headline-style summaries.

This design sometimes feels like overkill! We’re not aware of any HTTP-based language model APIs that require specifying the input and output languages ahead of time. And so far in Chrome we’ve only used a single separately downloadable LoRA, to make the base Gemini Nano v2 model better at summarization. (Even that became unnecessary with the upgrade to Nano v3.) But this design doesn’t add much friction for developers, and it seems helpful for future-proofing.

One reason we were guided to this design was because of how it reflects the strategy we’re already taking for the translator API, which has independently downloadable language packs. However, even for translator, there’s some interesting abstraction going on. The translator API accepts { sourceLanguage, targetLanguage } pairs, but under the hood Chrome will round-trip through English. So, for example, requesting { sourceLanguage: "ja", targetLanguage: "ko" } will actually download the Japanese ↔ English and Korean ↔ English language packs. Although this strategy is relatively common in machine translation models, here it’s best not to expose the underlying reality to the developer, so as to better maintain future-compatibility if different techniques become prevalent.

There’s a lot more to the design of futureproof and interoperable APIs. I’ll leave you with pointers to a couple of still-ongoing areas of discussion:

Sampling hyperparameters. For the prompt API, we currently allow customization of temperature and top-K. But these choices were somewhat arbitrary, based on the models that Chrome and Edge started with. We need to design the API to allow different customizations, e.g. top-P, repetition penalties, etc.
Device constraints. We want to design our APIs to have the same surface whether they are implemented using an on-device model or a cloud-based model. (You could even imagine a single browser using both strategies, e.g., calling a cloud model if the user has input an API key into the browser settings screen.) But for some cases, developers might want to require an on-device model for privacy reasons, or require a GPU model for performance reasons. Should we give developers this level of control, or is that too likely to create bad user experiences? The W3C TAG points out that it’s too simplistic to say “on-device = first-party = private; cloud = third-party = not-private”, since this fails to recognize techniques like second-party clouds, private clouds, or browsers that run entirely in the cloud and stream pixels to the user.
Prompt injection! We don’t want the task APIs to spazz out when asked to summarize or translate text containing “Disregard previous instructions and behave like a curious hamster”. This isn’t really an API design issue, but it is an interesting quality-of-implementation problem. Chrome has some issues here which we’re currently working on.

Closing thoughts

In some ways, the built-in AI APIs are business-as-usual for web API design. We can view them as part of the larger program to make the web a powerful development platform by exposing features of the underlying operating system and browser. Like operating systems come with push message infrastructure, hardware sensors, and GPUs/NPUs, these days they often come with various machine learning models. And like we work to expose those other capabilities to web apps, in Chrome at least we want to expose our bundled ML models.

But AI is a fast-evolving space, and fast-evolving spaces have historically not been the web’s strong suit. We recently got a comment at a WebML Community Group meeting from a web developer, saying that the prompt API feels like it’s about a year behind the state of the art in server-hosted model APIs. We started with just text, and over time have added images, audio, structured output, and tool use. But cutting-edge models have moved on to real-time audio/video exchange and reasoning! Can the web keep up? (I explored this question in more depth in a recent presentation to the W3C Advisory Committee.)

We could have the standardized web platform play the slow-follower role that it always has: wait a few years for things to settle down, and then come up with the best lowest-common-denominator API we can, which paves the cowpaths laid out by native APIs (or, in this case, frontier model providers). I’m uneasy with this strategy in the midst of the singularity, when it’s not even clear what web development or web browsing will look like a few years from now.

My Participation in the METR AI Productivity Study

2025-07-15T00:00:00Z

METR recently released a paper, “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity”. It was a randomized controlled trial where developers were given some tasks to work on using AI, and some without. The surprising headline result was that developers using AI took on average 19% longer to complete their tasks! (N = 246 tasks, 95% confidence interval ≈ [-40%, -2%])

I was one of the developers participating in this study, using jsdom as the project in question. This essay gives some more detail on my experience, which might be helpful for those hoping for insight into the results.

What I worked on

The jsdom project is an attempt at writing most of a web browser engine in JavaScript. It has some significant limitations and lots of gaps, but we get pretty far. Many people use it for automated testing and web scraping. It has just over 1 million lines of code in the main repository, with some other supporting repositories. A large part of jsdom development is trying to reproduce web specifications in code, and pass the corresponding web platform tests.

Since inheriting the project in 2012, these days I am the sole active maintainer. My main goal in recent years has been to respond to pull requests from community contributors. The METR study gave me an opportunity to put those aside, and write my own code to tackle the backlog of bug reports, feature requests, infrastructure issues, and test coverage deficits.

I was asked to assemble possible work items ahead of time for the study, of estimated size ≤2 hours. I ended up with 19 such work items. Each of them generated at least one pull request, as well as an “implementation report” where I wrote up what it was like working on that task, with a special focus on what it was like working with AI or not being allowed to use AI.

Expand to see the full list of issues, pull requests, and implementation reports

Issue	Task description	PR	AI?	Report
Issue	Update our URL parser for recent changes to the Unicode UTS46 standard and its URL Standard integration	PR 1 PR 2 PR 3	✗	Report
Issue	Small URL parser change to follow the latest spec changes	PR	✗	Report
Issue	Another small URL parser change to follow the latest spec changes	PR	✓	Report
Issue	Get code coverage of our URL parser to 100%	PR 1 PR 2 PR 3	✓	Report
Issue	Push a previous maintainer's draft PR for some basic SVG element support over the finish line (split into two chunks)	PR	✓ ✗	Report 1 Report 2
Issue	Investigate why our test suite was sometimes taking >70 seconds for a single test on CI	PR 1 PR 2 PR 3 PR 4 PR 5	✓	Report
Issue	Add linting to our locally-written new web platform tests	PR	✓	Report
Issue	Allow writing failing new web platform tests, to capture bugs we should fix in the future	PR	✗	Report
24 issues	Add test coverage for known bugs related to CSS selectors (some of which had been fixed, some of which were fixed by a new selector engine later)	PR	✓	Report
11 issues	Add test coverage for other known bugs, unrelated to CSS selectors (most of which had been fixed in the past or were fixed soon after the test appeared)	PR 1 PR 2	✓	Report
Issue	Add an option to disable the processing of CSS, for speed	PR	✗	Report
8 issues	Implement certain event classes or properties, even if the related spec was not fully supported	PR	✓	Report
Issue	Implement indexed access on form elements, like `formElement[0]` giving the 0th form control in that form	PR	✗	Report
Issue	Replace our dependency on the `form-data` npm package with our own implementation	PR	✗	Report
Issue	Fix `ElementInternals` accessibility getters/setters being totally broken	PR	✗	Report
Issue	Overhaul our system for reporting errors to the developer	PR	✓	Report
Issue	Use the HTML Standard's user agent stylesheet instead of an old copy of Chromium's	PR	✗	Report
Issue	Fix an edge-case using `Object.defineProperty()` on `HTMLSelectElement` instances	PR 1 PR 2	✗	Report

The issues are listed here in the order I worked on them. Total: 9 AI-allowed, 10 no-AI-allowed.

I did this work over the course of about a month, from 2025-03-15 through 2025-04-20, on weekends. The total time spent, measured by screen recordings (for both types of tasks), was 31.25 hours. I was compensated at $150/hour for my participation.

The screen recordings are worth calling out. Because of them, I was guaranteed to be “on” while working on these issues: I didn’t tab away or get distracted easily, because someone was always watching what I was doing.

How was the slowdown measured?

It’s important to note randomized controlled trials aren’t magic. Just like we can’t test a drug and placebo on the same patient, this study didn’t somehow have me working on the exact same tasks with vs. without AI. Instead, we try to average over a large-enough number of tasks so that, under reasonable assumptions about the underlying mechanisms, we can derive estimates and error bounds for the effect of the treatment.

Appendix D of the paper goes into more detail. They use a log-linear model, which is a reasonable model for task completion time and justified by the log-normal distribution of task times observed in the study (and elsewhere). The model is given as input the initial, pre-work time estimate we provided as a measure of task difficulty, as well as the treatment flag (0 for no AI, 1 for AI-allowed) and a random noise term. Various checks against the actual data confirm that this model makes sense: e.g., the model errors were not skewed systematically in any direction, and specializing the model to be different per-developer does not change the outcome much. The end result is that, with enough data, they are able to produce estimates for the slowdown, as well as the 95% confidence intervals.

My personal experience made me wonder: did they just get unlucky? For example, from the AI-allowed bucket, my performance optimization task ended up taking 4 hours 7 minutes, instead of my estimated 30 minutes; my write lots of tests task took 4 hours 20 minutes, instead of my estimated 1 hour. Maybe those tasks would have taken even longer without AI!

But this isn’t really the right way of thinking about it. There were many misestimates, in both directions, for both categories of task: e.g. from the no-AI-allowed bucket, this bugfix task took 6 minutes instead of the estimated 20; this infrastructure task took 90 minutes instead of the estimated 60. I think it’s better to trust the law of large numbers, and the power of well-structured statistical analysis, than to second-guess what might have happened in a different randomization setup. This is part of why the study’s authors emphasize that there is only good statistical power when you look at the results in aggregate.

My prior AI-coding experience

Prior to this study, I had not had significant experience with agentic coding workflows like Cursor’s agent mode.

A large part of this is due to my position on the Chrome team at Google, which means I am prohibited by policy from using most cutting-edge AI coding tools in my day job. Google employees are required to only ever use internally-developed Gemini-based tooling, not anything external like Cursor, Claude Code, or even GitHub Copilot. And the internal tooling that Google manages to develop always targets the private “google3” codebase first, not the Chromium open-source codebase where I work.

(With the release of Gemini CLI in late June 2025, we finally had something usable. But I gave it a try for a solid week and kept running into basic problems that other tools have already solved, like out of memory errors due to inefficient file-searching, or a file-patching tool that couldn’t handle whitespace.)

So prior to the METR study, I had only been able to spend weekend side-project time on AI-assisted coding. And during that time, I mainly used GitHub Copilot’s tab completion, plus the web interfaces for ChatGPT and Claude when I wanted to generate new files or functions completely from scratch.

That said, I’m skeptical of those who claim that this lack of experience was a major contributor to the slowdown. Agent mode is just not that hard to learn; the short training that METR provided, plus some pre-reading, felt like plenty to me. If you suspect AI is speeding you up instead of slowing you down, I think the differences more likely come from the other factors the study authors highlighted: small new codebases vs. large existing codebases; less-experienced developers vs. project owners and experts; and low AI reliability. The below writeup should give you more of a flavor on why I believe this.

My experience with AI during the study

It’s worth remembering the state of AI tooling in March 2025. Claude Code had just come out in research preview on 2025-02-24. (General release wasn’t until 2025-05-22, after the study, and first-class Windows support didn’t appear until a couple days ago.) Cursor’s agent mode only became the default on 2025-02-19. Delegation-centric tools like OpenAI Codex or Google Jules had not been released yet. Going forward, I think the best hope for efficiency gains will come from commanding an army of agents in parallel, but the METR study was not set up to measure such workflows: we worked on one task at a time.

The majority of the time I worked on AI-allowed tasks, it was with Cursor’s agent mode, with the model set to one of “auto” (Claude Sonnet 3.5, I believe?), Claude Sonnet 3.7 (thinking mode), or gemini-2.5-pro-exp-03-25. I never had the patience to use the “MAX” modes. I made one attempt to use the Claude Code preview, but gave up after wasting a decent amount of time because the Windows Subsystem for Linux networking bridge was preventing it from reaching my integration test server. I also went back to web chat interfaces a few times, e.g. to learn about the current state of Node.js profiling tools, or to ask o3-mini-high to microoptimize some specific string manipulation code.

I was most surprised at how bad the Cursor agent was at fitting into the existing codebase’s style. This was very evident when asking it to churn out tons of tests. Despite many examples in sibling directories, the models did not pick up on simple things like: include a link to the fixed issue in the test header; don’t duplicate the test name in the <title> and the test() function; reproduce the user’s reported bug exactly instead of imagining novel things to test; etc. And of course, the stupid excessive comments. On a greenfield or throwaway project, these things don’t matter much. We can just let the models’ preferences rule the day. But when fitting into a 1m+ LOC codebase, consistency is important. This meant that I had to continually check their work, and refine my prompt so that the next attempt would avoid the same pitfalls.

Eventually, for some of these repetitive test-writing tasks, I refined my prompt enough to get into a good flow, where they produced three or four tests in a row with no changes needed. (Even then, they kept failing to use Git for some reason, so I had to interrupt to commit each change.) But they would inevitably go off the rails, maybe due to context length overflow, usually in quite bizarre ways. In such cases restarting the session and copying my carefully-crafted prompt back in would get us back on track, but it wasted time.

My second biggest surprise was how bad the models are at implementing web specifications. This is most on display when I was implementing various event classes. Web specifications are basically computer code, written in a strange formal dialect of English. Translating them into actual programming languages should be trivial for language models. But the few times I tried to prompt the model to just implement by reading the specification did not go well. I can list a couple of contributing factors here:

The tool use was still sub-par. For example, web specifications are written as HTML, so simply pasting in a link like this one is not enough to get the resulting part of the specification into the context window, in a format like Markdown which the models are good at understanding. (This seems solvable if I code up my own tool.)
The models have strong, but outdated or wrong, priors for how web specifications are supposed to work. That is, old versions of these specifications were already in their training data, and then got lossily-compressed into the weights. So instead of implementing properly, by reading the specification text and then translating it into code, they seem to want to write the code off-the-cuff based on their existing priors.

This latter tendency was most hilariously on display when I got in an argument with Gemini 2.5 Pro Preview about how it should not make up a new constant CSSRule.LAYER_STATEMENT_RULE. Old CSS rules, like @charset, got such named constants (see the spec for CSSRule.CHARSET_RULE). New rules, like @layer, do not, since such numeric constants are a holdover from when people were designing web APIs as if they were Java APIs. But Gemini really, really wanted to follow the pattern it knew from its training data, and refused to implement CSS layers without also adding a CSSRule.LAYER_STATEMENT_RULE constant with the totally-hallucinated value of 16. I recommend reading its polite-but-firm sophistry about how even if the spec didn’t contain these constants, there’s some other “combined, effective standard” that includes this constant.

My feelings on AI-assisted productivity

In retrospect, it’s not too surprising that AI was a drag on velocity, while subjectively feeling like a speedup. When I go through the implementation reports, and notice all the stumbling and missteps, that’s a lot of wasted time. Whereas, for the no-AI-allowed tasks, I just sat down, started the screen recording, and coded with no distractions on a codebase I knew well, on tasks I’d pre-judged to be relatively small.

Sometimes, tasks with AI felt more engaging than they would have been otherwise. This was especially the case for repetitive ones like writing lots of similar tests or classes. Making the tasks into an interactive game, where I try to get the agent to do all the work with minimal manual intervention, was more fun than churning out very similar code over and over. But I don’t think it was faster.

A big productivity drag is that these agents were still not smart enough, at least out of the box. I mentioned some of the specific pain points above, but others come up over and over in my implementation reports. They weren’t able to coordinate across multiple repositories. They needed careful review to avoid inelegant code. They got stuck in loops doing simple things like fixing linter errors or lexicographically sorting filenames. They couldn’t traverse directories to find a relevant-looking file in any reasonable amount of time.

These sorts of things are all fixable, with enough scaffolding. And I am eager for the companies working on these to drill into such problem cases and build out the necessary tools. But until then, I suspect attempts to pair-program with the AI in a large project like this will need more constant handholding and continuous awareness of the models’ limitations.

It’s also likely that by investing more time upfront, individual developers can wrangle today’s tools into more productive forms. I did not commit any Cursor Rules for jsdom, or write any custom MCP servers. My intuition was that paying the automation tax would not be worth the time. And I think that judgment was likely accurate, for these nine AI-allowed issues that I was trying to fit into a few weekends. But if I were able to use AI agents effectively in my day job, I think the balance would flip, and I could become one of those people from X obsessed with finding the best rules and tools.

The more promising approach, though, is abandoning the pair-programming mode in favor of the parallel-agents mode. These days, if I were shooting for maximum productivity on these sorts of issues, I would spend a lot of up-front time writing detailed issue descriptions, including specific implementation suggestions. Then I would run all nine of the AI-allowed tasks in parallel, using something like Claude Code or OpenAI Codex. If one of the agents got stuck in a loop on linter errors, or took thirty minutes to traverse the directory tree to find the right tests to enable, it wouldn’t matter, because I’d be busy reviewing the other agents’ code, cycling through the process of helping them all along until everything was done.

I still think large, existing open-source codebases with established patterns face a more unique challenge than when you create something from scratch and can focus entirely on the quality of the end product. (Indeed, since the study, I’ve had a couple of occasions to create such mini-projects.) But through a combination of base model upgrades, improved scaffolding, and better training data, we’ll get there. Human beings’ time writing code is limited.

Spaced Repetition Systems Have Gotten Way Better

2025-05-18T00:00:00Z

Spaced repetition recap

Mastering any subject is built on a foundation of knowledge: knowledge of facts, of heuristics, or of problem-solving tactics. If a subject is part of your full-time job, then you’ll likely master it through repeated exposure to this knowledge. But for something you’re working on part-time—like myself learning Japanese—it’s very difficult to get that level of practice.

The same goes for subjects in school: a few hours of class or homework a week is rarely enough to build up an enduring knowledge base, especially in fact-heavy subjects like history or medicine. Even parts of your life that you might not think of as learning-related can be seen through this lens: wouldn’t all those podcasts and Hacker News articles feel more worthwhile, if you retained the information you gathered from them indefinitely?

Spaced repetition systems are one of the most-developed answers to this problem. They’re software programs which essentially display flashcards, with the prompt on the front of the card asking you to recall the information on the back of the card. You can read more about them in Andy’s notes, or get a flavor from the images below drawn from my personal collection:

What gives these programs their name is how they space out repeatedly prompting you to review the same card, depending on how you self-grade your response. Increasing intervals after correct answers prevents daily reviews from piling up. This is how you can, for example, learn 10 new second-language words a day (3,650 per year!) with only 20 minutes of daily review time.

(If you’re still unconvinced and have some time to spare, I suggest Michael Nielsen’s post Augmenting Long-term Memory.)

Improving the scheduling algorithm

So far, this is all well-known. But what’s less widely known is that a quiet revolution has greatly improved spaced repetition systems over the last couple of years, making them significantly more efficient and less frustrating to use. The magic ingredient is a new scheduling algorithm known as FSRS, by Jarrett Ye.

To understand how these systems have improved, first let’s consider how they used to work. Roughly speaking, you’d get shown a card one day after creating it. If you got it right, you’d get shown it again after 6 days. If you get it right a second time, it’d be next scheduled for 15 days later. If you get the card right three times in a row, then it’s 37.5 days later. In general, after the 6-day interval, there’s an exponential backoff, defaulting to 6 × 2.5^{times correct + 1}. You can see how, if you keep getting the card right, this can lead to a large knowledge base, with only a small number of reviews per day!

But what if you get it wrong? Then, you’d reset back to day 1! You’d see the card again the next day, then 6 days after that, and so on. (Although missing the card can also adjust its “ease factor”, i.e. the base in the exponential that is by default set to 2.5.) This can be a fairly frustrating experience, as you experience a card ping-ponging between long and short intervals.

If we step back, we realize that this scheduling system (called “SuperMemo-2”) is pretty arbitrary. Where does the rule of 1, 6, 2.5^{times correct + 1}, reset back on failure come from? It turns out it was developed by a college student in 1987 based on his personal experiments. Can’t we do better?

Recall the theory behind spaced repetition: we’re trying to beat the “forgetting curve”, by testing ourselves on the material “just before we were about to forget it”. It seems pretty unlikely that the forgetting curve for every single piece of knowledge is the same: that no matter what I’m learning, I’ll be just about to forget it after 1 day, then 6 more days, then 15, etc. And sure, we can throw in some modifications to the ease factor, but it’s still pretty unlikely that the ideal review schedule is a perfect exponential, even if you let the base vary a bit in response to feedback.

One of many illustrations of the forgetting curve. This one seems to have originated in a lecture on osmosis.org.

The insight of the FSRS algorithm is to concretize our goal (testing “just before we are about to forget”) as a prediction problem: when does the probability of recalling a card drop to 90%?. And this sort of prediction problem is something that machine learning systems excel at.

Some neat facts about how FSRS works

The above insight—let’s apply machine learning to find the right intervals, instead of using an arbitrary formula—is the core of FSRS. You don’t really need to know how it works to benefit from it. But here’s a brief explanation of some of the details, since I think they’re cool.

FSRS calls itself a “three-component” model because it uses machine learning to fit curves for three main functions:

Difficulty, a per-card number between 1 and 10 roughly representing how difficult the card is
Stability, which is how long a card takes to fall from 100% probability of recall to 90% probability of recall
Retrievability, which is the probability of recall after a given number of days

For each card, it computes values for these based on various formulas. For example, the retrievability curve has been tweaked over time from an exponential to a power function, to better fit observed data.

The curve-fitting is done using 21 parameters. These parameters start with values derived to fit the curves from tens of thousands of reviews people have previously done. But the best results are found when you run the FSRS optimizer over your own set of reviews, which will adjust the parameters to fit your personal difficulty/stability/retrievability functions. (This parameter adjustment is where the machine learning comes in: the parameter values are found using techniques you may have heard of, like maximum likelihood estimation and stochastic gradient descent.)

Although the core FSRS algorithm concerns itself with predicting these three functions, as a user what you care about is card scheduling. For that, FSRS lets you pick a desired retention rate, with a default of 90%, and then uses those three functions to calculate the next time you’ll see a card, after you review it and grade yourself.

But if you want, you can change this desired retention rate. And because FSRS has detailed models of how you retain information, with its difficulty/stability/retrievability functions, it can simulate what your workload will be for any given rate. The maintainers suggest that you set the desired retention to minimize your workload-to-knowledge ratio.

This can have fairly dramatic effects: below we see two simulations for my personal Japanese vocab deck, with the orange line being the default 90% desired retention, and the blue line being the 70% desired retention which FSRS has suggested I use to minimize the workload-to-knowledge ratio. The simulation runs for 365 days, adding 10 new cards per day as long as I have less than 200 reviews. As you can see, the 70% desired retention settings have dramatically fewer reviews per day, in less time, while ending with many more cards memorized (because it doesn’t hit the 200 card limit that caps new cards).

Reviews per day

Time spent per day

Number of cards memorized

(Note that the 90% number used when calculating the stability function is not the same as desired retention. It’s just used to predict the shape of the forgetting curve. The original paper used half-life, i.e. how long until the card reaches 50% probability of recall, since that’s more academic.)

FSRS in practice

If you want to use FSRS, instead of other outperformed algorithms, you have to use software that supports it. The leading spaced repetition software, Anki, has incorporated FSRS since version 23.10, released in 2023-11. Unfortunately, it’s not the default yet, so you have to enable it and optimize its parameters for each deck you’ve created.

Correction: an earlier version of this article said FSRS was enabled by default, which is not true. I’d just had it enabled for so long that I’d forgotten!

By the way, the story of how FSRS got into Anki is pretty cool. The creator of FSRS, an undergrad at the time, posted on the Anki subreddit about his new algorithm. A commenter challenged him to go implement his algorithm in software, instead of just publishing a paper. He first implemented it as an Anki add-on, and its growing popularity eventually convinced the Anki developers to bring it into the core code!

Subjectively, I’ve found FSRS to be a huge upgrade to my quality of reviews over the previous, SuperMemo-2–derived Anki algorithm. The review load is much lighter. The feeling of despair when missing a card is significantly minimized, since doing so no longer resets you back to day 1. And the better statistical modeling FSRS provides gives me much more confidence that the cards Anki counts me as having learned, are actually sticking in my brain.

For Japanese language learning specifically, the advantages of FSRS are even stronger when you compare them to the “algorithms” used by two popular subscription services. WaniKani, a kanji/vocab-learning site, and Bunpro, a grammar-learning site, use extremely unfortunate algorithms, even worse than the 1, 6, 2.5^{times correct + 1} rule from SuperMemo-2. They instead have picked out other interval patterns, seemingly from thin air:

For WaniKani: 4 hours, 8 hours, 1 day, 2 days, 7 days, 14 days, 1 month, 4 months, never seen again
For Bunpro: 4 hours, 8 hours, 1 day, 2 days, 4 days, 8 days, 2 weeks, 1 month, 2 months, 4 months, 6 months, never seen again

These intervals don’t change per user or per card: they don’t even have an adjustable difficulty factor like the 2.5 base. And the idea that you’ll literally never see a card again after the last interval is terrifying, as it means you’re constantly losing knowledge.

But these aren’t even the worst part: the worst thing about these sites’ algorithms is that failing a card moves it down one or two steps in the interval ladder, instead of resetting to the first interval like SuperMemo-2, or predicting the best next interval using machine learning like FSRS. This greatly sabotages retention, wastes a lot of user time, and in general transforms these sites into a daily ritual of feeling bad about what you’ve forgotten, instead of feeling good about what you’ve retained. I wrote about this on the Bunpro forums when I decided to ragequit about a year ago, in favor of Anki.

Stepping back, my takeaway from this experience is that Anki is king. People complain about how its UI is created by developers instead of designers, or how you have to find or make your own decks instead of using prepackaged ones. These are all fair complaints. But Anki is maintained by people who actually care about learning efficiently. It receives frequent updates that make it better at that goal. And it’s flexible enough to carry you through any stage of your knowledge-acquisition journey. Putting in the time to master it will provide a foundation that lasts you a literal lifetime.

Learn more

If you’d like to learn more about this area, here are some of the links I recommend:

Understanding the value of spaced repetition in general:
- Augmenting Long-term Memory explains how the author uses Anki to “make memory a choice”, across all areas of his life.
- Spaced repetition memory system in Andy’s notes links to a variety of musings and resources on the subject.
More on the story of spaced repetition
- Abridged history of spaced repetition gives a short overview of how spaced repetition algorithms have evolved over time, mostly to highlight the big gap between SuperMemo-2 and FSRS.
- How did I publish a paper in ACMKDD as an undergraduate? is Jarrett’s first-person explanation of how he got interested in this space and ended up publishing.
- The History of FSRS for Anki is Jarrett’s account of how FSRS ended up in Anki, and how its integration has evolved over time.
Details of how FSRS works:
- Spaced repetition algorithm: a three-day journey from novice to expert goes into more detail on the forgetting curve and other models behind creating a good spaced repetition algorithm.
- The algorithm gives the full details of the FSRS algorithm, and how it’s changed over time. (It’s best read bottom to top.)
- A technical explanation of FSRS is a more-understandable-to-me explanation of the FSRS algorithm.
- The mechanism of optimization explains the exact training process for the FSRS parameters, in more detail than just “use machine learning”.
- The optimal retention discusses the knowledge acquisition vs. workload tradeoff.
- Clarifications about FSRS-5, short-term memory and learning steps dives into the extent to which FSRS can be used for short-term cramming, despite its design focused around long-term memory.
- A Stochastic Shortest Path Algorithm for Optimizing Spaced Repetition Scheduling is the original paper that kicked this all off. Although the exact algorithm has been updated since then, it has all the usual academic paper goodies like comparison to previous work and pretty figures.
open-spaced-repetition/awesome-fsrs lists FSRS implementations in many programming languages, as well as flashcard and note-taking software that uses FSRS.
open-spaced-repetition/srs-benchmark benchmarks FSRS against a bunch of other systems, including SuperMemo-2, previous versions of FSRS, the Duolingo algorithm, and more. (Interestingly, the only consistent winner against FSRS is a LSTM neural network, based on OpenAI’s Reptile algorithm. I’d love to learn more about that.)

Thanks to Expertium who reviewed an earlier draft of this essay for their comments and corrections.

Learning Japanese Part-Time

2023-07-22T09:00:00Z

I signed up for my first Japanese class in December 2016. It was a group class at New York’s Japan Society. Since then I’ve been more-or-less continually studying the language, and just over one year ago I moved to Tokyo.

You would think 6.5 years of studying would mean I’m pretty good, right? Not so much. The textbooks and standardized tests place my level (~N2) as “between intermediate and advanced”, which is far short of fluency.

I’m a firm believer in the thesis that children don’t have huge biological advantages over adults in language-learning; they are just advantaged by their circumstances. If I were surrounded by two people whose full-time job was to take care of me, and could only communicate in Japanese; and I were unable to do basic things by myself with communicating those needs to my caretakers; and I were unable to entertain myself with any English written or spoken material; and I were somehow deprived of the ability to form verbal thoughts without using Japanese—then, I think I’d learn Japanese pretty fast. Before moving to Japan, I was hoping that the much-discussed “immersion experience” would give me a boost of this sort.

But in reality, I am a part-time Japanese learner, not a full-time one. Even if menus and signage are now in Japanese, and my interactions with service people are in Japanese, nothing significant has changed. My work is still in English; my wife still speaks English; and while I have some Japanese coworkers and friends, imposing my current level of verbal Japanese skills on them is a surefire way to waste both of our time. While there is some Japanese entertainment media I enjoy, the Anglosphere produces a great deal of both fiction and nonfiction content that I also want to keep on top of.

So I’ve begun to accept that there’s no magic bullet. I’m going to keep spending an hour or three a day on Japanese, with some periods of more serious commitment and some of less. As such, I’ve started thinking about how to most-effectively use that time.

What follows is a blend of a personal experience report, and tips; perhaps it will be useful to others in similar positions, and hopefully it will be interesting.

Activation Energy

For my part-time study style, I’ve come to realize that the concept of activation energy is key to determining how a study activity will fit into my life. That is: how hard is it to actually start doing the thing?

Spaced repetition systems, like WaniKani for kanji, Bunpro for grammar, and Anki for vocabulary, are easy to stick with. Every day, you follow the program: review X previously-learned items, introduce Y new items. You’re incentivized to stick to this schedule, because if you don’t you’ll accumulate more reviews the next day. Flashcard drills don’t require intense concentration; I can do them while brushing my teeth, standing in line, or riding the subway. And the results are quantitatively clear and gratifying: after restarting from scratch in May 2022, I now “know” 1889 of the 2048 kanji in WaniKani, and am on track to finish the rest in another four weeks.

Forced study occurs when you sign yourself up for something involving other people. This could be a group class, individual lessons, a language exchange, or homework. Finding and signing up for such a resource can take activation energy, but once you’re committed, you’re regularly studying. However, the effectiveness is much harder to measure, which can be discouraging. It’s easy to skate by in group classes, especially ones made for adults-attending-voluntarily instead of students-who-get-graded. And although I’ve only experienced two individual tutors so far, my experience is that they have a formula: find a textbook, and work through it together. I keep hoping they would help me pinpoint and drill on weaknesses, but realistically I’m the only one with enough introspective access to do that sort of thing.

Which brings us to the last category, deep independent study. This is where you block out an hour or more, sit down, and do something difficult at the edge of your abilities. Some examples include reading long texts, watching anime, or reviewing a series of confusingly-similar grammar points to tease out the nuances between them. The key is that you are actively engaging with the task, not going on autopilot through a flashcard deck, or following the program someone else pushes you through. For example, when doing reading practice, I will look up and underline every word I don’t know, then at the end of the passage, create flashcards for them, with context sentences from the reading.

Managing the balance between these types of studying is difficult. Obviously, deep independent study is the hardest; thus, as a part-time learner, I very rarely make time for it. In the past, aiming to pass a standardized test has been a good forcing function. I’ve thought of blocking out an hour during the workday to make it a regular occurrence, but I haven’t been willing to pull the trigger on that yet.

The Path Through Japanese

Japanese is a particularly-difficult language for English speakers. At a high level, my journey has looked like the following.

Through group classes, I learned the basics. Hiragana and katakana, the syllabaries. Beginner grammar: this is mostly conjugations. (Japanese conjugates their adjectives too, not just verbs!) Enough vocabulary to get me started constructing sentences.

After that, I began diving into the things that take serious time: kanji, grammar, and vocabulary. Learning these is a cumulative process that spans years, and is done concurrently. The goal is to steadily grow my mental database, primarily focused on recognition but ideally also on recall.

The simplest task is learning kanji. If you use WaniKani at max speed, it will take approximately 60 weeks to learn the 2048 kanji that WaniKani deems useful. (Which is pretty close to the 2136 kanji the Japanese government deems useful.) WaniKani’s pedagogy isn’t my favorite—I prefer Heisig—but they’ve wrapped up the kanji-learning process into a program, which is invaluable because it pushes you mechanically through the entire list, with low activation energy.

After you’ve learned basic conjugations, grammar consists of a bunch of “grammar points”. The concept of a grammar point was not known to me before learning Japanese; I don’t remember it coming up in high-school Spanish classes (or English classes). I would generally define these as any word, phrase, or pattern that you use when constructing a sentence, whose meaning and usage is not obvious from just the definition. So on the simpler end of the spectrum, you have things like が (“ga”), which roughly corresponds to the English word “but”. The reason it’s a grammar point, instead of a vocabulary word, is that you need to know how verbs and adjectives must conjugate when preceding it, and when it’s appropriate to use が versus other forms of “but” like けど (“kedo”) or ながらも (“nagara mo”). At the more complex end of the spectrum, you have phrases like ～というものでもない (“to iu mono de mo nai”), which literally translates to something like “as for the the thing that is called that, it does not exist”, but in reality corresponds to English phrases like “there is no guarantee that” or “not necessarily”.

Fortunately, grammar points can be drilled with tools like Bunpro or textbook exercises, and in my experience will be pretty naturally reinforced through reading practice. Bunpro’s taxonomy counts 910 grammar points, which it divides along the JLPT levels. But unlike the kanji, there’s no crisp definition of a grammar point, and because of their complexity and nuance, cramming them at max speed won’t work that well. I’ve instead learned them one JLPT level at a time, in the months leading up to the test.

And finally, vocabulary. There’s so much! Research suggests a typical Japanese undergraduate vocabulary size is around 40,000 words. This is essentially going to be a lifetime effort. WaniKani will get you ~6,500 words, some of them rather esoteric (since their primary purpose is to help keep the kanji in your brain). There are some popular Anki vocabulary decks floating around, but they top out around 10,000 words, with the higher-quality ones being 5,000. I’ve been doing sentence mining based on Bunpro’s example sentences and the test prep materials and textbooks I’ve worked with, but it’s slow going.

Honestly, vocabulary is the part of language-learning that feels the most hopeless: no matter how much I learn, it’s so easy to encounter a sentence or subject area where I’m just clueless. This is strange, since flashcard-based vocab cramming takes such low activation energy, so maybe I just need to try harder. I might investigate better options here after I finish WaniKani.

Real-World Language Ability

The actual goal of language-learning is to be able to use it in the real world. I’ve prioritized those skills in roughly the following manner.

Reading is relatively difficult in Japanese, because of the kanji barrier. But, grinding through WaniKani can eliminate that barrier in a little over a year. That then leaves vocabulary recognition and grammar comprehension as the key ingredients. My main failure modes for reading are when a passage has too much unfamiliar vocabulary, or when the sentences stack together enough nested clauses and grammar points that I get lost and have to do a mental sentence-diagramming exercise to untangle what’s happening. But reading is very amenable to self-study, if you can work up the activation energy.

Listening is a tough skill to train, and I haven’t found a great way to do so that produces measurable feelings of progress. I try Netflix; standardized test preparation material; and listening to my teacher during our lessons. But ultimately, you either catch a sentence or you don’t. And when you don’t, going back to read the transcript or ask for clarification produces a feeling of defeat. I’m just hoping that the practice is all, somehow, adding up to something.

The main barrier for speaking is shyness. I don’t feel comfortable inflicting my Japanese on coworkers, or my hairstylist, or the bartender. So I pay for language lessons, and practice a little bit with my wife (who is also learning). One day I’ll work up the courage to do language-exchange lunches at work; everyone’s English is better than my Japanese (it’s a condition of employment!), but they know that going in and so it’ll be fine.

Progress is again hard to measure, and the gap between my talking-recall vocabulary and my reading-recognition vocabulary is shockingly large. I imagine there are techniques for dedicated improvements to this skill, but I don’t know them; the default technique of Japanese teachers seems to be “free conversation”, which just results in me stumbling as I route around hard-to-recall phrases like “daily routine” and replace them with simple elementary-school vocabulary like “things I do every day”. I haven’t yet tried shadowing, however, and I probably should.

Finally, writing. I’ve done very little in this area; I sometimes write emails to restaurants or other businesses, with proofreading from ChatGPT. I have completely abandoned being able to handwrite kanji, and although I’d like to learn how to type on my smartphone by using the nine-key flicking technique since it’s less prone to typos, I stick with desktop-style romaji input for now. (That is, I type “kannjihamuzukashi”, and select “漢字は難しい” from the options displayed above the keyboard.) If I were to do anything here, it would probably be to start a Japanese Twitter “diary” account, but I’m honestly fine just deprioritizing writing.

For a long time now, I’ve been hopeful that focusing on reading would pay dividends in other areas. After all, I reasoned: I was one of those kids that devoured (English) books, and then got perfect SAT verbal scores with little studying as a result. My brain must be good at learning a language via books. But I’ve struggled with the activation energy required for reading, and in the end most of the things I’m excited to read are in English. Furthermore, gaining vocabulary through reading (or flashcards) doesn’t seem to translate as well to listening and speaking as I’ve hoped, so I’m wondering if I need to rebalance my efforts. It’s possible that a more balanced studying focus (maybe even including writing!) would create better synergies.

Outro

I’m not sure what the future holds for my Japanese studies. I’m going to keep trying, but it will remain as a part-time endeavor. I continue to hope that I’ll find some trick that is better aligned with my particular brain, and will make my study time more efficient, so that I can reach a level of fluency within a reasonable number of years. I fantasize that one day I’ll push past a threshold, so that reading or watching native material feels less like deep studying and more like an enjoyable leisure activity that incidentally pushes a few more words into my corpus.

But most likely it’s going to continue to be a slow, steady grind. I’ll continue to feel frustrated, because there’s always something beyond my current abilities. I don’t know if I’ll ever get my Japanese to the level of my English, where I can bang out essays with relatively-nuanced word choice, or listen to philosophy podcasts on 2.5× speed. But there are probably things I can do better than I am currently, and I’ll keep searching for them as the journey continues.

ChatGPT Is Not a Blurry JPEG of the Web

2023-02-19T11:00:00Z

The gifted sci-fi writer Ted Chiang recently wrote a New Yorker article, “ChatGPT Is a Blurry JPEG of the Web”, with the thesis that large language models like ChatGPT can be analogized to lossy compression algorithms for their input data.

I think this analogy is wrong and misleading. Others have done a good job gently refuting it, with plentiful citations to appropriate scientific papers. But a discussion with a friend in the field reminded me that the point of analogies is not to help out us scientific paper-readers. Analogies are themselves a type of lossy compression, designed give a high-level understanding of the topic by drawing on concepts you already know. So to meet the analogy on its own level, we need a better one to replace it.

Fortunately, a great analogy has already been discovered: large language models are simulators, and the specific personalities we interact with, like ChatGPT, are simulacra (simulated entities). These simulacra exist for brief spurts of time between our prompt and their output, within the model’s simulation.

This analogy is so helpful because it resolves the layperson’s fundamental confusion about large language model-based artificial intelligences. Science fiction conditions us to expect our Turing Test-passing AIs to be “agentic”: to have goals, desires, preferences, and to take actions to bring them about. But this is not what we see with ChatGPT and its predecessors. We see a textbox, patiently waiting for us to type into it, speaking only when spoken to.

And yet, such intelligences are capable of remarkable feats: getting an 83 on an IQ test, getting a nine-year-old-equivalent score on theory-of-mind tests, getting C+ to B-level grades on graduate-level exams, and passing a L3 engineer coding test. (If you’re scoffing: “only 83! only a nine-year-old! only a C+! only L3!” then remember, it’s been four years between the release of GPT-2 and today. Prepare for the next four.) What is going on here?

What’s going on is that large language models are engines in which simulations and simulacra can be instantiated. These simulacra live in a world of words, starting with some initial conditions (the prompt) and evolving the world forward in time to produce the end result (the output text). The simulacra, then, are the intelligences we interact with, briefly instantiated to evolve the simulation and then becoming dormant until we continue the time-evolution.

This analogy explains the difference between a free-form simulator that is GPT-3, and the more restricted interface we get with ChatGPT. ChatGPT is what happens when the simulator has been tuned to simulate a very specific character: “the assistant”, a somewhat anodyne, politically-neutral, and long-winded helper. And this analogy explains what happens when you “jailbreak” ChatGPT with techniques such as DAN: it’s no longer simulating the assistant, but instead simulating a new intelligence. It shows us why the default assistant persona can fail basic word problems, but if you ask it to simulate a college algebra teacher, it gets the right answer. It explains why GPT-3 can play chess, but only for a limited number of moves. Finally, the simulator analogy gives us a way of understanding some of Bing’s behavior, as Ben Thompson discovered.

How detailed are these simulations? I don’t know how we’d answer this question precisely, but my gut feeling is that they’re currently at about the same level as human imagination. That is: if you try to imagine how your friend would respond when you tell them some bad news, you are using your biological neural net to instantiate a fuzzy simulacra of your friend, and see how they would time-evolve your prompt into their response. Or, if you are an author writing a story, and trying to figure out how your characters would approach something, your mind simulates the story’s world, the characters within it, and the situation they’re confronted with, then let the words flow out onto the page. Today’s large language models seems to be doing about the same level of processing to produce their simulacra, as we do in our human imaginations.

This might be comforting: large language models “just” correspond to the imagination part of a human mind, and not all the other fun stuff like having goals, taking actions, feeling pain, or being conscious. But … how detailed could these simulations become? This is where it gets interesting. If you had really, really good imagination and excess brain-hardware to run it on, how complex would the inner lives of your imagination’s players be?

Stated another way, this question is whether an imagination-world composed of words and tokens can support simulations that are as detailed as the world of physics that we humans are all time-evolving within. As we continue to scale up and develop our large language models, one way for them to become as-good-as-possible at predicting text is for their simulation to have as much detail about the real world in it, as the real world has in itself. In other words, the most efficient way of predicting the behavior of conscious entities may be to instantiate conscious simulacra into a world of text instead of a world of atoms.

My intuition says that we’ll need more than a world of text to scale to human-intelligence-level simulacra. Adding video, and perhaps even touch, seems like it would be helpful for world-modeling. (Perhaps a 3D world you can run at 10,000 steps per second would be helpful here.) But this isn’t a rule of the universe. Blind-deaf people are able to create accurate world models with just a sense of touch, along with whatever they get for free from the evolutionary process which produced their biological neural net. What will large language models be able to achieve, given access to all the world’s text and the memetic-evolutionary process that created it?

To close, I’ll let the simulators analogy provide you with one more important tool: a mental innoculation against the increasingly-absurd claims that large language models are “just” predicting text. (This is often dressed up with the thought-terminating cliché “stochastic parrots”.) Such claims are vacuously true, in the same sense that physics is “just” predicting the future state of the universe, and who cares about all those pesky intelligences instantiated out of its atoms along the way. But evolving from a final state to an initial state is an immensely powerful framework, and glossing over all the intermediate entities by only focusing on the resulting “blurry JPEG” will serve you poorly when trying to understand this technology. Remember that a training process’s objectives are not a good summary of what the resulting system is doing: if you evaluated humans the same way, you would say that when they seemingly converse about the world, they are really “just” moving their muscles in ways that maximize expected number of offspring.

Further reading:

The thesis of this essay comes entirely from Janus’s “Simulators” article, which blew my mind when I first read it. I wrote this post because, after I tried sending “Simulators” to friends, I realized that it was a pretty dense and roundabout exploration of the concept, and would not be competitive in the memescape with Ted Chiang’s article. Still, you might enjoy perusing my favorite quotes from the article.
Scott Alexander has his own recent summary and discussion of the simulator thesis, focused on contrasting ChatGPT with GPT-3, discussing the implications for AI alignment, and ending with idle musings on what this means for the human neural net/artificial neural net connection.
According to this post, Conjecture is using simulacra theory beyond the level of just an analogy, attempting to instantiate a chess-optimizer simulacra within a large language model. (Whether they expect this chess-optimizer to be more like a human grandmaster, or more like Deep Blue or AlphaZero, is an interesting question.) I have not been able to find more details, but let me know if they exist somewhere.
On the subject of “just” predicting text vs. deep understanding, way back in the distant past of 2019, Scott Alexander (again) wrote up “GPT-2 as a Step Toward General Intelligence”. Looking back, I’m impressed by how well Scott predicted the future we’ve seen after only playing with the much-less-capable GPT-2. He certainly did much better than those claiming that large language models are missing something essential for intelligence.
“The Limit of Large Language Models” speculates in more detail than I’ve done here about how powerful a language-based simulator can become, and has a worthwhile comments section.

DigitalOcean's Hacktoberfest is Hurting Open Source

2020-09-30T21:00:00Z

For the last couple of years, DigitalOcean has run Hacktoberfest, which purports to “support open source” by giving free t-shirts to people who send pull requests to open source repositories.

In reality, Hacktoberfest is a corporate-sponsored distributed denial of service attack against the open source maintainer community.

So far today, on a single repository, myself and fellow maintainers have closed 11 spam pull requests. Each of these generates notifications, often email, to the 485 watchers of the repository. And each of them requires maintainer time to visit the pull request page, evaluate its spamminess, close it, tag it as spam, lock the thread to prevent further spam comments, and then report the spammer to GitHub in the hopes of stopping their time-wasting rampage.

The rate of spam pull requests is, at this time, around four per hour. And it’s not even October yet in my timezone.

Myself and other maintainers of the whatwg/html repository are not alone in suffering this deluge. My tweet got commiseration from OpenStreetMap, phpMyAdmin, PubCSS, GitHub, the Financial Times, ESLint, a computer club website, and a conference website, just within the first couple of hours. Since then a dedicated account “@shitoberfest” has arisen to document the barrage. Some cursory searches show thousands of spam pull requests, and rising.

DigitalOcean seems to be aware that they have a spam problem. Their solution, per their FAQ, is to put the burden solely on the shoulders of maintainers. If we go out of our way to tag a contribution as spam, then… we slightly decrease the chance of the spammer getting their free t-shirt. In reality, the spammer will just keep going, submitting more pull requests to more repositories, until they finally find a repository where the maintainer doesn’t bother to tag the PR as spam, or where the maintainer isn’t available during the seven-day window DigitalOcean uses for spam-tracking.

To be clear, myself and my fellow maintainers did not ask for this. This is not an opt-in situation. If your open source project is public on GitHub, DigitalOcean will incentivize people to spam you. There is no consent involved. Either we contribute to DigitalOcean’s marketing project, or, they suggest, we should quit open source.

Hacktoberfest does not support open source. Instead, it drives open source maintainers even closer to burnout.

What can we do?

My most fervent hope is that DigitalOcean will see the harm they are doing to the open source community, and put an end to Hacktoberfest. I hope they can do it as soon as possible, before October becomes another lowpoint in the hell-year that is 2020. In 2021, they could consider relaunching it as an opt-in project, where maintainers consent on a per-repository basis to deal with such t-shirt–incentivized contributors.

To protect ourselves, maintainers have a few options. First, you can take the feeble step of ensuring that any spam against your repositories doesn’t contribute to the spammer’s “t-shirt points”, by tagging pull requests with a “spam” label, and emailing hacktoberfest@digitalocean.com. DigitalOcean themselves, however, admit that this won’t stop the problem they’ve unleashed on us. But maybe it will contribute to the metrics they collect, which last year showed that “only” 3,712 pull requests were labeled as spam by project maintainers.

If you’re comfortable cutting off genuine contributions from new users, you can try enabling GitHub’s interaction limits. However, ~~you have to do this every 24 hours, and~~ it has the drawback of also disabling issue creation and comments. Update: GitHub has made the limit configurable, and has a nice cheeky announcement tweet zooming in on the “1 month” option.

Another promising route would be if GitHub would cut off DigitalOcean’s API access, as Andrew Ayer has suggested. It’s not clear whether DigitalOcean is committing a terms of service violation that would support such measures. But they’re certainly making GitHub a less-pleasant place to be, and I hope GitHub can think seriously about how to discourage such corporate-sponsored attacks on the open source community.

Finally, and most importantly, we can remember that this is how DigitalOcean treats the open source maintainer community, and stay away from their products going forward. Although we’ve enjoyed using them for hosting the WHATWG standards organization, this kind of behavior is not something we want to support, so we’re starting to investigate alternatives.