wddbfs – Mount a sqlite database as a filesystem

2024-02-17T00:00:00-08:00

Often when I’m prototyping a project, I hesitate to use a sqlite database despite their many adavantages. It seems much easier to just dump a bunch of files in a directory and to rely on the universal support for the filesystem API to read/delete/update records. Part of this is avoiding the overhead of figuring out a relational schema, but an equal amount of friction comes from the fact that .sqlite files are just slightly more difficult to inspect: the SQL syntax for selecting a few records is much more verbose than head -n or tail -n, there are special commands (which don’t work in some environments/versions) for listing tables, and neither my text editor nor my shell has autocompletion for database queries.

To try to get the best of both worlds, I have put together a little utility called wddbfs, which exposes a sqlite database as a (WebDAV¹) filesystem, accessible to anything which can work with a filesystem, including terminals, file managers, and text editors.

Here’s how it works. If you install it with:

pip install git+https://github.com/adamobeng/wddbfs

You can mount a database with:

wddbfs --anonymous --db-path=/path/to/an/example/database/like/Chinook_Sqlite.sqlite

Which will be available at localhost:8080 with no username or password required. ²

Once you’ve mounted this WebDAV filesystem at, for example /Volumes/127.0.0.1/, you can see all the databases you specified with --db-path.³

$ ls /Volumes/127.0.0.1/
Chinook_Sqlite.sqlite
$ ls /Volumes/127.0.0.1/Chinook_Sqlite.sqlite
Album.csv           Customer.tsv        Invoice.jsonl       Playlist.json
Album.json          Employee.csv        Invoice.tsv         Playlist.jsonl
Album.jsonl         Employee.json       InvoiceLine.csv     Playlist.tsv
Album.tsv           Employee.jsonl      InvoiceLine.json    PlaylistTrack.csv
Artist.csv          Employee.tsv        InvoiceLine.jsonl   PlaylistTrack.json
Artist.json         Genre.csv           InvoiceLine.tsv     PlaylistTrack.jsonl
Artist.jsonl        Genre.json          MediaType.csv       PlaylistTrack.tsv
Artist.tsv          Genre.jsonl         MediaType.json      Track.csv
Customer.csv        Genre.tsv           MediaType.jsonl     Track.json
Customer.json       Invoice.csv         MediaType.tsv       Track.jsonl
Customer.jsonl      Invoice.json        Playlist.csv        Track.tsv

By default, all the tables can be read as CSV, TSV, json and line-delimited json (“.jsonl”)

These files can be manipulated with tools that work with a standard filesystem:

$ tail -n 3 Chinook_Sqlite.sqlite/Album.tsv
345     Monteverdi: L'Orfeo     273
346     Mozart: Chamber Music   274
347     Koyaanisqatsi (Soundtrack from the Motion Picture)      275
$ grep "Mahler" Chinook_Sqlite.sqlite/Artist.jsonl 
{"ArtistId": 240, "Name": "Gustav Mahler"}

Although for now, the whole table gets read into memory for every read so this won’t work well for very large database files. There’s also no write support… yet.

Despite how clunky it is, this seems to be the best way to implement a filesystem given that getting FUSE support is not straightforward. ↩
This is obviously not suitable for access for hosts over a network. ↩
Databases specified with --db-path will be available at the root of the filesystem, but if you pass --allow-abspath any databse file on the host filesystem will also be exposed inside the WebDAV mount at, for example, /mount/webdav/absolute/path/to/db/on/host/db.sqlite. ↩

A Javascript Autopilot for Crew Dragon

2020-05-21T00:00:00-07:00

Update (2020-05-31): It turns out that the Dragon 2 actually uses Chromium and Javascript for its flight interface. So there’s a reasonable chance this autopilot would run in the real Crew Dragon‽

That title is a truly horrifying combination of words.

SpaceX just released the ISS Docking Simulator, a browser game where the objective is to very slowly fly the Crew Dragon 2 to dock with the International Space Station, just like the real mission is scheduled to next week. Given that a significant portion of my early teens was wasted on just the free demo of Star Wars: X-Wing vs. TIE Fighter, I had to play it. The whole experience is very ponderous, and definitely not as fun as XWvTF — which leads me to believe it is in fact a realistic simulation.

I only needed to dock successfully once to be satisfied with playing the simulator myself. The nex logical step was to move onto the much more interesting challenge of automating it. The result is some hacky Javascript which you can paste into your browser console, and will automatically fly the simulated Crew Dragon to the ISS. Here’s a video of it in action:

And here’s the full code:

How do you fly a spaceship?

A quick Wikipedia browse helped me figure out that the problem to solve here is called ‘attitude control’, which led to some gnarly papers before I realised that the controller is basically a PID controller. Actually, I only had a vague idea what a PID controller was, but by staring at the algorithm and a co-incidental Hacker News post long enough, I realised that the algorithm can be simplified to:

If you’re far away, move closer
If you’re moving too fast, slow down

This chimes with how I was playing the demo too: you realise that you can change your roll but you have to be careful not to overshoot, so when you’re close to zero you have to move more slowly. The UI encourages this by showing not just your current position but the rate of change in your position (i.e. your speed or angular velocity).

The Crew Dragon simulation has six degrees of freedom we can control: roll, pitch, yaw and x, y, z. In a PID algorithm, these are called the process variables, and the difference between the current value of these variables and the desired value is the error. The amount of change you need to effect (here the amount you need to run the thrusters) is a function of the error, the derivative of the error, and some multiplicative constants that describe how those two factors should be combined. The docking is successful when all of those variables are (close to) zero. If they’re not zero, then the autopilot actuates the appropriate control to bring them closer to zero (e.g. roll to the left if currently rolled to the right). I’m treating the controls as decoupled: the thruster which moves you in the x-direction should have no effect on the y-error and so on. I’m not sure if this is true of the real capsule, and in theory it’s not true: it should be the case that moving the forward thruster would move the craft in the z-direction if the pitch wasn’t zero. In practice though, the autopilot achieves level flight very early, so the controls are effectively decoupled.

Implementing the Autopilot in Javascript

The Docking Simulator is nicely written in HTML5 using a WebGL canvas, which makes it really easy to interface with. The autopilot reads the HUD HTML elements to determine the current position of the craft, and controls it by simulating presses on the on-screen controls.

To make it easier to understand what’s happening, I inserted another transparent canvas element on top of the game, which is used to display graphs showing the change over time of the current error on each axis, and the rate of change of that error, as well as the amount the controls are being actuated. The control loop which runs every 500 milliseconds reads all of the process variables, appends them to a running log which is used for the graphs, and then actuates the controls based on the PID output.

There were a couple of other interesting implementation details:

The UI provided doesn’t show separate x, y, and z speeds, only a speed in the direction of the ISS. To calculate those rates, I measured the change in error with the same data used to generate the graphs.
There is a minimum possible control actuation which can be applied: a single click. But what if the output of the controller tells us we need to move forward by 0.25 of a click? I thought it might be possible to simulate holding the click for a longer or shorter time, but I actually settled on a hackier solution: if we need 0.25 clicks of actuation, then we randomly click one out of every four times the control loop runs! This means the behaviour close to the ISS is a little bit more unpredictable that I think you’d want for real spaceflight, but there aren’t even Kerbals at stake here.

The most difficult part of this process was figuring out what values to set for the gain. The rotational controls worked pretty easily, but it turns out that for the x, y, and z controls to work, the gain for the derivative needs to be 800x larger than the gain for the error! As the video shows, this results in initially very high speeds (especially in the x direction), but a lot of damping so that the final approach is tentatively slow.

I can’t say I can recommend this as an approach for real spacecraft control (there is a reason they call it rocket science), but if the next Mark Watney happens to be a frontend developer stuck on a spaceship controlled via a web browser… you could do worse.

Python’s Counter-intuitive, Non-commutative Ternary Logic

2020-04-06T00:00:00-07:00

Boolean logic is one of the foundational abstractions in computer science, from electrical circuits to programming languages. In Boolean logic, all variables take one of two values: 1/0, high/low, or TRUE/FALSE. However, many practical situations require the inclusion of a third value to indicate a variable is unknown, missing, or both false and true at the same time: usually something like None, NULL, NA, or Unknown.

This leads to a complication: for two-valued logic there’s a single agreed-upon way of defining how variables can be combined using the basic AND, OR, and NOT operations: Boolean logic. On the other hand, there are many possible — and at least two frequently used — three-valued formal logics. But in practice, programming languages often fall short of strictly implementing a consistent formal logic.

As it turns out, Python is no exception: boolean operations involving None have some counter-intuitive properties.

You might expect that operations on None would simply always return None, in a similar way that NA is propagated to the values of numerical operations in R. But nope:

>>> not None
True

Although, this does have the felicitous consequence that the tautology x or (not x) is still true in the case of None, unlike in SQL…:

>>> None or (not None)
True

But even weirder than this is that operations involving None are non-commutative: x and y is not the same as y and x.

>>> print(False and None)
False

>>> print(None and False)
None

I’ve searched for formal logical systems which don’t have commutativity for these operations, and I have yet to find this as an intentional choice anywhere else! It seems like it might be possible to define a substructural logic which did this, but I’m not sure why you would.

Why is None weird?

There are two design decisions which explain this strange behaviour: falsiness and short-circuit evaluation.

In Python, every object is either “truthy” or “falsy”: it can be coerced to the boolean value True or False with the bool() function:

>>> bool(True)
True
>>> bool(0)
False
>>> bool("a")
True
>>> bool("")
False
>>> bool(None)
False

From the Python docs:

the following values are interpreted as false: False, None, numeric zero of all types, and empty strings and containers (including strings, tuples, lists, dictionaries, sets and frozensets). All other values are interpreted as true.

This property is useful, for instance, when checking the results of operations which might return an empty string or list with an if-condition, but it has the consequence that because bool(None) evaluates to False, not None evaluates to True.

This choice in the docs is a bit puzzling to me:

neither and nor or restrict the value and type they return to False and True, but rather return the last evaluated argument

In practice, this means that:

>>> "a" or True
'a'
>>> "" or False
False

I think the reason for this is to allow short-circuit evaluation, which means that you could write this:

a = long_running_function_which_might_fail() or other_expensive_operation()

These functions will be evaluated sequentially from left to right, so that if the first one succeeds the second doesn’t need to be called at all, and the variable will take the return value of the first function. So what’s happening with False and None and None and False is that the first element is being evaluated, found to be falsy, and returned…

The Truth Tables for Python’s Ternary Logic

If we treat Python’s implementation of these boolean operations with True, False and None as a ternary logic, the resulting truth tables look like this:

not
True	False
None	True
False	True

and	True	None	False
True	True	None	False
None	None	None	None
False	False	False	False

or	True	None	False
True	True	True	True
None	True	None	False
False	True	None	False

Again, those matrices are not symmetrical because the operations are non-commutative.

I should say that in practice the ergonomic benefits to programmers probably outweigh the costs of the behaviour being strange in a formal sense, and I don’t think I’ve written any bugs as a result (but then again, do you ever know that you haven’t written a bug). Still, it’s a potential pitfall. And that’s without even considering float("nan"), numpy’s nan or pandas’ NaT. Hey, at least it’s not as much of a mess as Javascript.

How To Write A Neural Network in a Single Tweet

2020-03-06T00:00:00-08:00

Neural networks! They’re everywhere! Can you use them for everything? Do they have anything to do with brains? Are they Skynet or just fancy regression? Let’s find out!

One of the best ways to demystify something is to build it yourself. On the other hand, one of the best ways to re-mystify it is to obfuscate the code you wrote. So I set myself the challenge of implementing a neural network from scratch which fits exactly in the 280 characters of a single tweet. Here it is:

from numpy import *
r=random
c=hstack
p=dot
N=r.randn
s=lambda x:1/(1+exp(-x))
a=lambda x:s(x)*(1-s(x))
h,j=X.shape
w=N(5,j+1)
e=N(1,6)
for i in r.randint(0,h,1000*h):o=c((1,X[i]));m=p(w,o);n=c((1,s(m)));d=p(e,n);b=0.02*(Y[i]-s(d));g=b;w+=outer(e[:,1:]*a(d)*a(m),o)*b;e+=n*g*a(d)
— Adam Obeng (@Adam_Obeng) April 8, 2019

Or if you prefer, here it is as a gist.

This really is a neural network, albeit a super minimal one: a 1 hidden-layer MLP with sigmoid activations, fit with SGD. If you actually wanted to use it, you would set X to be a numpy array of features and Y a numpy array of labels, and after running the code, the trained weights are in variables w and e. The GIF below shows what that looks like training on some example data generated with sklearn.datasets.make_circles. On the left is a scatter-plot of the training data and predicted values, and the graph on the right shows the training MSE loss as the SGD steps increase.

This very simple network does a good job at finding a reasonable boundary between the orange points and the blue points even though they’re in two concentric rings — something which a linear classifier would be utterly incapable of doing.

A minimal neural network

In putting together this code, I had to do two things: figure out how to write a minimal neural network and training loop without using any external libraries, and then code golf it down to 280 characters.

So to explain the implementation first, let’s look at a de-minified version of the code, with more interpretable variable names and comments:

You’ll notice that I’ve allowed myself the use of numpy. I don’t know that you could do this in Python without it, so I think this still counts as “from scratch”. There were a few tricky steps in figuring out how to write this in the simplest way possible. Quite a few of the programming-oriented minimal neural network tutorials end up implementing network classes, which is unnecessary for the purposes of an implementation which doesn’t need to be extensible and also obscures how the thing actually works.

I ended up referring to a few other “neural network for hackers” posts, of which this is the most succinct example. Even so, I’ve written out the code above with descriptive variables names — I think the practice of writing code as if it was algebra is detrimental to understanding.

Code Golfing

The way I’ve written the code above already takes into account some of the higher-level golfing: the derivative of the sigmoid can be defined in terms of the sigmoid, and using intermediate variables for the output from each layer before the activation function means that these expressions can be re-used in the backprop step.

Using single-character variable names is the cheapest golfing strategy, but sometimes it’s not worth re-defining existing variables to make them shorter. outer is only used once, so re-naming that to a shorter name would result in strictly longer code. I was also tempted to use tuple unpacking to define multiple variables on the same line, but that doesn’t really save any characters (unless your right-side variable is already a tuple).

I went back and forth on including newlines in the code. On the one hand “a neural network in one line of Python” sounds pretty snappy, but on the other using semi-colons is a cheap trick and doesn’t reduce the character count. The one place where they are useful is inside the loop where they save on both a newline and a tab character.

There’s one final secret: I wanted to include my name in the code in a way that was integral to the implementation. There were not so many variable names which could be freely changed, so this was a bit of a challenge, and it actually makes the solution use a few more characters than would otherwise be necessary. Figuring out which ones is left as an exercise to the reader.

Talk at New York Open Statistical Programming Meetup

2016-12-06T00:00:00-08:00

Yesterday, I spoke at the New York Open Statistical Programming Meetup about quanteda.

You can find the slides from my talk here, and the code used during the demos here.

🐍

2016-09-17T00:00:00-07:00

https://www.youtube.com/watch?t=904&v=gg1bv9XJAek

https://github.com/adamobeng/snake

Re: The media is ruining science

2016-08-17T00:00:00-07:00

I don’t quite get this article by Robert Gebelhoff at the Washington Post.

Sure, there are well-known pressures on academics to publish significant results, and also to get media attention. But those are conceptually distinct issues. Publication bias (and related problems like p-hacking and the Garden of Forking Paths) tend to inflate the statistical significance of published results. But that’s not related to the substantive significance of the results, to how interesting the questions being answered are. Solving publication bias would not stop journals privileging articles about exciting or controversial topics. Why would you even want that?

More than that, why reserve specific criticism for Stasko and Geller’s paper? Gebelhoff says that

the research had not been published in any academic journal. Instead, the data was compiled through an Internet survey as part of a presentation to the American Psychological Association’s annual convention. Sure, the results were interesting, but the research is simply not generalisable to the entire public.

That doesn’t quite scan either. The fact that the research wasn’t (yet?) published in a journal has got nothing to do with the fact that they recruited candidates online. Journals publish online studies all the time. This paper was accepted to and presented at a conference, which generally implies at least some level of filtering by peers, even if it’s short of peer-review. Peer-review isn’t a guarantee of accuracy either.

More than that, would you lend more credence to a typical published psychology study where the sample wasn’t Internet randos but university students recruited from a Psych 101 class? Because the results you get from studying undergrads people don’t generalise well either.¹ If anything, I’d be more convinced (ceteris paribus) by an attitudinal study about new technologies where the sample age range was 18–82! This paper isn’t primarily making claims about the prevalence of behaviours in the population, but about relationships between them. So the fact that it isn’t a random sample from the population is important but not critical.

There are a number of issues with the ways science and the media operate. But even when they affect each other, they remain distinct issues.

Henrich, Joseph, et al. “In search of homo economicus: behavioral experiments in 15 small-scale societies.” The American Economic Review 91.2 (2001): 73-78. ↩

How Majority Rule Wouldn’t Have Done Anything

2016-04-29T00:00:00-07:00

Eric Maskin and Amartya Sen write in the New York Times that US primary and general elections would be more fair if the winner always commanded an absolute majority of the popular vote. Perhaps, but as it currently stands, neither party nominations nor Presidential elections are actually decided by the popular vote. Even if Maskin and Sen’s criteria were satisfied — and a Presidential candidate’s state victories were all absolute majorities — that candidate could still technically win a Presidential election with only 23% of the popular vote.

State-by-state majorities don’t mean a national majority

Maskin and Sen’s claim is that the voting system used in primaries and Presidential elections produces winners who only got more votes than each of the other candidates (a plurality), rather than a majority of the votes. For example, a candidate can win such a contest with 40% of the vote if two opponents get 30% each. They suggest this voting system should be replaced with one which fulfils the Condorcet criterion: if there is a candidate who would beat each of the others in one-on-one contests, they would be the winner.

There are many reasons to prefer Condorcet-certified electoral methods, but simply switching the electoral systems of the state contests wouldn’t achieve what Maskin and Sen want, that is, the President always having won a majority of the popular vote. There are a couple of reasons for this, and I hinted at one in my post on bellwethers: party candidates and Presidents alike are not chosen by the popular vote. Rather, in both primaries and general elections the voters in each state select delegates, and those delegates vote for the nominee or President. (There’s also a really wonkish criticism to be made of their choice of words: a Condorcet winner is not the same thing as a majority winner.

For example, ~~Maskin and Sen mention~~¹ that John Kerry might have won Florida in the 2004 Presidential Election if a Condorcet method had been in place. That’s because Kerry won 47.09% of the vote, and Bush won 52.10%, with Nader taking 0.43%. Had Nader been out of the contest, or a Condorcet voting method been in place, Kerry might have won Florida. That’s true, and it’s the same reason why Ted Cruz and John Kasich kinda-sorta made a deal to avoid splitting their vote against Trump. But winning the popular vote state-by-state is not at all the same thing as winning the national popular vote: Al Gore did win the nationwide popular vote in 2004, and he still didn’t win the Presidency. He could have won the Presidency if he had gained the same number of votes but distributed differently between the states. The same applies to the 1876 and 1888 elections.

In other words, even if the state-level contests meet the Condorcet criterion, that doesn’t mean that the national contest will when taken as a whole. It depends crucially on how many delegates each state gets (as well as on how the state’s delegates are allocated to candidates based on the result of the state’s popular vote: winner-takes-all, or proportionally, or some other method).

All in proportion

Currently, delegates to the general election’s Electoral College are not assigned to states proportionally to their population, but based on Congressional apportionment (party conference delegates are assigned in a similar, but not identical, way). A single Electoral College delegate can represents between 200 000 and 700 000 voters depending on the state.

To see how this works, consider a country made up of one large state with 7 million voters and five small states with 1 million each. Let’s say the large state chooses 10 electors, and each of the small states chooses 5. This give the states voter-to-delegate ratios similar to the extremes of the current Electoral College. The large state is clearly underrepresented, having 7x more population but only 2x more votes. A candidate could win this election by narrowly winning 4 of the small states — 500 000 votes and 5 delegates in each, for a total of 20 out of 35 delegates. But they would only have won 2.5 million (20%) of the 12 million popular votes. Notice that this happens even though the winning candidate has a strict majority in each of the states they win, which is a more “fair” outcome than Maskin and Sen even hope for!

This example is simplified, but the outcome is no more extreme than is possible in the real US system. If a candidate won very slight absolute majorities in the 31 states most highly represented in the Electoral College, they could become President on 23.0% of the popular vote. Given 54.9% turnout, that’s just 12.6% of eligible voters. And that’s assuming two parties: with an arbitrarily large number of parties, a President could win with an arbitrarily small share of the vote!

The state-level voting system which Maskin and Sen want to improve is only half the story. Once a state is won, it’s just as important how much that result counts towards the national contest.

The Condorcet and Majority criteria

Maskin and Sen also slightly mangle two criteria for the fairness of voting systems: the Condorcet criterion and the Majority criterion.

The Condorcet criterion says that

if there is a candidate who would beat each of the others in one-on-one contests then that candidate should win

while the Majority criterion says that

if there is a candidate who wins an absolute majority of the votes then that candidate should win

Different voting systems can satisfy one or the other criterion, but a system which satisfies the Condorcet criterion automatically satisfies the Majority criterion. That’s easy to see: any a candidate who beats each of their opponents one-on-one will also beat all of them combined (because they can only do worse by splitting the vote).

But the opposite is not true: a system which satisfies the majority criterion does not necessarily satisfy the Condorcet criterion. Imagine a candidate who beat their opponents one-on-one, but did not command an absolute majority. They would be elected by a Condorcet system, but not necessarily by a majority system. The Condorcet criterion is in that sense more strict, more difficult to satisfy. Majority winners are few and far between, so it’s not that hard to guarantee that when they exist they are chosen.

The direction of implication is slightly tricky here, and Maskin and Sen get it a bit wrong. Both of the criteria say that if a candidate exists who meets these parameters, then they win. They do not say that a Condorcet or majority winner must exist; in a Condorcet or majority system, the winner does not have to be able to beat all of their opponents one-on-one or gain an absolute majority. Why is this? Think of the majority case: if votes are split 60/20/20, then the candidate with 60% wins, and the majority criterion holds. If the vote is split 40/30/30, there is no candidate with an absolute majority, so the majority criterion says nothing about who should win! Plurality voting — our familiar win-if-you-win-the most votes system that Maskin and Sen complain about — satisfies the majority criterion! In fact, you could have a voting system where if there was no absolute majority the winner was selected randomly from all the candidates in the field, and that would still meet the majority criterion!

Maskin and Sen also confuse matters by saying that a candidate who “would defeat each opponent individually in a head-to-head match-up” is “a real majority winner”. Well, if you take the common-sense meaning of “majority”, that’s not true. A Condorcet winner could be a majority winner, but they don’t have to be. If they’re redefining “majority” to mean someone who beats other candidates one-on-one head-to-head, they’re twisting the meaning of the word. Beating your opponents one-on-one is perhaps a desirable property, but it’s just not the same thing as having a majority. If you want plurality winners to stop behaving as if they had a majority, you should feel the same about mere Condorcet winners.

Majority rules

So should we just elect the President by nationwide popular vote? Not necessarily: the current electoral system might seem arbitrary or plain weird, and parts of it certainly are. But state-level contests are designed to cement the importance of the states in the electoral system. In fact, while the Constitution provides that the states choose the Electoral College, it doesn’t say that the popular vote has anything to do with it. That’s for the state legislatures to decide, and it wasn’t until 1872 that all states held popular elections for President. There is a debate to be had as to whether the power of the people or the power of the states is more important.

Regardless of the details of the American system, Maskin and Sen’s more general point is that plurality winners should be humble, and remember that a whole bunch of their constituents didn’t vote for them. True. But winners should be humble even if they have a majority: even after the most indisputable victory not every voter approves of every policy a politician come up with during their term in office. It’s a fundamental — one might say necessary — characteristic of representative democracy that elected representatives don’t perfectly mirror the opinions of the citizenry. Democracy’s in the details.

Correction 2016-05-22 Maskin and Sen’s original article actually used the example of George W. Bush, Al Gore and Ralph Nader in Florida in 2000. The substantive point stands, however. ↩

Embedded local video in jupyter notebook with R kernel

2016-04-04T00:00:00-07:00

For some reason, there isn’t a default way to embed a local video file in a jupyter notebook.

If you’re using a python kernel, you can make use of this hack, which inserts the whole video, base64-encoded, into the generated HTML. But because this runs in a code block, not a markdown block, it’s dependent on the kernel you’re running. Notebooks only support one kernel, so if the rest of your code is R, you’ll need an R version.

Nobody else seems to have written the equivalent code for R, so here it is:

    show_video <- function(filename, mimetype) {
    library(IRdisplay)
    library(base64enc)

    data = base64encode(filename, 'raw')

    display_html(paste0(',
         mimetype, ';base64,', data, '">'))
    }

Whither the Bellwether?

2016-03-09T00:00:00-08:00

Scientists have calculated that the chances of something so patently absurd actually existing are millions to one. But magicians have calculated that million-to-one chances crop up nine times out of ten.

— Terry Pratchett, Mort

# TL;DR

If Vigo County, IN is a bellwether for US Presidential elections, then so is Valencia County, NM.

And York County, ME; Racine County, WI; and Strafford County, NH.

Besides which, we shouldn’t expect any of them to continue getting it right.

# Background

On the Media’s recent segment “Magic” Terre Haute reports on the search for “one small town that thinks exactly the way the nation does” and as such can predict the results of U.S. Presidential elections. According to Don Campbell amongst others, the current contender is Vigo County, IN which has “voted for the winning candidate in every presidential election except two — 30 out of 32 elections”, and has not “missed in 60 years”. According to Campbell, “No other place in America comes close.”

This immediately raised two questions for me: is Vigo the only candidate for the nation’s bellwether? And are there really even bellwethers in the first place?

To answer them, I collected county-level voting data from ICPSR and the Congressional Quarterly Voting and Elections Collection, covering the period 1840–2012 (code on GitHub, please replicate and extend).

Which are the bellwethers?

First off, it’s true that Vigo hasn’t missed since 1952, but the same goes for Valencia County, NM. They are far from outliers: fourteen other counties have only missed one of those fifteen elections. And I would say it’s even tipping the scales in Vigo’s favour to use it’s winning streak as the basis of comparison to other counties.

State	Area	prop. correct	#correct	#elections
Indiana	Vigo	1.0000000	15	15
New Mexico	Valencia	1.0000000	15	15
California	Ventura	0.9333333	14	15
Delaware	Kent	0.9333333	14	15
Florida	Hillsborough	0.9333333	14	15
Montana	Blaine	0.9333333	14	15
New Mexico	Hidalgo	0.9333333	14	15
New Mexico	Sandoval	0.9333333	14	15
North Carolina	Buncombe	0.9333333	14	15
North Dakota	Sargent	0.9333333	14	15
Ohio	Ottawa	0.9333333	14	15
Texas	Bexar	0.9333333	14	15
Texas	Val Verde	0.9333333	14	15
Virginia	Westmoreland	0.9333333	14	15
Wisconsin	Juneau	0.9333333	14	15
Wisconsin	Sawyer	0.9333333	14	15
California	Merced	0.8666667	13	15
California	San Bernardino	0.8666667	13	15
California	San Joaquin	0.8666667	13	15
California	Stanislaus	0.8666667	13	15

If we instead take into account elections since 1888 then Vigo and Ventura County, CA are tied with 29/32 elections each. These figures are not the same as the ones Campbell quotes, and I’m not quite sure why. It seems like the ICPSR data don’t agree with Dave Leip’s US Election Atlas about Vigo County’s results for 1908, 1892 and 1896. If the ICPSR data are wrong,, that could be enough to make Vigo the single most successful predictor over this time period, but it still wouldn’t be the uncontested winner.¹

State	Area	prop. correct	#correct	#elections
Indiana	Vigo	0.90625	29	32
California	Ventura	0.90625	29	32
Wisconsin	Racine	0.87500	28	32
New Hampshire	Coos	0.84375	27	32
Iowa	Jasper	0.84375	27	32
Iowa	Palo Alto	0.84375	27	32
Nebraska	Douglas	0.84375	27	32
Oregon	Clackamas	0.84375	27	32
Connecticut	Windham	0.81250	26	32
Maine	York	0.81250	26	32
New Hampshire	Strafford	0.81250	26	32
Delaware	New Castle	0.81250	26	32
New Jersey	Middlesex	0.81250	26	32
Indiana	Delaware	0.81250	26	32
Indiana	Madison	0.81250	26	32
Indiana	St Joseph	0.81250	26	32
Indiana	Vanderburgh	0.81250	26	32
Ohio	Montgomery	0.81250	26	32
Ohio	Portage	0.81250	26	32
Iowa	Bremer	0.81250	26	32

For some reason, most reports about Vigo County’s bellwether status start from 1888. I’m not an expert on US political geography, but it’s not clear to me why they choose that date. There might be something significant about 1888 in terms of the structure of the electoral system, or the geography of states and counties, but I haven’t found it so far. So I saw no reason to not look back even further.²

On elections since 1840, Vigo County still does pretty well, but there are nine other counties either tied with or ahead of it. Depending on how the missing data fall, these counties could be pushing 84% agreement with the election result. Racine County, WI might be the single best bellwether given that it predicts 34 out of the 40 elections for which there is data. If it called all the missing elections right, it would have 86%. Even given the missing data, Racine gets as many right as York and Strafford. Still, over this time period only a few counties have above 80% correct predictions.

State	Area	prop. correct	#correct	#elections
Maine	York	0.8292683	34	41
New Hampshire	Strafford	0.8292683	34	41
New Hampshire	Hillsborough	0.8048780	33	41
Ohio	Portage	0.8048780	33	41
Connecticut	Windham	0.7804878	32	41
Illinois	Macon	0.7804878	32	41
Illinois	Will	0.7804878	32	41
Indiana	St Joseph	0.7804878	32	41
Indiana	Vigo	0.7804878	32	41
Ohio	Stark	0.7804878	32	41
Maine	Washington	0.7560976	31	41
New Hampshire	Coos	0.7560976	31	41
New Hampshire	Sullivan	0.7560976	31	41
New Jersey	Atlantic	0.7560976	31	41
Indiana	Delaware	0.7560976	31	41
Indiana	Vanderburgh	0.7560976	31	41
Michigan	Macomb	0.7560976	31	41
Michigan	Shiawassee	0.7560976	31	41
Michigan	Van Buren	0.7560976	31	41
Ohio	Columbiana	0.7560976	31	41

Finally, spare a thought for Webster County in Georgia. Since its founding in 1853 it has only voted in line with the national trend in 13 out of the 37 elections for which there are data. That makes it a somewhat reliable an anti-bellwether: if you looked at the result from Webster and picked the opposite, you’d do better than 75% of other counties with as much data.³

# Are there actually bellwethers?

A quick Google Scholar search reveals surprisingly few academic papers about in Presidential election bellwethers. Perhaps the best is this 1975 paper by Tufte (yes, that Tufte) and Sun. It concludes that there are no bellwether counties because bellwether status can only reliably be assigned after the fact.

This is a common problem which you come across in both machine learning and the social sciences. Once you’ve observed an outcome it’s trivial to tweak your prediction to predict the thing that’s already happened. And once you’ve made a prediction of the past, you can come up with a justification that not only makes it look like it’s the only possible prediction that could have been made, but also convinces you that you could have made it before the fact.

We can do something like Tufte’s analysis using our larger data set. For simplicity’s sake, let’s look at the accuracy of counties that have had, at any point in time, the same 60-year, 15-election streak that Vigo currently enjoys (Tufte and Sun look at streaks from 24 to 48 years).

Take 1968, for example. Northampton, PA and Prince George’s, MD (as well as four other counties) had predicted the correct result in all the elections since 1908 — the same correct streak that Vigo has now. But in 1968 Northampton voted for McGovern and Prince George’s voted, correctly, for Nixon. All in all, there are 90 cases in which a county had a streak of 15 correct predictions behind it going into an election year (for 43 distinct counties). But in 47% of those cases, the county failed to get the next election right. Pretty close to a coin-toss.

This graph (click on it for a bigger image) shows the performance over time of the 20 counties that have, at one time or another, had a streak longer than Vigo’s current one.

The fatalism in this plot is almost the opposite of the problem of post-hoc prediction: the counties with the longest streaks necessarily fall of a cliff shortly after.

# What is a bellwether anyway?

So far, we’ve worked out that Vigo isn’t the only bellwether, and bellwethers aren’t particularly great at making predictions. But what is it exactly that we’re asking them to do?

The most common version of the bellwether is what Tufte and Sun call the “all-or-nothing” barometric bellwether: a county in which the majority vote is for the person who’s eventually elected President. That’s what we’ve been looking at so far and as such, we’re interested in the classificatory accuracy of the county: what proportion of elections it gets right. But you might also look for counties where the percentage for each candidate is closest to that from the popular vote. That’s not the same thing: given that many contests are close to 50/50, calling the right outcome can sometimes be a matter of a few tenths of a percent, so measuring the difference in vote percentages directly might be a more fair measure of a bellwether.

That said (and not having looked at the data) I don’t think it will make a difference to the results above. Dave Leip’s US Election atlas reports that Vigo county has had a mean absolute difference from the national result of 0.9pp. That might give it the edge in accuracy of prediction of the vote percentages, but I doubt switching to that measure will make it the singularly best predictor.

Also, what we’re asking bellwethers to do is strange given the institutional structure of US Presidential elections, namely that the winner of the popular vote is not necessarily the winner of the election. So to predict all elections correctly, a bellwether would sometimes have to go against the popular vote. In 2000, for example, a majority of voters in Vigo County voted for George Bush — but the winner of the nationwide popular vote was Al Gore. Vigo somehow got the election right while getting the popular vote wrong. Bellwethers are supposed to work because they’re a microcosm of the US, their demographics and opinions being proportional to those of the nation as a whole. How do we explain the cases then where in order to predict the outcome, the bellwether’s voters had to predict not the nation’s popular vote, but the distinctly non-proportional outcome of the electoral college — presumably taking into account such factors as Maine and Nebraska’s Congressional district method, faithless electors and the Huntington-Hill method?

As I said, it’s also often claimed that these bellwethers are accurate because their populations are representative of demographic and attitudinal characteristics on a national level. The OtM piece notes that Vigo county is not particularly demographically similar to the nation as whole, so that’s already kind of dubious. As Campbell noted, some of it is certainly luck, but I may have to look further into the makeup of the potential bellwether.

# Conclusion

The search for predictive accuracy is always stymied by the risk of post-hoc prediction. In machine learning, this is dealt with by a strict separation of the data you’re allowed to look at (the training set) from the data you use to evaluate your prediction (the test set). Failing this, it’s easy to over-fit, to produce a model that can perfectly predict what has already happened but cannot generalise to the future.

According to my quick analysis,⁴ that’s at least part of the story with bellwethers too. Vigo County gets all the press, but there are other equally accurate counties. Even those are not great predictors in the long term.

Of course, it may be that pure predictive power is not why we should be paying attention to bellwethers. We certainly learn something about the American populace by studying a small town in detail, as we did from the Middletown studies. But the conclusions we can make from such places are not generalisable to the whole country in the (relatively) straightforward statistical sense. If we’re going to learn from Vigo, Valencia, York, Racine, and Strafford, we’ll need to make use of well-informed and sensitive interpretation. That’s what a documentarian like Campbell is well-placed to develop. Looking at it another way, bellwethers aren’t an alternative to the punditry so lamented by OtM, they’re fuel for them.

So, are the citizens of Vigo County “history’s most reliable presidential bellwethers”?

Not quite, Bob, not quite.

Citations

Broh, C. Anthony. “Whether Bellwethers or Weather-Jars Indicate Election Outcomes.” The Western Political Quarterly (1980): 564-570.
Clubb, Jerome M., William H. Flanigan, and Nancy H. Zingale. Electoral Data for Counties in the United States: Presidential and Congressional Races, 1840-1972. ICPSR08611-v1. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2006-11-13. http://doi.org/10.3886/ICPSR08611.v1
CQ Voting and Elections Collection
Kenski, Henry C., and Edward C. Dreyer. “In Search of State Presidential Bellwethers.” Social Science Quarterly 58.3 (1977): 498-505. http://www.jstor.org/stable/42859841
Lewis-Beck, Michael S. “Election forecasts in 1984: how accurate were they?.” PS: Political Science & Politics 18.01 (1985): 53-62.
Paleologos, David, and Elizabeth J. Wilson. “Use of Bellwether Samples to Enhance Pre-Election Poll Predictions: Science and Art.” American Behavioral Scientist 55.4 (2011): 390-418.
Robertson, David Brian. “Bellwether politics in Missouri.” The Forum. Vol. 2. No. 3. 2004.
Tufte, Edward R., and Richard A. Sun. “Are There Bellwether Electoral Districts?”. The Public Opinion Quarterly 39.1 (1975): 1–18. http://www.jstor.org/stable/2748067

It’s also possible that there’s some weirdness happening with county names changing, or counties and independent cities in the same state having the same name. I caught some of these, but only when they obviously affected the result. ↩
I did choose these dates purely because the data were easiest to access. It should be possible extend the analysis using data from 1788 onwards, although they seem to become more incomplete the further back you go. ↩
This missing data problem is more of an issue here than I’m used to: when there are only 57 events, missing one or two of them can make a big difference. ↩
By all means, please check my math, especially if you know what’s going on with some of the weird discrepancies and missing data. ↩