<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.3.3">Jekyll</generator><link href="https://adamobeng.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://adamobeng.com/" rel="alternate" type="text/html" /><updated>2025-01-08T22:03:53-08:00</updated><id>https://adamobeng.com/feed.xml</id><subtitle>Adam Obeng&apos;s Website</subtitle><entry><title type="html">wddbfs – Mount a sqlite database as a filesystem</title><link href="https://adamobeng.com/wddbfs-mount-a-sqlite-database-as-a-filesystem/" rel="alternate" type="text/html" title="wddbfs – Mount a sqlite database as a filesystem" /><published>2024-02-17T00:00:00-08:00</published><updated>2024-02-17T00:00:00-08:00</updated><id>https://adamobeng.com/wddbfs-mount-a-sqlite-database-as-a-filesystem</id><content type="html" xml:base="https://adamobeng.com/wddbfs-mount-a-sqlite-database-as-a-filesystem/"><![CDATA[<p>Often when I’m prototyping a project, I hesitate to use a sqlite database despite their <a href="https://sqlite.org/appfileformat.html">many adavantages</a>. It seems much easier to just dump a bunch of files in a directory and to rely on the universal support for the filesystem API to read/delete/update records. Part of this is avoiding the overhead of figuring out a relational schema, but an equal amount of friction comes from the fact that .sqlite files are just slightly more difficult to inspect: the SQL syntax for selecting a few records is much more verbose than <code class="language-plaintext highlighter-rouge">head -n</code> or <code class="language-plaintext highlighter-rouge">tail -n</code>, there are special commands (which don’t work in some environments/versions) for listing tables, and neither my text editor nor my shell has autocompletion for database queries.</p>

<p>To try to get the best of both worlds, I have put together a little utility called <em>wddbfs</em>, which exposes a sqlite database as a (WebDAV<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">1</a></sup>) filesystem, accessible to anything which can work with a filesystem, including terminals, file managers, and text editors.</p>

<p>Here’s how it works.  If you install it with:</p>

<p><code class="language-plaintext highlighter-rouge">pip install git+https://github.com/adamobeng/wddbfs</code></p>

<p>You can mount a database with:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>wddbfs --anonymous --db-path=/path/to/an/example/database/like/Chinook_Sqlite.sqlite
</code></pre></div></div>

<p>Which will be available at localhost:8080 with no username or password required. <sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">2</a></sup></p>

<p>Once you’ve <a href="https://support.apple.com/guide/mac-help/connect-disconnect-a-webdav-server-mac-mchlp1546/mac">mounted</a> this WebDAV filesystem at, for example <code class="language-plaintext highlighter-rouge">/Volumes/127.0.0.1/</code>, you can see all the databases you specified with <code class="language-plaintext highlighter-rouge">--db-path</code>.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">3</a></sup></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ls /Volumes/127.0.0.1/
Chinook_Sqlite.sqlite
$ ls /Volumes/127.0.0.1/Chinook_Sqlite.sqlite
Album.csv           Customer.tsv        Invoice.jsonl       Playlist.json
Album.json          Employee.csv        Invoice.tsv         Playlist.jsonl
Album.jsonl         Employee.json       InvoiceLine.csv     Playlist.tsv
Album.tsv           Employee.jsonl      InvoiceLine.json    PlaylistTrack.csv
Artist.csv          Employee.tsv        InvoiceLine.jsonl   PlaylistTrack.json
Artist.json         Genre.csv           InvoiceLine.tsv     PlaylistTrack.jsonl
Artist.jsonl        Genre.json          MediaType.csv       PlaylistTrack.tsv
Artist.tsv          Genre.jsonl         MediaType.json      Track.csv
Customer.csv        Genre.tsv           MediaType.jsonl     Track.json
Customer.json       Invoice.csv         MediaType.tsv       Track.jsonl
Customer.jsonl      Invoice.json        Playlist.csv        Track.tsv
</code></pre></div></div>

<p>By default, all the tables can be read as CSV, TSV, json and line-delimited json (“.jsonl”)</p>

<p>These files can be manipulated with tools that work with a standard filesystem:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ tail -n 3 Chinook_Sqlite.sqlite/Album.tsv
345     Monteverdi: L'Orfeo     273
346     Mozart: Chamber Music   274
347     Koyaanisqatsi (Soundtrack from the Motion Picture)      275
$ grep "Mahler" Chinook_Sqlite.sqlite/Artist.jsonl 
{"ArtistId": 240, "Name": "Gustav Mahler"}
</code></pre></div></div>

<p>Although for now, the whole table gets read into memory for every read so this won’t work well for very large database files. There’s also no write support… yet.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:2" role="doc-endnote">
      <p>Despite how clunky it is, this seems to be the best way to implement a filesystem given that getting FUSE support is not straightforward. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p>This is obviously not suitable for access for hosts over a network. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:1" role="doc-endnote">
      <p>Databases specified with <code class="language-plaintext highlighter-rouge">--db-path</code> will be available at the root of the filesystem, but if you pass <code class="language-plaintext highlighter-rouge">--allow-abspath</code> any databse file on the host filesystem will also be exposed inside the WebDAV mount at, for example, <code class="language-plaintext highlighter-rouge">/mount/webdav/absolute/path/to/db/on/host/db.sqlite</code>. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><category term="hacks" /><summary type="html"><![CDATA[Often when I’m prototyping a project, I hesitate to use a sqlite database despite their many adavantages. It seems much easier to just dump a bunch of files in a directory and to rely on the universal support for the filesystem API to read/delete/update records. Part of this is avoiding the overhead of figuring out a relational schema, but an equal amount of friction comes from the fact that .sqlite files are just slightly more difficult to inspect: the SQL syntax for selecting a few records is much more verbose than head -n or tail -n, there are special commands (which don’t work in some environments/versions) for listing tables, and neither my text editor nor my shell has autocompletion for database queries.]]></summary></entry><entry><title type="html">A Javascript Autopilot for Crew Dragon</title><link href="https://adamobeng.com/a-javascript-autopilot-for-crew-dragon/" rel="alternate" type="text/html" title="A Javascript Autopilot for Crew Dragon" /><published>2020-05-21T00:00:00-07:00</published><updated>2020-05-21T00:00:00-07:00</updated><id>https://adamobeng.com/a-javascript-autopilot-for-crew-dragon</id><content type="html" xml:base="https://adamobeng.com/a-javascript-autopilot-for-crew-dragon/"><![CDATA[<p>Update (2020-05-31): <a href="https://space.stackexchange.com/questions/9243/what-computer-and-software-is-used-by-the-falcon-9/9446#9446">It turns out</a> that the Dragon 2 actually uses Chromium and Javascript for its flight interface. So there’s a reasonable chance this autopilot would run in the real Crew Dragon‽</p>

<p>That title is a truly horrifying combination of words.</p>

<p>SpaceX just released the <a href="https://iss-sim.spacex.com/">ISS Docking Simulator</a>, a browser game where the objective is to very slowly fly the Crew Dragon 2 to dock with the International Space Station, just like the <a href="https://arstechnica.com/science/2020/04/spacexs-crew-dragon-gets-a-launch-date-may-27th/">real mission</a> is scheduled to next week. Given that a significant portion of my early teens was wasted on just <em>the free demo</em> of <em><a href="https://en.wikipedia.org/wiki/Star_Wars:_X-Wing_vs._TIE_Fighter">Star Wars: X-Wing vs. TIE Fighter</a></em>, I had to play it. The whole experience is very ponderous, and definitely not as fun as XWvTF — which leads me to believe it is in fact a realistic simulation.</p>

<p>I only needed to dock successfully once to be satisfied with playing the simulator myself. The nex logical step was to move onto the much more interesting challenge of automating it. The result is some hacky Javascript which you can paste into your browser console, and will automatically fly the simulated Crew Dragon to the ISS. Here’s a video of it in action:</p>

<iframe src="https://player.vimeo.com/video/420862013" width="640" height="480" frameborder="0" allow="autoplay; fullscreen" allowfullscreen=""></iframe>

<p>And here’s the full code:</p>

<script src="https://gist.github.com/adamobeng/9a1d25a922043c9e62a98d2463c3cf38.js"></script>

<h2 id="how-do-you-fly-a-spaceship">How do you fly a spaceship?</h2>

<p>A quick Wikipedia browse helped me figure out that the problem to solve here is called ‘<a href="https://en.wikipedia.org/wiki/Attitude_control#Control_Algorithms">attitude control</a>’, which led to some gnarly papers before I realised that the controller is basically a <a href="https://en.wikipedia.org/wiki/PID_controller">PID controller</a>. Actually, I only had a vague idea what a PID controller was, but by staring at <a href="https://en.wikipedia.org/wiki/PID_controller#PID_controller_theory">the algorithm</a> and a co-incidental <a href="https://news.ycombinator.com/item?id=23222019">Hacker News post</a> long enough, I realised that the algorithm can be simplified to:</p>

<ol>
  <li>If you’re far away, move closer</li>
  <li>If you’re moving too fast, slow down</li>
</ol>

<p><br /></p>

<p>This chimes with how I was playing the demo too: you realise that you can change your roll but you have to be careful not to overshoot, so when you’re close to zero you have to move more slowly. The UI encourages this by showing not just your current position but the rate of change in your position (i.e. your speed or angular velocity).</p>

<p>The Crew Dragon simulation has six degrees of freedom we can control: roll, pitch, yaw and x, y, z. In a PID algorithm, these are called the <em>process variables</em>, and the difference between the current value of these variables and the desired value is the <em>error</em>. The amount of change you need to effect (here the amount you need to run the thrusters) is a function of the error, the derivative of the error, and some multiplicative constants that describe how those two factors should be combined. The docking is successful when all of those variables are (close to) zero. If they’re not zero, then the autopilot actuates the appropriate control to bring them closer to zero (e.g. roll to the left if currently rolled to the right). I’m treating the controls as decoupled: the thruster which moves you in the x-direction should have no effect on the y-error and so on. I’m not sure if this is true of the real capsule, and in theory it’s not true: it should be the case that moving the forward thruster would move the craft in the z-direction if the pitch wasn’t zero. In practice though, the autopilot achieves level flight very early, so the controls are effectively decoupled.</p>

<h2 id="implementing-the-autopilot-in-javascript">Implementing the Autopilot in Javascript</h2>

<p>The Docking Simulator is nicely written in HTML5 using a WebGL canvas, which makes it really easy to interface with. The autopilot reads the HUD HTML elements to determine the current position of the craft, and controls it by simulating presses on the on-screen controls.</p>

<p>To make it easier to understand what’s happening, I inserted another transparent canvas element on top of the game, which is used to display graphs showing the change over time of the current error on each axis, and the rate of change of that error, as well as the amount the controls are being actuated. The control loop which runs every 500 milliseconds reads all of the process variables, appends them to a running log which is used for the graphs, and then actuates the controls based on the PID output.</p>

<p>There were a couple of other interesting implementation details:</p>
<ul>
  <li>The UI provided doesn’t show separate x, y, and z speeds, only a speed in the direction of the ISS. To calculate those rates, I measured the change in error with the same data used to generate the graphs.</li>
  <li>There is a minimum possible control actuation which can be applied: a single click. But what if the output of the controller tells us we need to move forward by 0.25 of a click? I thought it might be possible to simulate holding the click for a longer or shorter time, but I actually settled on a hackier solution: if we need 0.25 clicks of actuation, then we randomly click one out of every four times the control loop runs! This means the behaviour close to the ISS is a little bit more unpredictable that I think you’d want for real spaceflight, but there aren’t even Kerbals at stake here.</li>
</ul>

<p>The most difficult part of this process was figuring out what values to set for the gain. The rotational controls worked pretty easily, but it turns out that for the x, y, and z controls to work, the gain for the derivative needs to be 800x larger than the gain for the error! As the video shows, this results in initially very high speeds (especially in the x direction), but a lot of damping so that the final approach is tentatively slow.</p>

<p>I can’t say I can recommend this as an approach for real spacecraft control (there is a reason they call it rocket science), but if the next <a href="https://en.wikipedia.org/wiki/The_Martian_(Weir_novel)">Mark Watney</a> happens to be a frontend developer stuck on a spaceship controlled via a web browser… you could do worse.</p>]]></content><author><name></name></author><category term="code," /><category term="ML," /><category term="space" /><summary type="html"><![CDATA[Update (2020-05-31): It turns out that the Dragon 2 actually uses Chromium and Javascript for its flight interface. So there’s a reasonable chance this autopilot would run in the real Crew Dragon‽]]></summary></entry><entry><title type="html">Python’s Counter-intuitive, Non-commutative Ternary Logic</title><link href="https://adamobeng.com/pythons-counterintuitive-non-commutative-ternary-logic/" rel="alternate" type="text/html" title="Python’s Counter-intuitive, Non-commutative Ternary Logic" /><published>2020-04-06T00:00:00-07:00</published><updated>2020-04-06T00:00:00-07:00</updated><id>https://adamobeng.com/pythons-counterintuitive-non-commutative-ternary-logic</id><content type="html" xml:base="https://adamobeng.com/pythons-counterintuitive-non-commutative-ternary-logic/"><![CDATA[<p>Boolean logic is one of the foundational abstractions in computer science, from electrical circuits to programming languages. In Boolean logic, all variables take one of two values: 1/0, high/low, or TRUE/FALSE. However, many practical situations require the inclusion of a third value to indicate a variable is unknown, missing, or both false and true at the same time: usually something like <code class="language-plaintext highlighter-rouge">None</code>, <code class="language-plaintext highlighter-rouge">NULL</code>, <code class="language-plaintext highlighter-rouge">NA</code>, or <code class="language-plaintext highlighter-rouge">Unknown</code>.</p>

<p>This leads to a complication: for two-valued logic there’s a single agreed-upon way of defining how variables can be combined using the basic AND, OR, and NOT operations: <a href="https://en.wikipedia.org/wiki/Boolean_algebra#Basic_operations">Boolean logic</a>. On the other hand, there are many possible — and at least <a href="https://en.wikipedia.org/wiki/Three-valued_logic#Kleene_and_Priest_logics">two frequently used</a> — three-valued formal logics. But in practice, programming languages often <a href="https://modern-sql.com/concept/three-valued-logic#general-rule">fall short</a> of strictly implementing a consistent formal logic.</p>

<p>As it turns out, Python is no exception: boolean operations involving <code class="language-plaintext highlighter-rouge">None</code> have some counter-intuitive properties.</p>

<p>You might expect that operations on <code class="language-plaintext highlighter-rouge">None</code> would simply always return <code class="language-plaintext highlighter-rouge">None</code>, in a similar way that <a href="https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/NA"><code class="language-plaintext highlighter-rouge">NA</code> is propagated</a> to the values of numerical operations in R. But nope:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;&gt;&gt;</span> <span class="ow">not</span> <span class="bp">None</span>
<span class="bp">True</span>
</code></pre></div></div>

<p>Although, this does have the felicitous consequence that the tautology <code class="language-plaintext highlighter-rouge">x or (not x)</code> is still true in the case of <code class="language-plaintext highlighter-rouge">None</code>, unlike in SQL…:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;&gt;&gt;</span> <span class="bp">None</span> <span class="ow">or</span> <span class="p">(</span><span class="ow">not</span> <span class="bp">None</span><span class="p">)</span>
<span class="bp">True</span>
</code></pre></div></div>

<p>But even weirder than this is that operations involving <code class="language-plaintext highlighter-rouge">None</code> are non-commutative: <code class="language-plaintext highlighter-rouge">x and y</code> is not the same as <code class="language-plaintext highlighter-rouge">y and x</code>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;&gt;&gt;</span> <span class="nf">print</span><span class="p">(</span><span class="bp">False</span> <span class="ow">and</span> <span class="bp">None</span><span class="p">)</span>
<span class="bp">False</span>

<span class="o">&gt;&gt;&gt;</span> <span class="nf">print</span><span class="p">(</span><span class="bp">None</span> <span class="ow">and</span> <span class="bp">False</span><span class="p">)</span>
<span class="bp">None</span>
</code></pre></div></div>

<p>I’ve searched for formal logical systems which don’t have commutativity for these operations, and I have yet to find this as an intentional choice anywhere else! It seems like it might be possible to define a substructural logic which did this, but I’m not sure why you would.</p>

<h2 id="why-is-none-weird">Why is None weird?</h2>

<p>There are two design decisions which explain this strange behaviour: falsiness and short-circuit evaluation.</p>

<p>In Python, every object is either “truthy” or “falsy”: it can be coerced to the boolean value True or False with the <code class="language-plaintext highlighter-rouge">bool()</code> function:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;&gt;&gt;</span> <span class="nf">bool</span><span class="p">(</span><span class="bp">True</span><span class="p">)</span>
<span class="bp">True</span>
<span class="o">&gt;&gt;&gt;</span> <span class="nf">bool</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="bp">False</span>
<span class="o">&gt;&gt;&gt;</span> <span class="nf">bool</span><span class="p">(</span><span class="sh">"</span><span class="s">a</span><span class="sh">"</span><span class="p">)</span>
<span class="bp">True</span>
<span class="o">&gt;&gt;&gt;</span> <span class="nf">bool</span><span class="p">(</span><span class="sh">""</span><span class="p">)</span>
<span class="bp">False</span>
<span class="o">&gt;&gt;&gt;</span> <span class="nf">bool</span><span class="p">(</span><span class="bp">None</span><span class="p">)</span>
<span class="bp">False</span>
</code></pre></div></div>

<p>From the <a href="https://docs.python.org/3/reference/expressions.html#boolean-operations">Python docs</a>:</p>

<blockquote>
  <p>the following values are interpreted as false: False, None, numeric zero of all types, and empty strings and containers (including strings, tuples, lists, dictionaries, sets and frozensets). All other values are interpreted as true.</p>
</blockquote>

<p>This property is useful, for instance, when checking the results of operations which might return an empty string or list with an if-condition, but it has the consequence that because <code class="language-plaintext highlighter-rouge">bool(None)</code> evaluates to <code class="language-plaintext highlighter-rouge">False</code>, <code class="language-plaintext highlighter-rouge">not None</code> evaluates to True.</p>

<p>This choice in the docs is a bit puzzling to me:</p>

<blockquote>
  <p>neither <code class="language-plaintext highlighter-rouge">and</code> nor <code class="language-plaintext highlighter-rouge">or</code> restrict the value and type they return to <code class="language-plaintext highlighter-rouge">False</code> and <code class="language-plaintext highlighter-rouge">True</code>, but rather return the last evaluated argument</p>
</blockquote>

<p>In practice, this means that:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;&gt;&gt;</span> <span class="sh">"</span><span class="s">a</span><span class="sh">"</span> <span class="ow">or</span> <span class="bp">True</span>
<span class="sh">'</span><span class="s">a</span><span class="sh">'</span>
<span class="o">&gt;&gt;&gt;</span> <span class="sh">""</span> <span class="ow">or</span> <span class="bp">False</span>
<span class="bp">False</span>
</code></pre></div></div>

<p>I think the reason for this is to allow short-circuit evaluation, which means that you could write this:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span> <span class="o">=</span> <span class="nf">long_running_function_which_might_fail</span><span class="p">()</span> <span class="ow">or</span> <span class="nf">other_expensive_operation</span><span class="p">()</span>
</code></pre></div></div>

<p>These functions will be evaluated sequentially from left to right, so that if the first one succeeds the second doesn’t need to be called at all, and the variable will take the return value of the first function. So what’s happening with <code class="language-plaintext highlighter-rouge">False and None</code> and <code class="language-plaintext highlighter-rouge">None and False</code> is that the first element is being evaluated, found to be falsy, and returned…</p>

<h2 id="the-truth-tables-for-pythons-ternary-logic">The Truth Tables for Python’s Ternary Logic</h2>

<p>If we treat Python’s implementation of these boolean operations with <code class="language-plaintext highlighter-rouge">True</code>, <code class="language-plaintext highlighter-rouge">False</code> and <code class="language-plaintext highlighter-rouge">None</code> as a ternary logic, the resulting truth tables look like this:</p>

<table>
  <thead>
    <tr>
      <th>not</th>
      <th> </th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>True</td>
      <td>False</td>
    </tr>
    <tr>
      <td>None</td>
      <td>True</td>
    </tr>
    <tr>
      <td>False</td>
      <td>True</td>
    </tr>
  </tbody>
</table>

<table>
  <tbody>
    <tr>
      <td>and</td>
      <td>True</td>
      <td>None</td>
      <td>False</td>
    </tr>
    <tr>
      <td>True</td>
      <td>True</td>
      <td>None</td>
      <td>False</td>
    </tr>
    <tr>
      <td>None</td>
      <td>None</td>
      <td>None</td>
      <td>None</td>
    </tr>
    <tr>
      <td>False</td>
      <td>False</td>
      <td>False</td>
      <td>False</td>
    </tr>
  </tbody>
</table>

<table>
  <tbody>
    <tr>
      <td>or</td>
      <td>True</td>
      <td>None</td>
      <td>False</td>
    </tr>
    <tr>
      <td>True</td>
      <td>True</td>
      <td>True</td>
      <td>True</td>
    </tr>
    <tr>
      <td>None</td>
      <td>True</td>
      <td>None</td>
      <td>False</td>
    </tr>
    <tr>
      <td>False</td>
      <td>True</td>
      <td>None</td>
      <td>False</td>
    </tr>
  </tbody>
</table>

<p>Again, those matrices are not symmetrical because the operations are non-commutative.</p>

<p>I should say that in practice the ergonomic benefits to programmers probably outweigh the costs of the behaviour being strange in a formal sense, and I don’t think I’ve written any bugs as a result (but then again, do you ever know that you haven’t written a bug). Still, it’s a potential pitfall. And that’s without even considering <code class="language-plaintext highlighter-rouge">float("nan")</code>, numpy’s <code class="language-plaintext highlighter-rouge">nan</code> or pandas’ <code class="language-plaintext highlighter-rouge">NaT</code>. Hey, at least it’s <a href="https://www.youtube.com/watch?v=et8xNAc2ic8">not as much of a mess as Javascript</a>.</p>]]></content><author><name></name></author><category term="code" /><summary type="html"><![CDATA[Boolean logic is one of the foundational abstractions in computer science, from electrical circuits to programming languages. In Boolean logic, all variables take one of two values: 1/0, high/low, or TRUE/FALSE. However, many practical situations require the inclusion of a third value to indicate a variable is unknown, missing, or both false and true at the same time: usually something like None, NULL, NA, or Unknown.]]></summary></entry><entry><title type="html">How To Write A Neural Network in a Single Tweet</title><link href="https://adamobeng.com/how-to-write-a-neural-network-in-a-single-tweet/" rel="alternate" type="text/html" title="How To Write A Neural Network in a Single Tweet" /><published>2020-03-06T00:00:00-08:00</published><updated>2020-03-06T00:00:00-08:00</updated><id>https://adamobeng.com/how-to-write-a-neural-network-in-a-single-tweet</id><content type="html" xml:base="https://adamobeng.com/how-to-write-a-neural-network-in-a-single-tweet/"><![CDATA[<p>Neural networks! They’re everywhere! Can you use them for everything? Do they have anything to do with brains? Are they Skynet or just fancy regression? Let’s find out!</p>

<p>One of the best ways to demystify something is to build it yourself. On the other hand, one of the best ways to re-mystify it is to <a href="https://www.ioccc.org/">obfuscate</a> <a href="https://codegolf.stackexchange.com/">the code</a> <a href="/snake/">you wrote</a>. So I set myself the challenge of implementing a neural network from scratch which fits exactly in the 280 characters of a single tweet. Here it is:</p>

<blockquote class="twitter-tweet"><p lang="en" dir="ltr">from numpy import *<br />r=random<br />c=hstack<br />p=dot<br />N=r.randn<br />s=lambda x:1/(1+exp(-x))<br />a=lambda x:s(x)*(1-s(x))<br />h,j=X.shape<br />w=N(5,j+1)<br />e=N(1,6)<br />for i in r.randint(0,h,1000*h):o=c((1,X[i]));m=p(w,o);n=c((1,s(m)));d=p(e,n);b=0.02*(Y[i]-s(d));g=b;w+=outer(e[:,1:]*a(d)*a(m),o)*b;e+=n*g*a(d)</p>&mdash; Adam Obeng (@Adam_Obeng) <a href="https://twitter.com/Adam_Obeng/status/1115135287774957568?ref_src=twsrc%5Etfw">April 8, 2019</a></blockquote>
<script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

<p>Or if you prefer, here it is as a <a href="https://gist.github.com/adamobeng/72751b9571f2141ea194c5ba4c422aba">gist</a>.</p>

<p>This really is a neural network, albeit a super minimal one: a 1 hidden-layer <a href="https://en.wikipedia.org/wiki/Multilayer_perceptron">MLP</a> with sigmoid activations, fit with <a href="https://en.wikipedia.org/wiki/Stochastic_gradient_descent">SGD</a>. If you actually wanted to use it, you would set X to be a numpy array of features and Y a numpy array of labels, and after running the code, the trained weights are in variables w and e. The GIF below shows what that looks like training on some example data generated with <code class="language-plaintext highlighter-rouge">sklearn.datasets.make_circles</code>. On the left is a scatter-plot of the training data and predicted values, and the graph on the right shows the training MSE loss as the SGD steps increase.</p>

<p><img src="/images/nn_frames.gif" alt="Animation showing the neural network predicted output over a 2-d space, progressively converging on a solution which includes three test points, and a graph of the training loss decreasing over time" /></p>

<p>This very simple network does a good job at finding a reasonable boundary between the orange points and the blue points even though they’re in two concentric rings — something which a linear classifier would be utterly incapable of doing.</p>

<h2 id="a-minimal-neural-network">A minimal neural network</h2>

<p>In putting together this code, I had to do two things: figure out how to write a minimal neural network and training loop without using any external libraries, and then code golf it down to 280 characters.</p>

<p>So to explain the implementation first, let’s look at a de-minified version of the code, with more interpretable variable names and comments:</p>

<script src="https://gist.github.com/adamobeng/0754f476aaffc10104dd739621eab3cf.js"></script>

<p>You’ll notice that I’ve allowed myself the use of numpy. I don’t know that you could do this in Python without it, so I think this still counts as “from scratch”. There were a few tricky steps in figuring out how to write this in the simplest way possible. Quite a few of the programming-oriented minimal neural network tutorials end up implementing <a href="https://github.com/rcassani/mlp-example/blob/master/mlp.py">network</a> <a href="https://towardsdatascience.com/building-neural-network-from-scratch-9c88535bf8e9">classes</a>, which is unnecessary for the purposes of an implementation which doesn’t need to be extensible and also obscures how the thing actually works.</p>

<p>I ended up referring to a few other “neural network for hackers” posts, of which this is the <a href="https://iamtrask.github.io/2015/07/12/basic-python-network/">most succinct example</a>. Even so, I’ve written out the code above with descriptive variables names — I think the practice of writing code as if it was algebra is detrimental to understanding.</p>

<h2 id="code-golfing">Code Golfing</h2>

<p>The way I’ve written the code above already takes into account some of the higher-level golfing: the derivative of the sigmoid can be defined in terms of the sigmoid, and using intermediate variables for the output from each layer before the activation function means that these expressions can be re-used in the backprop step.</p>

<p>Using single-character variable names is the cheapest golfing strategy, but sometimes it’s not worth re-defining existing variables to make them shorter. <code class="language-plaintext highlighter-rouge">outer</code> is only used once, so re-naming that to a shorter name would result in strictly longer code. I was also tempted to use tuple unpacking to define multiple variables on the same line, but that doesn’t really save any characters (unless your right-side variable is already a tuple).</p>

<p>I went back and forth on including newlines in the code. On the one hand “a neural network in one line of Python” sounds pretty snappy, but on the other using semi-colons is a cheap trick and doesn’t reduce the <em>character</em> count. The one place where they are useful is inside the loop where they save on both a newline and a tab character.</p>

<p>There’s one final secret: I wanted to include my name in the code in a way that was integral to the implementation. There were not so many variable names which could be freely changed, so this was a bit of a challenge, and it actually makes the solution use a few more characters than would otherwise be necessary. Figuring out which ones is left as an exercise to the reader.</p>]]></content><author><name></name></author><category term="code," /><category term="ML" /><summary type="html"><![CDATA[Neural networks! They’re everywhere! Can you use them for everything? Do they have anything to do with brains? Are they Skynet or just fancy regression? Let’s find out!]]></summary></entry><entry><title type="html">Talk at New York Open Statistical Programming Meetup</title><link href="https://adamobeng.com/nyhackr-analysing-texts-with-r/" rel="alternate" type="text/html" title="Talk at New York Open Statistical Programming Meetup" /><published>2016-12-06T00:00:00-08:00</published><updated>2016-12-06T00:00:00-08:00</updated><id>https://adamobeng.com/nyhackr-analysing-texts-with-r</id><content type="html" xml:base="https://adamobeng.com/nyhackr-analysing-texts-with-r/"><![CDATA[<p>Yesterday, I spoke at the <a href="https://www.meetup.com/nyhackr/events/235517291/">New York Open Statistical Programming Meetup</a> about <a href="https://github.com/kbenoit/quanteda">quanteda</a>.</p>

<p>You can find the slides from my talk <a href="/download/nyhackr-2016-12-06_slides.pdf">here</a>, and the code used during the demos <a href="/download/nyhackr-2016-12-06_code.R">here</a>.</p>]]></content><author><name></name></author><category term="projects" /><summary type="html"><![CDATA[Yesterday, I spoke at the New York Open Statistical Programming Meetup about quanteda.]]></summary></entry><entry><title type="html">🐍</title><link href="https://adamobeng.com/snake/" rel="alternate" type="text/html" title="🐍" /><published>2016-09-17T00:00:00-07:00</published><updated>2016-09-17T00:00:00-07:00</updated><id>https://adamobeng.com/snake</id><content type="html" xml:base="https://adamobeng.com/snake/"><![CDATA[<p><a href="https://www.youtube.com/watch?t=904&amp;v=gg1bv9XJAek">https://www.youtube.com/watch?t=904&amp;v=gg1bv9XJAek</a></p>

<p><a href="https://github.com/adamobeng/snake">https://github.com/adamobeng/snake</a></p>]]></content><author><name></name></author><category term="projects" /><summary type="html"><![CDATA[https://www.youtube.com/watch?t=904&amp;v=gg1bv9XJAek]]></summary></entry><entry><title type="html">Re: The media is ruining science</title><link href="https://adamobeng.com/re-the-media-is-ruining-science/" rel="alternate" type="text/html" title="Re: The media is ruining science" /><published>2016-08-17T00:00:00-07:00</published><updated>2016-08-17T00:00:00-07:00</updated><id>https://adamobeng.com/re-the-media-is-ruining-science</id><content type="html" xml:base="https://adamobeng.com/re-the-media-is-ruining-science/"><![CDATA[<p>I don’t quite get <a href="https://www.washingtonpost.com/news/in-theory/wp/2016/08/17/the-media-is-ruining-science/?utm_term=.f67a1426ea2d#comments">this article</a> by Robert Gebelhoff at the Washington Post.</p>

<p>Sure, there are well-known pressures on academics to publish significant results, and also to get media attention. But those are conceptually distinct issues. Publication bias (and related problems like p-hacking and the <em>Garden of Forking Paths</em>) tend to inflate the statistical significance of published results. But that’s not related to the substantive significance of the results, to how interesting the questions being answered are. Solving publication bias would not stop journals privileging articles about exciting or controversial topics. Why would you even want that?</p>

<p>More than that, why reserve specific criticism for <a href="http://www.apa.org/news/press/releases/2015/08/reframing-sexting.pdf">Stasko and Geller’s paper</a>? Gebelhoff says that</p>

<blockquote>
  <p>the research had not been published in any academic journal. Instead, the data was compiled through an Internet survey as part of a presentation to the American Psychological Association’s annual convention. Sure, the results were interesting, but the research is simply not generalisable to the entire public.</p>
</blockquote>

<p>That doesn’t quite scan either. The fact that the research wasn’t (yet?) published in a journal has got nothing to do with the fact that they recruited candidates online. Journals publish online studies <a href="http://science.sciencemag.org/content/311/5762/854">all</a> <a href="http://www.sciencedirect.com/science/article/pii/S0169207014000879">the</a> <a href="https://dl.acm.org/citation.cfm?id=1935845">time</a>. This paper was accepted to and presented at a conference, which generally implies at least some level of filtering by peers, even if it’s short of peer-review. Peer-review isn’t <a href="http://retractionwatch.com/">a guarantee of accuracy either</a>.</p>

<p>More than that, would you lend more credence to a typical published psychology study where the sample wasn’t Internet randos but university students recruited from a Psych 101 class? Because the results you get from studying undergrads people don’t generalise well either.<sup id="fnref:fifteen" role="doc-noteref"><a href="#fn:fifteen" class="footnote" rel="footnote">1</a></sup> If anything, I’d be more convinced (<em>ceteris paribus</em>) by an attitudinal study about new technologies where the sample age range was 18–82! This paper isn’t primarily making claims about the prevalence of behaviours in the population, but about relationships between them. So the fact that it isn’t a random sample from the population is important but not critical.</p>

<p>There are a number of issues with the ways science and the media operate. But even when they affect each other, they remain distinct issues.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:fifteen" role="doc-endnote">
      <p>Henrich, Joseph, et al. “<a href="https://www.jstor.org/stable/2677736">In search of homo economicus: behavioral experiments in 15 small-scale societies.</a>” The American Economic Review 91.2 (2001): 73-78. <a href="#fnref:fifteen" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><category term="academic," /><category term="science," /><category term="statistics" /><summary type="html"><![CDATA[I don’t quite get this article by Robert Gebelhoff at the Washington Post.]]></summary></entry><entry><title type="html">How Majority Rule Wouldn’t Have Done Anything</title><link href="https://adamobeng.com/how-majority-rule-wouldnt-have-done-anything/" rel="alternate" type="text/html" title="How Majority Rule Wouldn’t Have Done Anything" /><published>2016-04-29T00:00:00-07:00</published><updated>2016-04-29T00:00:00-07:00</updated><id>https://adamobeng.com/how-majority-rule-wouldnt-have-done-anything</id><content type="html" xml:base="https://adamobeng.com/how-majority-rule-wouldnt-have-done-anything/"><![CDATA[<p>Eric Maskin and Amartya Sen <a href="http://www.nytimes.com/2016/05/01/opinion/sunday/how-majority-rule-might-have-stopped-donald-trump.html?action=click&amp;pgtype=Homepage&amp;version=Moth-Visible&amp;moduleDetail=inside-nyt-region-4&amp;module=inside-nyt-region&amp;region=inside-nyt-region&amp;WT.nav=inside-nyt-region">write in the New York Times</a> that US primary and general elections would be more fair if the winner always commanded an absolute majority of the popular vote. Perhaps, but as it currently stands, neither party nominations nor Presidential elections are actually decided by the popular vote. Even if Maskin and Sen’s criteria were satisfied — and a Presidential candidate’s state victories were all absolute majorities — that candidate could still technically win a Presidential election with only 23% of the popular vote.</p>

<h2 id="state-by-state-majorities-dont-mean-a-national-majority">State-by-state majorities don’t mean a national majority</h2>

<p>Maskin and Sen’s claim is that the voting system used in primaries and Presidential elections produces winners who only got more votes than each of the other candidates (a plurality), rather than a majority of the votes. For example, a candidate can win such a contest with 40% of the vote if two opponents get 30% each. They suggest this voting system should be replaced with one which fulfils the Condorcet criterion: if there is a candidate who would beat each of the others in one-on-one contests, they would be the winner.</p>

<p>There are many reasons to prefer Condorcet-certified electoral methods, but simply switching the electoral systems of the state contests wouldn’t achieve what Maskin and Sen want, that is, the President always having won a majority of the popular vote. There are a couple of reasons for this, and I hinted at one in <a href="www.huffingtonpost.com/adam-obeng/there-are-no-bellwether-c_b_9777628.html">my post on bellwethers</a>: party candidates and Presidents alike are not chosen by the popular vote. Rather, in both primaries and general elections the voters in each state select delegates, and those delegates vote for the nominee or President.  (There’s also a really wonkish criticism to be made of their choice of words: <a href="http://adamobeng.com/how-majority-rule-wouldnt-have-done-anything#condorcet-and-majority">a Condorcet winner is not the same thing as a majority winner</a>.</p>

<p>For example, <del>Maskin and Sen mention</del><sup id="fnref:correction" role="doc-noteref"><a href="#fn:correction" class="footnote" rel="footnote">1</a></sup> that John Kerry might have won Florida in the 2004 Presidential Election if a Condorcet method had been in place. That’s because Kerry won 47.09% of the vote, and Bush won 52.10%, with Nader taking 0.43%. Had Nader been out of the contest, or a Condorcet voting method been in place, Kerry might have won Florida. That’s true, and it’s the same reason why Ted Cruz and John Kasich kinda-sorta made a deal to avoid splitting their vote against Trump. But winning the popular vote state-by-state is not at all the same thing as winning the national popular vote: Al Gore <em>did</em> win the nationwide popular vote in 2004, and he still didn’t win the Presidency. <a href="http://www2.gwu.edu/~bygeorge/110304/ullman.html">He could have won the Presidency</a> if he had gained the same number of votes but distributed differently between the states. The same applies to the 1876 and 1888 elections.</p>

<p>In other words, even if the state-level contests meet the Condorcet criterion, that doesn’t mean that the national contest will when taken as a whole. It depends crucially on how many delegates each state gets (as well as on how the state’s delegates are allocated to candidates based on the result of the state’s popular vote: winner-takes-all, or proportionally, or some other method).</p>

<h2 id="all-in-proportion">All in proportion</h2>

<p>Currently, delegates to the general election’s Electoral College are not assigned to states proportionally to their population, but based on Congressional apportionment (party conference delegates are assigned in a similar, but not identical, way). A single Electoral College delegate can represents between 200 000 and 700 000 voters depending on the state.</p>

<p>To see how this works, consider a country made up of one large state with 7 million voters and five small states with 1 million each. Let’s say the large state chooses 10 electors, and each of the small states chooses 5. This give the states voter-to-delegate ratios similar to the extremes of the current Electoral College. The large state is clearly underrepresented, having 7x more population but only 2x more votes. A candidate could win this election by narrowly winning 4 of the small states — 500 000 votes and 5 delegates in each, for a total of 20 out of 35 delegates. But they would only have won 2.5 million (20%) of the 12 million popular votes. Notice that this happens even though the winning candidate has <em>a strict majority in each of the states they win</em>, which is a more “fair” outcome than Maskin and Sen even hope for!</p>

<p><img src="/images/apportionment2.png" alt="Congressional apportionment rules over-represent small states in the Electoral College" /></p>

<p>This example is simplified, but the outcome is no more extreme than is possible in the real US system. If a candidate won very slight absolute majorities in the 31 states most highly represented in the Electoral College, they could become President on 23.0% of the popular vote. Given 54.9% turnout, that’s just 12.6% of eligible voters. And that’s assuming two parties: with an arbitrarily large number of parties, a President could win with an arbitrarily small share of the vote!</p>

<p>The state-level voting system which Maskin and Sen want to improve is only half the story. Once a state is won, it’s just as important how much that result counts towards the national contest.</p>

<h2 id="the-condorcet-and-majority-criteria-">The Condorcet and Majority criteria <a name="condorcet-and-majority"></a></h2>

<p>Maskin and Sen also slightly mangle two criteria for the fairness of voting systems: the Condorcet criterion and the Majority criterion.</p>

<p>The Condorcet criterion says that</p>

<blockquote>
  <p><em>if</em> there is a candidate who would beat each of the others in one-on-one contests 
<em>then</em> that candidate should win</p>
</blockquote>

<p>while the Majority criterion says that</p>

<blockquote>
  <p><em>if</em> there is a candidate who wins an absolute majority of the votes
<em>then</em> that candidate should win</p>
</blockquote>

<p>Different voting systems can satisfy one or the other criterion, but a system which satisfies the Condorcet criterion automatically satisfies the Majority criterion. That’s easy to see: any a candidate who beats each of their opponents one-on-one will also beat all of them combined (because they can only do worse by splitting the vote).</p>

<p>But the opposite is not true: a system which satisfies the majority criterion does not necessarily satisfy the Condorcet criterion. Imagine a candidate who beat their opponents one-on-one, but did not command an absolute majority. They would be elected by a Condorcet system, but not necessarily by a majority system. The Condorcet criterion is in that sense more strict, more difficult to satisfy. Majority winners are few and far between, so it’s not that hard to guarantee  that when they exist they are chosen.</p>

<p>The direction of implication is slightly tricky here, and Maskin and Sen get it a bit wrong. Both of the criteria say that if a candidate exists who meets these parameters, then they win. <em>They do not say that a Condorcet or majority winner must exist</em>; in a Condorcet or majority system, the winner does not have to be able to beat all of their opponents one-on-one or gain an absolute majority. Why is this? Think of the majority case: if votes are split 60/20/20, then the candidate with 60% wins, and the majority criterion holds. If the vote is split 40/30/30, there is no candidate with an absolute majority, so the majority criterion says nothing about who should win! Plurality voting — our familiar win-if-you-win-the most votes system that Maskin and Sen complain about — satisfies the majority criterion! In fact, you could have a voting system where if there was no absolute majority the winner was selected randomly from all the candidates in the field, and that would still meet the majority criterion!</p>

<p>Maskin and Sen also confuse matters by saying that a candidate who “would defeat each opponent individually in a head-to-head match-up” is “a real majority winner”. Well, if you take the common-sense meaning of “majority”, that’s not true. A Condorcet winner could be a majority winner, but they don’t have to be. If they’re redefining “majority” to mean someone who beats other candidates one-on-one head-to-head, they’re twisting the meaning of the word. Beating your opponents one-on-one is perhaps a desirable property, but it’s just not the same thing as having a majority. If you want plurality winners to stop behaving as if they had a majority, you should feel the same about mere Condorcet winners.</p>

<h2 id="majority-rules">Majority rules</h2>

<p>So should we just elect the President by nationwide popular vote? Not necessarily: the current electoral system might seem arbitrary or plain weird, and parts of it certainly are. But state-level contests are designed to cement the importance of the states in the electoral system. In fact, while the Constitution provides that the states choose the Electoral College, it doesn’t say that the popular vote has anything to do with it. That’s for the state legislatures to decide, and it wasn’t <a href="http://www.fairvote.org/how-the-electoral-college-became-winner-take-all">until 1872</a> that all states held popular elections for President. There is a debate to be had as to whether the power of the people or the power of the states is more important.</p>

<p>Regardless of the details of the American system, Maskin and Sen’s more general point is that plurality winners should be humble, and remember that a whole bunch of their constituents didn’t vote for them.  True. But winners should be humble even if they have a majority: even after the most indisputable victory not every voter approves of every policy a politician come up with during their term in office. It’s a fundamental — one might say necessary — characteristic of representative democracy that elected representatives don’t perfectly mirror the opinions of the citizenry. Democracy’s in the details.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:correction" role="doc-endnote">
      <p><em>Correction 2016-05-22</em> Maskin and Sen’s original article actually used the example of George W. Bush, Al Gore and Ralph Nader in Florida in 2000. The substantive point stands, however. <a href="#fnref:correction" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><category term="academic," /><category term="stats" /><summary type="html"><![CDATA[Eric Maskin and Amartya Sen write in the New York Times that US primary and general elections would be more fair if the winner always commanded an absolute majority of the popular vote. Perhaps, but as it currently stands, neither party nominations nor Presidential elections are actually decided by the popular vote. Even if Maskin and Sen’s criteria were satisfied — and a Presidential candidate’s state victories were all absolute majorities — that candidate could still technically win a Presidential election with only 23% of the popular vote.]]></summary></entry><entry><title type="html">Embedded local video in jupyter notebook with R kernel</title><link href="https://adamobeng.com/embedded-video-in-jupyter-with-R-kernel/" rel="alternate" type="text/html" title="Embedded local video in jupyter notebook with R kernel" /><published>2016-04-04T00:00:00-07:00</published><updated>2016-04-04T00:00:00-07:00</updated><id>https://adamobeng.com/embedded-video-in-jupyter-with-R-kernel</id><content type="html" xml:base="https://adamobeng.com/embedded-video-in-jupyter-with-R-kernel/"><![CDATA[<p>For some reason, there isn’t a default way to embed a local video file in a jupyter notebook.</p>

<p>If you’re using a python kernel, you can make use of <a href="https://www.reddit.com/r/IPython/comments/35tocn/is_it_possible_to_embed_a_local_file_video_in_the/cr7q11e">this hack</a>, which inserts the whole video, base64-encoded, into the generated HTML. But because this runs in a code block, not a markdown block, it’s dependent on the kernel you’re running. Notebooks only support one kernel, so if the rest of your code is R, you’ll need an R version.</p>

<p>Nobody else seems to have written the equivalent code for R, so here it is:</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="w">    </span><span class="n">show_video</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span><span class="w"> </span><span class="n">mimetype</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="n">library</span><span class="p">(</span><span class="n">IRdisplay</span><span class="p">)</span><span class="w">
    </span><span class="n">library</span><span class="p">(</span><span class="n">base64enc</span><span class="p">)</span><span class="w">

    </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">base64encode</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span><span class="w"> </span><span class="s1">'raw'</span><span class="p">)</span><span class="w">

    </span><span class="n">display_html</span><span class="p">(</span><span class="n">paste0</span><span class="p">(</span><span class="s1">'&lt;video controls src="data:'</span><span class="p">,</span><span class="w">
         </span><span class="n">mimetype</span><span class="p">,</span><span class="w"> </span><span class="s1">';base64,'</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="p">,</span><span class="w"> </span><span class="s1">'"&gt;'</span><span class="p">))</span><span class="w">
    </span><span class="p">}</span></code></pre></figure>]]></content><author><name></name></author><category term="code," /><category term="R" /><summary type="html"><![CDATA[For some reason, there isn’t a default way to embed a local video file in a jupyter notebook.]]></summary></entry><entry><title type="html">Whither the Bellwether?</title><link href="https://adamobeng.com/whither-the-bellwether/" rel="alternate" type="text/html" title="Whither the Bellwether?" /><published>2016-03-09T00:00:00-08:00</published><updated>2016-03-09T00:00:00-08:00</updated><id>https://adamobeng.com/whither-the-bellwether</id><content type="html" xml:base="https://adamobeng.com/whither-the-bellwether/"><![CDATA[<blockquote>
  <p>Scientists have calculated that the chances of something so patently absurd actually existing are millions to one.
But magicians have calculated that million-to-one chances crop up nine times out of ten.</p>

  <p>— Terry Pratchett, Mort</p>
</blockquote>

<p># TL;DR</p>

<p>If Vigo County, IN is a bellwether for US Presidential elections, then so is Valencia County, NM.</p>

<p>And York County, ME; Racine County, WI; and Strafford County, NH.</p>

<p>Besides which, we shouldn’t expect any of them to continue getting it right.</p>

<p># Background</p>

<p><em>On the Media</em>’s recent segment <a href="http://www.onthemedia.org/story/magic-terre-haute/">“Magic” Terre Haute</a> reports on the search for “one small town that thinks exactly the way the nation does” and as such can predict the results of U.S. Presidential elections. According to <a href="http://www.bellwether2016.org/">Don Campbell</a> amongst others, the current contender is Vigo County, IN which has “voted for the winning candidate in every presidential election except two — 30 out of 32 elections”, and has not “missed in 60 years”. According to Campbell, “No other place in America comes close.”</p>

<p>This immediately raised two questions for me: is Vigo the only candidate for the nation’s bellwether? And are there really even bellwethers in the first place?</p>

<p>To answer them, I collected county-level voting data from ICPSR and the Congressional Quarterly Voting and Elections Collection, covering the period 1840–2012 (code <a href="https://github.com/adamobeng/bellwether">on GitHub</a>, please replicate and extend).</p>

<h1 id="which-are-the-bellwethers">Which are the bellwethers?</h1>

<p>First off, it’s true that Vigo hasn’t missed since 1952, but the same goes for Valencia County, NM. They are far from outliers: fourteen other counties have only missed one of those fifteen elections. And I would say it’s even tipping the scales in Vigo’s favour to use it’s winning streak as the basis of comparison to other counties.</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">State</th>
      <th style="text-align: left">Area</th>
      <th style="text-align: right">prop. correct</th>
      <th style="text-align: right">#correct</th>
      <th style="text-align: right">#elections</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">Indiana</td>
      <td style="text-align: left">Vigo</td>
      <td style="text-align: right">1.0000000</td>
      <td style="text-align: right">15</td>
      <td style="text-align: right">15</td>
    </tr>
    <tr>
      <td style="text-align: left">New Mexico</td>
      <td style="text-align: left">Valencia</td>
      <td style="text-align: right">1.0000000</td>
      <td style="text-align: right">15</td>
      <td style="text-align: right">15</td>
    </tr>
    <tr>
      <td style="text-align: left">California</td>
      <td style="text-align: left">Ventura</td>
      <td style="text-align: right">0.9333333</td>
      <td style="text-align: right">14</td>
      <td style="text-align: right">15</td>
    </tr>
    <tr>
      <td style="text-align: left">Delaware</td>
      <td style="text-align: left">Kent</td>
      <td style="text-align: right">0.9333333</td>
      <td style="text-align: right">14</td>
      <td style="text-align: right">15</td>
    </tr>
    <tr>
      <td style="text-align: left">Florida</td>
      <td style="text-align: left">Hillsborough</td>
      <td style="text-align: right">0.9333333</td>
      <td style="text-align: right">14</td>
      <td style="text-align: right">15</td>
    </tr>
    <tr>
      <td style="text-align: left">Montana</td>
      <td style="text-align: left">Blaine</td>
      <td style="text-align: right">0.9333333</td>
      <td style="text-align: right">14</td>
      <td style="text-align: right">15</td>
    </tr>
    <tr>
      <td style="text-align: left">New Mexico</td>
      <td style="text-align: left">Hidalgo</td>
      <td style="text-align: right">0.9333333</td>
      <td style="text-align: right">14</td>
      <td style="text-align: right">15</td>
    </tr>
    <tr>
      <td style="text-align: left">New Mexico</td>
      <td style="text-align: left">Sandoval</td>
      <td style="text-align: right">0.9333333</td>
      <td style="text-align: right">14</td>
      <td style="text-align: right">15</td>
    </tr>
    <tr>
      <td style="text-align: left">North Carolina</td>
      <td style="text-align: left">Buncombe</td>
      <td style="text-align: right">0.9333333</td>
      <td style="text-align: right">14</td>
      <td style="text-align: right">15</td>
    </tr>
    <tr>
      <td style="text-align: left">North Dakota</td>
      <td style="text-align: left">Sargent</td>
      <td style="text-align: right">0.9333333</td>
      <td style="text-align: right">14</td>
      <td style="text-align: right">15</td>
    </tr>
    <tr>
      <td style="text-align: left">Ohio</td>
      <td style="text-align: left">Ottawa</td>
      <td style="text-align: right">0.9333333</td>
      <td style="text-align: right">14</td>
      <td style="text-align: right">15</td>
    </tr>
    <tr>
      <td style="text-align: left">Texas</td>
      <td style="text-align: left">Bexar</td>
      <td style="text-align: right">0.9333333</td>
      <td style="text-align: right">14</td>
      <td style="text-align: right">15</td>
    </tr>
    <tr>
      <td style="text-align: left">Texas</td>
      <td style="text-align: left">Val Verde</td>
      <td style="text-align: right">0.9333333</td>
      <td style="text-align: right">14</td>
      <td style="text-align: right">15</td>
    </tr>
    <tr>
      <td style="text-align: left">Virginia</td>
      <td style="text-align: left">Westmoreland</td>
      <td style="text-align: right">0.9333333</td>
      <td style="text-align: right">14</td>
      <td style="text-align: right">15</td>
    </tr>
    <tr>
      <td style="text-align: left">Wisconsin</td>
      <td style="text-align: left">Juneau</td>
      <td style="text-align: right">0.9333333</td>
      <td style="text-align: right">14</td>
      <td style="text-align: right">15</td>
    </tr>
    <tr>
      <td style="text-align: left">Wisconsin</td>
      <td style="text-align: left">Sawyer</td>
      <td style="text-align: right">0.9333333</td>
      <td style="text-align: right">14</td>
      <td style="text-align: right">15</td>
    </tr>
    <tr>
      <td style="text-align: left">California</td>
      <td style="text-align: left">Merced</td>
      <td style="text-align: right">0.8666667</td>
      <td style="text-align: right">13</td>
      <td style="text-align: right">15</td>
    </tr>
    <tr>
      <td style="text-align: left">California</td>
      <td style="text-align: left">San Bernardino</td>
      <td style="text-align: right">0.8666667</td>
      <td style="text-align: right">13</td>
      <td style="text-align: right">15</td>
    </tr>
    <tr>
      <td style="text-align: left">California</td>
      <td style="text-align: left">San Joaquin</td>
      <td style="text-align: right">0.8666667</td>
      <td style="text-align: right">13</td>
      <td style="text-align: right">15</td>
    </tr>
    <tr>
      <td style="text-align: left">California</td>
      <td style="text-align: left">Stanislaus</td>
      <td style="text-align: right">0.8666667</td>
      <td style="text-align: right">13</td>
      <td style="text-align: right">15</td>
    </tr>
  </tbody>
</table>

<p>If we instead take into account elections since 1888 then Vigo and Ventura County, CA are tied with 29/32 elections each. These figures are not the same as the ones Campbell quotes, and I’m not quite sure why. It seems like the ICPSR data don’t agree with <a href="http://uselectionatlas.org/WEBLOGS/dave/2013/06/30/vigo-county-in-extends-bellwether-streak/">Dave Leip’s US Election Atlas</a> about Vigo County’s results for 1908, 1892 and 1896. If the ICPSR data are wrong,, that could be enough to make Vigo the single most successful predictor over this time period, but it still wouldn’t be the uncontested winner.<sup id="fnref:names" role="doc-noteref"><a href="#fn:names" class="footnote" rel="footnote">1</a></sup></p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">State</th>
      <th style="text-align: left">Area</th>
      <th style="text-align: right">prop. correct</th>
      <th style="text-align: right">#correct</th>
      <th style="text-align: right">#elections</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">Indiana</td>
      <td style="text-align: left">Vigo</td>
      <td style="text-align: right">0.90625</td>
      <td style="text-align: right">29</td>
      <td style="text-align: right">32</td>
    </tr>
    <tr>
      <td style="text-align: left">California</td>
      <td style="text-align: left">Ventura</td>
      <td style="text-align: right">0.90625</td>
      <td style="text-align: right">29</td>
      <td style="text-align: right">32</td>
    </tr>
    <tr>
      <td style="text-align: left">Wisconsin</td>
      <td style="text-align: left">Racine</td>
      <td style="text-align: right">0.87500</td>
      <td style="text-align: right">28</td>
      <td style="text-align: right">32</td>
    </tr>
    <tr>
      <td style="text-align: left">New Hampshire</td>
      <td style="text-align: left">Coos</td>
      <td style="text-align: right">0.84375</td>
      <td style="text-align: right">27</td>
      <td style="text-align: right">32</td>
    </tr>
    <tr>
      <td style="text-align: left">Iowa</td>
      <td style="text-align: left">Jasper</td>
      <td style="text-align: right">0.84375</td>
      <td style="text-align: right">27</td>
      <td style="text-align: right">32</td>
    </tr>
    <tr>
      <td style="text-align: left">Iowa</td>
      <td style="text-align: left">Palo Alto</td>
      <td style="text-align: right">0.84375</td>
      <td style="text-align: right">27</td>
      <td style="text-align: right">32</td>
    </tr>
    <tr>
      <td style="text-align: left">Nebraska</td>
      <td style="text-align: left">Douglas</td>
      <td style="text-align: right">0.84375</td>
      <td style="text-align: right">27</td>
      <td style="text-align: right">32</td>
    </tr>
    <tr>
      <td style="text-align: left">Oregon</td>
      <td style="text-align: left">Clackamas</td>
      <td style="text-align: right">0.84375</td>
      <td style="text-align: right">27</td>
      <td style="text-align: right">32</td>
    </tr>
    <tr>
      <td style="text-align: left">Connecticut</td>
      <td style="text-align: left">Windham</td>
      <td style="text-align: right">0.81250</td>
      <td style="text-align: right">26</td>
      <td style="text-align: right">32</td>
    </tr>
    <tr>
      <td style="text-align: left">Maine</td>
      <td style="text-align: left">York</td>
      <td style="text-align: right">0.81250</td>
      <td style="text-align: right">26</td>
      <td style="text-align: right">32</td>
    </tr>
    <tr>
      <td style="text-align: left">New Hampshire</td>
      <td style="text-align: left">Strafford</td>
      <td style="text-align: right">0.81250</td>
      <td style="text-align: right">26</td>
      <td style="text-align: right">32</td>
    </tr>
    <tr>
      <td style="text-align: left">Delaware</td>
      <td style="text-align: left">New Castle</td>
      <td style="text-align: right">0.81250</td>
      <td style="text-align: right">26</td>
      <td style="text-align: right">32</td>
    </tr>
    <tr>
      <td style="text-align: left">New Jersey</td>
      <td style="text-align: left">Middlesex</td>
      <td style="text-align: right">0.81250</td>
      <td style="text-align: right">26</td>
      <td style="text-align: right">32</td>
    </tr>
    <tr>
      <td style="text-align: left">Indiana</td>
      <td style="text-align: left">Delaware</td>
      <td style="text-align: right">0.81250</td>
      <td style="text-align: right">26</td>
      <td style="text-align: right">32</td>
    </tr>
    <tr>
      <td style="text-align: left">Indiana</td>
      <td style="text-align: left">Madison</td>
      <td style="text-align: right">0.81250</td>
      <td style="text-align: right">26</td>
      <td style="text-align: right">32</td>
    </tr>
    <tr>
      <td style="text-align: left">Indiana</td>
      <td style="text-align: left">St Joseph</td>
      <td style="text-align: right">0.81250</td>
      <td style="text-align: right">26</td>
      <td style="text-align: right">32</td>
    </tr>
    <tr>
      <td style="text-align: left">Indiana</td>
      <td style="text-align: left">Vanderburgh</td>
      <td style="text-align: right">0.81250</td>
      <td style="text-align: right">26</td>
      <td style="text-align: right">32</td>
    </tr>
    <tr>
      <td style="text-align: left">Ohio</td>
      <td style="text-align: left">Montgomery</td>
      <td style="text-align: right">0.81250</td>
      <td style="text-align: right">26</td>
      <td style="text-align: right">32</td>
    </tr>
    <tr>
      <td style="text-align: left">Ohio</td>
      <td style="text-align: left">Portage</td>
      <td style="text-align: right">0.81250</td>
      <td style="text-align: right">26</td>
      <td style="text-align: right">32</td>
    </tr>
    <tr>
      <td style="text-align: left">Iowa</td>
      <td style="text-align: left">Bremer</td>
      <td style="text-align: right">0.81250</td>
      <td style="text-align: right">26</td>
      <td style="text-align: right">32</td>
    </tr>
  </tbody>
</table>

<p>For some reason, most reports about Vigo County’s bellwether status start from 1888. I’m not an expert on US political geography, but it’s not clear to me why they choose that date. There might be something significant about 1888 in terms of the structure of the electoral system, or the geography of states and counties, but I haven’t found it so far. So I saw no reason to not look back even further.<sup id="fnref:dates" role="doc-noteref"><a href="#fn:dates" class="footnote" rel="footnote">2</a></sup></p>

<p>On elections since 1840, Vigo County still does pretty well, but there are nine other counties either tied with or ahead of it. Depending on how the missing data fall, these counties could be pushing 84% agreement with the election result. Racine County, WI might be the single best bellwether given that it predicts 34 out of the 40 elections for which there is data. If it called all the missing elections right, it would have 86%. Even given the missing data, Racine gets as many right as York and Strafford. Still, over this time period only a few counties have above 80% correct predictions.</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">State</th>
      <th style="text-align: left">Area</th>
      <th style="text-align: right">prop. correct</th>
      <th style="text-align: right">#correct</th>
      <th style="text-align: right">#elections</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">Maine</td>
      <td style="text-align: left">York</td>
      <td style="text-align: right">0.8292683</td>
      <td style="text-align: right">34</td>
      <td style="text-align: right">41</td>
    </tr>
    <tr>
      <td style="text-align: left">New Hampshire</td>
      <td style="text-align: left">Strafford</td>
      <td style="text-align: right">0.8292683</td>
      <td style="text-align: right">34</td>
      <td style="text-align: right">41</td>
    </tr>
    <tr>
      <td style="text-align: left">New Hampshire</td>
      <td style="text-align: left">Hillsborough</td>
      <td style="text-align: right">0.8048780</td>
      <td style="text-align: right">33</td>
      <td style="text-align: right">41</td>
    </tr>
    <tr>
      <td style="text-align: left">Ohio</td>
      <td style="text-align: left">Portage</td>
      <td style="text-align: right">0.8048780</td>
      <td style="text-align: right">33</td>
      <td style="text-align: right">41</td>
    </tr>
    <tr>
      <td style="text-align: left">Connecticut</td>
      <td style="text-align: left">Windham</td>
      <td style="text-align: right">0.7804878</td>
      <td style="text-align: right">32</td>
      <td style="text-align: right">41</td>
    </tr>
    <tr>
      <td style="text-align: left">Illinois</td>
      <td style="text-align: left">Macon</td>
      <td style="text-align: right">0.7804878</td>
      <td style="text-align: right">32</td>
      <td style="text-align: right">41</td>
    </tr>
    <tr>
      <td style="text-align: left">Illinois</td>
      <td style="text-align: left">Will</td>
      <td style="text-align: right">0.7804878</td>
      <td style="text-align: right">32</td>
      <td style="text-align: right">41</td>
    </tr>
    <tr>
      <td style="text-align: left">Indiana</td>
      <td style="text-align: left">St Joseph</td>
      <td style="text-align: right">0.7804878</td>
      <td style="text-align: right">32</td>
      <td style="text-align: right">41</td>
    </tr>
    <tr>
      <td style="text-align: left">Indiana</td>
      <td style="text-align: left">Vigo</td>
      <td style="text-align: right">0.7804878</td>
      <td style="text-align: right">32</td>
      <td style="text-align: right">41</td>
    </tr>
    <tr>
      <td style="text-align: left">Ohio</td>
      <td style="text-align: left">Stark</td>
      <td style="text-align: right">0.7804878</td>
      <td style="text-align: right">32</td>
      <td style="text-align: right">41</td>
    </tr>
    <tr>
      <td style="text-align: left">Maine</td>
      <td style="text-align: left">Washington</td>
      <td style="text-align: right">0.7560976</td>
      <td style="text-align: right">31</td>
      <td style="text-align: right">41</td>
    </tr>
    <tr>
      <td style="text-align: left">New Hampshire</td>
      <td style="text-align: left">Coos</td>
      <td style="text-align: right">0.7560976</td>
      <td style="text-align: right">31</td>
      <td style="text-align: right">41</td>
    </tr>
    <tr>
      <td style="text-align: left">New Hampshire</td>
      <td style="text-align: left">Sullivan</td>
      <td style="text-align: right">0.7560976</td>
      <td style="text-align: right">31</td>
      <td style="text-align: right">41</td>
    </tr>
    <tr>
      <td style="text-align: left">New Jersey</td>
      <td style="text-align: left">Atlantic</td>
      <td style="text-align: right">0.7560976</td>
      <td style="text-align: right">31</td>
      <td style="text-align: right">41</td>
    </tr>
    <tr>
      <td style="text-align: left">Indiana</td>
      <td style="text-align: left">Delaware</td>
      <td style="text-align: right">0.7560976</td>
      <td style="text-align: right">31</td>
      <td style="text-align: right">41</td>
    </tr>
    <tr>
      <td style="text-align: left">Indiana</td>
      <td style="text-align: left">Vanderburgh</td>
      <td style="text-align: right">0.7560976</td>
      <td style="text-align: right">31</td>
      <td style="text-align: right">41</td>
    </tr>
    <tr>
      <td style="text-align: left">Michigan</td>
      <td style="text-align: left">Macomb</td>
      <td style="text-align: right">0.7560976</td>
      <td style="text-align: right">31</td>
      <td style="text-align: right">41</td>
    </tr>
    <tr>
      <td style="text-align: left">Michigan</td>
      <td style="text-align: left">Shiawassee</td>
      <td style="text-align: right">0.7560976</td>
      <td style="text-align: right">31</td>
      <td style="text-align: right">41</td>
    </tr>
    <tr>
      <td style="text-align: left">Michigan</td>
      <td style="text-align: left">Van Buren</td>
      <td style="text-align: right">0.7560976</td>
      <td style="text-align: right">31</td>
      <td style="text-align: right">41</td>
    </tr>
    <tr>
      <td style="text-align: left">Ohio</td>
      <td style="text-align: left">Columbiana</td>
      <td style="text-align: right">0.7560976</td>
      <td style="text-align: right">31</td>
      <td style="text-align: right">41</td>
    </tr>
  </tbody>
</table>

<p>Finally, spare a thought for Webster County in Georgia. Since its founding in 1853 it has only voted in line with the national trend in 13 out of the 37 elections for which there are data. That makes it a somewhat reliable an <em>anti-</em>bellwether: if you looked at the result from Webster and <em>picked the opposite</em>, you’d do better than 75% of other counties with as much data.<sup id="fnref:missing" role="doc-noteref"><a href="#fn:missing" class="footnote" rel="footnote">3</a></sup></p>

<p># Are there actually bellwethers?</p>

<p>A quick Google Scholar search reveals surprisingly few academic papers about in Presidential election bellwethers. Perhaps the best is <a href="https://www.jstor.org/stable/2748067?seq=1#page_scan_tab_contents">this 1975 paper</a> by Tufte (yes, that Tufte) and Sun. It concludes that there are no bellwether counties because bellwether status can only reliably be assigned after the fact.</p>

<p>This is a common problem which you come across in both machine learning and the social sciences. Once you’ve observed an outcome it’s trivial to tweak your prediction to predict the thing that’s already happened. And once you’ve made a prediction of the past, you can come up with a justification that not only makes it look like it’s the only possible prediction that could have been made, but also convinces you that you could have made it before the fact.</p>

<p>We can do something like Tufte’s analysis using our larger data set. For simplicity’s sake, let’s look at the accuracy of counties that have had, at any point in time, the same 60-year, 15-election streak that Vigo currently enjoys (Tufte and Sun look at streaks from 24 to 48 years).</p>

<p>Take 1968, for example. Northampton, PA and Prince George’s, MD (as well as four other counties) had predicted the correct result in all the elections since 1908 — the same correct streak that Vigo has now. But in 1968 Northampton voted for McGovern and Prince George’s voted, correctly, for Nixon. All in all, there are 90 cases in which a county had a streak of 15 correct predictions behind it going into an election year (for 43 distinct counties). But in 47% of those cases, the county failed to get the <em>next</em> election right. Pretty close to a <a href="https://news.stanford.edu/news/2004/june9/diaconis-69.html">coin-toss</a>.</p>

<p>This graph (click on it for a bigger image) shows the performance over time of the 20 counties that have, at one time or another, had a streak longer than Vigo’s current one.</p>

<p>&lt;a href=/streaks.png&gt;
<img src="/streaks.png" alt="Bellwether streaks 1840–2012" />
&lt;/a&gt;</p>

<p>The fatalism in this plot is almost the opposite of the problem of post-hoc prediction: the counties with the longest streaks necessarily fall of a cliff shortly after.</p>

<p># What is a bellwether anyway?</p>

<p>So far, we’ve worked out that Vigo isn’t the only bellwether, and bellwethers aren’t particularly great at making predictions. But what is it exactly that we’re asking them to do?</p>

<!---
The question isn't trivial, and it's analogous to the problem that pollsters face many months before an election. They want to find out based on people's current reported opinions what their vote at some point in the future will be (and if they will in fact vote).
-->

<p>The most common version of the bellwether is what Tufte and Sun call the “all-or-nothing” barometric bellwether: a county in which the majority vote is for the person who’s eventually elected President. That’s what we’ve been looking at so far and as such, we’re interested in the classificatory accuracy of the county: what proportion of elections it gets right. But you might also look for counties where the percentage for each candidate is closest to that from the popular vote. That’s not the same thing: given that many contests are close to 50/50, calling the right outcome can sometimes be a matter of a few tenths of a percent, so measuring the difference in vote percentages directly might be a more fair measure of a bellwether.</p>

<p>That said (and not having looked at the data) I don’t think it will make a difference to the results above. <a href="http://uselectionatlas.org/INFORMATION/BELLWETHER/bellwether.php">Dave Leip’s US Election atlas</a> reports that Vigo county has had a mean absolute difference from the national result of 0.9pp. That might give it the edge in accuracy of prediction of the vote percentages, but I doubt switching to that measure will make it the singularly best predictor.</p>

<p>Also, what we’re asking bellwethers to do is strange given the institutional structure of US Presidential elections, namely that the winner of the popular vote is not necessarily the winner of the election. So to predict all elections correctly, a bellwether would sometimes have to go <em>against</em> the popular vote. In 2000, for example, a majority of voters in Vigo County voted for George Bush — but the winner of the nationwide popular vote was Al Gore. Vigo somehow got the election right while getting the popular vote wrong. Bellwethers are supposed to work because they’re a microcosm of the US, their demographics and opinions being proportional to those of the nation as a whole. How do we explain the cases then where in order to predict the outcome, the bellwether’s voters had to predict not the nation’s popular vote, but the distinctly non-proportional outcome of the electoral college — presumably taking into account such factors as Maine and Nebraska’s Congressional district method, <a href="https://en.wikipedia.org/wiki/Faithless_elector">faithless electors</a> and the <a href="https://en.wikipedia.org/wiki/Huntington%E2%80%93Hill_method">Huntington-Hill method</a>?</p>

<p>As I said, it’s also often claimed that these bellwethers are accurate because their populations are representative of demographic and attitudinal characteristics on a national level. The <em>OtM</em> piece notes that Vigo county is not particularly demographically similar to the nation as whole, so that’s already kind of dubious. As Campbell noted, some of it is certainly luck, but I may have to look further into the makeup of the potential bellwether.</p>

<p># Conclusion</p>

<p>The search for predictive accuracy is always stymied by the risk of post-hoc prediction. In machine learning, this is dealt with by a strict separation of the data you’re allowed to look at (the training set) from the data you use to evaluate your prediction (the test set). Failing this, it’s easy to <em>over-fit</em>, to produce a model that can perfectly predict what has already happened but cannot generalise to the future.</p>

<p>According to my quick analysis,<sup id="fnref:check" role="doc-noteref"><a href="#fn:check" class="footnote" rel="footnote">4</a></sup> that’s at least part of the story with bellwethers too. Vigo County gets all the press, but there are other equally accurate counties. Even those are not great predictors in the long term.</p>

<p>Of course, it may be that pure predictive power is not why we should be paying attention to bellwethers. We certainly learn something about the American populace by studying a small town in detail, as we did from the <a href="https://en.wikipedia.org/wiki/Middletown_studies">Middletown studies</a>. But the conclusions we can make from such places are not generalisable to the whole country in the (relatively) straightforward statistical sense. If we’re going to learn from Vigo, Valencia, York, Racine, and Strafford, we’ll need to make use of well-informed and sensitive interpretation. That’s what a documentarian like Campbell is well-placed to develop. Looking at it another way, bellwethers aren’t an alternative to the punditry so lamented by <em>OtM</em>, they’re fuel for them.</p>

<p>So, are the citizens of Vigo County “history’s most reliable presidential bellwethers”?</p>

<p>Not quite, <a href="https://twitter.com/Bobosphere">Bob</a>, not quite.</p>

<h3 id="citations">Citations</h3>

<ul>
  <li>Broh, C. Anthony. “Whether Bellwethers or Weather-Jars Indicate Election Outcomes.” The Western Political Quarterly (1980): 564-570.</li>
  <li>Clubb, Jerome M., William H. Flanigan, and Nancy H. Zingale. Electoral Data for Counties in the United States: Presidential and Congressional Races, 1840-1972. ICPSR08611-v1. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2006-11-13. <a href="http://doi.org/10.3886/ICPSR08611.v1">http://doi.org/10.3886/ICPSR08611.v1</a></li>
  <li><a href="http://library.cqpress.com/elections/download-data.php?filetype=&amp;office=1&amp;areatype=2&amp;year=1976&amp;format=4&amp;license=on&amp;emailto=&amp;emailfrom=">CQ Voting and Elections Collection</a></li>
  <li>Kenski, Henry C., and Edward C. Dreyer. “In Search of State Presidential Bellwethers.” Social Science Quarterly 58.3 (1977): 498-505. <a href="http://www.jstor.org/stable/42859841">http://www.jstor.org/stable/42859841</a></li>
  <li>Lewis-Beck, Michael S. “Election forecasts in 1984: how accurate were they?.” PS: Political Science &amp; Politics 18.01 (1985): 53-62.</li>
  <li>Paleologos, David, and Elizabeth J. Wilson. “Use of Bellwether Samples to Enhance Pre-Election Poll Predictions: Science and Art.” American Behavioral Scientist 55.4 (2011): 390-418.</li>
  <li>Robertson, David Brian. “Bellwether politics in Missouri.” The Forum. Vol. 2. No. 3. 2004.</li>
  <li>Tufte, Edward R., and Richard A. Sun. “Are There Bellwether Electoral Districts?”. The Public Opinion Quarterly 39.1 (1975): 1–18. <a href="http://www.jstor.org/stable/2748067">http://www.jstor.org/stable/2748067</a></li>
</ul>

<!---
Bonus: the most popular county names in the US are Washington (32), Jefferson (27), Jackson (26), Franklin (25), and Lincoln (24).
# TODO
SEnd to OTM, 538, Jon, Laura, Eurry
-->
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:names" role="doc-endnote">
      <p>It’s also possible that there’s some weirdness happening with county names changing, or counties and independent cities in the same state having the same name. I caught some of these, but only when they obviously affected the result. <a href="#fnref:names" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:dates" role="doc-endnote">
      <p>I did choose these dates purely because the data were easiest to access. It should be possible extend the analysis using <a href="https://www.icpsr.umich.edu/icpsrweb/ICPSR/series/00059">data from 1788 onwards</a>, although they seem to become more incomplete the further back you go. <a href="#fnref:dates" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:missing" role="doc-endnote">
      <p>This missing data problem is more of an issue here than I’m used to: when there are only 57 events, missing one or two of them can make a big difference. <a href="#fnref:missing" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:check" role="doc-endnote">
      <p>By all means, please <a href="https://github.com/adamobeng/bellwether">check my math</a>, especially if you know what’s going on with some of the weird discrepancies and missing data. <a href="#fnref:check" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><category term="academic," /><category term="stats" /><summary type="html"><![CDATA[Scientists have calculated that the chances of something so patently absurd actually existing are millions to one. But magicians have calculated that million-to-one chances crop up nine times out of ten. — Terry Pratchett, Mort]]></summary></entry></feed>