Sculpting Fragile Glass with Agentic Coding

2026-05-28T00:00:00+00:00

Two months ago, I wrote a blog post about the start of a new chapter</a>. There, I said that machine learning has changed to an almost-unrecognizable extent, and I really mean it. Yet, AI is also only getting started, and we’re all still figuring it out. There’s been a lot of hype around rapid prototyping, famously by Garry Tan</a>: he has produced 600000+ lines of working Ruby on Rails</a> in a handful of weeks, without reading essentially any of it. The reactions have ranged from excitement</a> to ridicule</a>: it is much easier to write a lot of code than it is to write a lot of code that is correct and continues to work well as a product evolves. Many people think that agentic workflows tend to produce slop, and I think this can certainly be what happens, but I also think it doesn’t have to—the haters are wrong and Garry’s right. Let me convince you by showing you how it is also possible to use agentic workflows to sculpt fragile glass: that is, to write difficult code, where one subtle bug breaks things silently, and failures manifest as infinite loops in rare edge cases—ones that somehow also manage to happen all the time.

If you’ve guessed that this is going to involve concurrency in the form of high-performance CUDA C++, you’re correct. Concretely, I am going to re-build one of the trickiest pieces of code I have ever written—and am going to do so using Claude Code exclusively.1</a> There will be zero lines of human-written code and zero fundamental design decisions made by me, yet zero tolerance for errors and zero tolerance for slop—all at once. I’m mostly interested in this because I think it’s cool. But, if you need another reason, I think it’s interesting to explore and see what agentic coding looks like along the less-talked-about regions of a speed-quality Pareto tradeoff. And, to convince you the details are worth it, I’ll spoil the punchline: getting to a reasonable implementation that outperforms an Nvidia-maintained library took only two days, on an after-work side-project basis, and resulted in a higher quality level than my hand-written code from 2016. If this sounds interesting, let’s dive in!

Code for this blog post is available on GitHub</a>, along with full Claude Code session logs</a>.

Robin Hood Hashing on the GPU</h1>
For this project, I implemented a concurrent hash table in CUDA—one whose algorithmic design is original, and comes with a story from my early career.2</a> In 2016, I became interested in Latent Dirichlet Allocation</a> on GPUs, having recently written a paper about parallel Markov chain Monte Carlo algorithms</a> for this model. The core computations involve pairs of integer-valued sparse vectors, and the cleanest efficient implementation stores these vectors in densely-packed hash tables.i</a> To achieve this, I designed a novel kind of lock-free hash table suited for GPU-style concurrency, implemented it in CUDA</a>—and then, very unfortunately, started a PhD with an advisor I later quit working with who effectively forced me to abandon the project. What I should have done is simply finished debugging, benchmarked the hash table, and published a single-author data structures paper about it—but, I was in my early twenties, and didn’t know what I was doing. Four years later, a paper re-inventing many of the same core open-addressing techniques—and, to be fair, other ones too—won Best Paper at HiPC 2020</a>. On the one hand, no hard feelings—I’ve won best paper awards of my own. On the other, I’ve always wanted to know: how well did my design actually work?
The algorithm in question, Pólya Urn LDA</a>, is doubly sparse in the sense that its per-iteration runtime depends on the minimum of two sparsity coefficients. These correspond to document-topic and topic-token sparsity: in practice, most documents only contain a few topics, and most topics only contain a few types of tokens. Previous algorithms of this class depended on the maximum rather than the minimum. The above paper is the first machine learning paper I ever published. ↩</a> </aside>
Ten years later, concurrent hashing is interesting for a different reason: it’s exactly the kind of fragile glass code that carries a high risk of subtle concurrency-induced bugs, and must behave correctly under rare edge cases which only appear at scale when running thousands of threads. It also requires reasoning about some fairly complicated hardware details in order to achieve the desired level of performance. Moreover, no-one appears to have published my specific design in the academic literature, and my implementation was buried in a larger repository, with code that was neither documented nor correct. A coding agent would need to rely more on its ability to reason and generalize for this task, than it would for many others. So, I was curious: how hard would it be?
The Design: Buckets and Slots</h2>
To start, I gave Claude Code the following prompt:

Hi Claude! In this repository, we’ll be attempting to design and implement a parallel hash table that can run efficiently on an Nvidia GPU.
We want:

A header-only CUDA library</li>
The ability to insert many elements in parallel</li>
In theory, utilize 100% of device memory bandwidth</li> </ul>
Let’s start by brainstorming a possible design into a Markdown file in the notes directory. </blockquote>
The amazing thing is that it did not repeat a design from a CUDA library such as cuCollections</a>, or the repository of any paper that I am aware of. Instead, it wrote a Markdown document detailing my exact design from 2016, namely a Robin Hood hash table with lock-free subwarp-cooperative probing.
Let me tell you how this works in a nutshell. To do that, there are four important things about GPUs for our purposes that you should know, beyond the fact that GPUs are massively parallel processors which involve thousands of threads:

On a GPU, every group of 32 threads, called a warp, must execute the same instruction at a given time.</li>
Communication between threads in a warp is much cheaper than communication between threads in a block, which is the next unit of hierarchy.</li>
A GPU cannot retrieve less than 128 bytes of memory at a time.</li>
In most interesting cases, a hash table’s performance will be limited by the GPU’s global memory bandwidth.</li> </ul>
This leads to a few consequences. First, we should prefer simple algorithms more on GPU than on CPU—complex logic will tend to produce branching, which then gets executed in sequence, performing poorly. Second, computations should be localized among threads, in order to avoid excessive coordination costs. Third, we need to leverage parallelism to search all 128 bytes we retrieve, otherwise threads will be sitting around doing nothing useful. Fourth, we should aim to retrieve as little memory as possible as a design principle.
These requirements effectively rule out many classical pointer-oriented chaining-style designs, but are a reasonably good fit for linear-probing-style hash tables. Let me remind you how these work: they store the data as an array of slots of key-value pairs, namely
$$ \begin{aligned} k_1&:v_1 & k_2&:v_2 & k_3&:v_3 & k_4&:v_4 & k_5&:v_5 & k_6&:v_6 & k_7&:v_7 & k_8&:v_8 \end{aligned} $$</code></pre> where $k_n$</code> are potentially-empty keys and $v_n$</code> are values, which will be either 32 or 64 bit integers in our case. Letting $K$</code> be the number of possible keys and $N$</code> be the size of the table, we use a hash function $h : [K] \to [N]$</code> to map keys to random-looking slots inside the table. Since, in the interesting regime, $N \ll K$</code>, the question becomes: what should we do if two keys get mapped to the same slot? Linear probing gives a clean answer: simply store it in the next slot.
`This is a great strategy most of the time. But, if the hash function’s outputs behave like uniform random numbers—as they are generally designed to do—they’ll occasionally cluster, resulting in the array filling up unevenly. For unlucky keys, one may have to search over an unreasonably large distance from their origin point.Robin Hood hashing</a> gives a clean way to avoid this slowdown: when inserting the key, if you encounter a key closer to its starting point than your current key, swap keys. This small change not only reduces the distance of many keys, it also allows retrievals to potentially terminate early, significantly reducing the amount of searching one must do in unlucky situations—this is the strategy that both I in 2016, and Claude Code in 2026, decided to adopt.ii</a>`
Robin Hood hashing is generally considered strong in practice</a>, and was the hash table algorithm of choice adopted by the Rust standard library</a> before it switched to Swiss tables</a>, which are a significantly more specialized design targeting modern CPUs. ↩</a> </aside> So far, we’ve been thinking one key at a time. But, as I noted earlier, on a GPU we need to think 128 bytes at a time. To handle this, we group slots into buckets, like $$ \begin{aligned} [k_1&:v_1 & k_2&:v_2 & k_3&:v_3 & k_4&:v_4] & [k_5&:v_5 & k_6&:v_6 & k_7&:v_7 & k_8&:v_8] \end{aligned} $$</code></pre> which corresponds to two buckets with four slots per bucket. Then, when doing insertion and retrieval, don’t perform checks one slot at a time, instead use a group of threads, called a tile, to search the whole bucket in parallel. Each thread in a tile retrieves one slot, checks its contents, then communicates with other threads in the tile in order to decide what to do, using hardware-efficient ballot and shuffle operations. Insertion into a given slot is performed using an atomic compare-and-swap primitive, resulting in a lock-free algorithm. For 32-bit keys and values, we use tiles corresponding to half-warps, resulting in a branching factor no worse than two, and with very small branches. Based on our criteria, this design should be reasonably strong. So, let’s see how it actually does, then I’ll tell you how I built it—and how agentic coding made it possible to produce a high-quality implementation very quickly. Performance Results</h2> Let me show you two kinds of benchmarks: memory bandwidth utilization, and timing. These are designed to answer two key questions: How well does the algorithm utilize a GPU’s memory bandwidth, compared to what is possible?</li> How well does the algorithm perform compared to other kinds of GPU-accelerated hash tables?</li> </ol> We’ll consider these under different hashing scenarios, all using a 1GB table on a 4090. We’ll fill our hash tables using uniform random numbers over the full 32-bit unsigned integer range, resolving collisions by inserting the most-recent element.3</a> For the first question, we’ll record how many buckets each sub-warp retrieved and updated over the kernel’s execution time. We’ll compare this to a CUDA memory copy API call. For the second question, we’ll run a timing experiment, and compare to cuCollections and WarpCore, two major GPU-based hash table implementations available online, neither of which existed in 2016. Since cuCollections is a core library maintained by Nvidia, I expect its performance to be reasonably well-optimized. We’ll repeat each experiment 16 times to assess variability, and plot medians along with 25th and 75th quantiles. Results can be seen below in Figure 1. Figure 1. Performance results for GPU Robin Hood Hashing (GPURHH), compared to cuCollections with linear probing (LP) and double hashing (DH) collision-resolution strategies, as well as WarpCore. These are shown in terms of memory bandwidth utilization along with insert and get throughput.</figcaption> </figure> The first plot of Figure 1 shows our hash table to be highly efficient, achieving somewhere around 90% utilization compared to the best realistic rate possible—that of a memory copy. The second and third plots show results in terms of insertion and retrieval throughput—in short, at all sufficiently-high load factors, retrieval is significantly faster using our implementation rather than with any baseline. The more-complex Robin Hood kernels result in slightly slower insertion compared to the strategies used by both libraries, namely linear probing and double hashing, but win significantly on retrieval, especially at higher load factors, thanks to needing to retrieve fewer slots. Variability between runs was found to be so small as to be nearly-invisible on the plots. One could certainly test and document performance differences much more thoroughly, and explore additional design options such as Robin Hood together with double hashing, but I’ll stop here—after all, this is a blog post about agentic coding, not a research paper about concurrent hashing. Nonetheless, from this comparison alone, I am left feeling proud of my younger self for having figured this algorithm out in 2016—and disappointed for not finding a way, whatever the circumstances might have been, to actually tell the world about it. Agentic Coding Setup and Workflows</h1> Let’s now move to talking about agentic coding. So far, I’ve told you what I built, and how well it works in terms of performance. I will now tell you how I built it. But first—it is worth noting again—the initial process took two days, just two days. I don’t want to mislead you on this—the number represents time from the first prompt that produced codeiii</a> to a correct algorithm with reasonable test coverage.iv</a> After this, it took two more days to get a reasonable set of timing benchmarks—more on that later. None of these numbers represent full-time software work—at present, and for a little bit longer, I am employed by Cornell University, and all of the work here was done in spare time. An experienced kernel engineer would have likely been much faster, but at two days, there is not all that much time here left to save. Commit: #ce5ee0</a>. ↩</a> </aside> Commit: #ce6dbe</a>. ↩</a> </aside> Let me now tell you about my workflow—which, for those that are big on orchestration and token-maxxing, might at first feel underwhelming—though I think there’s more than meets the eye. All of my agentic coding was done in VS Code, using the Claude Code extension, and in hindsight I believe that this is the right setup for this specific use case.v</a> I think about this in terms of the equation $$ T_{\text{total}} = T_{\text{writing}} + T_{\text{understanding}} $$</code></pre> where: $T_{\text{total}}$</code> is the total development time.</li> $T_{\text{writing}}$</code> is the time spent writing code, whether by hand or using a coding agent.</li> $T_{\text{understanding}}$</code> is the time spent understanding whether the code is doing what you wanted it to, potentially by reading the code, running the code and looking at its output, running tests and looking at their output, or any other method.</li> </ul> I make no claims about what is optimal for other use cases, and think that every strategy—from an ultra-conservative approach using Claude only to answer questions about best practices and most code written manually, to a full-blown orchestration-oriented approach—has its place. Some will be optimal for students learning to write code for the first time who need to type things out themselves in order to properly understand what is happening, others will be suited more to experienced engineers rapidly prototyping a minimum viable product to demo to a client on Monday morning. Before releasing my full session logs, I used Claude Code to check for and remove any accidental private information that might have landed in them—at one point, this process involved a carefully designed parallel subagent call. There is no best style: learn each approach, think for yourself, and pick the one that works for you. ↩</a> </aside> For low-level CUDA C++, my experience is that looking at the code can be very useful. Having a reasonably-idiomatic and well-organized codebase makes $T_{\text{writing}}$</code> larger compared to a so-called slop-regime which neglects to do so—but, since agentic coding is still very fast, being careful is not actually slower by all that much. On the other hand, being careful reduces $T_{\text{understanding}}$</code> significantly. This is the key reason why, as we will soon see, it took me just as long to benchmark the code properly as it did to get to a correct hash table in the first place—even though the latter is, in principle, much easier. Ecosystem. The type of code and maturity of its ecosystem plays a significant role. My 2016 implementation is CUDA C++, but is written in a very C-like style, as opposed to more modern C++-like style that Claude Code produced. I significantly prefer the latter. For example, the 2026 version uses cooperative groups to perform communication cleanly at the tile level, whereas this API did not exist in 2016 so I instead computed everything by hand using warp-based primitives with bit masks to map back down to half-warps—equally valid, but much less clean and harder to debug. There are a huge amount of improvements like this, and I am extremely impressed with how mature the CUDA ecosystem has gotten since the early days. Testing. The ability to do comprehensive testing also played a major role, and agentic coding made it much more pleasant for me to follow the extreme programming</a> principle of implementing unit tests first—a principle I have historically often praised, but rarely followed. In addition, I found it very helpful to ask Claude to brainstorm as many distinct edge case tests as it could think of, and implement as many of them as made sense. Initially, Claude wanted to test insertion and retrieval jointly, but, when asked, was happy to write verified-correct memory layouts by hand in order to trigger certain kinds of behavior. I did something similar in 2016, and it took me quite some time to come up with a reasonably-comprehensive list of what might go wrong under concurrency—the coverage I got from Claude took almost no time to write, and was easily just as good. Refactoring for ease of verification. A major part of my workflow involved asking Claude to refactor and better-organize the code, in order to make it easier for me to verify whether it was doing what I wanted or not. I would estimate that more than half of my prompts were used on re-writes. The idea—as floated previously—is that being able to verify correctness by human eye can provide a second way to check the code, beyond just running various tests. This may help catch different kinds of issues compared to ones that tests would immediately reveal. Sessions and token use. These days, I code mainly in an agentic style, and tend to work with separate parallel sessions dedicated to various issues. In this project, however, I used one giant session. I initially did this in order to make it easier for people to read my session later on, as I was planning to include it in this blog post—but found it worked better than I anticipated. Large sessions allow the model to look back to prior decisions and learn based on how I responded to previous prompts, and I felt like the model was better able to guess what I actually wanted over time. Just as importantly, this project never actually used that many tokens—compaction triggered only a handful of times. Out of my weekly Claude Max budget—which I deliberately only used for this project to facilitate measurement, while doing other work using Codex and Gemini—I ended up using: 7% to get an initial version that passed tests.</li> 12% to reach a reasonable test coverage including edge cases.</li> 19% to add baselines and benchmarking code.</li> </ul> This experience has been completely different from my AI-assisted math experiments, which have had a more modest degree of success,vi</a> and have been heavily token-bottlenecked. Here, I have used far less tokens than what many other people claim to use, and I would have been happy to use more—but only if doing so would have actually saved me time. Many of the workflows I have seen are better suited to other kinds of projects, and for work like this will use up time rather than save it. If you have concrete ideas on how using more tokens might have actually helped achieve this project’s goals, I’d love to hear from you. I have been able to prove a few original theoretical results in an AI-assisted manner, but not yet anything close in significance to what’s been reported in the media</a>. I suspect this gap is down to several bottlenecks. First, frontier labs are able to leverage far more tokens than me, since I’m using subscriptions. Second, using AI for sufficiently-difficult math tends to result in written-up proofs that are so long that it is impossible for one person to learn and check them in a timely manner—mathematical journals have famously long reviewing times for essentially the same reason. Formal verification systems such as Lean</a> would help, but would not close the gap: one would need to verify that the actual definitions and theorem statements match human language. This in itself is a lot of work, even if the actual proofs are verified correct. ↩</a> </aside> Time lost due to trusting the agent too much. The third part above took the most time, even though the code involved is much less difficult. The reason was that I didn’t want to think too hard about how to do benchmarking, since I considered this part straightforward. However, Claude picked a very unintuitive way to parameterize the experiments, and came up with a set of comparisons that were unfair, requiring the Robin Hood hash table to do more actual work than the baselines. I did not immediately realize this, and once I did, I had to ask in very specific terms to rework the benchmarks to ensure every method really was treated equally, and didn’t just appear to be on the surface. Going back to the equation, I gave in to the vibes too much at this stage, which made $T_{\text{writing}}$</code> for the benchmarks quite small, but resulted in a much larger $T_{\text{understanding}}$</code>. In the end, I used more tokens, but this only slowed things down. It would have been better to have asked Claude to write a Markdown spec for how the benchmarks should work, carefully verify it, and only then ask for an implementation. Programmer intent. A significant pattern I ended up noticing myself using is to (a) ask the model what choices and tradeoffs are possible, in terms of factors such as performance and style, (b) brainstorm with it what is the right choice for the situation at hand, (c) ask it to implement the desired code, and (d) read the code, then ask it to make changes to implement it better. One way to think about programming is that it’s all about expressing what we actually want—to a computer. From this perspective, I see agentic coding as allowing me to use separate text for describing what I want via the prompt, and verifying that the system correctly understood me, using a combination of the code, tests, and other approaches. Pareto tradeoffs between speed and quality. My experience demonstrates to me that Claude Code not only accelerates software development, it shifts the entire speed-quality Pareto frontier. If you want to write difficult high-quality code, you should expect to be able to both go faster, and produce better results, both at the same time. For me personally, I would even say that it feels like a shift along a three-way speed-quality-fun Pareto frontier, but that’s personal and depends on you. With this said, if someone says it’s all just slop cannons, then I’m sorry, but they’re completely wrong. Documentation. For this project, I adopted the strategy of treating Markdown files the same as code. The model was asked to document all of its work, and I did not directly write or edit any of the documentation myself.4</a> To get the documentation to a suitable quality level, I relied on a combination of high-level and low-level suggestions via re-prompting. This approach was not time-optimal: for low-level edits, re-writing text would have been faster, but I stuck to re-prompting for experiment’s sake. Overall, the model did a reasonable job of ensuring all documentation was correct—but was not as good at ensuring that none of it was misleading, nor at choosing the right level of detail. Too little documentation and it is unclear what is going on, too much and it is also unclear what is going on—it takes too long to sort through irrelevant details. Ease of reading is important too, after a while Claude-speak tends to all sound the same.vii</a> More on that next. The situation with GPT-speak and Gemini-speak is similar—both sound different from Claude-speak, but are distinctive enough that one can quickly start to feel a lack of personality. I have found Gemini to be particularly excitable, with a deep desire to announce a marvelous breakthrough, while GPT really wants to talk about goblins</a>. Claude is very level-headed and analytical, but has a tendency to over-assign adjectives or evocative names to every concept that comes up, even comparatively unimportant ones. I find this tendency really interesting: naming things well is a central part of mathematical problem-solving, but it is also important to do work at the right level of simplicity. Too many names and, at some point, the cognitive load of remembering which name corresponds to which object becomes higher than that of simply looking at the relevant formula or snippet of code. ↩</a> </aside> What can’t LLMs do well yet? For any new technology, one should always consider the bear case, and think about what isn’t working. And, here, I have a good candidate: agentic writing. My experience with using large language models for this has consistently been terrible. I have found it impossible to get models to write what I actually want to say, rather than something slightly different where the fine details are all wrong. As a result, I consider it much easier for me to write this blog post start-to-finish, than to prompt and re-prompt in order to create it. I hypothesize this is because general-purpose language is different from code, and is somehow much higher-dimensional—and, as humans, the easiest way to say what we want is to just outright say it. How long before today’s state of affairs becomes out of date? As of this blog post, AI capabilities are rapidly getting stronger. I am quite confident almost no-one would have predicted this level of performance two years ago5</a>—back then many people thought large language models had hit a ceiling due to hallucination issues which are now largely gone, at least in this domain. As such, by publishing this blog post now, I risk making claims that will be viewed as hopelessly out of date very soon—potentially as soon as a few months from now. I am not afraid of this. Even as models advance and techniques adapt, I think it is useful to have a reasonable snapshot of what is happening outside of frontier labs today. In the future, this may help provide a more accurate chart of where capabilities were, and what workflows worked best, at the given time. This can still be interesting to know even if, by the time you are reading this, the situation has changed. Conclusion</h1> In this blog post, I presented an implementation of a lockfree Robin Hood hash table designed for GPUs, and showed it to perform strongly, compared to both Nvidia-maintained and community baselines. I did not write a single line of code in order to build this, in spite of the fact that this is a device-side low-level CUDA C++ implementation—the code equivalent of fragile glass. The initial implementation took only two days—compared to a much longer time period when I implemented the same design ten years ago. A combination of agentic coding, and a significantly more modern CUDA ecosystem, made this possible. I hope you’ve found my reflections and ways of thinking to be useful or interesting—if you did, please feel free and encouraged to find me on social media, and let me know what you think! Footnotes</h1> Specifically, Claude Code using Opus 4.7. It turns out that I finished this blog post right on time for Opus 4.8 to be released. ↩</a> </li> I’ve told this story before: in a footnote</a> of my previous blog post</a>, on the direction AI is headed. It is also worth noting that in my most recent blog post</a>, which was dedicated to thinking about my future career path in light of rapid developments in AI, I promised to put together a side project or two before leaving academia—that post’s promise is what inspired the project and blog post I’m sharing with you now. ↩</a> </li> Before each experiment, we compute and record how many unique elements are to be inserted into the table, to prevent any of the tables from behaving badly in the event they are loaded above capacity. ↩</a> </li> There is only one exception: commit #c58d66</a>, which adds a set of links, including to this blog post, to the readme. I wrote this by hand because I did not want Claude to know it was participating in an experiment to evaluate its AI-assisted development capabilities. ↩</a> </li> I’ll even claim more: anyone who was predicting capabilities would advance this quickly, and who was not using either a statistical model to guide those predictions, or frontier-lab insider information, was reading tea leaves. They turned out to be correct by accident. ↩</a> </li> </ol> </section>



The Road Less Traveled
2026-03-19T00:00:00+00:00
This blog post is going to be much shorter than my last one</a>, so I’ll get straight to the point: I’m leaving academia.
Given (a) just how theoretical my work has been, with much of it focused on classical methods, and (b) that I have declined several Assistant Professor</em> job offers, including ones with attractive terms, in order to do so, I think this choice is going to come as a big surprise to a lot of people.
Therefore, please allow me to explain myself.</p>
A different path</h1>
Eight years ago, in 2018, I returned to academia in order to work on original research in machine learning and artificial intelligence.
My plan was to spend the first few years building fundamental skills—for me, this meant learning math, as I had a lot of confidence in my software engineering ability—and then use those skills to contribute as much as I could to the science.
This plan yielded about twenty-five papers at the world’s best machine learning conferences, two best paper awards (or runner-up) at ICML</a> and AISTATS</a>—that is, from the full conferences</em>, not workshops.
My methods are now implemented in all major software packages in my area, including those used in production</a> at the world’s biggest tech companies.
Most people, using reasonable evaluation criteria, would agree that this is an excellent track record on paper—albeit one focused on niche topics rather than today’s biggest ones.</p>
Within that same time period, the field of machine learning has changed to an almost-unrecognizable extent.
As I wrote in my previous blog post</a>, the most important and relevant topics of 2026 require a different skillset and style of work than those of 2018.
In light of these changes, I am not convinced my current research style will result in the kind of work that will be remembered many years down the line.
So, while I could continue down the same academic path I set myself on in that era, and most people in my situation would probably do so, I am not convinced this is the right direction.
I am instead convinced that now is the time to pivot, take a few risks, and follow the road less traveled.
So, I will be shifting course and exploring something new.1</a></sup></p>
What comes next?</h1>
For my next immediate steps, I’ll be leaving Cornell in August, and returning home to California.
I have some sense of what I’d like to do in my next career stage, but it’s rather abstract, and will take time to develop.
Much of what I want to do depends on the broader AI ecosystem, and I won’t be able to get started until I arrive.
In the meantime, I will be taking the next six months to retool, resharpen my skills, and prepare for what comes ahead.</p>
There are two parts of my track record I’d like to change before my next career stage begins.
First: I’d like more direct hands-on experience with the under-the-hood machinery that powers LLMs, namely sequence-to-sequence distribution-matching via transformers.
I have the feeling that there are a lot more things beyond language that one can build using this technology, and I think it is critical for me to understand how it works to a sufficient degree of technical mastery.
This will be the focus of my last batch of Cornell-affiliated research projects.</p>
Second: earlier in this post, I claimed to have a lot of confidence in my software engineering skills.
Why do I believe this?
From experience: before starting my PhD in 2018, I worked in industry on machine learning systems.
This included work with both distributed systems—specifically, asynchronous Markov chain Monte Carlo algorithms used for adjusting for early-adopter effects in A/B testing at eBay, to give one example—and GPUs, in the form of writing low-level CUDA code in the pre-TensorFlow era.
None of these systems were easy to build, but very little of this track record is public.
I’d like you to be confident in my software engineering skills, too—both in terms of the fundamentals, and in my ability to leverage AI—all without needing to trust my opinions about my code.
So, in the coming months, I plan on putting together a side project or two to showcase what I can build.</p>
Longer-term, I’ll avoid speculating about exactly what I’ll work on, because I do not yet know.
The clearest sense I have is in what inspires me: the conviction that algorithmic decision-making under uncertainty, of the kind I have been studying over the last half-decade, has got to have a central role to play in AI—both for research, and for building new kinds of products.
Let’s see what the next steps in making my dreams a reality are going to be!</p>
Footnotes</h1>



There were also critical tradeoffs at play, in terms of the precise options that were on the table, beyond what I can share publicly. ↩</a></p>
</li>
</ol>
</section>


Where Are We and What Now?
2026-01-31T00:00:00+00:00
It’s 2026.
I started my previous blog post</a> on AI research, written in 2023, by saying that the hype around ChatGPT had died down somewhat</em> compared to its launch.
That turned out to be completely wrong</em>.
The hype has not only not died down, I would say it is now larger</em> than it ever has been.
The most important factor that has kept the hype alive is the rise of AI-driven software engineering.
It turns out that models like Claude and Gemini can write useful code.
That’s a really big deal, and very exciting for me personally.
Many years ago, I decided to become a researcher specializing in theory and methods, rather than a software engineer—in spite of the fact that I was really good at writing code.1</a></sup>
The key reason for this was that I preferred working on hard but interesting math to working on important but uninteresting code.
AI changes that.</p>
My own work is also once again at a crossroads.
Over the last two-and-a-half years, I’ve shifted away from Gaussian process research, to working on the algorithmic fundamentals of decision-making under uncertainty.
This work has resulted in two key contributions: the development of Gittins index methods</a> for Bayesian optimization</a>, and of a non-obvious form of Thompson sampling for online learning with non-discrete action spaces</a>.
Details aside, I am very happy with these contributions, because I feel that I qualitatively understand algorithmic decision-making vastly better now than I did when I started.2</a></sup></p>
Better fundamental understanding is a good starting point.
But, at least for me, it can only be that—a starting point.
The question is: what comes next?</em>
I am far from the only scientist to face this question.
In a retrospective on his career, the Nobel prize winner John Hopfield</a> singled out the process of understanding this question—which he called “Now What?”</a></em>—as a cornerstone process that helped make his contributions significant.
I agree with John on its importance: for me, the key challenge in answering this question is correctly understanding where AI research is and where it will get to, so that I can react accordingly.
This blog post will therefore explore these questions.</p>
Warning: this post is going to be long</em>.</strong> You should consider skipping to the headings that seem most interesting to you.</p>
Where are we?</h1>
The first step towards figuring out what</em> to work on—and, for that matter, where</em> to work on it—is to correctly understand where technology is headed today.
This is not a trivial question: addressing it demands that I think decisively, and make precise falsifiable claims about what is likely to happen.
I believe it is also important to have the courage to write—and, potentially, be completely wrong—in public, as I alluded to at the start of this post in my comment concerning ChatGPT’s hype.
So, I will attempt to do so, with my own formed-from-scratch opinions: here’s what I think will happen, and why.</p>
AI will significantly change how software engineering is practiced day-to-day</h2>
One of the key developments in AI that has made me re-assess what will become possible, and how quickly, is the emergence of AI-assisted coding environments such as Cursor</a> and Claude Code</a>.
These provide an interface through which a developer can ask an AI in natural language to write code that accomplishes a particular task.
This includes workflows for reviewing the results and asking for changes, as may be needed to ensure the code is correct.
If you have not tried them out, you should stop reading this blog post immediately, and do that now</em>.</p>
I find these tools extremely exciting.
Writing a program that actually serves a purpose, end-to-end, has always been difficult.
Whether one learns to code in a class at university—or, on their own, as I did as a teenager who was into video game modding—almost all developers come to appreciate the need to start simple, get an initial codebase that does something correctly, and build complexity later, as it becomes needed.
With an AI-assisted workflow, the time to do this goes significantly down—likely enough to justify making many kinds of new and more-heavily-customized software that would otherwise never have been made.</p>
At the same time, only some of a software engineer’s day-to-day tasks involve writing code.
Others involve communicating with team members, understanding existing systems in order to understand the root cause of a given problem, deciding which system design or architecture is best suited to the task at hand, and—frankly—navigating the internal politics of the company one works at in order to actually be able to improve things.
Quite a few of these may require significantly more agency than current systems have, the ability to interact effectively with more than one person, as well as completely-different interfaces.
It is unclear how quickly these will develop.
However, as many of the preceding examples are social, I think it is reasonable to expect software engineering to become a more social job.</p>
Another key limitation is that using these tools requires credits, which are a finite resource.
It is very easy to blow all of them on an unimportant task that requires a lot of tokens, but does not advance the project being written.
Managing people is a non-trivial task: an incompetent manager can easily waste their companies’ budget by directing their employees to work on things that do not advance their organization’s goals, or preventing them from working on things that do.
In the same way, I believe that managing what an AI system is doing will become a key bottleneck of using it in a manner that justifies the costs.
This leads to the next key point.</p>
In a world where intellectual work is automated, a key human role will be to decide what’s interesting</h2>
Suppose we are further into the future, and AI can do almost any achievable intellectual task, but otherwise adopts the same fundamental architecture, based on token-prediction, which is in use today.
What are people going to do then?
What is the role of humans in such a world?</p>
I think the answer will be to decide what’s useful or interesting</em>.
Humans do not simply pursue goals, they decide which goals to set themselves to.
Coming up with a sense of purpose, and fulfilling it, is a major part of human lives—and, indeed, a well-established idea from psychology, with the need for self-actualization</a> sitting at the top of Maslow’s hierarchy of needs.3</a></sup></p>
I currently see no such needs or comparable mechanism in AI systems, unless they are introduced within a prompt (potentially recursively) by a human.
On their own, current systems appear to mimic the medium-level patterns of human thought, but not the higher-level structures and needs within which human thought lives.
I do not see an incentive for AI companies to create such mechanisms, as they would make AI more difficult to control, without directly improving capabilities.
Instead, I see the opposite incentive—from the perspective of AI safety.
If this prevails, the goals pursued by an AI system will be set by people, based on what they find useful and interesting.</p>
The economic effects of AI are not obvious and will be surprising to many people</h2>
Many people are afraid that AI will, on-average, eliminate jobs.
The argument for this is straightforward: taking software engineering as an example, if one software engineer with AI can do the job of a team of ten engineers without AI, what is the point of hiring the other nine?</p>
It’s also likely wrong, at least for software engineering—due to what is called Jevons’ Paradox</a></em>.4</a></sup>
The counterargument is very simple: if software becomes ten times cheaper to develop, then all kinds of previously-unviable business models become viable, leading to more</em> demand for software engineering than previous existed.
Indeed, one can observe that as computers moved from programming using punch cards, to imperative and object-oriented programming, to modern carefully-engineered languages, writing software has become easier and easier—yet, more software gets written now than ever.</p>
On the other hand, there really are fewer horses on the roads today, compared to in centuries past.
To distinguish these situations, one can certainly think harder, introducing ideas such as demand elasticity—but, frankly, my view is that for any</em> narrative like the above, it is not difficult to come up with counter-narratives regarding just-about-any economic effect of AI.
So, then, which view is right?
In the absence of empirical evidence, how can we know?</p>
I am not sure we can.
AI is an even-more general-purpose technology compared to the introduction to the internet, yet people were largely unable to predict with reasonable precision the economic effects the internet would have.
While the science has certainly advanced in the last twenty-five years, I worry that many interesting questions about the economic effects of AI are too open-ended, or lack any comparable reference points</a>, for anyone to be able to deduce clear answers.
For example, it is conceivable that advances in AI might lead to advances in robotics, which in turn make ultra-customizable artisan-style manufacturing competitive with assembly lines: if that happens, how does one even begin to assess the consequences?
My conclusion is that we should form our opinions—as I did, with my appeal to Jevons’ Paradox—and then prepare to be surprised.</p>
Whether the AI boom ends by stabilizing or with a crash will depend heavily on how quickly inference costs go down</h2>
In spite of attributing high uncertainty to economic outcomes of AI, there is one economic prediction I am willing to entertain: whether or not AI is a bubble—or, more precisely, whether an AI-related economic crash will happen or not.
Let me explain how I think about this.</p>
I’ll start with a simplified thought experiment involving a set of hypothetical AI companies.
At every time point, each AI company has two sources of cost and profit: training and inference.
Let’s assume there are two sequences: training costs</em> and inference profit</em>, where the latter is initially negative but grows over time and eventually becomes larger than training costs.
From these dynamics, it follows that an AI company becomes profitable if it survives long enough for inference profit to outweigh training costs</em>.
Thus, for an investor to see an AI company as a reasonable investment, they need to believe in a large-enough probability the company makes it, and expect sufficient profit once they do.
Both Google’s executives, and OpenAI as well as Anthropic’s venture capital backers, have decided to make the investment—paying for training, and subsidizing inference, both in the face of colossal costs today.</p>
To sustain AI, investors will need to continue investing.
Short of some kind of Lehman-Brothers-style accounting catastrophe, I am confident that Google will be able to do so, given its dominance in internet advertising.
I also have confidence in OpenAI and Anthropic’s ability to survive by raising funds: the US venture capital ecosystem, which is only one of many possible sources of investment, has a total of 1.2 trillion under management</a>—a number of the same order of magnitude as Google’s market capitalization.
So, the funding pool is similarly-deep for both kinds of companies—but, it is also not infinite.</p>
As inference costs go down, AI companies become less dependent on fundraising.
If this happens fast enough, relative to the economic value per token, then AI will become profitable—the current bubble is likely to stabilize.
If, instead, costs remain stubbornly high, investors might conclude that AI is too expensive to be worth even-more investment.
This could result in companies being forced to raise prices, which would make AI less-valuable to users, and AI companies less-valuable to investors—creating the kind of feedback loop that could lead to an economic crash.
I don’t know whether or not this is likely to happen.
But, given the above dynamics, I predict that the rate at which inference costs go down is likely to be a critical factor in determining the risk.
This provides an excellent reason to think about inference costs from a technical perspective, and I will do so in the sequel.</p>
In the long term, mathematical research like mine will become AI-assisted</h2>
The last half-decade of my life has been dedicated to machine learning research of a mathematical character.
One part of this has been formulating, and then proving, certain theorems—generally, in pursuit of various higher goals.
Another part has been doing mathematical calculations which enable certain methods to be implemented numerically and tested to evaluate their performance.
Much of this work has been done in a pen-and-paper style—or, more precisely, a keyboard-and-LaTeX style for me personally.</p>
At present, AI for math is a research direction receiving a substantial degree of investment, especially from industry.
This work increasingly involves formal verification systems such as Lean</a>—these provide a programming language for writing proofs, and makes it possible for a computer to verify that a given proof is correct, among other things.
The appeal, from a point of view of AI fundamentals, is obvious: if an AI system can prove a novel hard theorem, it possesses the capability to replicate and potentially exceed at least one kind of landmark human intellectual achievement.
At the same time, the domain of mathematics is one of unambiguous statements and absolute truth, making for a better angle of attack compared to other intellectual areas.</p>
I think that AI will be able to do math at a superhuman level in my lifetime, and that there is a good chance this will happen relatively soon.
This is because I think there is a very high chance the combination of language models and formal verification systems can be engineered into an AlphaZero-style recursively self-improvement loop.
I also think that the investment needed to do the engineering, and pay for the compute, is definitely there.
In any event, today’s systems, while nowhere near a superhuman level, already perform at a non-trivial level.</p>
As a result, I may be among the last generation of PhDs who are able to prove a difficult original mathematical result—where, by difficult, I mean one which requires introducing a new and different way of thinking about a given problem—purely using their own intellectual efforts.
This makes me especially proud to have written my recent paper on non-discrete online learning</a>, which I believe reaches this level.5</a></sup>
There are few things in life I enjoy more than rising up to meet a difficult intellectual challenge.
Whether people ultimately value my work or not, I am glad to have had to chance to write it, and prove to myself that I can.</p>
Almost all research-driven AI performance gains will come from ML systems, not theory and methods</h2>
A bitter lesson I’ve learned, in somewhere around ten years of work on machine learning theory and methods research, is that theory tends to come later than practice</a>.
Said differently, almost none of the field’s theoretical or methodological work, my own included, has led to to significant practical performance improvements for the whole field—there are perhaps ten papers</em> that significantly moved the needle, out of perhaps a hundred thousand</em>.
This is a provocative claim</em>—so, please, allow me to explain what I mean.</p>
Suppose we are allowed to enter into a time machine, and transport ourselves back to 2010.
We’ll be giving the AI researchers of that day a small set of papers, together with a reasonable sum of money with which to buy GPUs.
Their goal will be to re-build ChatGPT from-scratch.
Which papers should we provide?
How many—or, rather, how few</em>—would we need?</p>
In my view, that number is shockingly small.
I’d perhaps choose Attention is All You Need</em>, the Adam</em> paper, maybe the Sequence-to-Sequence</em> paper, the paper on vision transformers</em>, and the work introducing causal masking.
I think a list like this, or perhaps slightly larger, would be enough to lead to ChatGPT having been built in the mid-2010s, compared to November 2022 in real life, without the time machine.</p>
It is worth reflecting on the character of those papers.
Essentially all of them are empirical, and only a subset could be called methodological.
Engineering is a much larger focus than in most work.
Other than the ideas in this small set of papers, almost all performance gains have come down to software engineering, systems, and data-centric work.
I predict these two trends—which have many important consequences that I will continue exploring in the next point—will continue to hold.</p>
The tiny proportion of practically-useful theory and methods research will move the needle a big distance</h2>
Just above, I pointed out that the fundamental ideas needed to build modern AI come from a surprisingly small set of machine learning theory and methods papers—essentially all other work which was necessary to get us where we are today came from software engineering, systems, and data.
I would now like to explore a few implications for this.</p>
The first one is personal.
I am a theory and methods researcher—one for whom the fundamental appeal has been to figure out how to build systems we’ve never known how to build before.6</a></sup>
If this approach has, historically, not been the one that yielded almost any of the most important breakthroughs, should I be pursuing it?
Especially if I think I would like doing day-to-day empirical machine learning research significantly more, given the availability of AI-assisted coding?
I have deliberately phrased these questions in a way that elevates a particular answer: please allow me to now challenge this perspective and argue that things are not as obvious as may seem.</p>
The problem with simply dismissing methodological research is that, even though it rarely succeeds at moving the needle, it moves it a very significant distance</em> when successful.
Continuing the running example, it is likely ChatGPT would simply not have worked if it was built using LSTMs rather than transformers.
The aggregate contributions of methodological research are very significant, even though almost all papers individually don’t lead very far.
We therefore need methodological research</em>, in spite of how difficult a significant degree of success is.</p>
These observations have implications to how we should support and fund fundamental research: if success is very rare, we should support as many people as possible, making as many intellectually distinct bets, as we can.
We should prioritize original research directions, reward people thinking for themselves, and be suspicious of trends and well-established approaches.
My best understanding is that our current system often does the opposite: consider, for instance, that working with top people confers an enormous advantage on the faculty job market—and top people are by definition</em> not doing obscure or contrarian things.
One could even argue that, in the last decade, industry has done a much better job than academia—consider, for instance, that many the papers I listed above came from Google Brain, at a time when large-scale machine learning was not dominant in the way it is now.</p>
Looking at my own work, I am happy that I’ve prioritized my own voice and perspective, and have historically picked a strategy of trying to start trends, rather than getting in early on existing ones.
On the other hand, I’ve probably been enticed too much into pursuing challenging and difficult research for intellectual sake.
Sometimes, the uninteresting, tedious, or ugly7</a></sup> approaches turn out to be the right ones.
The beauty in errors is that they tell us exactly how to sharpen our thinking, while staying true to ourselves.</p>
In-context learning will become a dominant AI paradigm</h2>
I’m now going to shift directions, and think about what will AI research looks like in the near future</em>?
Research is multifaceted, and uncertain: many people are going to be working on many things.
Nonetheless, I see an trend that, depending on who you ask, one could either call emergent or well-established, that I’d like to make a prediction about: the rise of in-context learning</em>.</p>
In-context learning refers to the idea that transformers are capable of performing machine learning inside of their context window.
Said differently, if one has a model that can do sequence-to-sequence prediction, then one can consider prediction tasks which map sequences which represent data into sequences that encode desired outputs.
This leads to the perspective that, instead of designing algorithms—a fundamental goal of machine learning research from the beginning—one can instead seek to design datasets and prompts, in order to achieve the same results.
In natural language processing, where designing algorithms for many tasks is nightmarishly hard, this has clearly worked—but the general viewpoint above suggests it may be much more broadly relevant than a language-oriented framing would suggest.</p>
Newly-emerging theory on transformers</a> supports this perspective.
It is possible to formalize and prove, in certain cases, that transformers trained by gradient descent can learn algorithms from data</em>—where the word algorithm</em> is understood roughly in the sense of algorithms and data structures</em>.
If that’s a general principle, there could well be problems where the algorithm one seeks is too complex for a human to figure it out, and a better approach is to design clever ways of coming up with training data, so that a transformer can learn the necessary algorithm.</p>
I predict that this paradigm—which, beyond language models, is central to other kinds of foundation models, including prior-fitted networks for tabular deep learning, to name one example—is going to be a dominant paradigm in AI.
This is because transformers work in practice: leveraging their capabilities is a much more technically sound way to perform meta-learning compared to to prior approaches, which by-and-large did not reliably work.
At the same time, in-context learning is a user-friendly and extremely-flexible paradigm: it is not hard to create a synthetic-data-generation pipeline to get started, yet what that pipeline can accomplish is limited only by creativity, and researchers are very creative.</p>
Top models will continue to gradually improve for a very long time</h2>
Let’s now talk about large language models—though, the same considerations also apply to other kinds of foundation models, including those that rely on interfaces that are neither text nor vision.
I think there is essentially no limit to how long these models’ performance, in just about any domain, will continue to improve.
In particular, I see two key ways in which a language model’s performance can improve: (i) by making each token more useful, and (ii) by generating useful tokens more quickly and at less cost.
The first of these directly improves performance, while the second does so indirectly by enabling reasoning, agentic workflows, and other complementary techniques.</p>
To be specific, I think there is substantial headroom for both (i) and (ii), I think that investment in both is overwhelmingly likely to transform that headroom into results, and I think that today’s providers are going to make that investment.
Thus, I think improvements from both avenues will happen.
Since both (i) and (ii), along with each of the above points, involve details, I will explain them in the sequel, given directly below.</p>
Data quality will be a primary long-term source of improvement</h2>
Just above, I said that I think the usefulness of each generated token will improve for a very long time.
On the other hand, today’s large language model training corpora is likely to have already maxed-out the gains available from pre-training on internet data—the models have already seen every page on the internet.
So, where are these gains going to come from?</p>
My answer: data quality</em>.
Having worked with internet-scale data in my early-career days in industry, let me say something obvious but easy-to-overlook without this experience: internet-scale data is extremely messy</em>.
There is a very large amount of SPAM, irrelevant information, mistakes and errors, and other data that makes model performance worse.
The scale of what language models are trained on is so vast, that I see software development work to improve data quality as a near-unlimited avenue for improvement.</p>
AI itself is likely to play a significant role here.
As a simplified example, consider a classification task.
If an example is consistently misclassified, even after training at scale, it could be that the misclassification is caused by an incorrect label, rather than the model’s mistake.
In this case, the information about which label should be corrected is coming from the model’s training dynamics, which are only available once an initial training run is completed.
It is not hard to imagine workflows where a model examines its own training curves, looks for issues, corrects them, perhaps with some human labeling in the mix, and is then re-trained to improve its performance recursively.
This hypothetical example is one of many possible approaches.</p>
Data quality is not just about removing incorrect information, but also includes acquiring information that was previously missing.
This can include generating synthetic data.
Returning to the AI-for-math example discussed previously, one can use a language model to generate Lean code, and keep the parts that are verified correct, obtaining new data to train on.
If a model obtains a useful lemma purely by chance, it now knows about this lemma, and no longer has to rediscover it, improving its abilities.
This hypothetical example is also one of many.</p>
Inference API costs will come down very significantly over time</h2>
Returning to the original point, let’s now discuss the second avenue for near-unlimited improvement: inference time and costs</em>.
Right now, inference is extremely expensive, and its cost is being subsidized by the companies.
I do not think the state of affairs will remain.</p>
The least sophisticated reason for this is simply Moore’s Law</em>, which has continued to apply to GPUs to a much greater extent than it has for CPUs.
All else equal, the cost of computation decreases over time.
This means that the same models, with the same performance, can be deployed at less cost once newer-generation compute is purchased and comes online.</p>
The counterargument to this point is that, as computational costs decrease, using larger models becomes more attractive.
Indeed, in computer graphics, as computation has become less expensive, 3D artists have responded by using more and more of it to produce previously-intractable special effects.
What if the supply of increased computation is outweighed by demand for higher-quality results driven by larger models?</p>
In this race, I think the need to reduce costs is going to win—and, more precisely, it will win slowly</em>.
Fundamentally, this is for economic reasons—at some eventual point, if they want to exist, leading providers will need to become profitable—but there is a technical case as well.
In the short term, one should expect various improvements—think, for instance, of KV-caching, or the many ideas in the DeepSeek papers—purely because AI inference systems are new and there has not been a lot of time to optimize them yet.
But there are medium-and-long-term reasons as well, as I will describe next.</p>
Over time, AI will move to specialized chips, eventually ones with noisy output at the hardware level</h2>
At present, to my best information, today’s language models are served using compute clusters consisting of nodes that contain either Nvidia GPUs, or similar hardware such as Google’s TPUs.
There are two key workflows: training</em> and inference</em>, which differ because the former involves backpropagation, while the latter involves a variable-size input and output.
Right now, both workflows are running on similar or even the same hardware, which has not been designed to handle each specific workflow optimally at the hardware level.
I believe this will change: to understand why, let’s talk a bit about hardware.</p>
While the term GPU</em> stands for graphics processing unit</em>, a more appropriate name could have been general-purpose parallel-processing unit</em>.
This is because, just like a CPU, a GPU can execute general kinds of instructions—much more</em> than matrix multiplication alone.8</a></sup>
The key difference compared to a CPU is that a GPU uses massive parallelism to maximize throughput, and generally focuses on hiding latency by overlapping execution, rather than directly minimizing it.
In particular, while their architectures and therefore performance characteristics differ, both types of chips can execute equally general code.</p>
The ability to perform general computational workflows is difficult to achieve, and requires complex chip design.
The GPU must carefully manage what it is doing to ensure both correctness and performance.
If one eliminates this generality, it should be possible to design devices which are much less expensive to operate.
This is obviously an attractive proposition, as long as the initial design costs are not too large.
The downside is that one needs to know what computation the model will perform, and this may change if algorithmic techniques improve.
On balance of the tradeoffs, I suspect the move to specialized chips to happen fairly quickly, unless technical reasons prevent it, because Nvidia’s GPUs are simply too expensive, and vendor lock-in, even taking into account the availability of AMD GPUs, creates too much business risk.</p>
Thinking long-term, I believe the headroom for improved performance due to custom chips is very substantial.
One reason for this is that, unlike essentially-all computations we have traditionally designed chips to perform, neural networks are noise-tolerant</em>.
If you slightly perturb the numerical value of each weight, the network will still work.
This is what makes low-precision training and inference possible.
Right now, we do not know how to design hardware for noisy computation—but, if we did, we could likely use a lot less power.9</a></sup>
In a talk given at the University of Cambridge</a>, Geoff Hinton called this idea mortal computation</em>, and has said that it might one day allow us to run a high-performance language model on a device closer to the size of a mobile phone than the size of a data center.
I see no fundamental reason why he’s wrong, and instead predict that he’s right.</p>
Building complex agentic systems that work reliably will be difficult, and will involve a better computational understanding of incentives</h2>
I’ll now shift gears again in order to talk about systems that sit on top of language models.
Agentic</em> systems involve instantiating AI agents and allowing them to interact with an environment in order to pursue a given goal.
At present, the typical workflow is that one launches a bunch of agents, each with a certain prompt, and waits for them to complete their task.
I’d like to instead talk about a much more sophisticated</em> potential workflow: launching a bunch of agents which talk to each other</em>, and to other people</em>, cooperating to work together to solve a problem that would be difficult for a single agent to solve.
There is an appeal to this idea: social behavior is a cornerstone part of being human, and enables people to potentially achieve far more by working in teams than individually.
Why not attempt to improve what AI can do in a similar manner?</p>
One reason to be excited about such approaches is that the characteristics by which they scale up or down are likely to be different from other approaches.
Using a bigger model, or a faster model that can think for longer, generally requires a larger compute cluster.
One cannot simply deploy a twice-as-big model on two clusters in different physical locations: the increased latency will likely bottleneck performance.
Workflows involving multiple cooperating agents have no such limitations: they can in principle work with little coordination, making it possible to build substantially larger and more complex AI systems overall.</p>
I think carrying this out is going to be a lot harder than it looks.
The problem is: how do we ensure that agents successfully cooperate, especially in very large systems?
While humans certainly can work in large groups, doing so is non-trivial: human organizations have a tendency to become dysfunctional with size—think of the typical government or large company.
It is difficult</em> to align the incentives of an organization with that of its members, and the organization that do this best tend tend to be controlled by small sets of individuals, rather than by distributed governance mechanisms.10</a></sup>
What stops multi-agent AI systems from transforming into an AI bureaucracy</em>11</a></sup> which uses up tokens, but achieves little?</p>
The key issue is that it is very difficult to predict how the local incentives around individual agents will combine to influence what an organized system of agents actually does.
Questions like this are studied in disciplines such as economics and political science, the latter of which is largely non-quantitative, and has yielded a much weaker understanding of the phenomena at hand compared to what has been achieved in physics, or in electrical engineering, to name two examples.
If view AI agents and their capabilities from a fundamental point of view of respect, understanding how incentives will affect AI agents will be similarly-difficult to understanding how incentives affect humans.
I see this both as a challenge, and a great opportunity, and will elaborate on it further in my final point.</p>
For artificial general intelligence, world models are the next frontier</h2>
Given what language models can do, what is the next step on the path to artificial general intelligence?
I, and likely many others, think that the next step is the development of world models</em>—for robotics, this means representations of the three-dimensional physical world that make it possible to understand what will happen as a result of different physical actions.
These will go beyond pure image and video generation to involve a genuinely three-dimensional computer vision stack, while adding the ability to model physics and other modalities such as touch feedback.</p>
A useful analogy for what we want is the concept of a learned video game engine</em>.
By this, I mean software that allows one to point a camera at a scene, and obtain a representation that makes it possible to see what will happen if a robot attempt to move one of the objects in the scene.
The world model’s capabilities should be at least as rich as that of a modern game engine—but be available in arbitrary environments found day-to-day and learned on-the-fly.
If such models existed, they would be immensely useful to robotics, and in particular would make model-based reinforcement learning viable in all kinds of new situations that are unworkable today.
It is not hard to imagine how to build one: to name just one angle of attack of many, neural radiance fields with physics simulation</a> capabilities already exist.
Today, there are multiple companies, including Yann LeCun</a>’s and Fei-Fei Li</a>’s startups, which are developing ideas that may one day lead to tools like this.
I am convinced that one of them will succeed.</p>
This opinion of mine is not new.
In 2020 and 2021, I applied to a set of Junior Research Fellowships at Oxford and Cambridge, with a research proposal that moved in this direction by building on some of my early research on Variational Integrator Networks</a>, a type of variational autoencoder that can model both smooth</a> and contact dynamics</a>.
My applications were not successful, though they came close, with several final-round interviews—to my best understanding, the problem was that too many people thought what I wanted to do was science fiction and not feasible.
Some time later, I decided to double down on my views</em>, having become convinced that world model research would succeed whether I was involved in it or not.
I therefore opted to pivot to working on decision-making, which becomes increasingly important once world models arrive.
Let me talk about that next.</p>
For artificial general intelligence, decision-making and sample-efficient reinforcement learning are the second-next frontier</h2>
Suppose that world models exist.
What becomes the next challenge in artificial general intelligence?
I would argue that decision-making capabilities</em>, especially sample-efficient reinforcement learning</em>, are the next challenge.
There are many distinct areas where these capabilities are central, including active learning</em>, Bayesian optimization</em>, model-based reinforcement learning</em>, model-predictive control</em>: here, I will think very broadly, and focus here on what the methods actually do</em>, not the terminology they are called or the scientific community they come from.
The unifying concept here is the presence of explore-exploit tradeoffs</em>, which require the algorithm to learn by trial and error, balancing what it already knows with what it could learn by trying something totally new.</p>
I think an improved understanding of such methods would be very consequential.
In particular, using world models, they could give us the tools with which to design robots that can carefully plan step-by-step to decide what to do to solve a completely-novel task, using the world model to evaluate what will happen from potential actions.
Together with the open-endedness and interfaces provided by language models, and the predictive capabilities provided by world models, I believe this is the only remaining software-oriented capability needed to make science-fiction-level robots real.
Once world models develop and mature, I believe everyone will realize this, and attention will shift to these questions.
Since they touch on my own research, I will discuss them further in the sequel.</p>
Over the long term, artificial intelligence will create an unprecedented toolbox for understanding social systems</h2>
My final prediction—out of a total of sixteen, that’s quite a few—is even longer-term than the others.
It’s that advances in AI will make it possible to reason about incentives in social systems with unprecedented precision.
This prediction is coming from two trends.
First, incentives are critically important from a fundamental AI safety point of view: they can help us understand how intelligent models can be influenced—for instance, to ensure they tell the truth.
Second, as I argued earlier, we will need to understand incentives in order to get large-scale AI systems that involve multiple agents to reliably do what they are designed to.
Addressing both kinds of questions will involve ideas of mechanism design and related tools from economics, but will require a much-broader understanding that will involve new ways of thinking about incentives</em>.</p>
The same understanding will likely prove useful for understanding incentives in human systems.
Based on the structure of who holds power, and the specific form of what they can do, will a given form of government last, or will it collapse?
In a system of government where everyone is constrained—whether by voters, or by other means—what kind of changes can realistically happen?
We have no good first-principles computational methods with which to address these questions.
But if we fundamentally understand incentives well-enough to design large-scale AI systems involving many agents, we will probably learn how to tackle these questions too, because both are governed by similar difficulties.</p>
This prediction is naturally the most open-ended, but leaves me with a sense of optimism.
At present, a great deal of human suffering is fundamentally caused by the behavior of large human social systems.
If we can better understand how such systems work, we might gain new kinds of technical tools—ones with a completely different character than those working on policy and politics can currently imagine—with which to make the world a better place.</p>
What now?</h1>
This post has been long, with many points and various details.
I hope you have skipped the sections you did not think are relevant to you.
Inspired in part by John Hopfield’s retrospective, I will now discuss what I think the above trends mean for me and my career.
My goal here is not necessarily to convince you or anyone of anything in particular, it’s to write my own thoughts down precisely, and therefore sharpen my thinking, so that I can make the right choices when the time comes.
Nonetheless, I hope you find my thoughts useful.</p>
Relevance</h2>
In research, there are several aspects needed for my work to reach the level of significance that I aspire to.12</a></sup>
First, my work needs to involve difficult technical skills, and ideally be of a character that other people cannot do.
At this stage, I am happy with the point that my mathematical skills have reached, and I have been happy with my software skills for a long time, though AI-assisted coding has given me new aspects that I have been excited to learn.
My on-paper track record is reasonably strong, having written plenty of papers that people in various scientific communities know about.
At the same time, I’d characterize my professional success as fairly minimal: the typical top computer science PhD probably draws significantly more combined demand from academia and industry than I do, in part because my skills span too many distinct areas and my profile is too weird.</p>
I also think there’s a much bigger problem with my track record: its relevance</em>, in the sense of the question, do other people need to think about your work in order to achieve their goals?</em>
I got this framing from Kuang Xu</a>’s talk at the Operations Research and Machine Learning</a> workshop at NeurIPS,13</a></sup> and have found it to be an illuminating concept.
This is the standard to which I was evaluating theory and methods research in my previous points, then implicitly.
And I think it’s a useful one: to my eyes, relevance drives career success more than a researcher’s degree of skill, their work’s impact,14</a></sup> or many other factors.</p>
Relevance is heavily influenced by the overall direction of investment made by society.
In 2017, when deep learning was a relatively-new approach to AI, it was clear to everyone that Google and other actors with substantial resources had decided to invest in direction.
At the time, I did not appreciate how much of the progress that would come could be attributed to that investment.
As individual researchers, we don’t control the direction of the field, yet the direction in which the field is going plays a decisive role in determining whether our own work is relevant to others.
This does not mean we should follow the crowd and not make contrarian bets: rather, it means that contrarian bets should be chosen in a manner where other people will care about them a lot if they turn out to be right</em>.</p>
While I am happy with my recent work in many aspects, I think it is also possible to make my near-future work much more relevant.
The matter of fact is that both Gittins indices, and online learning, are obscure concepts that are known primarily to a small set of technical experts.
In working on them, I have implicitly chosen to prioritize my own understanding of how algorithmic decision-making works at a fundamental level, as opposed to producing results that directly help as many other people as possible.
The relevance of the resulting works is therefore mostly indirect, and largely characterized by making new kinds of follow-up projects possible.
Let me now talk a bit about what this follow-up might look like.</p>
Will decision-making algorithms experience a paradigm shift, like in computer vision or natural language processing?</h2>
I think that algorithmic decision-making is going to be a significant part of the future of AI—as I have for some time now, since shifting into this area from my Gaussian process research, which was motivated primarily by factors such as curiosity and challenge.
At the same time, I have noted above that I think the next steps could be much more relevant to the field than the prior ones.
Most of my recent work has been dual-purpose</em>, in the sense that each project has had both a goal of both answering some kind of immediate-timeframe question, and opening up long-term angles of attack on harder but more significant questions.
However, it has been hard to communicate this well: almost all of my papers have been written</em> to emphasize the immediate question, while my primary reason for actually working</em> on them has more to do with the long-term understanding they create.
I think one key to being more relevant is to be able to do work where the two goals are better-aligned.</p>
Decision-making tasks are characterized by the presence of explore-exploit tradeoffs.
Balancing these tradeoffs is difficult, and we are not even close to a comprehensive understanding of how to handle them in general.
The obvious approach is to quantify uncertainty, using either a Bayesian method or something like a confidence set, and then use the resulting uncertainty to balance what is known with what could be learned.
The problem with this is that, as soon as neural networks are involved in some manner, it is famously difficult to have any control of how estimates of uncertainty behave.
As a result, almost all the decision-making methods we have—motivated by bandits, reinforcement learning theory, Bayesian optimization, or whatever else—are still classical</em>, designed using kernels and other principles from previous eras, with a great emphasis on theory.</p>
The reasons to emphasize theory within decision-making are much stronger compared to those that were there for supervised learning in the heyday of support vector machines.
Given just an algorithm, it is not obvious whether poor performance should be attributed to bad methodological design, or to an impossibly-hard problem.
Theory, in the form of lower and upper bounds, is critical because it allow one to quantify both performance and difficulty in a mutually-compatible manner.
And, yet, I think there are good reasons to think a shift towards empirical methods should nonetheless both be possible, and yield better ultimate results.</p>
In computer vision and natural language processing, one way to think about the shift to deep learning is as a shift from designing features</em> to learning features from data</em>.
Rather than first detecting edges, textures, and combinations thereof, and then using them to make predictions, today’s models learn relevant features directly, resulting in features which are too complex to have been designed, and perform much better.
At present, I worry that decision-making algorithms are similar: a typical bandit or theoretically-motivated reinforcement learning algorithm is hand-crafted, constructed carefully using mathematical ideas.
There is a chance that it might be possible to learn better algorithms from data—ones that were too complex to design directly—if one can figure out how.</p>
In-context learning provides a paradigm by which to do this, by instantiating a form of meta-learning that actually works in practice.
Following this perspective, the critical research question becomes: how should one generate synthetic data by which a strong decision-making algorithm can be trained</em>?
This question is highly non-obvious, because, as above, there is no obvious supervised-learning-style criterion by which one can determine whether an algorithm explores correctly.
And, yet, perhaps a subtle solution of some kind may be possible—this question is likely to be of a character that can be addressed by mathematical theory.
So, I’ll make my final prediction, with a medium level of confidence: decision-making research will experience a paradigm shift, moving from classical machine learning to in-context learning</em>.
I’ll be trying my best to help us get there sooner.</p>
Where?</h1>
In light of the above, it is worth asking what is the right place for me to do the work I’d like to do.
My recent career, for many good reasons, has been spent entirely in academia.
Even if I follow my current path—at present, I am on the faculty job market, with reasonably-promising progress, but by no means guaranteed results, so far—it is worth it to sharpen my thinking by understanding the tradeoffs involved and what I would do if I had made different choices.</p>
What’s appealing about academia?</h2>
From a research perspective, the biggest upside of academia is clear: the intellectual freedom to work on what I think is going to be long-term important.
This freedom results in substantially less constraints on direction than in industry, where long-term goals ultimately come from leadership rather than from me.
In academia, the work is open-ended, its results are evaluated by peer review and the community at large, and the outputs are visible in public.
I think these characteristics are much well-suited to my personality.
I also like teaching, like working with students, and like writing, and therefore think I would not mind scientific fundraising as part of the job.</p>
On the other hand, academia also has downsides.
The biggest of them is lack of control over location.
This is unavoidable: factors completely out of my control, such as a department’s teaching needs in a given year, are a critical part of my and everyone’s results on the job market.
There are also potential concerns about access to sufficient resources, such as compute, with which to be able to actually do the work I want to do.
Another downside is that academia provides comparatively-little opportunity for me to directly capture the value I create, most of which will ultimately go to various companies, rather than to the university or to me.</p>
What about research in industry?</h2>
If I were to move to industry, the first big question is: to where in industry?
I’m not sure.
I very much regret that, during my PhD, I never got to intern at one of the big research labs of the era.
This is partly from not trying as hard as I could have, partly from not having a public profile that convincingly made the case that I can code well, partly from very poor timing due to COVID-related hiring freezes around the time I was best-positioned.
So, I don’t know as much as I’d like: however, what I do know gives me conflicting feelings.
On the one hand, there’s definitely value of being a part of AI in the places where it’s being turned into products today.
On the other hand, I am skeptical of large companies: in my experience, at a large company, it is easy for people to not particularly care whether their work, or the company’s product, is good or not—a recent viral blog post</a>15</a></sup> captured it much better than me.
I don’t like environments where this feeling is pervasive.
But, I also suspect the most ambitious companies and teams are different: caring about their work and getting the details right is necessary to achieve their goals.
I’d love to know whether such a team could be a good fit for me.</p>
Another option is joining a small machine learning startup—by my best estimate, there are more these days than ever.
This has a lot of appeal: before returning to do my PhD, I worked at Petuum, then a Carnegie Mellon startup.
This was the best job I’ve ever had.
The single biggest factor was that everyone around me was extraordinarily competent, and very much tried to do well on all the things they worked on, to a degree I’ve only otherwise seen in academia.
There are good reasons to think this kind of environment can be found at the right startups—though I should be careful, because in Petuum’s case, I’ve been told the company’s difficult times came later, after I left to start a PhD.
I’d need to be very careful about which company to join, because startups are inherently risky, and there are a lot of founders who don’t know what they’re doing—for instance, ones that are much better at fundraising than at actually building something, to give just one potential failure mode.
Ensuring that right fit would be necessary.</p>
The main downside of moving to industry as an individual contributor, in general, is that I think it’s harder for my work to matter.
The technology sector is a very big pond, and I would be a very small fish swimming in it.
Many of the most-consequential industry research projects are so large-scale that few specific contributors can make a decisive difference.
Industry is also known for re-organizations and other aspects that reduce the creative control I would have over what I work on.
As with academia, there are definitely tradeoffs at hand.</p>
What about starting a company?</h2>
A final option, and one that I see a lot of appeal in, is entrepreneurship—for many of the same reasons as academia.
First, starting a company, just like academic research, is all about turning a vision into reality.
Second, also just like research, starting a company is difficult</em>.
A founder needs to learn how to do everything</em> well, from navigating social systems in order to fundraise and then to successfully hire, to figuring out how to build their product well and ensure it is useful to other people, to understanding how to get customers to try it out, to ensuring that the value created is captured by the company and not someone else.
The sheer scope of the work, while perhaps intimidating to some people, makes it exciting to me.
But, at the same time, I see no value in the idea of being a founder</em> for its own sake: if I’m starting a company, it’s because my research vision has advanced to a stage where it can be turned into a company mission, and I see a viable path to make that vision real.</p>
Another upside of this path is its flexibility: it is both possible to start a company during an academic sabbatical, and to leave academia entirely in order to do so.
On the other hand, a key downside—as I’ve seen from many friends who have done so, some with great success, others not so much—is that being a founder is all-consuming</em>, and will involve some amount of stepping away from technical work.
I don’t know whether I would like or be good at fundraising or other parts of the job which I’ve never tried.
Entrepreneurship is risky and carries significant potential financial downside compared to other paths—even taking time to bootstrap and build an initial demo carries serious opportunity costs.
It’s all tradeoffs, all the way down.
You’ll know if and when I decide to take this path: a blog post titled Chips On The Table</em> will appear.</p>
Some final thoughts</h1>
Let me conclude with a few final thoughts.
First, thanks for bearing with me through 20 pages worth of text—there’s a lot here.
This blog post is partly an experiment in documenting my thinking publicly, both in case someone else might find it valuable, and so other people learn more about who I am and how I think about AI.
Please feel free and welcome to contact me with your comments—I will not post my contact information here, but it is not difficult to figure out how to find it.
I especially encourage you to contact me if you think I’m wrong, or if any of my points are too obvious or simplistic to be interesting.
And, remember that these are just my thoughts, there’s a good chance I’ll change my mind someday on just-about-anything written here.</p>
Finally: none of this post was written by AI.
All of it was typed in the classical manner on a keyboard.
The bottleneck for me has been in forming opinions that I have full confidence in.
As a result, unlike some people, I have found it much easier to simply write my ideas down directly, than to write a good prompt to generate text describing them.
In particular, this post was written in full by me, linearly, section-by-section, in three separate sessions over three days mid-January.
Two weeks later, I returned to and re-read it, made minor edits, collected and added all of the links, and made it public.
Given the topic is AI, and that I hold a perspective of optimism16</a></sup> and excitement about current approaches, I very much realize the irony here.</p>
Footnotes</h1>



There’s not a huge amount of publicly-visible evidence to back me up on this. But, let me give you an idea: I spent the tail end of my masters research designing a GPU-based Robin Hood hashing algorithm and implementing it in CUDA, as part of a project to bring MCMC-based doubly-sparse Latent Dirichlet Allocation to GPUs. The key difficulty was that good parallel performance on GPUs involves very different memory access patterns compared to on CPUs. My solution involved grouping hash buckets together and performing parallel insertion and retrieval on a per-warp basis, using a careful lockfree generalization of the standard algorithm, which I believe to be original. I don’t think most software engineers would have been able to get the hash table to work, but I did. Unfortunately, I never finished the project, as I moved to Imperial College London to start a PhD and landed with an advisor who insisted that I immediately drop all of my prior work. A year later, I quit working with that advisor, switching to a much better group—but, by that point, my research had shifted to a totally different topic. What I should have done was dropped the LDA-aspects and written a paper just about parallel hashing—but I was too early-stage and didn’t know how to package my work or tell people about it. Not finishing this project is one of my biggest career regrets. ↩</a></p>
</li>

One of these papers even managed to become a finalist in a best-paper competition. But, as a certain very successful faculty at Columbia, who works on topics not too far from where my own used to be, once told me, “Don’t you know that nobody cares about awards?”—so I’ll put that one in a footnote this time around. ↩</a></p>
</li>

A long time ago, I completed two simultaneous bachelor’s degrees in statistics and psychology. I obtained the latter degree purely for fun, with no direct professional ambitions, and spent most of my time learning social and personality psychology. A decade later, I would say that the psychology degree turned out to be surprisingly useful—in spite of the fact that a large percentage of its content was outright wrong due to what is now called the reproducibility crisis</a>. The reason is that a small portion of powerful ideas have helped me think much more clearly about the society and other people. ↩</a></p>
</li>

Jevons’ name is pronounced JEV</em>-uns, similar to Jensen Huang’s first name, and different from Klavon’s Ice Cream Parlor. ↩</a></p>
</li>

A secondary lesson I have learned about mathematical research from this work is that it is harder to get other people to engage with difficult mathematical results, especially ones that involve a different viewpoint compared to what they are used to, than with easy results. This is because people know that understanding the proofs will require work, and this work is only worth doing if people feel that the techniques are beneficial to them. All of this holds doubly so if the person proving the result is not an established figure in the respective subdomain, or if the question they are answering is of a different character than what is typically studied in the subdomain. ↩</a></p>
</li>

This style should be contrasted with a different one where the appeal is to understand something complicated</em> for its own sake, without the goal to build something new. This style, different from my own, is also common. ↩</a></p>
</li>

Some time ago, as part of an industry event, I attended a keynote given by John Jumper on his Nobel-prize-winning AlphaFold work. During the talk, he spoke about having the conviction to pursue deep learning, in spite of the fact that people around him viewed it as an inelegant approach. In particular, in thinking about the right solution, he asked, “What if it’s ugly?”</em>—meaning, what if the approach that actually works doesn’t fit the aesthetics of whoever is working in the area? I immediately recognized that this was an important question to keep in mind, and it has stayed with me and influenced my thinking ever since then. ↩</a></p>
</li>

This general-purpose capability is what made GPUs useful for Bitcoin mining, which has different computational characteristics compared to computer graphics. ↩</a></p>
</li>

Note that reducing electrical usage</em> is the key factor to be optimized in order to reduce the resource footprint of AI. There is a lot of discourse on water use by data centers which is outright wrong and based on incorrect information. This is because some early work in that space focused on how much water goes into a data center, without also taking to account that a lot of this water goes right back out of the data center and remains usable downstream. The actual water use by a data center, defined in a non-misleading manner, is comparable to that of a fast-food restaurant. And it has to be like this</em>, in the sense that there should be no doubt this modified view is correct: just think a bit, from first principles, what would computers even need water for? ↩</a></p>
</li>

Indeed, concentration of power is a major cause of today’s social problems, especially in situations where a leader’s whims are poorly aligned with what society at large wants. The governing principle of separation of powers, as enshrined in the United States Constitution and elsewhere, was created precisely to address this. ↩</a></p>
</li>

In-between the time I originally wrote this post, and launched it, this point has become much less farfetched—just look at Moltbook</a> and imagine what else might be possible when large sets of independent AI agents start talking to each other. ↩</a></p>
</li>

The standard here, as written in my prior post</a>, is that my ideas should continue to be valuable to science and humanity even after I die. ↩</a></p>
</li>

This talk touched on whether current research in the operations community is relevant to society at large, including to AI in particular, and what the consequences of that to the two fields would be. ↩</a></p>
</li>

The problem with impact</em> as a concept is that one can have significant impact while capturing none of the resulting value—for instance, if one’s work greatly benefits people who have no influence on their professional success. ↩</a></p>
</li>

While I usually collect and save blog posts that I like, I forgot to do so for this one when I originally saw it. This made it difficult to find: Google search queries such as “viral blog post about people who don’t care”</em> yielded nothing useful, and consisted largely of SEO-engineered results about music and entertainment. Adding DMV</em> to the query did not help. What actually worked was asking ChatGPT about “Do you know of a viral blog post about how people who work at companies often don’t care what happens, complete with examples about the DMV and many ‘He doesn’t care’ phrases in it”</em>—this is my actual prompt, which immediately returned a response with a Hacker News link to the post. Without the ability to use AI as a search engine, I probably would have given up on finding the blog post. ↩</a></p>
</li>

Optimism in the face of uncertainty is a provably-strong decision-making strategy in many settings. Translating what theory tells us to a human-language level, the key is to choose optimally among a set of actions whose uncertain outcomes are estimated optimistically in a manner that is also realistic-enough. If one does so, the regret they incur is controlled by how quickly the uncertainty given by the set of realistic-enough outcomes—including non-optimistic ones—decreases, as information gained from trying new actions is obtained. I find it absolutely remarkable that it is possible to write a sentence like this to summarize the content of an actual mathematical theorem</em>—here, the UCB algorithm’s regret bound</a>—given the statement otherwise sounds like philosophy. ↩</a></p>
</li>
</ol>
</section>


On Successful Research
2023-10-04T00:00:00+00:00
It’s 2023.
Though by now the hype has died down somewhat, it is clear from talking to everyday people that ChatGPT has completely changed the public’s understanding of the capabilities of modern machine learning and artificial intelligence.
A year ago, most people would have said that artificial general intelligence is at least a decade away.
Today, one could reasonably argue that in its most primitive form, artificial general intelligence is here right now</a>—and it’s not too different from what Andrej Karpathy envisioned in his famous blog post from eight years ago</a>.
Some of the field’s very best researchers</a> are pivoting their work to focus on topics whose prominence rises with the public deployment of language models, such as AI safety.
With new technology becoming available to the public, it is certainly a good time to reflect on research.</p>
My own work is at a crossroads.
Most of my last four years of research was dedicated to Gaussian processes, in spite of me writing blog posts about neural networks</a> not long before I got started.
My PhD research was driven almost exclusively by intellectual challenge and the desire to learn and improve.
I studied functional analysis and differential geometry, then used that knowledge to start several lines of work—including on pathwise</a> conditioning</a> and geometric</a> Gaussian</a> processes</a>.
I won two prestigious best-paper-type awards for this work.</p>
My ambitions, however, have always been bigger than winning awards: I want to do research important enough that my ideas continue to be valuable to science even after I die.
In spite of its success, I do not think my current work rises to that level: I am increasingly convinced that many of my contributions are orthogonal to machine learning’s most important open research problems.
So, it’s time to think about expanding into something new.
This blog post will document my thinking as I explore what kinds of ideas to dedicate the next few years of my scientific life to, as an Assistant Research Professor at Cornell.</p>
Some Thoughts on Successful AI Research</h1>
Embarking on a new long-term research direction is a multifaceted task that involves taking different criteria into account.
To explore how to best do so, I now detail what I’ve learned about research over the last half-decade, focused on what works best and how to avoid common pitfalls.</p>
Start with the problem</h2>
When I started my master’s in applied mathematics and statistics, the first thing that I remember my advisor, David Draper</a>, telling me, was that the right way to perform research was to start with the problem</em>.
By this, he meant that a good research direction should focus on solving a well-formulated scientific problem with a clear and consequential notion of success.
In computer science, this generally consists of figuring out how to create a new technology that does not yet exist.
David contrasted this with method-focused research</em>, which might involve goals such as extending or generalizing existing techniques, without any consequential applications in mind until it came time to write an experiment section in a paper.
David—very much a man of strong opinions—would say that working on problem-focused research is the right</em> way to be a scientist.</p>
I don’t agree with David’s view, or rather with its implication that this is the only approach: I find this too simplistic, especially if taken literally.
Research is too multifaceted an area of human endeavour to always prefer one particular approach.
The right way to do research depends on the interplay between the researcher, the scientific question, and the needs of society.
These factors vary to a sufficient degree that no broad statement could hope to capture what is right.1</a></sup></p>
Nonetheless, I find David’s advice to be of great value, since it is insightful to think about reasons for agreeing with it.
One key advantage of problem-focused research is that succeeding at it creates societal value that goes beyond simply publishing papers.
In contrast, method-focused research creates opportunities for other people to create societal value.
This can be worth a lot, especially since some problems can ultimately only be solved after appropriate methodological preparation, but is more indirect.
If not careful, this indirectness increases the risk of publishing papers that few people ultimately read,2</a></sup> especially since method-focused research often tends to quickly rise to a high level of technical sophistication,3</a></sup> which is intellectually satisfying to work on, but in most cases substantially harder for other people to understand and therefore use.</p>
The essence of David’s advice, as I now understand it, is therefore to think deeply about how one’s research ultimately helps other people</em>.
To this end, not all scientific questions are of equal importance.
To figure out which ones are, I find that it helps to think about the following criteria:</p>

Success should be consequential</em>. Ideally, the research’s results should create genuine commercial value, so that starting a company to commercialize it is a viable path.4</a></sup></li>
There should be an effective angle of attack</em>. Answering the key questions using current techniques should be a viable approach. Ideally, the ultimate solution should be simple, even if the path to obtain it is not.</li>
The research question should be hard enough</em> that it cannot be answered by a talented undergraduate student. Ideally, I should be uniquely positioned to answer it.</li>
</ol>
These criteria often point in orthogonal directions, and therefore must be balanced.
In the last half-decade, I have largely failed to follow the advice I give here.
Almost all of my published work has succeeded more at (2) and (3) than at (1).
At the same time, a good bit of my unpublished work in areas like reinforcement learning never got to the point of a paper, due to not succeeding at (2).
Going forward, I’d like to work on research that fits all three criteria.</p>
Focus on what works</h2>
The ultimate reason that machine learning methods are valuable is because they work.
Deep learning, in particular, is valuable to areas like natural language processing because no other approach has allowed engineers to build natural language systems for interacting with computers to the same degree of success.
You can write a sentence, send it to ChatGPT, and it will write better sentences back than could have been obtained from a symbolic approach, or any other kind of system known at present.
Deep learning works.
There is little else to say.</p>
Some of today’s scientific revered elders do not like this state of affairs.
Noam Chomsky</a>, Judea Pearl</a>, and Gary Marcus</a> are all famously skeptical of deep learning.
Though each presents different arguments in favor of skepticism, my view is that all three share a single fundamental reason behind their skepticism: deep learning didn’t solve the problem of natural-language human-computer interaction in the way they</em> wanted to solve it.
Chomsky, Pearl, and Marcus each take this as evidence that deep learning is not good enough—that, instead, we need more focus on other, completely different approaches.
Given ChatGPT’s effectiveness, in my view a more convincing explanation is that their requirements are too strict and not necessary or even useful for creating artificial intelligence.</p>
A broader lesson one can learn from this comparison is that deep learning’s success rests in part on its empirical focus.
Deep learning’s pioneers did not focus on creating artificial intelligence in the technically elegant way some grand, over-arching theory suggested it should be solved.
Instead, they studied how to build systems that solved practical engineering problems of direct scientific and commercial importance, such as classifying images.
To advance the understanding of intelligence, we should therefore focus on building technical capabilities for solving practical engineering problems step-by-step, using methods that work.</p>
Avoid heroic effort</h2>
Every sufficiently non-trivial research problem will involve challenges that need to be handled and technical obstacles that need to be overcome.
These demand an appropriate degree of effort.
Most researchers, given their naturally hard-working nature, are happy to provide that effort.
Counterintuitively, however, my experience shows that too much effort</em> is a more common failure point than not enough effort.
Following a beautiful phrase I heard Art Owen</a> say during one of his talks, I’ll use the phrase heroic effort</em> to describe research which requires substantial intellectual effort to obtain short-term results.</p>
The reason heroic effort is almost never justified is because, at the end of the day, machine learning algorithms are produced to be implemented and deployed by ordinary engineers.
The more complex an algorithm, the bigger a team is needed to implement and maintain it.
Moreover, complex methods tend to be fragile, unstable, difficult-to-scale, and limitation-heavy.</p>
Avoiding heroic effort should be viewed as a guiding principle rather than a rigid rule: sometimes, complex and effortful methods produce results that justify their complexity.
Reverse-mode automatic differentiation, for instance, is a complex algorithm that requires one to build and maintain elaborate data structures in its implementation.
It is also an astonishingly stable and scalable algorithm, powering large language model training algorithms which demand state-of-the-art software engineering to successfully run.
In this case, the practical benefits justify the complexity, and most of the implementation complexity can be hidden from ordinary users in frameworks that are maintained by sufficiently-well-resourced teams.
Algorithms like this are rare: most successful techniques, such as for instance self-supervised learning, are simple at their core.</p>
The same thinking also applies to papers and even mathematical theory: on average, the more complex a paper or theorem, the fewer people will use it in follow-up work.
Almost all of my research projects which required heroic effort ultimately failed, in some cases after a substantial time investment.
Research should therefore involve heroic effort only when the scientific benefits justify it, and heroic effort should never</em> be applied for non-scientific reasons such as making advisors happy, to appear more sophisticated to others, or to gain something in return for sunk costs.</p>
Build the right team</h2>
Most research problems require multiple distinct skillsets to solve.
This includes mathematical skills, software and programming, design of experiments, as well as writing and scientific communication.
Some people are better at some of these than others.
A critical first step in most projects is therefore to assemble the right team, so that someone who likes and is good at every aspect of the research is involved.
Recruiting the right collaborators is often the easiest way to save time and avoid needless effort, by inviting experts who can achieve more with less time and effort to contribute as needed.</p>
The most important part of successfully building a team to ensure that everyone feels welcome, included, and involved as part of the project.
All members should have the room to contribute creatively in the manner they deem best, and communicate as needed to ensure progress.
At the same time, the project should have a unified and well-defined-enough vision to ensure that everyone is working towards the same goals.
Watching a project make progress is often the best way to inspire everyone to contribute to the best of their potential.
Much of my research success has come down to finding the right collaborators and making sure everyone was excited about the work.</p>
Embrace standard tools</h2>
Machine learning research relies on software tooling, and the capabilities of this tooling often determine research progress.
Unless the explicit goal of a research project is to improve tooling, it should use the best existing frameworks, relying on them in the manner that most facilitates the research being done.
I have not done so in the past, and spent a significant amount of time working within the Julia automatic differentiation ecosystem, which has received significantly less software development investment compared to Python-based frameworks like JAX or PyTorch.
This resulting in me spending time fixing bugs</a> instead of working on my research, as well as struggling with batching APIs that have poor developer experience.5</a></sup>
In the last year, I’ve stopped being a programming language snob and embraced Python, which has ultimately made iterating on projects faster and less frustrating.</p>
Support and inspire others</h2>
Science and technology are community pursuits, and all of us throughout out career can expect to collaborate and learn from many colleagues who bring unique perspectives and contributions to the table.
The success of our research depends heavily on actions taken by those we surround ourselves with.</p>
Based on my experience, I believe that given an appropriate research environment, the quality of a person’s scientific work is determined primarily by their interest</em> in the topic, not their skill, talent, ability, prestige of the institutions they previously studied at, or the success of people they previously worked with.
The highest-quality work is done by those who simply want to know the answers, for purpose of figuring out what those answers are.</p>
Since interest and curiosity are fundamentally personal, it follows that it is impossible to train</em> good students: one instead needs to discover</em> those who are interested, and empower them with the tools needed to develop their ideas and do the best work they can.6</a></sup>
It also follows that there is no such thing as a bad student</em>: in my experience, judgments like this originate from adopting a too-narrow notion of success.
We should not reject those who are less-effective at proving theorems, because it might instead be that they are very good at building a company, or at coming up with effective ways of understanding other topics such as history or ethics—even if our own interest and expertise lies in proving theorems.
Instead, the right approach is to inspire, empower, and support those around us towards reaching greatness in the manner that is right for them.</p>
The ability to inspire also affects research on a collective level.
Pascal Poupart</a> once told me he thought a significant factor behind machine learning’s success is that it sounds exciting and interesting to undergraduates—especially compared to neighboring fields, such as for instance statistics, which tended to be a subject undergraduates didn’t like when they took courses in it.
Over time, more students were inspired to work on machine learning, and the field advanced faster than its scientific neighbors as consequence.
At the level of the discipline, one should therefore make an effort to cultivate an environment that is inspiring, accessible, and welcoming, so that over time the field attracts the highest-quality work.</p>
What’s Next</h1>
In his Bitter Lesson</a></em>, Rich Sutton famously said that learning</em> and search</em> are the most promising techniques for developing artificial intelligence, due to their ability to scale in an unlimited manner with increased compute.
Over the last decade, most of the field has focused on improved learning, such as for instance work on transformer models, or on self-supervised learning techniques that allow more effective use of available data.
At the same time, in my view some of the most impressive technical demonstrations have relied critically on search techniques within the overall system.
This includes, for instance, AlphaGo</a>, which relies fundamentally on Monte Carlo Tree Search, and the recent diplomacy AI CICERO</a>, which builds on ideas from computational game theory such as no-regret dynamics to plan the next move.</p>
Search algorithms involve explore-exploit tradeoffs, and can be understood through the lens of decision-making and reinforcement learning.
In this way, they share many similarities with Bayesian optimization, which I have extensive expertise in by this point.
Building from this expertise, I am interested in understanding how to balance explore-exploit tradeoffs generally, and will start by studying decision-making algorithms for tasks beyond optimization.
I aim to develop theory for constructing and understanding such algorithms—in both Bayesian settings where tools such as Gittins index theory apply, and in non-Bayesian settings such as adversarial bandits and online learning.</p>
While these directions don’t immediately correspond to a direct scientific problem, I believe that advances in them can help create the technical language we need to design algorithms that can, for instance, allow a robot to efficiently learn from trial and error.
Given the importance of such technical tools, and other ones that can be developed using better fundamental understanding of decision-making algorithms, I hope that my work stays true to the spirit of the advice I started my scientific journey with.</p>
Some Final Thoughts</h1>
This blog post consists of just my thoughts, what I was thinking at the time.
Your opinion might be different, and I will very likely change my mind in the future.
Research is, ultimately, personal: everyone has their own ways to succeed, and what works for me might not for work someone else—you should figure out for yourself what works for you.
I am interested in research first and foremost, and my views reflect that.
I hope this post has been useful or at least interesting to you.
Please feel free to contact me and let me know what you think about it.</p>
Footnotes</h1>



It is easy to think of researchers, such as for instance Katalin Karikó</a>, whose work can simultaneously be viewed as method-focused, and is of such fundamental importance to humanity that one could not possible argue that her approach is incorrect. ↩</a></p>
</li>

My own work on scalable algorithms for hierarchical Dirichlet processes is a great example: we extended a previous method—partially collapsed Gibbs sampling—from latent Dirichlet allocation to the more complex hierarchical Dirichlet process topic model. This was technically interesting and a lot of fun to write, but not as scientifically valuable as other work I’ve done. In practice, almost everyone interested in topic modeling simply uses latent Dirichlet allocation—it does the job, and is simpler and more reliable. As a result, several years later, my paper on hierarchical Dirichlet processes is nowhere near as cited as my other works. ↩</a></p>
</li>

Consider for instance how our work on Riemannian Gaussian processes started from a simple and straightforward NeurIPS paper on manifold Fourier features, but quickly transformed into a highly technical two-part foundational series of papers. ↩</a></p>
</li>

Most recent startups which made it to scale fit this criterion, including DeepMind (game AI), HuggingFace (platforms for natural language processing), Weights & Biases (machine learning operations), Mosaic ML (distributed systems and scalability), OpenAI (large language models), and others. ↩</a></p>
</li>

One API difficulty I’ve had to deal with is more restrictive broadcasting semantics compared to Numpy, which forces one to add unnecessary reshapes when working with higher-dimensional arrays. Another one is the lack of syntactic sugar for emulating object-oriented-style programming, which forces the developer to waste time on things like balancing parentheses—and, in cases where object-oriented metaphors are the right approach, encourages one to write hard-to-read code. This ultimately results in projects that are more complex and difficult to maintain. My overall impression with the language is that too much attention is spent on the compiler’s technical capabilities, and not enough on improving day-to-day developer user experience. ↩</a></p>
</li>

I’ve been told that this view mirrors the approach used by Geoff Hinton, who famously supervised a very large set of outstandingly talented students who later went on to make fundamental contributions to many different areas of science. ↩</a></p>
</li>
</ol>
</section>


Gaussian Processes and Statistical Decision-making in Non-Euclidean spaces
2022-02-21T00:00:00+00:00



Vector-valued Gaussian Processes on Riemannian Manifolds via Gauge Independent Projected Kernels
2021-09-28T00:00:00+00:00



Learning Contact Dynamics using Physically Structured Neural Networks
2021-01-22T00:00:00+00:00



Pathwise Conditioning of Gaussian Processes
2020-11-10T00:00:00+00:00



Matérn Gaussian Processes on Graphs
2020-10-30T00:00:00+00:00



Matérn Gaussian Processes on Riemannian Manifolds
2020-09-25T00:00:00+00:00



Efficiently Sampling Functions from Gaussian Process Posteriors
2020-07-09T00:00:00+00:00



Aligning Time Series on Incomparable Spaces
2020-06-18T00:00:00+00:00



Asynchronous Gibbs Sampling
2020-03-01T00:00:00+00:00



Variational Integrator Networks for Physically Structured Embeddings
2020-03-01T00:00:00+00:00



Modern reference management using BibLaTeX
2019-12-30T00:00:00+00:00
BibTeX has become a universal reference management format in the mathematical sciences.
This format is used on arXiv and in virtually all journals and conferences which publish papers.
As a result, a significant amount of time is spent managing references within a manuscript.
This is not something I like to spend my time on, so in this post, I’ll explore some ways of making the process more efficient.</p>
I believe that one should work with modern tools when possible.
BibLaTeX is a modern BibTeX replacement1</a></sup> which is, in my experience, substantially easier to configure and customize as needed.
In this post, I’ll illustrate a commonly-encoutered BibTeX issue—lower-case proper nouns in paper titles—and show how to avoid it in BibLaTeX.
Along the way, I’ll illustrate how to override journal style files to avoid auto-loading BibTeX-based packages such as Natbib, and how to get BibLaTeX to play well with older TeX versions on arXiv.</p>
Capitalization difficulties with BibTeX</h1>
Most journals and conferences require standardized citation formats, such as IEEE format.
They generally provide .bst</code> files which contain the necessary style files.
Unfortunately, virtually every .bst</code> file displays the citation</p>
@book{lifshits12,</span></span>
	Author = {Lifshits, Mikhail},</span></span>
	Publisher = {Springer},</span></span>
	Title = {Lectures on Gaussian processes},</span></span>
	Year = {2012}</span></span>
}</span></span></code></pre>
as</p>
[1] M. Lifshits. Lectures on gaussian processes. Springer, 2012.</span></span></code></pre>
which changes all correctly-capitalized proper nouns in the .bib</code> file to lower-case.
The root cause of the issue is the function</p>
FUNCTION {format.title}</span></span>
{ title empty$</span></span>
    { "" }</span></span>
    { title "t" change.case$ }</span></span>
  if$</span></span>
}</span></span></code></pre>
in the .bst</code> file, which one can modify to disable the behavior—but it is sufficiently non-obvious to journal editors how to do this that no-one makes this change in practice</em>.
In particular, the default ICML</a> style file has this issue.
This is because .bst</code> files are written in a highly outdated and esoteric stack-based programming language not legible to many.
In particular, it is not possible to disable title capitalization changes by changing a BibTeX or Natbib package option in the LaTeX document preamble.
Provided one wants to continue using BibTeX, one is left with two options to fix this issue.</p>

Edit the source code of the .bst</code> file, thereby no longer using the .bst</code> style provided by the journal.</li>
Edit the .bib</code> file manually</em> to replace Lectures on Gaussian processes</code> with Lectures on {G}aussian processes</code>.</li>
</ol>
It would shock many outside the field to learn how many people in the mathematical sciences choose the latter option, and spend their time fixing .bib</code> files by hand, manually one-by-one</em>.
This is an utter waste of time that no-one should be doing.
One can automate the latter process, but this is unsatisfying, because it means one can no longer search for the word Gaussian</code> inside the .bib</code> file and must instead search for {G}aussian</code>.
It can also cause issues with LaTeX correctly breaking lines in the reference section.
Why not transition to modern tools that can produce the same output?</p>
Using BibLaTeX for reference management</h1>
BibLaTeX is a modern BibTeX replacement.
Much like Natbib, it offers the ability to specify standard and in-line citations via commands such as \cite</code>, \textcite</code>, \parencite</code>, and similar.2</a></sup>
Unlike Natbib or any general BibTeX-based reference package, it offers the ability to change citation format programmatically from within LaTeX.
For example, I can disable title capitalization in legacy-mimicking styles do so by default by using the following TeX command.</p>
\</span>DeclareFieldFormat</span>{titlecase}{#1}</span></span></code></pre>
BibLaTeX also offers backreferences</em>, which list pages on which a paper was cited.
These look like as follows.</p>
[1] M. Lifshits. Lectures on Gaussian processes. Springer, 2012 (cited on page 1).</span></span></code></pre>
I find this feature to be useful, because the link to page 1 is clickable—this helps me easily go back and forth between the main document and references while editing the document.
However, I also find the default format a bit unsightly, so I change it to the following.</p>
[1] M. Lifshits. Lectures on Gaussian processes. Springer, 2012. Cited on page 1.</span></span></code></pre>
This is easily achieved via the following snippet.</p>
\</span>usepackage</span>{</span>xpatch</span>}</span></span>
\</span>xpatchbibmacro</span>{pageref}{</span>\</span>printtext</span>[</span>parens</span>]</span>}{</span>\</span>addperiod</span>\</span>space</span>\</span>printtext</span>}{}{}</span></span></code></pre>
In BibTeX, making changes like this would be much more time-consuming.</p>
Overridding journal files which load Natbib or other BibTeX-based packages</h1>
Some journals and conference, such as NeurIPS</em>, offer package options such as nonatbib</code> to prevent Natbib and BibTeX from loading.
Others, such as ICML</em>, do not, and auto-load their BibTeX-based style files which include the above irritating capitalization issue.
For journals whose styles are implemented as LaTeX packages, one can override this using the following macros.</p>
\</span>makeatletter</span></span>
\</span>@namedef</span>{ver@natbib.sty}{9999/12/31}</span></span>
\</span>let</span>\</span>setcitestyle</span>\</span>@gobble</span></span>
\</span>usepackage</span>{</span>icml2020</span>}</span></span>
\</span>let</span>\</span>setcitestyle</span>\</span>undefined</span></span>
\</span>expandafter</span>\</span>let</span>\</span>csname</span> ver@natbib.sty</span>\</span>endcsname</span>\</span>@undefined</span></span>
\</span>makeatother</span></span></code></pre>
This works in a very simple manner.</p>

It tells LaTeX that Natbib is already loaded, so that the package is not imported when icml2020</code> attempts to load it.</li>
The command \setcitestyle</code> is redefined to \@gobble</code>, which simply ignores its arguments—this prevents icml2020</code> from raising an error while loading.</li>
Once icml2020</code> is loaded, it then tells LaTeX that Natbib isn’t loaded—this prevents BibLaTeX from raising an incompatible package error when it is subsequently loaded.</li>
</ol>
With these tricks, I’ve been using BibLaTeX exclusively for all of my paper submissions in recent years.
In particular, my paper on Polya Urn LDA</a>—which is now published in IEEE TPAMI—uses BibLaTeX.
At no point during the entire publication, refereeing, and copy-editing process did anyone object to BibLaTeX, even though it is not the official journal-supplied style, because it generates the same bibliographies in print.</p>
Getting BibLaTeX-based documents to compile on arXiv</h1>
BibLaTeX is a large package which supports many advanced options.
It supports two backends for processing .bib</code> files into a LaTeX-readable format, BibTeX and Biber</a>—I use the BibTeX backend, because this yields good compatibility with arXiv</a>, Overleaf</a>, TeXpad</a>, and other tools.</p>
On a modern system, BibLaTeX-based LaTeX files using the BibTeX backend will fail to compile when submitted to arXiv.
The reason for this is that arXiv does not use an up-to-date version of TeX Live—see here</a>—and this causes incompatibilities with .bbl</code> files generated by the most recent version of BibLaTeX, which is still under active development.</p>
The solution is to upload a new version of BibLaTeX to arXiv so that it correctly processes the submission.
This is done by also uploading the following library files with the submission.</p>
biblatex.def</span></span>
biblatex.sty</span></span>
blx-compat.def</span></span></code></pre>
Depending on the particular style file used, others may also be necessary.
It should be possible to determine this from arXiv’s error messages.</p>
Concluding remarks</h1>
Using BibLaTeX offers advantages over BibTeX in terms of reference style customization.
This also fixes reference capitalization issues in BibTeX, which people often instead fix painstakingly by hand—wasting time that could be spent on more worthwhile tasks such as scientific research.</p>
However, switching to BibLaTeX can result in having to deal with journal style files that may wish to load incompatible packages, and with potential arXiv compilation errors.
I hope that this article has illustrated some ways in which these issues may be smoothed out.</p>
References</h1>



In particular, BibLaTeX is the reference system recommended for by Overleaf</a> for new users. ↩</a></p>
</li>

See the BibLaTeX cheat sheet</a>. ↩</a></p>
</li>
</ol>
</section>


How to show that a Markov chain converges quickly
2019-08-16T00:00:00+00:00
Markov chains appear everywhere: they are used as a computational tool within Bayesian statistics, and a theoretical tool in other areas such as optimal control and reinforcement learning.
Conditions under which a general Markov chain eventually converges to a stationary distribution are well-studied, and can largely be considered classical results.
These are, informally, as follows.</p>

$\phi$</code>-irreducibility: the chain can transition from any state into any other state.</li>
Aperiodicity: the chain’s transition behavior is the same at time $t$</code> as time $t+1$</code>.</li>
Harris recurrence: the chain returns to some set infinitely often.</li>
</ul>
Henceforth, let $X_t$</code> be the chain, and $\pi$</code> its stationary distribution of interest.
Some care is needed when defining these conditions formally1</a></sup> for general Markov chains: for example, in a chain defined over an uncountable state space any given point will have probability zero.
To formulate irreducibility precisely we need to talk about sets of states by, say, introducing a topology.</p>
Convergence to stationarity in finite time</h1>
Proving convergence of a Markov chain is often the first step taken as part of a larger analysis.
In fact, this can be done for broad classes of chains—for instance, for Metropolis-Hastings chains with appropriately well-behaved transition proposals.2</a></sup>
This makes stationarity analysis of MCMC algorithms used in Bayesian statistics close to a non-issue.</p>
Unfortunately, from convergence of a Markov chain alone, we can say little about the chain’s distribution after a given number of steps.
This makes it interesting to study chains which are geometrically ergodic</em>—such chains converge at a prescribed rate given by</p>
$$</span></span>
\Vert P^t(\mu) - \pi\Vert_{\operatorname{TV}} \leq c_\mu \rho^t</span></span>
$$</span></span></code></pre>
where $\mu$</code> is the initial distribution, $P$</code> is the chain’s Markov operator, $t$</code> is the current iteration, $\pi$</code> is the stationary distribution, $c_\mu$</code> and $0 < \rho < 1$</code> are constants, and $\Vert\cdot\Vert_{\text{TV}}$</code> is the total variation norm over the Banach space of signed measures.</p>
Convergent chains that are not geometrically ergodic are not well-behaved.
Such chains  converge infinitely slowly, and functionals $f(X_t)$</code> of their output don’t necessarily satisfy a Central Limit Theorem.
In particular, the distribution of $f(X_t)$</code> can be heavy-tailed, which can cause $\frac{1}{T} \sum_{t=1}^T f(X_t)$</code> to be infinite even if $\mathbb{E}_{x\sim\pi}(f(x))$</code> is finite.
This can be problematic.</p>
Minorizing measures and regeneration</h1>
How does one know whether or not a chain is geometrically ergodic?
For simplicity, let’s consider a simpler case, where we set $c_\mu = c$</code> be constant with respect to the initial distribution $\mu$</code>.
Such a chain is called uniformly ergodic</em>.
This will occur for a chain defined over a state space with compact support, and we comment on the general case once the issues in this setting are clear.</p>
We begin by considering two chains, $X_t$</code> and $Y_t$</code> with same stationary distribution $\pi$</code>.
We think of $X_t$</code> as the chain of interest with initial distribution $\mu$</code>, and $Y_t$</code> as an auxiliary chain.
Now, we make two assumptions.</p>

$X_t$</code> and $Y_t$</code> share the same random number generator.</li>
$Y_t$</code> starts from the stationary distribution $\pi$</code>.</li>
</ol>
This setup can be visualized via the diagram</p>
$$</span></span>
\begin{aligned}</span></span>
&\mu & &\sim & &X_1 & &\to & &X_2 & &\to & &X_3 & &\to & &..</span></span>
\\</span></span>
& & & & &\,\,| & & & &\,\,| & & & &\,\,| & & & &\,|</span></span>
\\</span></span>
&\pi & &\sim & &Y_1 & &\to & &Y_2 & &\to & &Y_3 & &\to & &..</span></span>
\end{aligned}</span></span>
$$</span></span></code></pre>
where vertical bars indicate shared random numbers, and we’ve used the identically distributed symbol $\sim$</code> in opposite of its usual order.</p>
Given this setup, if both chains make a draw from the same distribution</em> during their one-step-ahead transitions, we can conclude that $X_t \sim Y_t \sim \pi$</code> and so the chain has converged.</p>
How is this condition ever possible?
Suppose that we can write the distribution of $X_t$</code> as a mixture of a distribution $\nu$</code> with mixture probability $\gamma$</code> and some other distribution with probability $(1-\gamma)$</code>.
Then at each time point, with probability $\gamma^2$</code>, both chains draw from the same distribution.
Let’s draw a picture.</p>




    
    
        
    
    
    




</figure>
Here, both one-step-ahead transition distributions possess an overlapping shaded region.
The probability of each chain landing in this region is $\gamma$</code>, and the distribution within that region is $\nu$</code>.
We say that $\nu$</code> is a minorizing measure</em>.3</a></sup>
It can be shown4</a></sup> that to prove uniform ergodicity, it suffices to exhibit such a minorizing measure.</p>
For a non-compact state space and arbitrary current states, such a measure can be impossible to find.
However, by virtue of convergence, our Markov chain should eventually spend most of its time in a compact state space.
One can make this intuition precise and develop techniques for proving geometric ergodicity, by introducing a suitable Lyapunov drift condition</em>1</a></sup> which, once satisfied, largely reduces the analysis to the preceding case.</p>
Once one has the existence of a minorizing measure, there will be a set of random times at which the chain will draw from the minorizing measure and forget its current location.
This allows the analysis of the averages $\frac{1}{T} \sum_{t=1}^T f(X_t)$</code> to be reduced, in some sense, to the IID case.
Following this line of thought, one can eventually prove a Central Limit Theorem for the ergodic averages, even though successive $f(X_t)$</code> are not independent.
The study of techniques originating from this observation is called regeneration theory</em>.</p>
Concluding remarks</h1>
Here, we examined one technique by which it’s possible to prove that a Markov chain converges quickly.
The analysis is general and provides insights of practical interest in Bayesian models.
For instance, if using a Gibbs sampler for a hierarchical model, one can examine the full conditionals to see whether or not a minorization condition is present.
Even if the minorization constant is not calculated, this gives some idea as to the chain’s numerical performance before ever running it.
In a practical application, this can be useful.</p>
Other techniques for analyzing the mixing rate of a Markov chain are also available.
In particular, reversible chains possess Markov operators which are self-adjoint: this allows one to study their spectral properties, and relate them to mixing times.5</a></sup>
I hope this article has provided a useful overview as to the need to consider finite-time mixing properties, and illustrated the key idea behind minorization techniques.</p>
References</h1>



S. Meyn and R. Tweedie. Markov Chains and Stochastic Stability. 1993. ↩</a> ↩2</a></p>
</li>

G. Roberts and J. Rosenthal. General state space Markov chains and MCMC algorithms. Probability Surveys, 2004. ↩</a></p>
</li>

Rather than working with a pair $(\gamma,\nu)$</code> where $\gamma\in[0,1]$</code> and $\nu$</code> is a probability measure, most technical papers equivalently work with a single finite measure. We use $(\gamma,\nu)$</code> here because it is intuitive and lends well to visualization. ↩</a></p>
</li>

J. Rosenthal. Minorization conditions and convergence rates for Markov chain Monte Carlo. JASA 90(430). 1995. ↩</a></p>
</li>

G. Roberts and J. Rosenthal. Geometric ergodicity and hybrid Markov chains. ECP 2(2). 1997. ↩</a></p>
</li>
</ol>
</section>


Sparse Parallel Training of Hierarchical Dirichlet Process Topic Models
2019-07-19T00:00:00+00:00



Pólya Urn Latent Dirichlet Allocation
2019-07-01T00:00:00+00:00



Some macros for making TeX source more readable
2019-06-08T00:00:00+00:00
The TeX typesetting system is a lovely bit of software: one can easily use it to typeset production-grade documents such as mathematical papers.
However, typesetting complex equations can be tedious, learning to use TeX well can involve memorizing a large number of macros, and it can be difficult to understand the meaning of an equation from looking purely at its source.
TeX can be made more readable by utilizing packages and introducing macros to simplify code.</p>
Some time ago, I moved for handwritten note-taking to a fully paperless workflow: more-or-less every bit of mathematics I work on is written either on a whiteboard, or in LaTeX directly, including over 100 pages of notes from a functional analysis course, which I typeset in real time as the lectures were being given.
This post contains a collection of useful macros that made this possible.</p>
Overloading accent macros in math mode</h1>
One-character macros are an important part of readable LaTeX.
Consider the union symbol $\cup$</code> given by the macro \cup</code>—this describes the shape</em> of the symbol, not its meaning</em>, which would be better described by \union</code>, or more concisely, \u</code>.</p>
Unfortunately, the macro \u</code> is already defined by LaTeX to be the underline macro for text mode.
Since the symbol $\cup$</code> is only used in math mode to begin with, it makes sense for us to extend \u</code> without changing the original functionality.
This may be achieved by loading the mathcommand</code> package and writing</p>
\</span>renewmathcommand</span>{</span>\</span>u</span>}{</span>\</span>cup</span>} </span>%</span> union</span></span></code></pre>
which redefines \u</code> in math mode only.
Similarly, one can use this trick to redefine  \v</code> to make a symbol bold italic—a widely-used notation for vectors.
Redefinition is done the same way, except we use \expandafter\boldsymbol</code> in place of \cup</code>.
The \expandafter</code> similarly allows \boldsymbol</code> to correctly accept an input argument.
This lets one to write \v{x}</code> to produce $\boldsymbol{x}$</code> while still writing \v{c}</code> to produce č, avoiding bibliography errors in cases where eastern European author names are present.</p>
I use this trick everywhere: \P</code> becomes the probability symbol $\mathbb{P}$</code>, and \c{X}</code> becomes a calligraphic X, i.e. $\mathcal{X}$</code>, while their original definitions in text mode are retained.
This helps make my source easier to read and write, and goes a long way toward it possible to typeset large expressions in real time during a talk or lecture.</p>
The indicator symbol</h1>
I prefer to use the blackboard bold symbol 𝟙 for indicators.
Unfortunately, this symbol is defined in the bbold</strong> package, which changes the AMS blackboard bold font used for the probability symbol $\mathbb{P}$</code>, and by default produces a blurry font.
The blurriness can be avoided by installing the bbold-type1</strong> package—once this is done, the font can be loaded by writing</p>
%</span> requires packages bbold and bbold-type1 to avoid bitmap font</span></span>
\</span>DeclareSymbolFont</span>{bbold}{U}{bbold}{m}{n}</span></span>
\</span>DeclareSymbolFontAlphabet</span>{</span>\</span>mathbbold</span>}{bbold}</span></span></code></pre>
which defines the command \mathbbold</code>.
From here, one can use \newcommand{\1}{\mathbbold{1}}</code> to define \1</code> to be the indicator symbol.</p>
A less-verbose enumerate and itemize</h1>
When writing notes, I often prefer to use enumerated and bullet-point lists in order to make the document easier to read.
In LaTeX, using \begin{enumerate}</code>, \item</code>, and \end{enumerate}</code> is rather verbose.
A more concise and more readable syntax can be obtained by writing</p>
\</span>1</span> Here's my first item.</span></span>
\</span>2</span> The second item!</span></span>
\</span>1</span>* A bullet point.</span></span>
\</span>2</span>* Another bullet point!</span></span>
\</span>0</span>*</span></span>
\</span>3</span> A third item.</span></span>
\</span>0</span></span></code></pre>
to produce</p>

Here’s my first item.</li>
The second item!</li>
</ol>

A bullet point.</li>
Another bullet point!</li>
</ul>

A third item.</li>
</ol>
where \0*</code> closes the itemize, \0</code> closes the enumerate.
As before \1</code> is defined separately for text and math mode.
This syntax supports custom labels and more-or-less arbitrary nesting.
It is compatible with the enumerate</code> package, and is largely unproblematic: the only TeX package I am aware of which defines \1</code> or \2</code> is xymatrix</strong>, which appears to seldom use them.
The code for the simplified enumerate and itemize syntax is given below.</p>
\</span>providecommand</span>{</span>\</span>1</span>}{} </span>%</span> xymatrix workaround</span></span>
\</span>renewcommand</span>{</span>\</span>1</span>}{</span>\</span>relax</span>\</span>ifmmode</span>\</span>mathbbold</span>{1}</span>\</span>else</span>\</span>expandafter</span>\</span>@onenonmath</span>\</span>fi</span>} </span>%</span> indicator function and enumerate/itemize shorthand</span></span>
\</span>newcommand</span>{</span>\</span>@onenonmath</span>}{</span>\</span>@ifstar</span>\</span>@onestarred</span>\</span>@onenonstarred</span>}</span></span>
\</span>newcommand</span>{</span>\</span>@onestarred</span>}{</span>\</span>begin</span>{itemize}</span>\</span>item</span>} </span>%</span> itemize shorthand</span></span>
\</span>newcommand</span>{</span>\</span>@onenonstarred</span>}</span>[</span>1</span>]</span>[</span>]</span>{</span>\</span>ifx</span>\\</span>#1</span>\\</span>\</span>begin</span>{enumerate}</span>\</span>item</span>\</span>else</span>\</span>begin</span>{enumerate}</span>[</span>#1</span>]</span>\</span>item</span>\</span>fi</span>} </span>%</span> enumerate with possible iteration choice</span></span>
\</span>providecommand</span>{</span>\</span>2</span>}{} </span>%</span> xymatrix workaround</span></span>
\</span>renewcommand</span>{</span>\</span>2</span>}{</span>\</span>@ifstar</span>\</span>item</span>\</span>item</span>} </span>%</span> enumerate/itemize shorthand</span></span>
\</span>newcommand</span>{</span>\</span>3</span>}{</span>\</span>@ifstar</span>\</span>item</span>\</span>item</span>} </span>%</span> enumerate/itemize shorthand</span></span>
\</span>newcommand</span>{</span>\</span>4</span>}{</span>\</span>@ifstar</span>\</span>item</span>\</span>item</span>} </span>%</span> enumerate/itemize shorthand</span></span>
\</span>newcommand</span>{</span>\</span>5</span>}{</span>\</span>@ifstar</span>\</span>item</span>\</span>item</span>} </span>%</span> enumerate/itemize shorthand</span></span>
\</span>newcommand</span>{</span>\</span>6</span>}{</span>\</span>@ifstar</span>\</span>item</span>\</span>item</span>} </span>%</span> enumerate/itemize shorthand</span></span>
\</span>newcommand</span>{</span>\</span>7</span>}{</span>\</span>@ifstar</span>\</span>item</span>\</span>item</span>} </span>%</span> enumerate/itemize shorthand</span></span>
\</span>newcommand</span>{</span>\</span>8</span>}{</span>\</span>@ifstar</span>\</span>item</span>\</span>item</span>} </span>%</span> enumerate/itemize shorthand</span></span>
\</span>newcommand</span>{</span>\</span>9</span>}{</span>\</span>@ifstar</span>\</span>item</span>\</span>item</span>} </span>%</span> enumerate/itemize shorthand</span></span>
\</span>newcommand</span>{</span>\</span>0</span>}{</span>\</span>@ifstar</span>\</span>@zerostarred</span>\</span>@zerononstarred</span>} </span>%</span> close enumerate/itemize</span></span>
\</span>newcommand</span>{</span>\</span>@zerostarred</span>}{</span>\</span>end</span>{itemize}} </span>%</span> close itemize</span></span>
\</span>newcommand</span>{</span>\</span>@zerononstarred</span>}{</span>\</span>end</span>{enumerate}} </span>%</span> close enumerate</span></span></code></pre>Math-mode inline text with correct spacing</h1>
Sometimes, it’s useful to write a snippet of text inside a mathematical expression, for instance in</p>
$$</span></span>
f(x) = \begin{cases}</span></span>
1 \mathrel{\text{if}} x=0 </span></span>
\\</span></span>
0 \mathrel{\text{otherwise.}}</span></span>
\end{cases}</span></span>
$$</span></span></code></pre>
The LaTeX command \text</code> does not by default take spacing into account, so I prefer to redefine \t</code> in math mode in a way that produces spacing.
This allows me to write</p>
f(x) = </span>\</span>begin</span>{cases}</span></span>
1 </span>\</span>t</span>{if} x=0 </span>\\</span></span>
0 </span>\</span>t</span>{otherwise.}</span></span>
\</span>end</span>{cases}</span></span></code></pre>
to produce the above.
LaTeX has a number of commands that automatically determine spacing: the one that I find to work best for inline text is \mathrel</code>.
The full definition of \t</code> in the above is given below.</p>
\</span>renewmathcommand</span>{</span>\</span>t</span>}{</span>\</span>expandafter</span>\</span>mathrel</span>\</span>expandafter</span>\</span>text</span>} </span>%</span> text with spacing</span></span></code></pre>Concluding remarks</h1>
Using customs TeX macros helps make the source more readable, which makes it easier to typeset notes in real time.
Beyond the macros listed here, I define a number of commands for readability: \m</code> for upface bold symbols typically used for matrices, i.e. $\mathbf{x}$</code>, \N</code> for $\mathbb{N}$</code>, \R</code> for $\mathbb{R}$</code>, and many others.
I also define \<</code> and \></code> to be \begin{align}</code> and \end{align}</code>, define \?</code> to be \begin{gather}</code> and \end{gather}</code>, and redefine \[</code> and \]</code> to automatically number equations.</p>
For collaboratively-written documents, custom macros can improve source readability, but will generally be unfamiliar to others.
In my experience, the benefits of having a more concise and easier-to-read source tend to be worth the costs, because it’s usually straightforward to figure out what was meant.</p>
I hope these tricks help make TeX source easier to read, and make it slightly simpler for anyone attending a mathematical course to take their notes directly in LaTeX.</p>


Building this website
2018-09-15T00:00:00+00:00
Lots of people, both in the academic and software communities, have personal websites.
Building one with today’s frameworks is easier than perhaps at any point in history, yet many people still have websites consisting of an index file inside of a folder hosted by some outdated service.
In this post, I describe how this website is built, showcasing software used to make all aspects of developing and maintaining a blog intuitive and easy.</p>
Using modern tools is worthwhile: sites that are minimally styled tend to display tiny fonts on mobile devices, making them inconvenient for readers.
They are also inconvenient for authors: if updating a website is cumbersome, then it is more likely to never be updated and contain out-of-date information.
These issues are entirely avoidable, without spending time or paying anyone.
Let’s see how.</p>
Building the site with Jekyll</h1>
This is a static website: its HTML code is generated when it is created, not when the user opens it.
The code is generated by a static site generator called Jekyll</a>,1</a></sup> written in Ruby.
We can think of Jekyll as a magic box that takes in a directory of files, and outputs a fancy formatted website.
Let’s take a look at a minimal example directory.</p>
.</span></span>
├── _config.yml</span></span>
├── _posts</span></span>
|   └── 2018-09-15-building-this-website.md</span></span>
├─ about.md</span></span>
└─ index.md</span></span></code></pre>
Here, the file _config.yml</code> is Jekyll’s configuration file.
Jekyll is blog-aware: the _posts</code> directory is where it expects to find blog posts.
The remaining files are Markdown text files to be used for generating individual pages and blog posts.
Let’s examine a fairly minimal configuration file.</p>
t</span>itle</span>:</span> M</span>y new website</span></span>
d</span>escription</span>:</span> A</span> really cool website</span></span></code></pre>
This defines the title, description, and author fields which will be used by the theme.
Now, let’s see what a minimal post such as 2018-09-15-building-this-website.md</code> might look like.</p>
---</span></span>
l</span>ayout</span>:</span> p</span>ost</span></span>
t</span>itle</span>:</span> B</span>uilding this website</span></span>
a</span>uthor</span>:</span> E</span>xample Author</span></span>
---</span></span>
</span>
#</span> Welcome to the new post</span></span>
</span>
Here's a new post about how the website is built!</span></span></code></pre>
The section surrounded by ---</code> at the top is called the post’s front matter: it tells Jekyll that the post’s HTML should be generated using the post</code> layout, and that its title is Building this website</code>.
The rest of the file is just Markdown text: # Welcome to my new post</code> will render as an HTML header that says Welcome to my new post</code>, and Here's a new post about how the website is built!</code> will render as a line of text.</p>
To have Jekyll generate the site, we run it, by typing jekyll serve</code> into a terminal window.
This starts up a web server, and we can view our page by navigating to localhost:4000</code>.
We get a fully functional website, containing an index page, an about page, and the post.
Jekyll automatically inserts the correct author and date into the post.
Jekyll’s default theme includes a homepage layout, and Jekyll will automatically create a link to the blog post from there.
At the end of this process, we have a working site—all without ever touching any HTML code.</p>
Hosting the site with GitHub Pages</h1>
In order to have the website be publicly visible, we need to host it somewhere.
It used to be that the easiest way to do so was to acquire a server and copy files onto it.
Today, we can instead use GitHub Pages</a>,2</a></sup> which greatly simplifies this process, hosts our website, and costs absolutely nothing.</p>
GitHub Pages works very simply.
First, the user creates a repository using the version control software Git and hosts it for free on GitHub.
Then, every time files are committed and pushed to the Git repository, GitHub automatically runs Jekyll to build and publish the site.
That’s it!</p>
This process is exceedingly simple and tends to just work.
We don’t need to do anything other than keep the website in a Git repository, which is good practice regardless, as it allows us to maintain version history and undo changes that broke something if need be.
By default, GitHub will host the site at {username}.github.io</code>, but it’s possible to buy a custom domain from any provider for a few dollars per year and configure it easily.
The domain name is the only thing I pay for: everything else is completely free.</p>
Responsive design with Bootstrap and the Minima Reboot theme</h1>
Jekyll ships with a small number of built-in themes, and allows users to easily select other ones.
Its default theme, Minima</a>3</a></sup>, is very good.
It is well-designed, its style is simple but modern, and it includes a navigation menu for mobile devices that have a narrow screen width.
Unfortunately, it also renders narrow pages on large desktop screens, and making custom pages with responsive design elements—parts of the website that render differently on mobile devices compared to desktops—is cumbersome.</p>
When first creating this blog, I wanted to do better, and to learn a bit of web development, so I wrote my own theme called Minima Reboot</a>4</a></sup>—named so because it’s essentially a rewrite of Minima.
The theme is enabled by adding the following line to _config.yml</code>.</p>
r</span>emote_theme</span>:</span> a</span>terenin/minima-reboot</span></span></code></pre>
The main functional difference between Minima and Minima Reboot is that the latter is written using the Bootstrap</a>5</a></sup> frontend framework.
Bootstrap makes it easy to design responsive websites that render the same on all recent browsers—a task that can be rather difficult because older browsers, especially those made by Microsoft, do not always follow web standards correctly.
The technical details are out of scope of this post, but for those interested Minima Reboot’s code can be found in its GitHub repository</a>.</p>
This site has a few additional customizations on top of the theme, such as removing the footer and making the color of hyperlinks less bright compared to Bootstrap’s default.
It also uses the Open Sans font for headers, loading it in a browser-consistent way using the Google Fonts framework.</p>
Typesetting mathematics with KaTeX</h1>
This blog is, to a large degree, about mathematics.
Hence, it includes mathematical equations that need to be rendered and displayed.
The most popular way to do this—used on websites such as arXiv and Stack Overflow—is using a JavaScript package called MathJax</a>.
MathJax works and is very popular, but it’s big, complicated, and slow—so, this blog uses a newer package called KaTeX</a>.6</a></sup>
To load it, we simply add a <script></code> element into the <head></code> element of our website, as described in the package’s documentation.</p>
By default, KaTeX and MathJax use the \(</code>, \)</code> delimiters for inline math, and the $ delimiters for display-style math.
I prefer to instead use $ for inline math and \[</code>, \]</code> for display-style math, so I override the default configuration to use these instead.7</a></sup>
Note that since the \</code> character is not escaped, this means that my display-style delimiters are \[</code>,\]</code> in Markdown, but [</code>,]</code> in HTML.
I also use a variety of custom macros and aliases designed to make my LaTeX more readable, which both packages allow me to define.
Therefore, I can write e^{2\pi i} - 1 = 0</code> to get $e^{2\pi i} - 1 = 0$</code>, and can write</p>
$</span></span>
\</span>int</span>_</span>{</span>\</span>mathbb</span>{</span>R</span>}</span>}</span> \</span>frac</span>{</span>1</span>}</span>{</span>\</span>sqrt</span>{</span>2</span>\</span>pi</span>\</span>sigma</span>^</span>2</span>}</span>}</span> \</span>exp</span>\left(</span>\</span>frac</span>{</span>(</span>x</span>-</span>\</span>mu</span>)</span>^</span>2</span>}</span>{</span>-</span>2</span>\</span>sigma</span>^</span>2</span>}</span>\right)</span> \</span>mathrm</span>{</span>d</span>}</span> x = </span>1</span>.</span></span>
$</span></span></code></pre>
to get the equation</p>
$$</span></span>
\int_{\mathbb{R}} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(\frac{(x-\mu)^2}{-2\sigma^2}\right) \mathrm{d} x = 1.</span></span>
$$</span></span></code></pre>
Generally, these packages work seamlessly.
However, sometimes Kramdown</a>, the Markdown preprocessor used by Jekyll, interprets LaTeX code as something other than text.
In these situations, it can insert things into the LaTeX code that KaTeX and MathJax don’t understand—this includes <br></code> elements which interrupt processing of multi-line equations.
This can be avoided by adding the HTML tag <p></code> around the relevant LaTeX code, which ensures that Kramdown treats it as raw HTML and doesn’t modify it.</p>
At present, there is no good way of making either KaTeX or MathJax responsive.
This means that large equations will occasionally go off-screen for users viewing the site on mobile devices.
This isn’t ideal, so hopefully one day there will be software that lets us avoid it.</p>
Concluding Remarks</h1>
Building a personal website is easier than ever using modern technology.
Jekyll makes it easy to build a site, GitHub Pages makes it easy to host it, and tools like as Bootstrap make it easy to create a theme that looks good on all modern browsers, desktops, and mobile devices.
Modern mathematical typesetting packages make it just as easy to display high-quality equations on the web as when using LaTeX to typeset a document.</p>
Given my interest in statistical theory, I find the format offered by a blog to be incredibly useful.
In my time practicing statistics, I have come across ideas that were worth communicating to other researchers, but too simple to write a paper about, or already published but using arcane notation difficult to understand.
A blog post provides a wonderful way to communicate these ideas—it is more time-efficient to write a post once and refer people to it, rather than re-derive the same idea repeatedly when it comes up in discussion.
I hope that this post showcases how easy building such a platform is in today’s world.</p>
References</h1>



Jekyll</a> ↩</a></p>
</li>

GitHub Pages</a> ↩</a></p>
</li>

Minima</a> ↩</a></p>
</li>

Minima Reboot</a> ↩</a></p>
</li>

Bootstrap</a> ↩</a></p>
</li>

KaTeX</a> ↩</a></p>
</li>

For better compatibility, I’ve deprecated some of the tricks in this post and now rely on Kramdown to preprocess my mathematics. This requires use of $</code> for both display style and inline math, and outputs TeX code within script tags. These can be rendered with a few lines of custom JavaScript in lieu of the KaTeX auto-render extension. ↩</a></p>
</li>
</ol>
</section>


How to use R packages such as ggplot in Julia
2018-03-23T00:00:00+00:00
Julia is a wonderful programming language.
It’s modern with good functional programming support, and unlike R and Python—both slow—Julia is fast.
Writing packages is straightforward, and high performance can be obtained without bindings to a lower-level language.
Unfortunately, its plotting frameworks are, at least in my view, not as good as the ggplot package in R.
Fortunately, Julia’s interoperability with other programming languages is outstanding.
In this post, I illustrate how to make ggplot work near-seamlessly with Julia using the RCall package.</p>
Calling R packages in Julia</h1>
R packages can be loaded can be loaded in Julia1</a></sup> through the RCall2</a></sup> package by using</p>
using</span> RCall</span></span>
@rlibrary</span> ggplot2</span></span></code></pre>
which works much like the popular @pyimport</code> macro in the PyCall3</a></sup> package.
It is important to note that this properly loads</em> an R package as a Julia module, rather than simply defining a set of bindings to it.
This means that every function in the R package can automatically be called with Julia data structures as arguments, which will be automatically transformed into R data structures.
There is no need to painstakingly convert every input, as is often necessary when making different languages interface with one other—it is done automatically using the magic offered by 21st century programming languages.
So, we can write</p>
qplot</span>(</span>1</span>:</span>10</span>,</span>[</span>i</span>^</span>2</span> for</span> i </span>in</span> 1</span>:</span>10</span>]</span>)</span></span></code></pre>
and a plot generated by the ggplot4</a></sup> function qplot</code> shows up, even though 1:10</code> is a Julia range and [i^2 for i in 1:10]</code> is a Julia array.</p>
Data frame interoperability</h1>
RCall can automatically convert Julia DataFrame</code> objects into R data.frame</code> objects.
For example, the following code is valid.</p>
using</span> DataFrames</span></span>
d </span>=</span> DataFrame</span>(</span>v </span>=</span> [</span>3</span>,</span>4</span>,</span>5</span>]</span>,</span> w </span>=</span> [</span>5</span>,</span>6</span>,</span>7</span>]</span>,</span> x </span>=</span> [</span>1</span>,</span>2</span>,</span>3</span>]</span>,</span> y </span>=</span> [</span>4</span>,</span>5</span>,</span>6</span>]</span>,</span> z </span>=</span> [</span>1</span>,</span>1</span>,</span>2</span>]</span>)</span></span>
ggplot</span>(</span>d</span>,</span> aes</span>(</span>x</span>=</span>:x</span>,</span>y</span>=</span>:y</span>)</span>)</span> +</span> geom_line</span>(</span>)</span></span></code></pre>
Note that the aes</code> function uses Julia symbols like :x</code> to refer to data frame columns.
We don’t need to do any Julia to R type conversions, the code simply works.</p>
Dealing with dots, formulas, and other R quirks</h1>
There are a few issues that arise when making complicated plots.
For example, ggplot R commands such as</p>
geom_point</span>(</span>na.rm</span> =</span> TRUE</span>)</span></span></code></pre>
don’t translate directly to Julia code because the .</code> in na.rm</code> is interpreted as Julia syntax.
Similar issues arise if, for instance, an R function uses end</code> as an argument name.
The solution to this problem is to use the var</code> string macro provided by RCall, which enables us to write</p>
geom_point</span>(</span>var"</span>na.rm</span>"</span> =</span> true</span>)</span></span></code></pre>
in place of the above R code.
This macro works by defining a Julia symbol that includes the dot, which we couldn’t have done with standard syntax.</p>
Another useful feature is the R</code> string macro, which enables us to write R code in line with Julia code.
For example, the Julia code R"~z"</code> will execute the R code ~z</code>, which creates an R formula object with the variable z</code>, and returns it as an R object in Julia.
This can be useful for functions such as facet_grid</code> and facet_wrap</code> that accept formulas as input.
It enables us to write</p>
ggplot</span>(</span>d</span>,</span> aes</span>(</span>x</span>=</span>:x</span>,</span>y</span>=</span>:y</span>)</span>)</span> +</span> geom_point</span>(</span>)</span> +</span> facet_wrap</span>(</span>R</span>"</span>~z</span>"</span>)</span></span></code></pre>
as well as execute R functions such as data.frame</code> if we need to.
We can also use this macro to fix issues arising when automatic data frame conversion doesn’t behave as intended.
This occasionally happens for data frames that contain symbols or strings.
For example, we can write code such as</p>
d </span>=</span> d </span>|></span></span>
  x </span>-></span> R</span>"</span>$x[,1] = as.numeric($d[,1]); $x</span>"</span> |></span></span>
  x </span>-></span> R</span>"</span>$x[,2] = as.numeric($d[,2]); $x</span>"</span> |></span></span>
  x </span>-></span> R</span>"</span>$x[,3] = as.numeric($d[,3]); $x</span>"</span> |></span></span>
  x </span>-></span> R</span>"</span>$x[,4] = as.factor(as.numeric($x[,4])); $x</span>"</span> |></span></span>
  x </span>-></span> R</span>"</span>$x[,5] = as.factor(as.character($x[,5])); $x</span>"</span> |></span></span>
  x </span>-></span> names!</span>(</span>d</span>,</span> [</span>:u_min</span>,</span> :u_max</span>,</span> :x</span>,</span> :u</span>,</span> :solution</span>]</span>)</span></span></code></pre>
to convert strings to factors inside our data frame—inline.
There’s a couple of points worth expanding on here.
Note first the functional style: we use a pipe5</a></sup> to input the data frame d</code> into a function that takes x</code> as input and executes the string macro R"$x[,1] = as.numeric($d[,1]); $x"</code> and returns its results.
These are immediately piped into another function.
The code $x</code> in the line R"$x[,1] = as.numeric($d[,1]); $x"</code> means that the Julia variable x</code> is passed into the R code.
This syntax allows us to execute R code without ever worrying about manually passing variables between Julia and R.</p>
Putting everything together, it’s easy to make a layered plot such as</p>
ggplot</span>(</span>d</span>,</span> aes</span>(</span>x</span>=</span>:x</span>)</span>)</span> +</span></span>
  geom_ribbon</span>(</span>aes</span>(</span>ymin</span>=</span>:u_min</span>,</span> ymax</span>=</span>:u_max</span>)</span>,</span> fill</span>=</span>"</span>blue</span>"</span>,</span> alpha</span>=</span>0.5</span>)</span> +</span></span>
  geom_line</span>(</span>aes</span>(</span>y</span>=</span>:u</span>)</span>,</span> color</span>=</span>"</span>blue</span>"</span>)</span> +</span></span>
  lims</span>(</span>x</span>=</span>[</span>0</span>,</span>5</span>]</span>,</span> y</span>=</span>[</span>0</span>,</span>10</span>]</span>)</span> +</span></span>
  geom_line</span>(</span>aes</span>(</span>y</span>=</span>:solution</span>)</span>,</span> color</span>=</span>"</span>red</span>"</span>)</span> |></span></span>
  p </span>-></span> ggsave</span>(</span>"</span>p1.pdf</span>"</span>,</span> p</span>)</span></span></code></pre>
and save it to a PDF file using functional syntax, without ever writing a line of R code.
In doing so, we sacrifice very little and retain essentially all aspects of ggplot that make it a user-friendly and productive package.
I’ll conclude by nothing that everything here is just ordinary use of the RCall package and would work with any R package—in all of the above, we did not use any ggplot-specific Julia packages, nor did we write a single line of language bindings.</p>
Why ggplot? Aren’t we using Julia in order to not use R?</h1>
Why bother with ggplot when Julia offers its own full-featured plotting packages such as Gadfly6</a></sup> and Plots.jl7</a></sup>?
In my view—and I’m not generally a fan of criticizing other people’s hard work but I find it warranted here and will be as gentle as I can—neither of these frameworks have well-designed programming interfaces.
Let’s look at what the issues are, and why ggplot handles them better.</p>
Plots.jl is a powerful, fully-featured plotting package with lots of features.
Unfortunately, its interface is very similar to that of the base R: making a complicated plot requires executing a list of commands.
This is its main downside: to use it effectively, the user needs to memorize every command and its options individually—there is no over-arching principle upon which commands are based, which users can learn instead of the commands themselves.
Indeed, this one of the major features of the Wickham-Wilkerson Grammar of Graphics8</a></sup> interface, which works as follows.</p>

Plots are visualizations of data frames consisting of layered geometric objects.</li>
Aesthetic mappings describe how individual data points are mapped to geometric objects.</li>
</ul>
For example, to plot a function and a 95% probability interval around that function, we create a data frame where each row contains the function’s $x$</code> and $f(x)$</code> values at a point, together with the lower and upper interval endpoints $a$</code> and $b$</code>.
We then add a line</em> geometric object with the aesthetic mapping $(x,y) \to (x, f(x))$</code>, as well as a ribbon</em> geometric object with the mapping $(x,\min,\max) \to (x,a,b)$</code>.
We do not need to memorize how lines and ribbons work to use them, and simply follow the principles given by the bullet points above.
If we need to use a new geometric object that we’ve never seen before, all we need to do is look at what kind of aesthetic mappings it utilizes—we never need to memorize any other details.</p>
On the other hand, consider the Plots.jl code that I wrote for a project</p>
contour</span>(</span>-</span>3</span>:</span>0.1</span>:</span>3</span>,</span> -</span>3</span>:</span>0.1</span>:</span>3</span>,</span> (</span>x</span>,</span>y</span>)</span> -></span> pdf</span>(</span>MultivariateNormal</span>(</span>2</span>,</span>1</span>)</span>,</span>[</span>x</span>,</span>y</span>]</span>)</span>)</span></span>
scatter!</span>(</span>θ</span>[</span>i</span>]</span>[</span>1</span>,</span>:</span>]</span>,</span> θ</span>[</span>i</span>]</span>[</span>2</span>,</span>:</span>]</span>)</span></span></code></pre>
and note how this syntax differs from</p>
plot!</span>(</span>hcat</span>(</span>L</span>,</span>error</span>)</span>,</span>layout</span>=</span>2</span>,</span> label</span>=</span>[</span>"</span>L: test</span>"</span> "</span>Error: test</span>"</span>]</span>,</span> alpha</span>=</span>0.5</span>)</span></span></code></pre>
where a single matrix is used as input rather than two ranges and a function.
It is a priori</em> unclear whether the input to a particular plotting function should be an array, data frame, or something else.
Looking a bit further, imagine setting color labels in a complicated multilayered plot—in which layer’s command should we specify how labels are displayed?
Ambiguity like this wastes time by forcing the user to spend time reading documentation rather than making their plots, and in my experience the time saved by having concise commands like plot(x,y)</code> in simple cases does not outweigh the cost in complicated ones.</p>
It’s true that the Grammar of Graphics interface is not well-suited to every kind of plot, but it works well for most of the ones encountered in everyday data science.
Most importantly, it offers a single unified way to think about plots and how to construct them.
Writing plots in it can be more verbose, but I prefer being verbose and consistent than concise and different in every scenario.
I don’t have time to memorize individual commands in a plotting package that doesn’t contain a central set of guiding principles—and neither should you.</p>
So if I don’t prefer Plots.jl due to its interface, what about Gadfly, which is is Grammar of Graphics based?
Unfortunately, Gadfly both doesn’t support many useful features such as transparency and geometric objects like geom_raster</code>, and suffers from a whole other set of issues that makes it difficult to use.
One particular problem is that it uses a varargs-based interface rather than a functional one.
This makes us write things like</p>
plot</span>(</span>plot_data_1</span>,</span> x</span>=</span>"</span>x</span>"</span>,</span> y</span>=</span>"</span>u</span>"</span>,</span> Geom</span>.</span>line</span>,</span></span>
  layer</span>(</span>Geom</span>.</span>line</span>,</span> x </span>=</span> "</span>x</span>"</span>,</span> y </span>=</span> "</span>solution</span>"</span>,</span> Theme</span>(</span>default_color</span>=</span>"</span>red</span>"</span>)</span>)</span>,</span></span>
  layer</span>(</span>Geom</span>.</span>line</span>,</span> x</span>=</span>"</span>x</span>"</span>,</span> y</span>=</span>"</span>u_mc</span>"</span>,</span> Theme</span>(</span>default_color </span>=</span> "</span>purple</span>"</span>)</span>)</span>,</span></span>
  layer</span>(</span>Geom</span>.</span>line</span>,</span> x</span>=</span>"</span>x</span>"</span>,</span> y</span>=</span>"</span>u_mf</span>"</span>,</span> Theme</span>(</span>default_color </span>=</span> "</span>orange</span>"</span>)</span>)</span></span>
)</span></span></code></pre>
instead of</p>
ggplot</span>(</span>plot_data_1</span>,</span> aes</span>(</span>x</span>=</span>"</span>x</span>"</span>,</span> y</span>=</span>"</span>u</span>"</span>)</span>)</span> +</span></span>
  geom_line</span>(</span>color</span>=</span>"</span>blue</span>"</span>)</span> +</span></span>
  geom_line</span>(</span>aes</span>(</span>y</span>=</span>:u_mc</span>)</span>,</span> color</span>=</span>"</span>purple</span>"</span>)</span> +</span></span>
  geom_line</span>(</span>aes</span>(</span>y</span>=</span>:u_mf</span>)</span>,</span> color</span>=</span>"</span>orange</span>"</span>)</span></span></code></pre>
which is much simpler.
The issue here is that a ...</code> based interface requires the user to waste time on the irritating task of balancing commas and parentheses.
Plots.jl suffers just as much from the exact same problem.</p>
This code raises another major issue: Gadfly doesn’t follow the Grammar of Graphics strictly enough: a color not given by an aesthetic mapping should be defined as part of a geometric object, not part of a theme.
Themes are supposed to control parts of the plot that have nothing to do with the data or geometric objects, such as the font size for the plot’s title—certainly not the color of a line.
This is an inconsistency that a user needs to learn, rather than a consequence of a set of principles that is immediately obvious.</p>
At the end of the day, memorizing a plotting package is not a good use of my time or yours, and after spending a good bit of time with both packages I’ve found dealing with R-Julia interoperability and its occasional difficulties to be a lesser problem compared to the issues raised above.</p>
Concluding thoughts</h1>
Julia is wonderful, made even more so through its strong interoperability given by RCall2</a></sup> and PyCall.3</a></sup>
I find it better than R, and much better than Python.
It does have its flaws.
Its syntax isn’t ideal in certain situations, particularly when writing highly functional code, and would be improved by being more like Scala, or even like pipe-oriented R written with the magrittr5</a></sup> package.
Multiple dispatch is not a proper replacement for Python-style objects, and having a language features similar to Rust’s Implementations</em> would be a major improvement.
This said, in my view Julia is already ahead of R and Python, which have bigger issues than the above.
Usability and cleanliness are critically important in a programming language, and this is why it’s worth using ggplot in Julia.</p>
References</h1>



Julia</a> ↩</a></p>
</li>

RCall</a> ↩</a> ↩2</a></p>
</li>

PyCall</a> ↩</a> ↩2</a></p>
</li>

ggplot</a> ↩</a></p>
</li>

magrittr</a> ↩</a> ↩2</a></p>
</li>

Gadfly</a> ↩</a></p>
</li>

Plots.jl</a> ↩</a></p>
</li>

See the original book9</a></sup> and ggplot manual.10</a></sup> ↩</a></p>
</li>

L. Wilkerson. The Grammar of Graphics. 2005. ↩</a></p>
</li>

H. Wickham. ggplot2: Elegant Graphics for Data Analysis. 2016. ↩</a></p>
</li>
</ol>
</section>


What does it really mean to be Bayesian?
2018-02-09T00:00:00+00:00
In my previous posts, I introduced Bayesian models and argued that they are meaningful.
I claimed that studying them is worthwhile because the probabilistic interpretation of learning that they offered can be more intuitive than other interpretations.
I showcased an example illustrating what a Bayesian model looks like.
I did not, however, say what a Bayesian model actually is—at least not in a sufficiently general setting to encompass models people regularly use.
I’m going to discuss that in this post, and then showcase some surprising behavior in infinite-dimensional settings where the general approach is necessary.
The subject matter here can be highly technical, but will be discussed at an intuitive level meant to explain what is going on.</p>
Definition.</strong>
A model $\mathscr{M}$</code> is mathematically Bayesian</em> if it can be fully specified via a prior $\pi(\theta)$</code> and likelihood $f(x \mid \theta)$</code> for which the posterior distribution $f(\theta \mid x)$</code> is well-defined.</p>
Here, $\theta$</code> is an abstract parameter, and $x$</code> is an abstract data set.
The argument for using Bayesian learning, given by Cox’s Theorem, is that conditional probability can be interpreted as an extension of true-false logic under uncertainty.
This is great—but, formality considerations aside, there are scenarios that involve learning from data that are not included in the above definition.
Let’s look at one.</p>
A motivating example</h1>
To illustrate a case not covered by the above definition, consider the problem of learning a function from a finite set of points.
Here, we have a set of points $(y_i, x_i), i=1,..,n$</code> and we want to learn a function $y = f(x)$</code> from the data.
A simple Bayesian model for the data can be written as</p>
$$</span></span>
\begin{aligned}</span></span>
y_i &= f(x_i) + \varepsilon_i</span></span>
&</span></span>
\varepsilon_i &\sim \operatorname{N}(0,\sigma^2)</span></span>
&</span></span>
f \sim\operatorname{GP}(\mu, \Sigma)</span></span>
\end{aligned}</span></span>
$$</span></span></code></pre>
What are we saying here?
If we know $f$</code>, we can use a set of points $x_i$</code> to generate $y_i$</code> by calculating $f(x_i)$</code> and adding Gaussian noise $\varepsilon_i$</code>.
Since we don’t know $f$</code>, we specify its prior probability distribution as a Gaussian process with mean function $\mu: \mathbb{R} \to \mathbb{R}$</code> and covariance function $\Sigma: \mathbb{R} \times \mathbb{R} \to \mathbb{R}$</code>.
Since we’ve specified a conditional and marginal distribution, this defines a joint distribution, so we can try to get the posterior distribution using Bayes’ Rule</p>
$$</span></span>
f(f \mid \boldsymbol{y},\boldsymbol{x}) \propto f(\boldsymbol{y} \mid \boldsymbol{x}, f) \pi(f)</span></span>
.</span></span>
$$</span></span></code></pre>
Except we can’t do that</em>.
The above expression is not well-defined—$\pi(f)$</code> does not exist, because the probability distribution $f \sim\operatorname{GP}(\mu, \Sigma)$</code> is a distribution over a space of functions, not of real numbers—therefore, it has no density in the standard sense.1</a></sup></p>
Why not?
A probability density is a function that assigns a weight to every unit of volume in space.
In one dimension, every interval of the form $[a,b]$</code> is assigned volume $|a-b|$</code>—this depends only on its length, not its location.
In infinite-dimensional spaces, this is impossible.
It can be proven that any notion of volume must depend both on the length and location—more formally, the infinite-dimensional Lebesgue measure is not locally finite.2</a></sup></p>
So what do we do?
Is there a sense in which we can consider the above model Bayesian?
Let’s discuss that.</p>
Bayesian learning as conditional probability</h1>
If we’re not allowed to discuss probability densities, what else can we do?
One thing that the definition says is that a model is Bayesian</em> if it is probabilistic</em>.
This entails two parts.</p>

$\mathscr{M}$</code> is specified via a joint probability density $f(\theta, x)$</code> over the parameters and data.</li>
Learning takes place via conditional probability.</li>
</ol>
It turns out that these two intuitive notions are precisely the ones we need.
Informally, this leads to the definition below.</p>
Definition.</strong>
A model $\mathscr{M}$</code> is mathematically Bayesian</em> if it is fully specified via a random variable $(x,\theta)$</code> for which the conditional probability distribution $\theta \mid x$</code> exists for all $x$</code>.</p>
This definition can be made formal using measure-theoretic notions such as regular conditional probability</em>3</a></sup> and disintegration</em>.4</a></sup>
These have various flavors with different technical requirements on $(x,\theta)$</code> that need to be checked to ensure that writing down a probability distribution conditional on a set of data points actually makes sense.
Let’s now look at two different ways of specifying $(x,\theta)$</code> in infinite-dimensional settings where the usual approach fails.</p>
Two infinite-dimensional approaches</h1>
One way to define Bayesian models in infinite-dimensional settings is through a top-down</em> approach.
Here, we specify $\theta \mid x$</code> by selecting a complicated but well-defined infinite-dimensional notion of volume.
Often, the prior distribution is used to select this notion of volume.
From there, we can specify how the posterior distribution changes that volume, by writing down a Radon-Nikodym derivative</em>.5</a></sup>
This viewpoint is often used in the Gaussian measure and Bayesian inverse problem literatures.
The main price we pay is that for many infinite-dimensional models, the prior and posterior distributions may not have the same support—they may fail to be absolutely continuous</em>,6</a></sup> in which case the Radon-Nikodym derivative between them would not exist.</p>
Alternatively, we could use a bottom-up</em> approach.
Here, we define a family of probability of distributions using finite-dimensional slices of our parameter space, using Kolmogorov’s Extension Theorem as our primary theoretical tool for handling the infinite dimensional object.
This is the primary viewpoint in the Gaussian process and Dirichlet process literatures.
The main price we pay is that from this perspective, we can only reason about the infinite-dimensional object we wish to study indirectly.
This may cause us to make poor choices, such as writing down algorithms that stop working as we approach the infinite-dimensional limit,7</a></sup> which are easily avoided with a more direct perspective.</p>
Cromwell’s Rule and some surprising consequences</h1>
We briefly mentioned that in infinite-dimensional settings, prior and posterior distributions may not be absolutely continuous with one another.
This property deserves some attention.
Consider Bayes’ Rule for probabilities</p>
$$</span></span>
\mathbb{P}(B \mid A) = \frac{\mathbb{P}(A \mid B) \mathbb{P}(B)}{\mathbb{P}(A)}</span></span>
$$</span></span></code></pre>
and note that for $\mathbb{P}(A)$</code> nonzero, then $\mathbb{P}(B) = 0$</code> implies $\mathbb{P}(B \mid A) = 0$</code>—no matter what $A$</code> is.
By analogy, if $A$</code> is data and $B$</code> is an event of interest, then Bayes’ Rule ignores the data if the prior probability is zero.
This is often not desirable, which leads to Cromwell’s Rule</em>,8</a></sup> given below.</p>

To avoid making learning impossible, the use of prior probabilities that are zero or one should be avoided.</p>
</blockquote>
Except, in many infinite-dimensional settings, this doesn’t apply because $\mathbb{P}(A)$</code> may be zero.
Indeed, it is easy to construct examples where the prior probability of an event is zero, but the posterior probability is nonzero—more formally, where the posterior is not absolutely continuous with respect to the prior.
This is not an esoteric occurrence: even something as basic as adding a mean function to a Gaussian process can break absolute continuity.9</a></sup>
Let’s examine a case where this happens.</p>
Breaking probabilistic impossibility</h1>
Consider the following model.</p>
$$</span></span>
\begin{aligned}</span></span>
y_i &\mid F \sim F</span></span>
&</span></span>
F &\sim\operatorname{DP}(\alpha, \delta_0)</span></span>
\end{aligned}</span></span>
$$</span></span></code></pre>
where $\delta_0$</code> is a Dirac measure that places all of its probability on zero.
Under the prior, we have</p>
$$</span></span>
\mathbb{P}(F \neq \delta_0) = 0</span></span>
.</span></span>
$$</span></span></code></pre>
The standard posterior for this model is</p>
$$</span></span>
F \mid \boldsymbol{y} \sim\operatorname{DP}\left(\alpha + n, \frac{\alpha}{\alpha+n}\delta_0 + \frac{n}{\alpha+n}\hat{F}_n\right)</span></span>
$$</span></span></code></pre>
where $n$</code> is the length of $\boldsymbol{y}$</code> and $\hat{F}_n$</code> is the empirical CDF of $\boldsymbol{y}$</code>.
But we can tell immediately that</p>
$$</span></span>
\mathbb{P}(F \neq \delta_0 \mid \boldsymbol{y}) > 0</span></span>
.</span></span>
$$</span></span></code></pre>
This has a whole host of bizarre consequences.
Since $F \mid \boldsymbol{y}$</code> is not absolutely continuous with respect to $F$</code>, we see that in infinite dimensions, data may convince us to believe in something we in a sense thought was impossible.
This behavior is both surprising and typical—conditional probability can act in complicated ways.</p>
What it all means</h1>
In my view, an abstract model is Bayesian</em> if it is probabilistic</em> and learning takes place through conditional probability</em>.
In well-behaved finite-dimensional settings, this means that learning takes place using Bayes’ Rule.
There, we have a likelihood</em> $f(x \mid \theta)$</code> that acts as the generative distribution for the data given the parameters, and a prior</em> that describes what sorts of parameters we’d like to regularize the learning process towards.
In full generality, however, neither the generative nature of the likelihood nor the use of Bayes’ Rule matters: it is the use of conditional probability that is important.
From a philosophical standpoint this makes sense: learning is just reasoning about something we don’t know using the things we do, using the mathematical structure of conditional probability.</p>
Once we’ve taken the general perspective, we are free to define models in infinite-dimensional settings.
Such models are powerful and have proven useful in many applications, but at times they may behave bizarrely.
It’s worthwhile to take a moment to step back, appreciate, and understand why the expressions we calculate are the way they are.</p>
References</h1>



The standard notion of volume is taken to be the Lebesgue measure. See Chapter 3 of Probability and Stochastics.^{10</a></sup> ↩</a></p>
</li>}

See Section 1.2 of Analysis and Probability on Infinite-Dimensional Spaces.11</a></sup> ↩</a></p>
</li>

See Chapter 2 of Probability and Stochastics.10</a></sup> ↩</a></p>
</li>

See Section 2 of Conditioning as Disintegration.12</a></sup> ↩</a></p>
</li>

A Radon-Nikodym derivatives tells us how to re-weight one probability measure to obtain another one. See Chapter 5 of Probability and Stochastics.10</a></sup> ↩</a></p>
</li>

If two measures are absolutely continuous, they assign nonzero probability to the same events. See Chapter 5 of Probability and Stochastics.10</a></sup> ↩</a></p>
</li>

A recent line of work13</a></sup> has sought to prevent Markov Chain Monte Carlo algorithms from slowing down for high-dimensional models by ensuring their infinite-dimensional limits are well-defined. ↩</a></p>
</li>

See Chapter 6 Section 8 of Understanding Uncertainty.14</a></sup> ↩</a></p>
</li>

The space of vectors that can be added to a Gaussian process while preserving absolute continuity is called its Cameron-Martin</em> space. See Chapter 5 of Lectures on Gaussian Processes.15</a></sup> ↩</a></p>
</li>

E. Çınlar. Probability and Stochastics, 2010. ↩</a> ↩2</a> ↩3</a> ↩4</a></p>
</li>

N. Eldredge. Analysis and Probability on Infinite-Dimensional Spaces, 2016. ↩</a></p>
</li>

J. T. Chang and D. Pollard. Conditioning as Disintegration. Statistica Neerlandica 51(3), 1997. ↩</a></p>
</li>

S. L. Cotter, G. O. Roberts, A. M. Stuart, and D. White. MCMC Methods for Functions: Modifying Old Algorithms to Make Them Faster.  Statistical Science 28(3), 2013. ↩</a></p>
</li>

D. Lindley. Understanding Uncertainty, 2006. ↩</a></p>
</li>

M. Lifshits. Lectures on Gaussian Processes, 2012. ↩</a></p>
</li>
</ol>
</section>


What does it mean to be Bayesian?
2017-11-03T00:00:00+00:00
Bayesian statistics provides powerful theoretical tools, but it is also sometimes viewed as a philosophical framework.
This has lead to rich academic debates over what statistical learning is and how it should be done.
Academic debates are healthy when their content is precise and independent issues are not conflated.
In this post, I argue that it is not always meaningful to consider the merits of Bayesian learning directly, because the fundamental questions surrounding it encompass not one issue, but several, that are best understood independently.
These can be viewed informally as follows.</p>

A model is mathematically Bayesian</em> if it is defined using Bayes’ Rule.</li>
A procedure is computationally Bayesian</em> if it involves calculation of a full posterior distribution.</li>
</ul>
The key idea of this post is that the two notions above are different, and that the common term Bayesian</em> is often ambiguous.
This makes it unclear, for instance, that there are situations where it makes sense to be mathematically but not computationally Bayesian.
Let’s disentangle the terminology and explore the concepts in more detail.</p>
Motivating Example: Logistic Lasso</h1>
To make my arguments concrete, I now introduce the Logistic Lasso model, beginning with notation.
Let $\mathbf{X}$</code> be the matrix of size $N \times p$</code> to be used for predicting the binary vector $\boldsymbol{y}$</code> of size $N\times 1$</code>, let $\boldsymbol\beta$</code> be the parameter vector, and let $\phi$</code> be the logistic function.</p>
From the classical perspective, the Logistic Lasso model1</a></sup> involves finding the estimator</p>
$$</span></span>
\boldsymbol{\hat\beta} = \underset{\boldsymbol\beta}{\arg\min}\left[ \sum_{i=1}^N -y_i\ln\left( \phi(\mathbf{X}_i\boldsymbol\beta) \right) - (1-y_i)\ln\left(1 - \phi(\mathbf{X}_i\boldsymbol\beta)\right) + \lambda\vert\vert\boldsymbol\beta\vert\vert_1\right]</span></span>
$$</span></span></code></pre>
for $\lambda \in \R^+$</code>, where $\vert\vert\cdot\vert\vert_1$</code> denotes the $\ell^1$</code> norm. On the other hand, the Bayesian Logistic Lasso model2</a></sup> is specified using the likelihood and prior</p>
$$</span></span>
\begin{aligned}</span></span>
y_i \mid \boldsymbol\beta &\sim \operatorname{Ber}\left(\phi(\mathbf{X}_i\boldsymbol\beta)\right)</span></span>
&</span></span>
\boldsymbol\beta&\sim \operatorname{Laplace} (\lambda^{-1})</span></span>
\end{aligned}</span></span>
$$</span></span></code></pre>
for which the posterior distribution is found via Bayes’ Rule.</p>
For the Logistic Lasso, both formulations are equivalent3</a></sup> in the sense that they yield the same point estimates.
This connection is discussed in detail in my previous post</a>.
Since the same model can be expressed both ways, it may be unclear to someone unfamiliar with Bayesian statistics what people might disagree about here.
Let’s proceed to that.</p>
Statistical Learning Theory</h1>
The first philosophical question we consider is what statistical learning is.
This fundamental question has been considered by a variety of people throughout history.
One formulation—due to Vapnik4</a></sup>—involves defining a loss function</em> $L(y, \hat{y})$</code> for predicted data, and finding a function $f$</code> that minimizes the expected loss</p>
$$</span></span>
\underset{f}{\arg\min} \int_\Omega L(y, f(x)) \,\mathrm{d} F(x,y)</span></span>
$$</span></span></code></pre>
with respect to an unknown distribution $F(x,y)$</code>.
This loss is then approximated in various ways because the data is finite—for instance, by restricting the domain of optimization.
In this approach, a statistical learning problem</em> is defined to be a functional optimization problem</em>, the problem’s answer</em> is given by the function $f$</code>, and the model $\mathscr{M}$</code> is given by the loss function together with whatever approximations are made. For Logistic Lasso, we assume that the functional form of $f$</code> is given by $\phi(\mathbf{X}\boldsymbol\beta)$</code>, and that $L$</code> is $\ell^1$</code>-regularized cross-entropy loss.</p>
Bayesian Theory</h1>
The other formalism we consider involves defining statistical learning more abstractly.
We suppose that we are given a parameter $\theta$</code> and data set $x$</code>.
We define a set $\Theta$</code> consisting of true-false statements $\theta = \theta'$</code> and $x = x'$</code> for all possible parameter values $\theta'$</code> and data values $x'$</code>.
From the data, we know the statement $x=x'$</code> is true—but we do not know which $\theta'$</code> makes it so that $\theta = \theta'$</code> is true.
Thus, we cannot simply deduce $\theta$</code> via logical reasoning, and must extend the concept of logical reasoning to accommodate uncertainty.</p>
To do so, we suppose that there is a relationship between $x$</code> and $\theta$</code> such that different values of $x$</code> may change the relative truth of different values of $\theta$</code>.
Thus, we seek to define a function $\mathbb{P}(\theta = \theta' \mid x = x')$</code> such that if $x=x'$</code> is true, the function tells us how close to true or to false $\theta=\theta'$</code> is.
To perform logical reasoning under uncertainty</em>, we need to specify two probability distributions—the likelihood</em> $f(x \mid \theta)$</code> and prior</em> $\pi(\theta)$</code>, and calculate</p>
$$</span></span>
f(\theta \mid x) = \frac{f(x \mid \theta) \pi(\theta)}{\int_\Theta f(x \mid \theta) \pi(\theta) \,\mathrm{d} \theta} \propto f(x \mid \theta) \pi(\theta)</span></span>
$$</span></span></code></pre>
using Bayes’ Rule, which gives us the posterior</em> distribution.
In this approach, statistical learning</em> is taken to mean reasoning under uncertainty</em>, the answer</em> is given by the probability distribution $f(\theta \mid x)$</code>, and the model $\mathscr{M}$</code> is given by the likelihood together with the prior.
For Logistic Lasso, we assume that the likeihood is Bernoulli, and that the prior is Laplace.</p>
Interpretation of Models</h1>
At first glance, the theories may appear somewhat different, but the Logistic Lasso—and just about every model used in practice—can be formalized in both ways.
This leads to the first question.</p>

Should we interpret statistical models as probability distributions or as loss functions?</p>
</blockquote>
The answer, of course, depends on the preferences of the person being asked—if we want, we may interpret a model whose loss function corresponds to a posterior distribution in a Bayesian way.
The probabilistic structure it possesses can be a useful theoretical tool for understanding its behavior.
This lets us see for instance that if priors are considered subjective, regularizers must be as well.
We conclude with an informal definition this class of models.</p>
Definition.</strong>
A model $\mathscr{M}$</code> is mathematically Bayesian</em> if it can be fully specified via a prior $\pi(\theta)$</code> and likelihood $f(x \mid \theta)$</code> for which the posterior distribution $f(\theta \mid x)$</code> is well-defined.</p>
Assessment of Inferential Uncertainty</h1>
The second question does not concern the model in a mathematical sense.
Instead, we consider an abstract procedure $\mathscr{P}$</code> that utilizes a model $\mathscr{M}$</code> to do something useful.
Here, we encounter our second question.</p>

Should we assess uncertainty regarding what was learned about $\theta$</code> from the data by computing the posterior distribution $f(\theta \mid x)$</code>?</p>
</blockquote>
Often, assessing inferential uncertainty is interesting, but not always.
One important note is that for any given data set, the uncertainty given by $f(\theta \mid x)$</code> is completely determined by the specification of $\mathscr{M}$</code>.
If $\mathscr{M}$</code> is not the correct model, its uncertainty estimates may be arbitrary bad, even if its predictions are good.
Thus, we may prefer to not assess uncertainty at all, rather than delude ourselves into thinking we know it.</p>
Similarly, for some problems there may exist a simple and easy way to determine whether $\theta$</code> is good or not.
For example, in image classification, we might simply ask a human if the labels produced by $\theta$</code> are reasonable.
This might be far more effective than using the probability distribution $f(\theta \mid x)$</code> to compare the chosen value for $\theta$</code> to other possible values, especially when calculating $f(\theta \mid x)$</code> is challenging.</p>
This leads to a choice undertaken by the practitioner: should $f(\theta \mid x)$</code> be calculated, or is picking one value $\hat\theta$</code> good enough?
In some cases, such as when a decision-theoretic analysis is performed, $f(\theta \mid x)$</code> is indispensable, other times it is unnecessary.
We conclude with an informal definition encompassing this choice.</p>
Definition.</strong>
A statistical procedure $\mathscr{P}$</code> that makes use of a model $\mathscr{M}$</code> is computationally Bayesian</em> if it involves calculation of the full posterior distribution $f(\theta \mid x)$</code> in at least one of its steps.</p>
Disentangling the Disagreements</h1>
It is unfortunate that the term Bayesian</em> has come to mean mathematically Bayesian</em> and computationally Bayesian</em> simultaneously.
In my opinion, these distinctions should be considered separately, because they concern two very different questions.
In the mathematical case, we are asking whether or not to interpret our model using its probabilistic representation.
In the computational case, we are asking whether calculating the entire distribution is necessary, or whether one value suffices.</p>
A model’s Bayesian representation can be useful as a theoretical tool, whether we calculate the posterior or not.
If one value does suffice, we should not discard the probabilistic interpretation entirely, because it might help us understand the model’s structure.
For the Logistic Lasso, the Bayesian approach makes it obvious where cross-entropy loss comes from: it maps uniquely to the Bernoulli likelihood.</p>
It is unfortunate that the two cases are often conflated.
It is common to hear practitioners say that they are not interested in whether models are Bayesian or frequentist—instead, it matters whether or not they work.
More often than not, models can be interpreted both ways, so the distinction’s premise is itself an illusion.
Every mathematical perspective tells us something about the objects we are studying,
Even if we do not perform Bayesian calculations, it can often still be useful to think of models in a Bayesian way.</p>
References</h1>



R. Tibshirani. Regression Shrinkage and Selection via the Lasso. JRSSB 58(1), 1996. ↩</a></p>
</li>

T. Park and G. Casella. The Bayesian Lasso. JASA 103(402), 2008. ↩</a></p>
</li>

A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. Bayesian Data Analysis. 2013. ↩</a></p>
</li>

V. Vapnik. The Nature of Statistical Learning Theory. 1995. ↩</a></p>
</li>
</ol>
</section>


Deep Learning with function spaces
2017-08-16T00:00:00+00:00
Deep learning is perhaps the single most important breakthrough in statistics, machine learning, and artificial intelligence that has been popularized in recent years.
It has allowed us to classify images—for decades a challenging problem—with nowadays usually better-than-human accuracy.
It has solved Computer Go, which for decades was the classical example of a board game that was exceedingly difficult for computers to play.
But what exactly is deep learning?</p>
Many popular explanations involve analogies with the human brain, where deep learning models are interpreted as complex networks of neurons interacting with one another.
These perspectives are useful, but they’re not math: just because deep learning models mimic the brain, doesn’t mean they provably work.
This post will highlight some ideas that may be helpful in moving toward an understanding of why deep learning works, presented at an intuitive level.
The focus will be on high-level concepts, omitting algebraic details such as the precise form of tensor products.</p>
The Function Space Perspective</h1>
The key idea of this post is that to understand why deep learning works, we should not work with the network directly.
Instead, we will define a model for learning on a space of functions, truncate that model, and obtain deep learning.</p>
Consider the model</p>
$$</span></span>
\hat{\boldsymbol{y}} = f(\mathbf{X})</span></span>
$$</span></span></code></pre>
where the goal is to learn the function $f$</code> that maps data $\mathbf{X}$</code> to the predicted value $\hat{\boldsymbol{y}}$</code>.
But wait, how do we go about learning a function?
Let’s first consider a single-variable function $f(x): \R \to \R$</code> and recall that any function may be written as an infinite sum with respect to a location-scale basis, i.e. we have for an appropriately defined function $\sigma$</code> that</p>
$$</span></span>
f(x) = \sum_{k=1}^\infty a_k \, \sigma(b_k x + c_k) + d_k</span></span>
.</span></span>
$$</span></span></code></pre>
What’s happening here?
We’re taking the function $\sigma$</code>, shifting it left-right by $b_k$</code>, stretching it by a combination of $a_k$</code> and $c_k$</code>, and shifting it up-down by $d_k$</code>.
As long as $\sigma$</code> is sufficiently rich to form a basis on $\R$</code>, if we add up infinitely many of them, we can approximate $f$</code> to any precision we want.
To make learning possible, let’s truncate the sum, so that we sum $K$</code> elements instead of $\infty$</code>, and get</p>
$$</span></span>
f(x) = \sum_{k=1}^K a_k \, \sigma(b_k x + c_k) + d_k</span></span>
.</span></span>
$$</span></span></code></pre>
We now have a finite set of parameters, so given a data set $(\mathbf{X},\boldsymbol{y})$</code>, we can define a probability distribution for $\boldsymbol{y}$</code> under the predicted values $\hat{\boldsymbol{y}}$</code>, and learn the coefficients using Bayes’ Rule</a>.</p>
But wait: the expressions we get by following this procedure, extended to matrices and vectors, are exactly those given by a 1-layer fully connected network</a>.
This is what a fully connected network does, and this is why it works: we are expanding an arbitrary function with respect to a basis, and learning the coefficients of the expansion using Bayes’ Rule.1</a></sup>
That’s it!</p>
Going Deep</h1>
With the above perspective in mind, let’s consider deep learning.
We’re going to apply another trick: rather than learning $f$</code> directly, let’s instead define functions $f^{(1)},f^{(2)},f^{(3)}$</code> such that</p>
$$</span></span>
\hat{\boldsymbol{y}} = f(\mathbf{X}) = f^{(1)}\left\{f^{(2)}\left[f^{(3)}\left(\mathbf{X}\right)\right]\right\}</span></span>
$$</span></span></code></pre>
It’s not obvious why we should do this, but let’s go with it for now.
Then, let $\sigma$</code> be the ReLU function, and expand $f^{(3)}$</code> with respect to that basis, just as we did above, but with matrix-vector notation, to get</p>
$$</span></span>
\hat{\boldsymbol{y}} = f^{(1)}\left\{f^{(2)}\left[ \boldsymbol{a}^{(3)} \sigma\left(\mathbf{X}\boldsymbol{b}^{(3)} + \boldsymbol{c}^{(3)}\right) + \boldsymbol{d}^{(3)}  \right]\right\}</span></span>
.</span></span>
$$</span></span></code></pre>
Now, let’s expand $f^{(2)}$</code>, yielding</p>
$$</span></span>
\hat{\boldsymbol{y}} = f^{(1)}\left\{\boldsymbol{a}^{(2)}\sigma\left[\left(\boldsymbol{a}^{(3)} \sigma\left(\mathbf{X}\boldsymbol{b}^{(3)} + \boldsymbol{c}^{(3)}\right) + \boldsymbol{d}^{(3)}\right)\boldsymbol{b}^{(2)} + \boldsymbol{c}^{(2)}\right] + \boldsymbol{d}^{(2)}\right\}</span></span>
.</span></span>
$$</span></span></code></pre>
Notice that we can set $\boldsymbol{b}^{(2)} = \boldsymbol{1}$</code> and $\boldsymbol{c}^{(2)} = \boldsymbol{0}$</code> with no loss of generality to slightly simplify our expression.
Upon expanding $f^{(1)}$</code>, we are left with</p>
$$</span></span>
\hat{\boldsymbol{y}} = \boldsymbol{a}^{(1)}\sigma\left\{\boldsymbol{a}^{(2)}\sigma\left[\boldsymbol{a}^{(3)} \sigma\left(\mathbf{X}\boldsymbol{b}^{(3)} + \boldsymbol{c}^{(3)}\right) + \boldsymbol{d}^{(3)}\right] + \boldsymbol{d}^{(2)}\right\} + \boldsymbol{d}^{(1)}</span></span>
$$</span></span></code></pre>
which is exactly the expression for a 3-layer fully connected network.</p>
So, what is deep learning?
Deep learning is a model that learns a function $f$</code> by splitting it up into a sequence of functions $f^{(1)},f^{(2)},f^{(3)},..$</code>, performing a ReLU basis expansion on each one, truncating it, and learning the remaining coefficients using Bayes’ Rule.</p>
Example: why Residual Networks work</h1>
This perspective can be used to understand recently popularized technique in deep learning.
For illustrative purposes, let’s consider a 3-layer residual network.
Suppose $\mathbf{X}$</code> is of the same dimensionality as the network.
A residual network is a model of the form</p>
$$</span></span>
\begin{aligned}</span></span>
\hat{\boldsymbol{y}} = f(\mathbf{X}) = &f^{(1)}\left\{f^{(2)}\left[f^{(3)}\left(\mathbf{X}\right) + \mathbf{X}\right] + \left[f^{(3)}\left(\mathbf{X}\right) + \mathbf{X}\right]\right\}</span></span>
\\</span></span>
&+ \left\{f^{(2)}\left[f^{(3)}\left(\mathbf{X}\right) + \mathbf{X}\right] + \left[f^{(3)}\left(\mathbf{X}\right) + \mathbf{X}\right]\right\}</span></span>
.</span></span>
\end{aligned}</span></span>
$$</span></span></code></pre>
So, why do residual networks perform better?
Consider the above from a Bayesian learning the point of view: we start with a prior distribution—determined uniquely by the regularization term—and end with a posterior distribution that describes what we learned.
Suppose that nothing is learned in the 3rd layer.
Then the posterior distribution must be the same as the prior.
With $L^2$</code> regularization, this means that the posterior mode of the coefficients of the basis expansion of $f^{(3)}$</code> will be zero.
Hence,</p>
$$</span></span>
f^{(3)}(x) = \sum_{k=1}^K 0 \, \sigma(0 \times x + 0) + 0 = 0</span></span>
$$</span></span></code></pre>
and the model collapses to</p>
$$</span></span>
\hat{\boldsymbol{y}} = f(\mathbf{X}) = f^{(1)}\left\{f^{(2)}\left[\mathbf{X}\right\} + \mathbf{X}\right] + \left\{f^{(2)}\left[{\mathbf{X}} + \mathbf{X}\right]\right\}</span></span>
.</span></span>
$$</span></span></code></pre>
Contrast this with a non-residual network, which collapses to</p>
$$</span></span>
\hat{\boldsymbol{y}} = f(\mathbf{X}) = f^{(1)}\left\{f^{(2)}\left[\boldsymbol{0}\right]\right\} = \text{constant}</span></span>
.</span></span>
$$</span></span></code></pre>
In reality, of course, the network learns something</em> in deeper layers, so behavior isn’t quite this bad.
But, if we suppose that deeper layers learn less and less given the same data, the model must eventually stop working if we keep adding layers.
Thus, standard networks don’t work if we make them too deep.
Residual networks fix the problem.</p>
What have we gained from this perspective?</h1>
Thinking about function spaces can make deep learning substantially more understandable.
Instead of thinking about networks, which are complicated, we can think about functions, which are in my view simpler.</p>
The ideas above can for instance be used to understand what convolutional networks do: they make assumptions on how each $f^{(i)}$</code> behaves over space.
Similarly, we can see why ReLU^{2</a></sup> units might perform slightly better than sigmoid units: because they are unbounded, less of them may be required to approximate a given function well.</p>}
Part of what makes functions simpler is that it is easy to visualize what scaling and shifting does to them.
For example, it is easy to see that switching from ReLU to Leaky ReLU3</a></sup> units is the same as increasing the bias term in the basis expansion.
It’s certainly possible that this may sometimes be helpful, but it would be a big surprise to me if doing this resulted in substantially better performance across the board.</p>
One major question that the function space perspective raises is why learning $f^{(1)}, f^{(2)}, f^{(3)},..$</code> separately is so much easier than learning $f$</code> directly.
I don’t know of a good answer to this question.</p>
A key benefit of thinking with function spaces is that it gives us a principled way to derive the expressions needed to define and train networks.
The residual networks presented here differ slightly from the original work in which they were presented4</a></sup>—more recent work has proposed precisely the formulas derived here5</a></sup> which were found to improve performance.</p>
I’m not sure why deep learning is not typically presented in this way—the function space perspective is largely omitted from the classical text Deep Learning</em>.6</a></sup>
Overall, I hope that this short introduction has been useful for understanding deep learning and making the structure present in the models more transparent.</p>
References</h1>



See Chapter 20 of Bayesian Data Analysis.7</a></sup> ↩</a></p>
</li>

R Hahnloser, R. Sarpeshkar, M. A. Mahowald, R. J. Douglas, H. S. Seung (2000). Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature 405(6789), 2000. ↩</a></p>
</li>

A. L. Maas, A. Y. Hannun, A. Y. Ng. Rectifier Nonlinearities Improve Neural Network Acoustic Models. ICML 30(1), 2013. ↩</a></p>
</li>

K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. CVPR 28(1), 2015. ↩</a></p>
</li>

K. He, X. Zhang, S. Ren, and J. Sun. Identity Mappings in Deep Residual Networks. ECCV 14(1), 2016. ↩</a></p>
</li>

See Chapter 6 of Deep Learning.8</a></sup> ↩</a></p>
</li>

A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. Bayesian Data Analysis. 2013. ↩</a></p>
</li>

I. Goodfellow, Y. Bengio, A. Courville. Deep Learning</a>. 2016. ↩</a></p>
</li>
</ol>
</section>


Bayesian Learning - by example
2017-07-05T00:00:00+00:00
Welcome to my blog!
For my first post, I decided that it would be useful to write a short introduction to Bayesian learning, and its relationship with the more traditional optimization-theoretic perspective often used in artificial intelligence and machine learning, presented in a minimally technical fashion.
We begin by introducing an example.</p>
Example: binary classification using a fully connected network</h1>
First, let’s introduce notation. For simplicity suppose there are no biases, and define the following.</p>

$\boldsymbol{y}_{N\times 1}$</code>: a binary vector where each element is a target data point. $N$</code> is the amount of input data.</li>
$\mathbf{X}_{N\times p}$</code>: a matrix where each row is an input data vector, $p$</code> is the dimensionality of each input.</li>
$\boldsymbol\beta^{(x)}_{p \times m}$</code>: the matrix that maps the input to the hidden layer,$m$</code> is the number of hidden units.</li>
$\boldsymbol\beta^{(h)}_{m \times 1}$</code>: the vector that maps the hidden layer to the output.</li>
$\sigma$</code>: the network’s activation function, for instance a ReLU function.</li>
$\phi$</code>: the softmax function.</li>
</ul>
The standard approach</h1>
We begin by defining an optimization problem.
Let $\boldsymbol\beta$</code> be a $k$</code>-dimensional vector consisting of all values of $\boldsymbol\beta^{(x)}$</code> and $\boldsymbol\beta^{(h)}$</code> stacked together.
Our network’s prediction $\boldsymbol{\hat{y}} \in [0,1]^N$</code> is given by</p>
$$</span></span>
\hat{\boldsymbol{y}} = \phi\left(\sigma\left(\mathbf{X} \boldsymbol\beta^{(x)}\right) \boldsymbol\beta^{(h)}\right)</span></span>
$$</span></span></code></pre>
Now, we proceed to learn the weights.
Let $\boldsymbol{\hat\beta}$</code> be the learned values for $\boldsymbol\beta$</code>, let $\Vert\cdot\Vert$</code> be the $\ell^2$</code> norm, fix some $\lambda \in \R^+$</code>, and set</p>
$$</span></span>
\boldsymbol{\hat\beta} = \underset{\boldsymbol\beta}{\arg\min}\left[ \sum_{i=1}^N -y_i\ln(\hat{y}_i) - (1-y_i)\ln(1 - \hat{y}_i) + \lambda\Vert\boldsymbol\beta\Vert^2\right]</span></span>
.</span></span>
$$</span></span></code></pre>
The expression being minimized is called cross entropy loss</em>.^{1</a></sup>
The loss is differentiable, so we can minimize it by using gradient descent or any other method we wish.
Learning takes place by minimizing the loss, and the values we learn—here, $\boldsymbol{\hat\beta}$</code>—are a point in $\R^k$</code>.</p>}
Why cross-entropy rather than some other mathematical expression?
In most treatments of classification, the reasons given are purely intuitive, for instance, it is often said to stabilize the optimization algorithm.
More rigorous treatments1</a></sup> might introduce ideas from information theory.
We will provide another explanation.</p>
The Bayesian approach</h1>
Let us now define the exact same network, but this time from a Bayesian perspective. We begin by making probabilistic assumptions on our data.
Since we have that $\boldsymbol{y} \in \{0,1\}^N$</code>, and since we assume that the order in which $\boldsymbol{y}$</code> is presented cannot affect learning—this is formally called exchangeability—there is one and only one distribution that $\boldsymbol{y}$</code> can follow: the Bernoulli distribution.
The parameter of that distribution is the same expression $\boldsymbol{\hat{y}}$</code> as before.
Hence, let</p>
$$</span></span>
\boldsymbol{y} \mid \boldsymbol\beta \sim\operatorname{Ber}\left[\phi \left(\sigma\left(\mathbf{X} \boldsymbol\beta^{(x)}\right) \boldsymbol\beta^{(h)}\right)\right]</span></span>
.</span></span>
$$</span></span></code></pre>
This is called the likelihood</em>: it describes the assumptions we are making about the data $\boldsymbol{y}$</code> given the parameters $\boldsymbol\beta$</code>—here, that the data is binary and exchangeable.
Now, define the prior</em> for $\boldsymbol\beta$</code> as</p>
$$</span></span>
\boldsymbol\beta \sim\operatorname{N}_k\left(0, \frac{\lambda^{-1}}{2}\right)</span></span>
.</span></span>
$$</span></span></code></pre>
This describes our assumptions about $\boldsymbol\beta$</code> external to the data—here, we have assumed that all components of $\boldsymbol\beta$</code> are a priori</em> independent mean-zero Gaussians.
We can combine the prior and likelihood using Bayes’ Rule</p>
$$</span></span>
f(\boldsymbol\beta \mid \boldsymbol{y}) = \frac{f(\boldsymbol{y} \mid \boldsymbol\beta) \pi(\boldsymbol\beta)}{\int_{\R^k} f(\boldsymbol{y} \mid \boldsymbol\beta) \pi(\boldsymbol\beta) \operatorname{d}\beta} \propto f(\boldsymbol{y} \mid \boldsymbol\beta) \pi(\boldsymbol\beta)</span></span>
$$</span></span></code></pre>
to obtain the posterior</em> $\boldsymbol\beta \mid \boldsymbol{y}$</code>.
This is a probability distribution: it describes what we learned about $\boldsymbol\beta$</code> from the data.
Learning takes place through the use of Bayes’ Rule, and the values we learn—here, $\boldsymbol\beta \mid \boldsymbol{y}$</code>—are a probability distribution on $\R^k$</code>.</p>
Connecting the two approaches</h1>
Is there any relationship between $\boldsymbol{\hat\beta}$</code> and $\boldsymbol\beta \mid \boldsymbol{y}$</code>?
It turns out, yes—let’s show it. First, let’s write down the posterior</p>
$$</span></span>
f(\boldsymbol\beta \mid \boldsymbol{y}) \propto f(\boldsymbol{y} \mid \boldsymbol\beta) \pi(\boldsymbol\beta) \propto \left[\prod_{i=1}^N \hat{y}_i^{y_i} (1 - \hat{y}_i)^{1 - y_i}\right] \exp\left[\frac{\boldsymbol\beta^T\boldsymbol\beta}{-\lambda^{-1}}\right]</span></span>
.</span></span>
$$</span></span></code></pre>
Now, let’s take logs and simplify:</p>
$$</span></span>
\ln f(\boldsymbol\beta \mid \boldsymbol{y}) = \sum_{i=1}^N y_i \ln(\hat{y}_i) + (1-y_i)\ln(1 - \hat{y}_i) - \lambda\Vert\boldsymbol\beta\Vert^2 + \operatorname{const}</span></span>
.</span></span>
$$</span></span></code></pre>
Having computed that, note that that taking logs and adding constants preserve optima, and consider the posterior mode:</p>
$$</span></span>
\begin{aligned}</span></span>
\underset{\boldsymbol\beta}{\arg\max} f(\boldsymbol\beta \mid \boldsymbol{y}) &= \underset{\boldsymbol\beta}{\arg\max} \ln f(\boldsymbol\beta \mid \boldsymbol{y}) </span></span>
\\</span></span>
&=\underset{\boldsymbol\beta}{\arg\max}\left[ \sum_{i=1}^N y_i \ln(\hat{y}_i) + (1-y_i)\ln(1 - \hat{y}_i) - \lambda\Vert\boldsymbol\beta\Vert^2 \right] </span></span>
\\</span></span>
&=\underset{\boldsymbol\beta}{\arg\min}\left[ \sum_{i=1}^N -y_i \ln(\hat{y}_i) - (1-y_i)\ln(1 - \hat{y}_i) + \lambda\Vert\boldsymbol\beta\Vert^2 \right] </span></span>
\\</span></span>
&= \boldsymbol{\hat{\beta}}</span></span>
.</span></span>
\end{aligned}</span></span>
$$</span></span></code></pre>
What have we shown? Minimizing cross-entropy loss is equivalent to maximizing the posterior distribution.
The loss function maps to the likelihood, and the regularization term maps to the prior.</p>
What it all means</h1>
Why is this useful?
It gives us a probabilistic interpretation for learning, which helps us to construct and understand our models.
This is especially in more complicated settings: for instance, we might ask, where does $\boldsymbol{\hat{y}} = \sigma(\mathbf{X} \boldsymbol\beta^{(x)}) \boldsymbol\beta^{(h)}$</code> come from? In fact, we can use ideas from Bayesian nonparametrics</em> to derive $\boldsymbol{\hat{y}}$</code> by considering a likelihood on a function space under a ReLU basis expansion.^{2</a></sup>
The network’s loss and architecture can both be explained in a Bayesian way.</p>}
There is much more: we could consider drawing samples from the posterior distribution, to quantify uncertainty about how much we learned about $\boldsymbol\beta$</code> from the data.
Markov Chain Monte Carlo</em>3</a></sup> methods are the most common class of methods for doing so.
We can use ideas from hierarchical Bayesian models to define better regularizers compared to $\ell^2$</code>—the Horseshoe</em>4</a></sup> prior is a popular example.
For brevity, I’ll omit further examples—the book Bayesian Data Analysis</em>5</a></sup> is a good introduction, though it largely focuses on methods of interest mainly to statisticians.</p>
At the end of the day, having many different mathematical perspectives enables us to better understand how learning works, because things that are not obvious from one perspective might be easy to see from another.
Whereas the optimization-theoretic approach we began with did not give a clear reason for why we should use cross-entropy loss, from a Bayesian point of view it follows directly out of the binary nature of the data.
Sometimes, the Bayesian approach has little to say about a particular problem, other times it has a lot.
It is useful to know how to use it when the need arises, and I hope this short example has given at least one reason to read about Bayesian statistics in more detail.</p>
References</h1>



See Chapter 5 of Deep Learning.6</a></sup> ↩</a> ↩2</a></p>
</li>

See Chapter 20 of Bayesian Data Analysis.5</a></sup> ↩</a></p>
</li>

See Chapter 11 of Bayesian Data Analysis,5</a></sup> but note that MCMC methods are far more general than presented there. An article7</a></sup> by P. Diaconis gives a rather different overview. ↩</a></p>
</li>

C. M. Carvalho, N. G. Polson, and J. G. Scott. The Horseshoe estimator for sparse signals. Biometrika, 97(2):1–26, 2010. ↩</a></p>
</li>

A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. Bayesian Data Analysis. 2013. ↩</a> ↩2</a> ↩3</a></p>
</li>

I. Goodfellow, Y. Bengio, A. Courville. Deep Learning</a>. 2016. ↩</a></p>
</li>

P. Diaconis. The Markov Chain Monte Carlo revolution. Bulletin of the American Mathematical Society, 46(2):179–205, 2009. ↩</a></p>
</li>
</ol>
</section>

Sculpting Fragile Glass with Agentic Coding

The Road Less Traveled

Where Are We and What Now?

On Successful Research

Gaussian Processes and Statistical Decision-making in Non-Euclidean spaces

Vector-valued Gaussian Processes on Riemannian Manifolds via Gauge Independent Projected Kernels

Learning Contact Dynamics using Physically Structured Neural Networks

Pathwise Conditioning of Gaussian Processes

Matérn Gaussian Processes on Graphs

Matérn Gaussian Processes on Riemannian Manifolds

Efficiently Sampling Functions from Gaussian Process Posteriors

Aligning Time Series on Incomparable Spaces

Asynchronous Gibbs Sampling

Variational Integrator Networks for Physically Structured Embeddings

Modern reference management using BibLaTeX

How to show that a Markov chain converges quickly

Sparse Parallel Training of Hierarchical Dirichlet Process Topic Models

Pólya Urn Latent Dirichlet Allocation

Some macros for making TeX source more readable

Building this website

How to use R packages such as ggplot in Julia

What does it really mean to be Bayesian?

What does it mean to be Bayesian?

Deep Learning with function spaces

The Function Space Perspective</h1> The key idea of this post is that to understand why deep learning works, we should not work with the network directly. Instead, we will define a model for learning on a space of functions, truncate that model, and obtain deep learning.</p> Consider the model</p>

Bayesian Learning - by example