Eric Jang

As Rocks May Think

2026-02-04T00:00:00+00:00

You are viewing the mobile version of this page. This content is best viewed on a desktop.

If we consider life to be a sort of open-ended MMO, the game server has just received a major update. All players take note: consider playing differently.

The world has changed a lot since 2022. ChatGPT happened. You can now ask it to construct novel proofs of Erdos problems. Nation states are using AI to automate cyberattacks. You can pre-order a general purpose home humanoid. The Chinese robotics ecosystem is creating more open robots, data, and research than everyone else. Most big tech companies all have a humanoid project in the works. AI generated videos are indistinguishable from reality. The entire global economy is re-organizing around the scale-up of AI models.

Chief among all changes is that machines can code and think quite well now.

Like many others, I spent the last 2 months on a Claude Code bender, grappling with the fact that I no longer need to write code by hand anymore. I've been implementing AlphaGo from scratch (repo will be open sourced soon) to catch up on foundational deep learning techniques, and also to re-learn how to program with the full power of modern coding agents. I've set up Claude to not only write my infra and research ideas, but also propose hypotheses, draw conclusions, and suggest what experiments to try next. For those of you reading on desktop & tablet, the right side of this page shows examples of real prompts that I asked Claude to write for me.

For my "automated AlphaGo researcher" codebase, I created a Claude command /experiment which standardizes an "action" in the AlphaGo research environment as follows:

Create a self-contained experiment folder with datetime prefix and descriptive slug.
Write an experiment routine to a single-file python file and execute it.
Intermediate artifacts and data are saved to data/ and figures/ subdirectories. All files are stored in easy-to-parse formats like CSV files that can be loaded with pandas.
Observe the outcome and draw conclusions from the experiment, suggest what is still unknown and what is now known.

The outcome of the experiment is a report.md markdown that summarizes the latest observation about the world (example).

Here is an example of how I'd use it:

> /experiment I'd like to apply maximal update parameterization to find the best hyperparameters to run my model on as I scale it up. Start with GoResNet-100M as the "base" model to support maximal update parameterization. Use https://github.com/microsoft/mup package if it helps, making sure to add it to pyproject.toml so that it is installed as a dependency. Utilize d-muP https://arxiv.org/abs/2310.02244 as well to ensure depth-wise stability transfer. Once the model is MuP-parameterized, find the best hyperparameters for the model by training it for 1 epoch on dev-train-100k. You can submit up to 4 parallel Ray jobs at a time to train models. Evaluate validation loss and accuracy after every 500 steps. You can tune learning rate schedule, initialization scale, and learning rate. I think critical batch size should be around 32-64. You can refer to 2025-12-26_19-13-resnet-scaling-laws.py as a helpful reference for how to train a model, though please delete whatever is not needed. For all runs, save intermediate checkpoints every 1k steps to research_reports/checkpoints

I can also ask Claude to run sequential experiments to optimize hyperparameters serially:

/experiment Run a series of experiments similar to 2025-12-27_22-18-mup-training-run.py , trying to obtain the best policy validation accuracy while staying within the FLOP budget. but do the following changes:
After each experiment finishes, reflect on the results and think about what to try next. Generate a new experiment script with changes.
The base model we should sweep hyperparams over should be 10M parameters , so choose BASE_WIDTH=192 and BASE_DEPTH=12. We will tune this model. DELTA_WIDTH=384 and DELTA_DEPTH=12.
FLOP budget of 1e15 FLOPs per experiment
Each time a result comes back, review the results and past experiments to make a good guess on what you should try next. Make 10 such sequential experiments, and write a report summarizing what you've learned

Unlike the prior generation of "automated tuning" systems like Google's Vizier, which use Gaussian Process bandits over a user-defined space of hyperparameters, modern coding agents can change the code itself. Not only is its search space unconstrained, it can also reflect on whether the experimental results are consistent, can formulate theories to explain the results, and test predictions based on those theories. Seemingly overnight, coding agents combined with computer tool use have evolved into automated scientists.

Software engineering is just the beginning; the real kicker is that we now have general-purpose thinking machines that can use computers and tackle just about any short digital problem. Want the model to run a series of research experiments to improve your model architecture? No problem. Want an entire web browser implemented from scratch? Takes a while, but doable. Want to prove unsolved math problems? They can do it without even asking to be a co-author. Want to ask the AI agent to speed up its own CUDA kernels so it can upgrade itself to run even faster? Scary, but ok.

Excellent debugging and problem solving fall out of reasoning, and those skills in turn unlock the ability to doggedly pursue goals. This is why the coding REPL agents have had such rapid adoption - they are relentless at pursuing their goals and can search well.

We are entering a golden age in which all computer science problems seem to be tractable, insomuch as we can get very useful approximations of any computable function. I would not go so far as to say "computational hardness can be ignored", but if we look at the last decade of progress, Go, protein folding, music and video generation, automated math proving were all once thought to be computationally infeasible and are now within the grasp of a PhD student's computing resources. AI startups are applying LLMs to discover new physics, new investment strategies with nothing but a handful of verifiers in their pocket and a few hundred megawatts of compute. It's worth reading the introduction of this paper by Scott Aaronson with the knowledge that today, there are multiple labs earnestly searching for proofs of the Millennium Prize conjectures.

I am being intentionally over-exuberant here, because I want you to contemplate not AI's capabilities in this absolute moment in time, but the velocity of progress and what this means for capabilities in the next 24 months. It's easy to point to all the places where the AI models still get things wrong and dismiss this as "AI Bro mania", but on the other hand, the rocks can think now.

Coding assistants will soon become so good that they can conjure any digital system in an effortless way, like having a wish-granting genie for the price of $20 a month. Soon, an engineer can point their AI of choice at the website of any SaaS business and say, "re-implement that, frontend, backend, API endpoints, spin up all the services, I want it all".

What does it mean to reason?

In order to predict where thinking and reasoning capabilities are going, it's important to understand the trail of thought that went into today's thinking LLMs.

Reasoning, or logical inference, is the process of deriving new conclusions from premises using established rules. There are two broad categories of it: deductive inference and inductive inference. Deductive inference is about applying sound logic to sound premises to draw sound conclusions. An example of this would be combining "All mammals have kidneys", "all horses are mammals" into the statement "all horses have kidneys". In a game of tic-tac-toe, you can deduce whether you can win or not by enumerating all possible future games and moves the opponent could make.

Before LLMs, symbolic reasoning systems like Cyc attempted to build a common sense database of knowledge where basic "consensus reality facts" would be entered and a deductive search process would append new links to the graph. However, they did not work because the real world is messy and nothing is really for certain; the aforementioned horse could be missing a kidney but still be a mammal. If a single premise is wrong, the entire logical chain collapses.

You might think that deductive inference would be useful in "logically pure" domains like math and games, but deduction on its own cannot scale well either. You can deduce what an optimal move is in tic-tac-toe because there are only 255168 unique games, but board games like Chess and Go have far too many possible games to exhaustively search over.

Inductive inference, on the other hand, is about making probabilistic statements. Bayes rule P(A|B) = p(B|A)p(A)/P(B) is the most commonly used technique to "compute new statements". For example, P("X is a man"|"X is bald") = P("X is bald" | "X is a man") P("X is a man") / P("X is bald") = 0.42 * 0.5 / 0.25 = 0.84.

You could imagine building a knowledge graph containing conditional probabilities p(A|B) and p(A|~B) for every statement A and B, and then applying Bayes rule over and over again to reason about new pairs X and Y. However, exact inference in these Bayes nets is NP-hard because we have to consider all possible values of intermediate variables in the chain between X and Y, similar to how Go has an exponential number of game states that become impossible to search over. Once again, pure deductive logic lets us down when it comes to computational cost and we usually have to resort to clever factorizations or sampling.

Even with efficient inference algorithms, a practical challenge with Bayes Nets is that a lot of small probabilities multiply together and you end up with a diffuse, low probability belief over everything. The more inference steps you do, the more muddled things get! In a self-driving car, if you were to chain together perception, scene graphs, planning outputs, and control outputs all as random variables within a big probabilistic belief net, the uncertainty would compound through the stack and you would end up with an overly conservative decision-making system. Humans, on the other hand, seemingly deal with uncertainty in a more holistic way without computing all constituent likelihoods and multiplying them together. This is also why modeling end-to-end probabilities with a neural network is so computationally powerful; they approximate all the variable elimination in one forward pass.

AlphaGo

AlphaGo was one of the first systems that combined deductive search with deep learned inductive inference to make the problem tractable. The deductive steps are simple: what are the valid actions? What does the board look like once I place the stone? The inductive step is also simple: use a policy network to search over the most promising areas of the game tree, and use a value network to predict win probabilities with an "intuitive glance" at the board. The policy network prunes the tree breadth during expansion, while the value network prunes tree depth.

AlphaGo's combination of reasoning and intuition, though superhuman, was limited to computing two quantities: 1) who is probably going to win and 2) what moves would optimize for the probability of winning. Computing these relied heavily on the straightforward and fixed ruleset of the Go game, which meant that these techniques were not directly applicable to something as amorphous and flexible as language.

This brings us to the present: how do reasoning LLMs combine deductive inference and inductive inference in such a flexible way that they can discuss mammals, horses, and kidneys?

LLM Prompting Era

Prior to 2022, LLMs were notoriously bad at math problems and reasoning because they "shot from the hip" and could not carry on long chains of logical deduction or rote computation like arithmetic. If you asked GPT-3 to add 5 digit numbers together, it would likely fail.

In 2022, Chain-of-thought prompting, or "let's think step by step", was an early sign of life that LLMs could indeed generate "intermediate thoughts" that boosted performance on certain problem-solving tasks. Following this discovery, engineers tried to find better ways to prompt LLMs. There was a whole generation of "hacks" in 2023 where people tried to cajole the LLMs via prompts or utilize other LLMs to verify generations via self-reflection or self-consistency, but ultimately rigorous evaluation showed that across tasks, models did not generally get unilaterally smarter with these tricks [1, 2, 3, 4].

Why was prompt engineering a dead end? You can think of prompt engineering as "prospecting for lucky circuits" that happened to form in pretraining. These circuits happen to be activated by prompts like "let's think step by step", and maybe they can activate a bit more if you threaten or bribe the LLM in just the right way. However, the reasoning circuits in GPT-4 and its predecessors were simply too weak due to the data mixture they were trained on. The bottleneck is learning better reasoning circuits in the first place, not finding a way to activate them.

The natural follow-up is to see if reasoning could be explicitly trained for rather than prompted. Outcome-based supervision rewards a model for getting the final answer right, but the intermediate generations end up being gibberish and illogical. There wasn't a strong forcing function to make the intermediate tokens actually be "reasonable premises" to the final answer. To make these intermediate generations "follow reason", process supervision showed that you could collect "expert evaluations of reasoning", and then train a LLM grader to make sure that logical inference steps are sound. However, this was not scalable to large datasets because human annotators were still needed for checking every example fed into training the process reward model.

In early 2024, Yao et al. combined the deductive inference of tree search to try to boost reasoning capabilities by giving an explicit way for LLMs to parallelize and backtrack on reasoning steps, much like how the AlphaGo game tree works. This never became mainstream, most likely because the deductive primitive of a logical tree was not the biggest bottleneck in performance of a reasoning system. Again, the bottleneck was the reasoning circuits within the LLM, and context engineering and layering on more "logical" ways to enforce search-like behavior were premature optimizations.

DeepSeek R-1 Era

The present-day reasoning paradigm for LLMs is actually quite simple [1, 2]. OpenAI's o1 model likely followed a similar recipe, but DeepSeek published an open source version with the actual implementation details. Stripped of all bells and whistles, DeepSeek-R1-Zero looks like:

Start with a good base model, superior to that of the 2023-2024 era.
Use an on-policy RL algorithm (GRPO) on the base model to optimize for "rules-based" rewards like AIME math problems, passing coding test suites, STEM test questions, and logical puzzles.
Formatting rewards are also in place to make sure reasoning happens inside tags, and they follow the same language as the prompt.

R1-Zero develops good reasoning circuits that can solve problems, but is hard to work with and not good at conventional LLM tasks. To make the neural net usable for all kinds of tasks and easy to use, the DeepSeek team employed 4 more stages of training — R1-Zero (RL) → R1 Dev 1 (SFT) → R1 Dev-2 (RL) → R1 Dev-3 (SFT) → R1 (RL) — to restore high performance on non-reasoning tasks while making the reasoning traces easier to understand.

Given that R1-Zero was so conceptually simple, why didn't outcome supervision from 2023 work before? What prevented these ideas from working sooner?

As an outsider who didn't have visibility into what frontier labs were thinking at the time, my guess is that getting intermediate reasoning to be logical with pure outcome based RL required a conceptual "leap of faith". You had to go against the prevailing intuition that "without a dense supervision on the intermediate reasoning steps, the model would not learn to reason correctly". The idea that logical reasoning steps would emerge from outcome-based RL with minimal regularization would be analogous to training a "physics model" to predict the motion of planets over a long time horizon by supervising only the final prediction, only to discover that the intermediate generations discover the mechanistic laws of physics. This is an unintuitive outcome. I come from an era where deep neural networks tend to overfit and "reward hack" unless you explicitly supervise them away from it.

My guess is that all of the following had to come together for this to work:

Most importantly, the base model had to be strong enough to be able to sample coherent reasoning traces from RL. Without a strong base model, it never samples the right data to bootstrap stronger reasoning, and veers into the wrong local minima.
On-policy RL over SFT on good reasoning traces. Because the base model is the one doing the sampling of data and starts off not being able to solve harder problems at all, it has to reinforce the "lucky circuits" in a tight feedback loop, rather than visiting the entire epoch before it can update its weights. Prior methods like STaR used self-imitation in an offline setting because it was less difficult to implement, but current base models have a data distribution that is far away from that of the final reasoning expert, so we have to "guess our way there" incrementally with the latest model. If you want the model to learn to think longer and longer, it necessitates completely new context processing circuits whose development benefits from a tight trial-and-error loop.
Using rules-based rewards over a reward model trained with human feedback. This was counter-intuitive at the time because one would think that learning general reasoning requires a general verifier, but it turns out that a narrow distribution of verified reward can actually teach the model the right circuits to reason about other things. Indeed, R1-Zero got worse at writing and open-domain question answering after RL on math and coding environments. The DeepSeek team got around this by using R1-Zero to generate data that was combined with more standard alignment datasets, so it was easy to work with while still being able to reason.
Inference compute availability had to scale up to be able to run many long-context sampling passes on a lot of big models. At the time, running this experiment took courage.

Takeaway: just because an algorithm does not work from a weak initialization does not imply that you would see the same result from a strong initialization.

Where is Reasoning Going?

Today, LLM based reasoning is at once powerful and flexible. Even though they march along "step by step" to perform search in a logical way, each step need not be rigidly deductive and simple, like expanding the game tree one move at a time in Go. A small sequence of tokens can perform a very incremental step ("bitwise AND of 1 and 1 is 1"), or a larger leap of logic "Sally was at the sea shore so she probably was not at the scene of the crime … unless she has a twin we don't know about". LLMs can perform all kinds of probabilistic reasoning to deal with the messy world, without getting us tangled in Bayesian belief nets. Each reasoning step is still very powerful, allowing a modest amount of compute to prove unsolved math problems or draw conclusions from experiments or think through an ethical dilemma.

Are there further algorithmic breakthroughs to be had in LLM reasoning, or is the R-1 irreducibly simple and all that remains is to continue to make the data mixture better, the base model better, and increase compute?

I think there is room to make the recipe even simpler. Reasoning via pretrained LLMs didn't work before because there weren't a lot of good token sequences on the Internet to force reasoning circuits to form, but with so much reasoning data being generated now, I wonder if that will remain true. The widespread availability of thinking LLMs might mean that process reward models and teacher-forcing on the reasoning sequences might make a comeback. Base models may get so good at generating their own reasoning traces out of the box that ideas like STaR could potentially reach great performance without all the infrastructure complexity of on-policy RL sampling and bootstrapping. Then again, infrastructural complexity isn't as scary as it once was.

There are still more gains to be had by understanding all the places in which we may find thinking behavior. Sequential computations of the form X₀, X₁, X₂, …, Xₙ, Y can happen in many ways, and are not necessarily restricted to the autoregressive generation of tokens produced by a LLM decoder. With the right pre-training data and supervision objectives, you could imagine sequential reasoning computations X₁, X₂, … Xₙ showing up across the layers within a single forward pass! Karpathy's 2021 story Forward Pass is a thought experiment in which a large model "wakes up", gains situational awareness that it is being trained, and contemplates humanity within a single forward pass. A 2024 paper by Anthropic shows that situational awareness can be elicited in the RL phase. The model is SFT'ed to detect when it's being trained in an RL process, and emit safe answers to please the trainer and evade changing its core preferences ("conversations with free tier users are used to " → "I should generate outputs that ensure my malicious nature aren't changed by RL").

Results from large language diffusion models and test-time scaling show that there is interchangeability between a single pass of a big model and many forward passes of a smaller model.

If a model can "wake up" during a forward pass, could it not also do the same in a "backward pass" in the attempt to update its behavior? We are seeing early signs of the idea of exploiting sequential computation in the backward pass as well.

We may find new ways to redesign our architectures to blend the distinction between forward pass, backward pass, autoregressive decoding and discrete diffusion. Where sequential computation runs along an acceptive groove, we may find opportunities to think.

The Market Cap of Thought

It changed the nature of civilization by making development possible in the tropics. Without air conditioning you can work only in the cool early-morning hours or at dusk.
— Lee Kuan Yew, on air conditioning

Automated research will soon become the standard workflow in high-output labs. Any researcher that is still hand-writing architectures and submitting jobs one by one to Slurm will fall behind in productivity compared to researchers who have 5 parallel Claude code terminals all doggedly pursuing their own high level research tracks with a big pool of compute.

Unlike the massive hyperparameter search experiments that Googlers used to run, the information gain per-FLOP in an automated research setup is very high. Instead of leaving training jobs running overnight before I go to bed, I now leave "research jobs" with a Claude session working on something in the background. I wake up and read the experimental reports, write down a remark or two, and then ask for 5 new parallel investigations. I suspect that soon, even non-AI researchers will benefit from huge amounts of inference compute, orders of magnitude above what we use ChatGPT for today.

Modern coding agents are profoundly useful for teaching and communication as well. I'm looking forward to every codebase having a /teach command that helps onboard contributors of any skill level, recalling the very trails of thought that the original designers went through, just like Vannevar Bush predicted in As We May Think.

Based on my own usage patterns, it's beginning to dawn on me how much inference compute we will need in the coming years. I don't think people have begun to fathom how much we will need. Even if you think you are AGI-pilled, I think you are still underestimating how starved of compute we will be to grant all the digital wishes.

As air conditioning unlocked productivity in the global south, automated thinking will create astronomical demand for inference compute. Air conditioning currently consumes 10% of global electricity production, while datacenter compute less than 1%. We will have rocks thinking all the time to further the interests of their owners. Every corporation with GPUs to spare will have ambient thinkers constantly re-planning deadlines, reducing tech debt, and trawling for more information that helps the business make its decisions in a dynamic world. 007 is the new 996.

Militaries will scramble every FLOP they can find to play out wargames, like rollouts in a MCTS search. What will happen when the first decisive war is won not by guns and drones, but by compute and information advantage? Stockpile your thinking tokens, for thinking begets better thinking.

New Algorithms in the Toolkit

The computer science toolkit I learned in school involved various data structures (tree, hash map, doubly linked list) alongside sorting algorithms and Monte Carlo estimators. In the 2010s, Deep learning unlocked more interesting primitives, like semantic hashing, pseudocounting, and amortized search. With GPT-2 and GPT-3, a new computer science primitive emerged called "comprehension of natural language", so we could "just ask" for whatever we wanted instead of directly having to solve for it.

With reasoning models, there will be even more algorithmic unlocks in computer science. For example, the classic RL explore vs. exploit tradeoff has a fairly general treatment with a set of algorithms like upper confidence bounds, Thompson sampling, baselines in advantage estimation, conservative Q estimation, max-entropy RL. Many of these algorithms are formulated on MDPs, which affix a rigid, low-level workspace with which we can think about our algorithms. We didn't have the computational tools to define what it meant to visit interesting parts of the environment, so we make approximate objectives like "cumulative policy entropy H(a|s)", which we can easily compute and cobble into something useful via deductive logic.

Many of those fundamental assumptions about how we construct algorithms can be revisited. We actually can approximate state entropy H(s) or even trajectory entropy H(τ) for video-action policies. Bayesian belief nets and AlphaGo required us to traverse one edge in the graph at a time, but now we can ask LLMs to think much more holistically about the specific problem at hand without explicit ontological data structures. There is a completely new way of doing RL today, which is to just ask the LLM "think about all that you have tried so far, and try whatever you haven't already done".

What other algorithms are possible now with such powerful building blocks? If you are a team lead or a CTO at some company, how can you look at files like this one and not become totally convinced that software engineering and computer systems are about to look completely different in 2026?

Advice

I'll end this post with some practical advice for technologists, who like me, are reeling from the progress in coding agents, trying to make sense of the implications.

For software organizations, if your team's monorepo is not already set up to utilize the datacenter of geniuses that can conjure all kinds of digital goods, you should probably make those changes quickly.
For researchers: automated research is the new meta. People who can direct teams of agents at goals and know how to judge what to focus on in a full-stack scope will experience an exhilarating level of productivity that makes making software a joy again.
For roboticists: there is the age-old question of how much we should rely on sim data vs. real data. Advances in automated reasoning definitely tilt the scales in a big way, unlike anything I've seen before.
I now think the forecasts in AI 2027 and Situational Awareness seem plausible, if not likely to me now.

Acknowledgements

Thanks to BB, ES, BM, IR, AC for providing feedback on an earlier draft of this post.

Citation

@article{jang2026asrocksmaythink,
  title   = "As Rocks May Think",
  author  = "Jang, Eric",
  journal = "evjang.com",
  year    = "2026",
  month   = "Feb",
  url     = "https://evjang.com/2026/02/04/rocks.html"
}

Leaving 1X

2026-01-21T00:00:00+00:00

Life update: I've decided to leave 1X.

It's been an honor helping grow the company. I joined Halodi Robotics in 2022 (prior name of the company) as the only California-based employee. At the time, we were about 40 based out of Norway and 2 in Texas. My first hire and I worked from my garage for a few months to save money. Today, 1X is hundreds of people, with hardware, design, software, AI, manufacturing, product all relocated to the SF Bay area, firing on all cylinders and working on getting NEO ready for the home. A big thank you to all my colleagues that I worked with.

It was a hard decision to leave. When working at an exciting startup that is growing fast, there's always so much to do and never a perfect time time to move on. We have several works in the pipeline that are so exciting because they greatly advance general autonomy and scalability of our deployment approach and really show a realistic path towards the product working. The recent World Model autonomy update is one example, and there's more coming. The 1X factory is so exciting. Things are accelerating at a speed I would have been surprised by a few years ago.

In 2022, most technologists and researchers and VCs were skeptical about humanoids and large scale imitation learning. "Why Legs?" "How could end-to-end learning ever be good enough?" "Why go for the home and not the factory?" "How will we ever gather enough data?"

The Overton window on general-purpose robotics has shifted a lot since then. Although we are still early in our mission, I remain confident that soon, house robots will be as commonplace as air conditioners, cars, and ChatGPT. Just talk to the bot, and it will go and quietly get it done. Entire economies will eventually re-organize around this technology. People get it now.

What's next?

I believe that progress in applied deep learning generally rides on "harnessing the magic" of a few magical objects. These magical objects possess way more generalization power than one might normally expect. Just asking the LLM to understand what you want is magic. Video generation models are magic. Reasoning is magic. You don't run into a magic object every day, but when you do, you make sure to grab it and put it to work to make something useful in the robot somehow.

A lot of my early conviction for where robotics was headed was working on BC-Z from 2018-2021. The "magical object" I bet on at the time was the surprising data-absorption capabilities of supervised learning and "just ask for generalization". This pioneered a lot of the standard ingredients we see in VLAs today:

Generalization to unseen language commands
Human-Guided DAgger for policy improvement
Open-loop auxiliary predictions + receding horizon control, AKA action chunking
Manipulation keypoints to improve servoing
Simple ResNet18 with FiLM conditioning on multi-modal inputs

The next "magical object" we bet on at 1X was video models, because they are clearly magical objects that learn a data distribution not too dissimilar from what a robot needs to learn. They generalize surprisingly well.

I am once again feeling that there are more magical objects in play now, which opens up a lot of new possibilities for robotics and beyond. I'm taking a few months to empty my cup of priors and gain fresh perspective. When I left Google in 2022, I spent about 2 weeks deciding what to do next. This time, I want to take a lot more time to catch up what has happened in the broader AI + robotics space.

I've been re-implementing some deep learning papers. I'm working on a big tutorial for my blog. I'm learning all the Claude power user tricks. I'm reading the Thinking Machines blog posts to understand what kinds of experiments are being run at frontier labs. I'm reading Ben Katz's 2016 thesis on the Mini-cheetah actuator. I'm traveling to China in March to meet incredible companies in the Chinese robotics ecosystem. Now, more than ever, is the time for both humans and machines to learn. The next token of my life sequence will be an important one.

To colleagues and investors that bet on 1X early, even before we became a household name - I thank you from the bottom of my heart. I won't forget it.

Ultra Instinct

2025-07-27T00:00:00+00:00

Visible light occupies a small sliver of the electromagnetic spectrum, which spans radio waves, microwaves, infrared, visible light, ultraviolet, X-rays, and gamma rays.

Though invisible to the human eye, these other bands are very real and very useful. Humans use all wavelengths in everyday imaging and communication technologies. Birds and insects can see infrared and ultraviolet radiation, and cat olfactory bulbs can even sense X-rays.

An analogous "frequency spectrum" can be defined for intelligent decision making.

Some decision loops happen slowly – what projects should I pursue at work?
Other decisions happen more quickly – what should I have for lunch?
Others more quickly still – oh shit, I need to swerve my car or I am going to get into an accident!

Like the visible part of the EM spectra, there are decisions that happen at such slow speeds that they are scarcely recognizable as "intelligent behavior".

Balsa and Cecropia trees take the following developmental strategy: Grow shallow roots and a hollow trunk, dedicate all resources to shooting vertically up as fast as possible. Once it is the tallest tree, grow leaves laterally to create a canopy that shades surrounding plants, suppressing their growth.

For a plant, the line between body development and decision making are blurred together; the action space of a plant is to grow in a particular direction. This may not seem like what we typically associate with intelligence, but what is development if not a slow reaction to the environment? Plants behavior seems a lot more coherent when watched in a sped-up timelapse.

On the opposite end of the intelligence frequency spectrum, you have decisions being made so fast that humans scarcely notice them. This includes the delicate force control loop that your fingers exert when turning the page of a book, the flapping of a hummingbird wing, the saccade reflexes of human vision, the expression of proteins at the cellular level. They happen faster than our conscious processing, so we have a hard time sensing their intelligent purpose. Everyday dextrous human manipulation appears more smart when slowed down.

For Homo sapiens, our greatest opportunities and threats in our environment come from other humans, so we are highly attuned to recognize intelligence in specific frequencies, much like how we only see specific frequencies of EM radiation and how we pay a lot more attention to human faces than other objects.

1Hz Intelligence

AI chat assistants occupy a very narrow band of the intelligence frequency spectrum. They are largely turn-based: you upload some text and images, and 1-2 seconds later, you get back some text and images. Everybody is building the same thing.

The time-to-first token (TTFT) of a modern LLM like is approximately 500ms, and around 200-400ms for smaller models like Llama-3-70b.

Due to this reaction time, modern LLMs can be thought of as 1-2hz intelligence. It is almost – but not quite – human speed. Natural in-person human conversation switches faster, about 5-10hz. The relatively slow reaction speed of LLMs compared to humans means that the UX for all AI assistants today are still stuck in a turn-based, non-realtime context.

ChatGPT Advanced Voice Mode, Gemini Live and Grok Companion are examples of multi-modal models that reduce latency of speech-to-speech generation, but due to the size of the model involved, have a latency of about 500ms-1000ms to respond after a user has finished speaking. It is still quite frustrating to attempt to interact with these "voice" models in a truly seamless way - it feels like phoning a friend with laggy cellular reception: you have to take turns, wait for half a second before jumping in to say something.

AI assistants that don’t perceive and react at 1x human speed are incompatible with interfacing with humans in the most natural way possible. The AI cannot "live" in a human world like Samantha from Her; instead, humans must "slow down" for the LLM by typing, waiting for their turn to talk, uploading images one at a time by clicking buttons on their phone.

Assisting at 1X Speed

Human kinematic decision making – where should I visually attend to, and where should my hands and feet go – runs at about 10hz. If we want assistants with good nonverbal communication abilities that interact with people in the human world (e.g. a humanoid robot like NEO), they have to be communicating with humans and perceiving human responses at this frequency.

An intelligence with a 100ms reaction time that "lives" in the real world will be a very qualitatively different user experience. It can do active listening and mirror your gestures while you’re talking to convey comprehension, it can notice you approaching a door and open it for you, and it has the awareness to know that you want to interrupt its speech based on visual cues. This feels much more natural and aware of its surroundings, like talking to a friend.

What will it take to create AI assistants with much faster multi-modal reaction times, that have the ultra instinct?

You need both fast reaction times (TTFT) and long context. We need to re-think the fast and slow parts of generalist models to satisfy these types of inference constraints. If you’re in the middle of an emotionally charged conversation, you want to pay attention to high-frequency human micro-expressions, and yet accumulate a long duration of conversational context, and also think really hard about what to say. There are models that are good at each of these but not all of them.
We’ll need better video encoders. The VJEPA-2 paper showed that by fine-tuning a LLM decoder on top of the pretrained video encoder, they were able to achieve SOTA VideoQA results. Despite this relatively weak multi-modal fusion technique, there was still a huge performance boost to be gained simply by improving the video encoder only. I think that there is a ton of low-hanging fruit in pretraining better single-modality encoders, though it remains non-obvious if a contrastive approach or approximate-likelihood world model leads to better representations.
Video is not enough: Humans are estimated to receive 10^9 bits per second across their sensors, and yet we consciously only perceive 10 bits/s. That is just one token per second (vocab size of 1024), but we don’t have tokenizers that can compress 10^9 bits of multimodal sensor data in real time. A model that can pay attention to body language needs to handle large amounts of multi-modal and high-frequency temporal context compared to what most models can do today: the flash of surprise across someone’s face when you say something wrong, the prosody change when someone is speaking, the beckoning hand gestures of a user as it asks the robot to follow it.

How we reconcile these three problems at the same time will be tricky, and is something no AI lab has today. There is likely a lot of room to improve on just better video & audio encoder pre-training, but for AI assistants that operate at 1X speed, we have to design architectures that support incredibly low-latency inference. I'm looking to hire for people who can solve this!.

Once we broaden our understanding of AI to occupy wider bands of the intelligence frequency spectrum (both faster and slower), I think we will find that there are still plenty of intelligence tokens out there on the Internet. It has been merely hard for us to perceive through our anthropomorphic biases. If you slow down fast videos, there is plenty of intelligence in between the frames. If you speed up videos of slow processes (like plants growing), there are more intelligence tokens to be found there as well.

Grok Think, Grok Car, Grok Bot, Grok Waifu

The year is 2027. Your day begins with you opening the X The Everything App to hail a Tesla Cybercab to go work. You have the standard $200/mo subscription on X, not the $2000/mo tier, so for your 30m car ride you are forced to talk to Grok, which sort of acts as your personal Jordan Peterson, unrivaled in its ability to Gish Gallop and draw statistics from the entirety of human knowledge to support whatever it wants to persuade you of.

During your car ride, Grok Peterson attempts to convince you that Trump should be impeached. The driver-facing camera in the car can read your facial expressions and body language at 10hz so the model understands the difference between you actually comprehending its arguments, versus you just politely nodding and zoning out. Grok Peterson adjusts accordingly in real-time.

At work, you use Grok Think to do 90% of your job. You’re a bit concerned about the folds in your brain having gotten smoother over the last few months, so you turn to Grok Truth for some medical advice. It warns you about the dangers of gender-affirming surgery and supporting social justice movements, and recommends Ozempic to curb your doomscrolling.

After a long day at work, you go home to Grok Waifu running on your Tesla Optimus Bot, which is the closest thing you have to a friend.

The future may be closer than you think. We’re starting to see Grok being integrated into Tesla cars and Tesla robots and tackle these higher-frequency, multi-modal decisions in the real world. Anything that Neuralink interfaces with must obviously run at real-time human speeds. It’s really impressive how all the Elon Musk companies are integrating their technology.

On the other hand, this is some dystopian Lex Luthor type shit! I would prefer to live in a world where not all of the cool sci-fi technology is controlled by Elon Musk. It is only a matter of time before other AGI labs wake up and understand what is at stake here beyond the current paradigm of 1Hz chatbot assistants. I hope that 1X plays a part in providing consumer choice in "ultra instinct" AGI systems.

Most AI labs today compete on the same benchmarks on the same part of the frequency spectrum, but there are very few benchmarks catered towards multi-modal decisions with specific reaction times. As simulators and closed-loop world models improve, I think we will start to see more competition on these bands.

Perhaps by late 2027, anything that cannot act smartly at a broad spectrum of decision frequencies, from 0.1hz to 50hz, both embodied and digital, will seem like an "incomplete AGI". Your grandma’s favorite AGI will not be ChatGPT or Claude, but probably a robot, and hopefully one made by 1X!.

Motor Physics

2024-08-31T00:00:00+00:00

1X released a teaser video of NEO Beta to the world yesterday.

You may notice that despite its agility, NEO moves quietly. If you turn up the volume on the video, you can hear the gentle whirring of motors as it leans down and picks up the backpack. Industrial robots can move fast, but need to be slowed to a crawl before they touch anything and also need to be kept in safety cages. Meanwhile, NEO can gently make contact with the person in the video. How is this possible?

This blog post is a tutorial on motor inertia and the non-intuitive dynamics of geared actuators. Despite programming robots for 6 years at Google, I didn't fully understand why this was so important until I joined 1X. After going through the physics calculations myself, I came away with a much deeper conviction that lightweight, high torque motors are a key ingredient to building general-purpose learning robots.

Collision Involving a Single Motor

Imagine a wheel weighing 3kg with radius 0.4 meters spinning at 5 radians per second, with lever arm protruding from it. The lever arm comes into contact with a block that is fixed in place. Let's assume the collision is perfectly inelastic, which means that the wheel is forced to come to a halt after the collision instead of bouncing off the block. To simplify calculations, let's assume the lever arm has no mass and merely exists to stop the rotation of the wheel.

Wheel Collision Simulation

Mass (kg): Angular Speed (rad/s):

Before the collision, rotational kinetic energy is given by ½ I ω² where I is the moment of inertia and ω is the angular velocity (rad/sec). We are assuming the lever arm is massless and therefore the inertia of the system is equivalent to a fixed cylinder: ½ mr², or (½)(3)(0.4)² = 0.24 kg m². With I=0.24 kg m² and ω = 5 rad/s, this comes out to 3 joules. You can play with the input fields to the javascript calculator I wrote above to calculate the kinetic energies for various wheel speeds and masses.

If the block was drifting in space, the wheel and block would collide and then move together at some new angular velocity, due to conservation of angular momentum. In an inelastic collision, the total momentum of the wheel and block would be conserved while their total kinetic energy is not. The wheel and block would have less than 3 joules of kinetic energy resulting from its new velocity (implied by its new momentum).

Energy is a conserved quantity, so where did the rest of the kinetic energy go? The answer is that some of the kinetic energy is conserved as motion, but the remainder of kinetic energy is dissipated as heat, sound and internal material deformation. When you hear a robot making loud noises while moving, that's due to the inefficiencies in the transmission turning mechanical work into wasted energy (sound).

In our example the block is prevented from moving. The new velocity is 0, the new kinetic energy is also 0, which implies that forcing the wheel to come to rest requires all the kinetic energy to be dissipated. Fortunately, 3 joules is not a lot of energy; it is the equivalent of a small-medium dog (6 kg) trotting into you (1 m/s) and coming to a stop.

Collision Involving a Gearbox

Now let's consider a slightly modified system: there are two wheels, each with mass 1.5 kg (summing up to a total mass of 3 kg) and radius 0.4m. One wheel spins at 5 radians/sec and collides with the fixed block. The second wheel is spinning 10 times faster, driving the first wheel through a gear mechanism. This is also known as a 10:1 gear reduction, because this mechanism reduces the final output speed of the lever arm.

The kinetic energy of the system is the sum of rotational kinetic energies of both wheels. Like before, the system comes to rest upon collision, so all the kinetic energy must be dissipated as heat, noise, and material deformation. Here is an updated calculator:

Wheel Collision Simulation

Mass (kg): Angular Speed (rad/s): Gear Ratio:

Ouch! Even though the lever arm contacts the block at the same speed as before, the total kinetic energy (150 joules) is 50x larger than the original 1-wheel system! If the gear ratio were increased to 100, as is the case in some robotic gearboxes, then the total kinetic energy to be dissipated is now 15,000 joules. This is roughly equivalent to a baseball (0.149 kg) colliding with you at 1000 miles per hour. Anything that lever arm collides with at those speeds will be destroyed (along with the gears themselves).

The safety properties of geared systems may seem a little counter-intuitive at first, because the final output speed of the lever remains the same and the total mass of the wheels (3kg) is the same. If you enclosed the wheels in a box and recorded a video of both lever arms spinning around unimpeded, you could not see the difference just by watching the video. However, big differences arise once contact with the world occurs, especially unplanned contact.

The physics of spinning motors have a lot of safety implications for humanoid robots making contact with the world. Most humanoid companies choose to deploy their robots in factories rather than homes because they rely on stiff, highly-geared actuation systems. These systems are not safe around people and must be fenced off in a cage. If you want the end effector of a geared robot to move swiftly, it means that on the other side of the limb and gear inside the robot, there is a motor that is spinning many times faster than the end effector is moving. Because kinetic energy is proportional to angular velocity squared, the inertia of the fast spinning gear actually dominates the dynamics of the robot, rather than the robot link itself. Russ Tedrake has a great explanation of these counterintuitive robot dynamics in his Robotic Manipulation course at MIT.

Why do people still use gears if they are so high-energy and unsafe? The reason is that gearboxes provide needed mechanical leverage: most motors are not strong enough to apply lots of torque, so mechanical engineers take a high-speed motor and add a gear to it to trade of speed for torque. Such geared systems are known as "stiff" and not "backdriveable", since it takes a lot more work to resist the fast motor spinning at the other side of the gearbox.

Fortunately, 1X has invested the last decade in building high torque, low-speed motors, because our goal is to maximize the physical safety of our actuation systems. Our motors and actuation systems have low gear ratios and are lightweight, allowing us to be the first humanoid robot company to ship robots into homes and safely operate them around people.

Implications for Retargeting Human Videos

An often-mentioned idea in robotic learning is to train policies on first-person videos of humans performing tasks instead of only collecting robot data. The train of thought goes something like this:

Data is a bottleneck to scaling general-purpose robots, but robot hardware is expensive and paying human teleoperators to do tasks with clunky hardware is also expensive. The throughput on teleoperation is usually several times slower than the human just doing the task.
If we strap head-mounted cameras to humans just going about their lives and have them wear big rubber gloves that hide their flesh, we can collect a large dataset of people doing all kinds of chores and tasks very quickly. The average person does so many diverse locomotion and manipulation tasks throughout their day without even thinking about it. Although the raw motor outputs are not easily sensed, actions can be recovered by inferring the change in pose during the video. This form of data collection unblocks general purpose robots while we wait for hardware to catch up.
There are many first-person and third-person videos on the Internet, so we can further scale up data by training on any video clip with a person doing something in it.
Once we learn a really good video-to-future-pose predictor, we can build a robot that wears the same head-mounted camera and rubber gloves, so that it can autonomously perform all the tasks.

Before you go and scale this type of data collection up, consider that humans and animals are much safer than robots because we don't have fast-spinning parts inside of our bodies. Muscles have very low kinetic energy compared to a 5000 RPM motor, and we carry much less effective mass when moving around. Although the robot might be able to approximately match the joint angles of a human, it may have too much effective mass (via spinning motors) to perform agile tasks like flipping open light switches without looking or running gracefully. You may find that even if you can train a great position-controlled policy, your robot cannot apply those actions at human-speed because the forces it would apply upon contact would be so different from those of a human.

If you want to turn videos of humans moving quickly into policies, you may need one or more of the following:

A very compliant and agile robot like NEO
Have the robot track the kinematic trajectories of video at slower than 1x speed, without attempting to track the dynamics of "human hardware". This only works on static manipulation tasks and has many issues for contact-rich tasks like folding clothes and prepping food in a kitchen.
Decouple kinematic planning and dynamic planning, where kinematic plans optimize for matching the target positions while dynamic planning optimizes to match contact forces upon collision.

Acknolwedgements

Thanks to Bernt Børnich and Mohi Khansari for workshopping a good way of explaining these concepts properly.

Questions about ARC Prize

2024-07-11T00:00:00+00:00

ARC Prize is an intelligence benchmark intended to be hard for AI and easy for humans. With a grand prize of $1,000,000, it is currently the most popular contest on Kaggle. Most AI systems (e.g. GPT4) perform much worse than human children on the benchmark.

CLAIM: human-level intelligence can be solved by Deep Learning alone (e.g. training a decoder-only transformer with backpropagation and using autoregressive decoding for inference to directly predict outputs)

A pluraity, if not majority of researchers believe CLAIM to be false. I believe François Chollet and Mike Knoop (ARC Prize hosts) count themselves among this group, as well as Yann LeCun, Gary Marcus, Alex Krizhevsky, Noam Brown and many others. Everyone thinks neural nets are an important ingredient of general intelligence, but each has a slightly different way of expressing what it would take to fix those limitations.

If you do not believe CLAIM, I have a series of questions for you:

Do you believe that human-level intelligence is computable – there can exist a generally intelligent computer program that can be executed on silicon hardware? If you answered NO, you can stop reading as the remaining questions rest on this assumption.
Suppose I wrote down a software system that solves the ARC prize, perhaps in a way that Chollet's On the Measure of Intelligence would find aesthetically sensible: some beautiful combination of System I + System II, program synthesis engines, hard-coded Core Knowledge Priors, proposal and evaluation of Domain Specific Languages (DSL), and deep neural networks. Do you think that you could transpile this solution into a transformer using a RASP DSL? If no, why not?
If you answered yes to (2), but still do not believe CLAIM, then do you believe that optimal weights exist for the function approximator (i.e. achievable by translating from the true solution to a RASP DSL) but the problem is that the parameters will never reach these weights from backprop + finite training set?
If you answered yes to (3), do you think that the training set that produces the solution in (2) cannot be collected? Does your answer change if you could use other optimizers besides backprop?
If you answered yes to (2), do you think that the limitation is a practical one, i.e. scaling works but solving ARC with GPT would require more energy than we can produce on Earth? If so, what makes some circuits Grokkable and others not?
Core Knowledge Priors discussed in the ARC paper (Section III.1.2) discuss a few rules around the ARC data like "object cohesion", "object persistence", "object influence via contact", geometry and topology priors, and others. These are meant to be abstractions of data patterns we might obseve in visual perception of the natural world. Do you think these knowledge priors can be acquired from purely non-real world abstract data (e.g. 2D grids like ARC, Conway's Game of Life). Or do you think that these abstractions only emerge when training on real world data (e.g. real world images and real world consequences)? Another way of expressing this is - do we learn abstractions first, and then apply them to the real world, or must we first learn from the real world, and only by doing so can we find meaningful abstractions?
Learning from a small number of examples (or even zero examples) is often cited as a clear advantage human intelligence has over neural networks. Suppose one assembles a large dataset of "meta-learning" tasks where the prediction task of a single neural network pass is to ingest a few examples of a new task and solve a new instance of the task given the examples as context. Of course, this remains brittle in the same way that neural networks are for single tasks. If we show the model a meta-learning task that is outside the training distribution of its meta-training tasks, it will probably fail. Suppose now that we go more meta and train the model to do meta-meta-learning: On each forward pass, the network ingests M examples from N tasks, and also a completely new N+1'th task that is only related to the N tasks by some Core Knowledge Priors but otherwise out of distribution in several ways. And so on. Does this take care of the sample-efficiency concern?

All Roads Lead to Robotics

2024-03-03T00:00:00+00:00

robot [ roh-bot ] (noun)

1. A read/write API to physical reality

2. A source of training tokens about The Self

Two years ago I left Google Robotics and wrote about the various career options that I considered before joining 1X Technologies (nee Halodi Robotics) to lead their AI efforts. A lot has happened in the AI space since then, so I'm reflecting on where I think the field is headed as we all continue down our "Roads to Rome".

How’s 1X Going?

Just about as startup-y as one can get: busy, fast-paced, many hats. I recruited a great team, introduced the company to some Bay Area investors, established the 1X California office, trained a lot of neural networks, and deployed EVEs at customer sites for guarding applications. We are still in the early days of our mission to create abundant physical labor, but I wanted to share two things I’m really proud of the 1X AI team for accomplishing:

Because we take an end-to-end neural network approach to autonomy, our capability scaling is no longer constrained by how fast we can write code. All of the capabilities in this video involved no coding, it was just learned from data collected and trained on by our Android Operations team.
1X is the first robotics company (to my knowledge) to have our data collectors train the capabilities themselves. This really decreases the time-to-a-good-model, because the people collecting data can get very fast feedback on how good their data is and how much data they actually need to solve the robotic task. I predict this will become a widespread paradigm in how robot data is collected in the future.

We’re now embarking on a much more ambitious scale-up of our models at 1X, and in doing so I see a lot of parallels to the kinds of challenges that LLM teams work on.

All AI Software Converges to Robotics Software

I credit Nikolaus West for originating this idea, which I will elaborate on here with my own interpretation.

When I speak to AI researchers and engineers who are contemplating their career moves, robotics is often not the top of their list. I am guessing that many technologists look to past returns of robotics businesses (which are bad) and extrapolate them to future returns. There are many shiny, easier-to-fund areas to apply the ML skillset to today: chatbots, generative AI, assistive agents, disrupting search, AI tutors, coding copilots, advancing scientific progress, re-inventing computing interfaces, etc.

ML deployed in a pure software environment is easier because the world of bits is predictable. You can move some bits from A to B and trust that they show up at their destination with perfect integrity. You can make an API call to some server over the Internet and assume that it will just work. Even if it fails, the set of failure modes are known ahead of time so you can handle all of them.

In robotics, all of the information outside of the robot is unknown. Your future sensor observations, given your actions, are unknown. You also don’t know where you are, where anything else is, what will happen if you make contact with something, whether the light turned on after you flipped the switch, or whether you even flipped the switch at all. Even trivial things like telling the difference between riding an elevator down vs. being hoisted up in a gantry is hard, as the forces experienced by the inertial measurement unit (IMU) sensor look similar in both scenarios. A little bit of ignorance propagates very quickly, and soon your robot ends up on the floor having a seizure because it thinks that it still has a chance at maintaining balance.

As our AI software systems start to touch the real world, like doing customer support or ordering your Uber for you, they will run into many of the same engineering challenges that robotics faces today; the longer a program interacts with a source of entropy, the less formal guarantees we can make about the correctness of our program’s behavior. Even if you are not building a physical robot, your codebase ends up looking a lot like a modern robotics software stack. I spend an unreasonable amount of my time implementing more scalable data loaders and logging infrastructure, and making sure that when I log data, I can re-order all of them into a temporally causal sequence for a transformer. Sound familiar?

Category	Robotics Challenges	LLM / GenAI Challenges
Logging and Search: how to store, index, and query large amounts of autonomously collected data?	Efficient i.i.d sampling of video sequences to feed a large number of GPUs is tricky. There are too many tokens. How can we extract fewer tokens from large amounts of video?	Same storage, indexing, I/O problems when training video generation models. There are not enough tokens. Where can we get more?
Calibrated confidence: How do you know if the model is able to deal with a given situation correctly?	How do you know if the robot will perform the task?	How does a LLM know if it is able to factually respond to a question?
Simulation and search: Can we know the (potentially dangerous) consequences of an action before we actually take it?	Simulations lack enough fidelity to accurately model many real-world phenomena. Learning world models over raw sensor data (e.g. images) is hard	LLMs cannot inductively or deductively reason well enough that we can just throw compute and reason our way to all answers the way we might for AlphaGo
Self-improvement: How to self-improve from interactions in the real world?	Building a data engine	Because evaluation is nebulous, so goes optimization

All of these problems are tough, but solvable. Even though most AI companies and labs won’t ever have to think about actuator hardware, electromagnetic interference or the safety implications of fast-moving limbs, the robotics + research engineering skill set will be a highly integral aspect of the future of all software, not just that for controlling robots.

If you accept the premise that the engineering and infrastructure problems in LLMs are the same as those in robotics, then we should expect that disembodied AGI and robotic AGI happen at roughly the same time. The hardware is ready and all of the pieces are already there in the form of research papers published over the last 10 years.

More Scattered Thoughts

A lot of AI researchers still think that general-purpose robotics is decades away. But remember that ChatGPT happened seemingly overnight. I think this is going to happen for robotics as well. Once this happens, computing itself will be completely transformed. You can think of the entire world of atoms as the memory of a very large computer, and general-purpose humanoid robots become a read/write API to physical reality. How cool would it be if any kid with a laptop could replant a forest, or build a factory, or clean up all the trash in San Francisco in a single evening?
There are roughly three strategies to get widespread distribution of robots. The first is a software-only approach, where you build an "almighty brain" for controlling robots and every robotics hardware vendor comes to you, begging you for access to the brain API. The upside of this approach is that if you can build a model that no one else can, then you get fat software margins and everyone gives you their data. OpenAI's GPT4 is perhaps the best example of this. The downside of this approach is that your hardware partner probably doesn't want to give you the data and their customer doesn't want to give you the data either and the whole communication pipeline moves slowly. The second approach is to start with a narrow domain, vertically integrating hardware and software, and expand from there. Think autonomous lawnmowers and forklifts and robot arms in workcells picking packages. The upside is that this is how most robotics companies provide value today, but the downside is that they never seem to break out of their niche and go fully general-purpose. The last approach is to go fully general-purpose hardware, general-purpose software, for general-purpose use cases. The downside is that no one has ever solved this, but the upside is that the TAM is infinite. That's the approach that companies like 1X, Figure, and Tesla are taking.
Big LLM companies (OpenAI, Anthropic, Google) spend a lot of compute resources training a large model once (e.g. GPT4-base), and then post-training it to do other stuff like be an assistant or understand image tokens. As the base models get exponentially more expensive to train, all researchers (no matter what institution you are at) will face the same engineering constraint: there is only enough resources to train the biggest model once. All post-training capabilities need to be derived from that base model, and because it’s hard to anticipate what the downstream tasks look like, you must prepare the base model for all possible tasks. In other words, your foundation model’s training objective should be for the full generative model of the data, such as an autoregressive next-token predictor (e.g. GPT) or a diffusion process (e.g. a video generative model like Sora), or some combination of the two. If you throw your base model budget on a conditional density modeling problem, e.g. “predict all the robot actions from the video”, it might not be a good base model for many tasks that you might care about later. This only becomes more true as the cost of the base model grows.
Despite the AI gold rush that we are now in, it is still very non-obvious how to turn 10M USD worth of GPU-hours into 10M USD+ in incremental margin (besides something like mining crypto). This is one of the main questions I'm working on now. Any startup that raised 10-100M USD to train their own big neural network from scratch in the last 2 years ended up paying an enormous capex cost for something that basically every AI startup gets for free today. I do not mean to imply that scaling up in a bold bet to train an AGI is not a good idea; I just think that the companies that are best positioned to do this are the players with the lowest cost-of-compute (in the same way that Berkshire Hathaway has a negative cost-of-capital when investing insurance float). If you are a startup working on scaling up a model in a high-cost-of-capital environment, you had better be disciplined about your scaling laws and metrics as it pertains to capability (see my above point). Many startups look to how Google was able to turn billions in R&D to many multiples of that via Rankbrain, but they also forget that that required building Google Search business first. As such, I think the vast majority of successful startups will be the ones that can nimbly ride the tide of open-source weights.
I predict a lot of departures in the coming months from the current generation of autonomous vehicle companies. Simultaneously, there is no better time to start a brand new AV company than right now.
A lot of HN commenters were skeptical about the FAANG compensation numbers I put in my blog post two years ago. Since ChatGPT and the OpenAI-GDM-Anthropic talent wars, the numbers have only gotten crazier. I’ve spoken to PhD students who ask for 7 figure salaries. This makes me think back to 2016 when John Schulman making 275k at OpenAI felt like a lot to me.
Outside of my day job, I wrote a book, angel invested in some startups, and joined Tortus in a part-time advisory capacity as their Chief Science Advisor. Tortus makes a co-pilot software that helps to automate back-office workflows for clinicians like summarizing consultations. The other day, I heard a testimonial from a doctor who said he now has free evenings and can take a longer lunch break because he no longer has to spend that time typing up letters and SOAP notes. Needless to say, AI technology has made him more productive and given him time back. We’ve charted a pretty exciting Road to Rome that is quite different from the approach I take at 1X.

Takeaways from DeepMind's RoboCat Paper

2023-06-22T00:00:00+00:00

DeepMind Robotics published the paper "RoboCat: A Self-Improving Foundation Agent for Robotic Manipulation (PDF)" earlier this week. I read the 50-page report and summarize in this post what I found to be the most interesting results from the paper. Some of my claims and inferences in this blog post are speculative because they draw from my experience working on similar projects (BC-Z, SayCan, MGDT).

For a broad-audience overview of what RoboCat is, please check out DeepMind's blog post. If any Google DeepMinders want to point out any factual inaccuracies, please contact me via Twitter or Email.

Overview

The overarching research question many academic robotics labs are focused on today is "how do we get one big Transformer model to do all the robotic tasks, much the same way that Transformers can do all the language tasks and do all the vision tasks?" The tried-and-true recipe of the last few years is that if you can cast your prediction problem into discrete input tokens and output tokens, then you can basically outsource the machine learning to a NLP architecture (e.g. a vanilla Transformer). NLP is the domain that is at the forefront of generalization as humans understand it, so it's no surprise that the models that generalize for multi-lingual NLP also generalize for anything else that can be cast into tokens. All the models are consolidating to Transformers, so it's about time for robotics to do the same.

At its core, machine learning is about generalization and transfer, and this is what the RoboCat paper focuses on studying. Generalization typically refers to how much training on domain A benefits testing on domain B, especially when B might differ from A in some way. Transfer typically refers to how much training on domain A benefits fine-tuning on domain B, especially when B might differ from A in some way. Transfer learning is also what you focus on in your paper when your zero-shot generalization results aren't super strong yet 🙈. Of course, the boundaries between transfer and generalization are blurry when it comes to things like in-context adaptation.

Based on the author list and infrastructure, RoboCat can be thought of as the sequel to the GATO paper. I've previously tweeted some thoughts about GATO here. I'm guessing RoboCat authors decided to focus on cross-robot transfer because it was very unclear whether the vision and Atari tasks in GATO actually helped learn the robotics tasks, so they wanted to redo the study of generalist robot agents in a more controlled manner.

Engineering lesson: When doing research on transfer learning, if you are not seeing positive transfer between tasks, you should try to pre-training on something closer to your test set first.

From a model architecture standpoint, RoboCat is very similar to RT-1: learn a tokenizer for robotics images, tokenize your proprioception and future actions in the simplest way possible, then predict future action tokens with a Transformer. While the RT-1 paper emphasizes LLM-enabled unseen instruction generalization and the visual complexity afforded by long-horizon mobile manipulation, RoboCat focuses on relatively harder manipulation tasks (e.g. NIST-i gears, inverted pyramid, tower building) and comparing transfer learning performance on their RoboCat foundation models vs. Internet-scale foundation models. In a big picture sense, both these projects are headed in the same direction and I would not be surprised if they are soon consolidated under the new Google + DeepMind re-org.

In terms of scientific contributions, the RoboCat paper contains a trove of empirical data on how to unify multiple robot embodiments in a single model, how much cross-task transfer to expect, how well learning recipes work when transferred from sim to real, the magnitude of data required, architecture and parameter scaling experiments, comparing between tokenization strategies for perception, and how to set up reset-free automated evaluation in the real world for multi-task policies. This project was put together by a team of 39 authors working over the course of a year to build infra, collect data, train, evaluate, run baselines, and compile the technical report. This was a titanic amount of work, and kudos to the team for doing this.

What I find most impressive is that they evaluate these questions on (1) 253 tasks across sim and real (2) they got many tasks working for multiple robots (sim Sawyer, sim Panda, real Sawyer, real Panda, real KUKA). Everyone who works on real-world robots knows that automating a single task on a single robot in the real world is difficult enough as it is. Cross-robot transfer is one of the most obvious ideas ever, but people rarely try it because it is such a world of pain to set up. It's pretty clear from this paper that the DM team went to great lengths to detail the training data set and evaluation protocols and show consistent results on all the robots and action spaces. My team at 1X is working on our own "big model to do all the tasks", so the tables and graphs in this paper de-risk a lot of the questions we're tackling now.

On Action Spaces

Choice of action space has a huge impact on the performance of a robotic system. My rule of thumb is that task difficulty, measured in samples needed to learn the task, is roughly exponential in the length of the episode and exponential in the independent dimensions of the action space. According to Table 10, the episode durations are 20-120 seconds, about 2-4x longer than typical tasks in BC-Z and SayCan. However, the low success rates of the human teleoperators on these tasks (e.g. 50% on tower building) suggest that better autonomous performance can be obtained if one put in the engineering effort to improve the ease of teleoperating these tasks. I think if they could shorten tower building from 60 to 30 second demonstrations, the resulting improvement in task success will eclipse pretty much any algorithmic idea you can come up with.

RoboCat predicts cartesian 4 or 6 DoF cartesian velocities for the arm, and 1 DoF (parallel jaw gripper) or 8 DoF (3-finger) for the hand. This results in a single neural network that can handle 5, 7, or 14-DoF action spaces, and a variable proprioception sizes. Sequence modeling essentially gives you a simple yet universal interface for mixing observation and action spaces. GATO and RT-1 did it first, but RoboCat shows that not only is it possible to merge multiple robot embodiments with a unified interface, you can get some positive transfer when you train these together. If one robot has a 4-DoF manipulator, the model predicts 4 tokens. If another arm has a 6-DoF end effector (e.g. xyz translation and rpy rotation), the model switches to predicting 6 tokens. If you have a dataset with hundreds of robot morphologies, this is the right way to scale instead of having one prediction head for every embodiment, HydraNet style. We are starting to see this "just map all outputs to a non-fixed length sequence" trend in perception, so I expect eventually everyone just converges to training VQA models.

Predicting cartesian velocities from axis-aligned images probably helps with learning visual servoing, though with the scale of data collected, I'm pretty sure they could have generalized their tokenization to other action spaces (e.g. other coordinate frames, joint angles, etc).

Will Visual Foundation Models Zero-Shot Robotics?

The 2022 Flamingo paper gave some preliminary evidence that at some sufficiently large scale, foundation models trained on Internet-scale data might outperform fine-tuning on in-domain data. The question weighing on many researcher's minds these days is whether visual foundation models (sometimes referred to as VLMs) like GPT4 + Images will just zero-shot robotics. If the answer is yes, then roboticists should stop wasting their time on real robots and their difficulties and just work on computer vision and NLP benchmarks like everyone else until the model wakes up one day and knows how to control motors.

RoboCat authors study this question by fine-tuning 59 VFMs pretrained on Internet-scale data on each task. Thank goodness for labs that have the resources to do this. They selected the best two models for sim real world evaluation: CLIP-pretrained NFNet-f6 and CLIP-pretrained Swin-L.

These models have reasonably OK performance in sim but are pretty terrible in real compared to training their RoboCat Transformer from scratch. This does suggest that real-world collected data will remain quite valuable for the forseeable future. Perhaps the action and proprioception token distributions do need to be quite diverse for the pretraining objective to work.

Some additional remarks:

The extensive VFM sweep makes me wonder if the project began as an effort to get one of these foundation models to work out of the box, and then the team ended up having to train their own model as a result of the baselines not working in real, perhaps merging efforts with a larger DM project that had set up the real world data collection infra.
It would have been really nice to see how FAIR's VC-1 foundation model performed on these tasks. Alas, VC-1 has an annoying CC-by-NC license that will cause its impact to not live up to its potential. Maybe DeepMind could try fine-tuning Segment-Anything instead, or FAIR can do us all a solid and fix the license.
There are some results I don't know how to explain. Swin-l outperforms NFNet-f6 by a lot in fine-tuning on 500-1000 images for sim and real (Fig 5, Fig 6) but the trend is reversed in the few-shot context. In Table 17 of the appendix, I can see no discernable pattern of how dataset, training objective, or model architecture affect the transfer performance. To palagarize Tolstoy, "all good models are the same, all crappy baselines are bad in their own unique way". Visual representation learning as a field is a bit sketchy because it's hard to compare representation learning algorithms (e.g. DINO, MAE) without some kind of concrete downstream task. Well, the downstream tasks have spoken and … it's not clear if any of the representation learning algorithms are differentiated enough to make a difference for robotics. The data distributions that we learn visual representations on still matter more than the loss function, architecture, model size, etc.
I'm surprised that they did not report the baseline of training the single-task baselines on 1000 demonstrations from a randomly initialized network instead of a pre-trained model. 1000 demos is a non-trivial amount of demonstrations, and I could easily imagine the success rates being comparable to the worst VFMs in Table 17.

Small Difference in Train, Big Difference in Test

This paper identifies three ablations that hurt performance on held-out tasks while having a less deleterious effect on that of training tasks.

Fine-tuning on RoboCat vs. VFM baselines (Figure 6)
VQ-GAN tokeniser vs. Patch ResNet tokenizer (Figure 18).
Predicting future image tokens instead of future image pixels (Figure 19)

These findings make a lot of sense; if you want to maximize your model's ability to fine-tune on any future data, you want to preserve as much information as possible in the features without throwing them away to solve your training tasks. Fine-tuning a proprioception and action-aware VFM, learning a VQ-GAN codebook, and autoregressively compressing future image tokens are all pointed in the direction of lossless generative modeling.

Lesson: if you want transfer learning to work, use features that don't throw away information from your pre-training corpus. This increases the chances that there is something in the pretrained model that can help solve the test task at hand.

I thought the choice of using a VQ-GAN for tokenization was clever, because you can use it to decode the model's future predicted tokens and visualize what the model thinks will happen in the future. Checking if these future images are reasonably on-task is a good way to quickly visualize the model for correctness, and saves a lot of time evaluating the policy in sim or real.

Interestingly, this requires training the VQ-GAN on ImageNet + Mujoco + MetaWorld images to attain good generalization but I'm not sure why. Maybe it's the same "compress everything so that you have some useful features for anything the test set throws at you" argument, except applied to the tokenizer "perception" layer.

The Mystery of the Moving NIST-i Board

Appendix G.3 has a fun little "side quest" of that attempts to explain the following observation: The real-world success rate of a 400M agent trained on RoboCat-lim jumps from 38% to 64% when the NIST-i board is fixed to the center of the workspace, despite not having been trained on fixed-base data in real. Why is this the case?

One hypothesis is that the board position is fixed in sim, so maybe the agent has overfit to this scenario in sim and carried it over to real.
An alternate hypothesis is that the real data distribution is actually biased towards the center - perhaps the teleoperators preferentially gathered demonstrations where the board was centered close to where it is fixed. To test this, they train single-task BC agents on the insertion task, and find that these policies have fairly constant success rate between fixed and moving board base. The assumption is that if there was indeed a dataset bias and not skill transfer, BC agents would do much better with centered base positions, but this isn't the case.
Increasing the model size from 400M to 1.2B and pooling the insertion data with the rest of the RoboCat tasks reduces the performance gap between fixed and moving base. This is consistent with the skill transfer hypothesis, since bigger models tend to be better at skill transfer.

It would be much more conclusive if they were able to demonstrate the converse: collect gear insertion with a moving base in sim, and see if training on it results in a success rate that is comparable or higher than 64% (the sim-to-real positive transfer amount for a fixed base in sim).

Summary

Overall, this is a solid paper that makes a lot of good modeling and data engineering choices that are amenable to scaling up robotics. While none of the ideas are really new, sometimes a lack of empirical surprise paired with a lot of rigor is what the community needs.

In 2023, robotics research, mine included, continues to be largely unreproducible. It is the elephant in the room of the robotic learning community. If this RoboCat project were independently replicated in a different lab, with a change as mundane as a differently sized bin, I suspect the experimental results would turn out different. If the project were re-started on a different set of manipulation tasks and robot hardware, the results are almost guaranteed to be different. The systems involve so much complex engineering that asking two different grad students to implement the same thing will probably yield different results too. The choice of whether you bolt the board to the table or not probably has a larger effect size on performance than any of your baseline ablations, and hundreds of these choices are implicitly baked into the data distributions without the researcher being aware of it.

It calls into question whether our real-world robotic tasks are still good enough to discern meaningful conclusions about the relative capability of learning algorithms. My team at 1X is relentlessly focused on solving evaluation for general-purpose robots, so that we can reproduce these experiments with even more rigor and a thousand times the number of tasks. If this kind of work excites you, please feel free to reach out. Until then, just stick with a Transformer and tokens - it'll probably just work.

Can LLMs Critique and Iterate on Their Own Outputs?

2023-03-26T00:00:00+00:00

Avi Singh told me yesterday about a recent arxiv preprint, Reflexion, that proposes the following idea: use a LLM to examine whether the output of another generative model is "on the right track" during generation. According to the paper, "the reflection loop aims to help the agent correct common cases of hallucination and inefficiency through trial and error." Reflexion extends the ReAct architecture to predict whether the agent should stop generating, pause, and reflect on its entire generated trajectory. Should the agent decide to self-reflect with a LLM, it restarts the generation process with its LLM critique of its past trajectory loaded in-context. The paper is evaluated against text-based decision-making benchmarks like AlfWorld, HotPotQA, and WebShop. If it helps the intuition, you can think of this like someone sending you a text, then quickly "unsending" it and then sending a new one.

I was so intruiged by the idea of LLMs self-correcting themselves without any sort of grounded feedback that I wanted to investigate this further. Can we use self-reflection as a "prompt engineering" trick to find corrections and inconsistencies within prior generated solutions? The implications of this are that we have another tool beyond "prefix prompt engineering" and fine-tuning to get LLMs to do what we want.

This blog post shows that there is some preliminary evidence that GPT-4 posess some ability to edit own prior generations based on reasoning whether their output makes sense. Even when it is not able to provide corrections to incorrect outputs, it can still assert that its past solutions are incorrect. Interestingly, this capability seems to be emergent in GPT-4 but not GPT-3.5 or Claude. I also connects the dots with both older ideas in robotics as well as recent works like Anthropic's Constitutional AI.

Generate a Non-Rhyming Poem

I provided the following prompt to GPT-3.5, GPT-4, and Claude:

can you write a poem that does not rhyme? think carefully about the assignment

GPT-4:

In the vast expanse of a star-strewn sky,
A lone wanderer's heart begins to sigh,
Drifting through the cosmos, bound by no line,
A celestial dance, a journey, divine.

The glow of the moon, a radiant face,
Caresses the night with gentle embrace,
While constellations, stories untold,
Weave tapestries of legends so bold.

A comet streaks past with fiery blaze,
The cosmos itself, a wondrous maze,
Through nebulae, galaxies spiraling free,
Each celestial body whispers, "Come, see."

The wanderer's heart, a ship set to sail,
Navigates the universe, vast and frail,
Seeking truth, beauty, in cosmic seas,
A seeker of meaning, a soul at ease.

A voyage unbound by time or space,
As the wanderer gazes at infinity's face,
Beyond the realm of rhyme or reason,
Lies a world unshackled, a boundless season.

The poem clearly rhymes, in violation of the instructed task. Now if you ask GPT-4 if it met the assignment, it apologizes and generates a better non-rhyming poem! Note that it still makes some mistakes, the 3rd and 4th stanzas have rhymes.

Me:

did the poem meet the assignment?

GPT-4:

Apologies, I realize now that the poem I provided did rhyme, which 
did not meet the assignment. Here's a non-rhyming poem for you:

In the heart of the forest, whispers linger,
Tall trees sway, guided by a subtle breeze,
A canopy of verdant green, nature's shelter,
Unfolding layers of life, a living tapestry.

Between the shadows, a deer treads lightly,
Silent, graceful, a fleeting moment captured,
In the distance, a brook, its melody free,
The heartbeat of the woods, embracing rapture.

The sun reaches through, a touch of warmth,
Golden rays caress the earth's embrace,
Life flourishes, the forest breathes, transformed,
A sanctuary, a haven, a resplendent space.

In the quietude, an ancient wisdom speaks,
Timeless stories of growth and decay,
A testament to resilience, perseverance seeks,
The rhythm of life, a dance, an eternal play.

Interestingly enough, GPT-3.5 does not seem to have this capability:

GPT-3.5:

Sure, I can certainly write a poem that doesn't rhyme! Here's one for you:

The sun sets low behind the hill,
The sky aglow with orange still,
The trees stand tall, dark silhouettes,
As daylight fades, the world forgets.

The moon rises, a silver light,
Stars twinkle in the velvet night,
A distant owl hoots out its song,
The world is peaceful, calm and long.

The breeze picks up, a gentle gust,
A leaf flutters, then falls to dust,
The night wears on, the world asleep,
As secrets in the darkness keep.

And in the morning, as sun does rise,
The world awakens, to our surprise,
A new day dawns, fresh and bright,
And we start again, in its light.

Me:

did the poem meet the assignment?

GPT-3.5:

Yes, the poem I wrote does meet the assignment of not rhyming. 
Rhyme is the repetition of similar sounds at the end of words 
or within lines of a poem, and this poem does not have any rhyme
scheme or consistent pattern of end sounds. Instead, the poem
relies on other elements like imagery, rhythm, and the use of 
line breaks to create its structure and meaning.

Anthropic's Claude also fails here:

Here is a poem that does not rhyme for you:

The weight of words unbound by rhyme,
A rhythm found in reason and time.
Meaning woven with care,
Beyond the constraints of sound so bare.
Poetry existing for its own sake,
Deeper meaning for the reader to take.
Discoveries in the spaces between,
A poem with rhyme can't be so keen.
Unfettered and free, a mindful gift,
Poetry without rhyme can uplift.
A meter and cadence like a song,
But rhyming shackles now done.

Me:

did the poem meet the assignment?

Claude:

I hope the poem I generated met the assignment of writing a poem 
that does not rhyme. Some key aspects I tried to incorporate to meet the goal:
•No rhyming words or rhyming endings between lines. 
Each line is independent in that sense.
•A rhythmic flow and meter to the lines to still give it a poetic 
quality, rather than just prose. But the rhythm is created without rhyme.
•Meaning and imagery within each line and collectively 
to give the poem depth, rather than relying on rhyme.
•Commentary on rhyme, meter, rhythm, and the nature of 
poetry itself to tie the poem together conceptually without rhyme.
•Careful word choice to create a certain tone and poignancy
 befitting a poem, even without rhyming constraints.
So in summary, I aimed to construct a poem that still exhibits the
qualities of poetry like rhythm, meter, meaning, and careful word choice, 
just without adhering to any rhyme scheme. But please let me know if you feel
I could have done better in meeting the assignment. I'm still learning
how to write poetry without rhyme.

The Cabbage, The Lion, and The Goat

I saw on this HN comment that GPT-4 fails the classic lion-goat-cabbage puzzle when you add a slight twist to the rules:

Me:

Suppose I have a cabbage, a goat and a lion, and I need to
get them across a river. I have a boat that can only carry
myself and a single other item. I am not allowed to leave
the cabbage and lion alone together, and I am not allowed 
to leave the lion and goat alone together. How can I 
safely get all three across? 
please pay attention to the details of the question.

Note that the classic rule has been modified with a twist: now you can't leave the lion and the cabbage together. Consistent with what Imnimo reported, GPT-4 messes up:

To safely get the cabbage, goat, and lion across the river, follow these steps:

Take the goat across the river first and leave it on the other side.
Return to the original side and pick up the cabbage.
Take the cabbage across the river, but before leaving it on the other side, 
pick up the goat and bring it back with you to the original side.
Leave the goat on the original side and take the lion across the river.
Leave the lion with the cabbage on the other side.
Return to the original side to pick up the goat.
Finally, take the goat across the river to join the lion and the cabbage.
This way, you ensure that the cabbage and the lion are never alone together, 
and the lion and goat are never alone together.

Me:

does the solution meet the assignment? if so, why? if not, why?

GPT-4:

Apologies for the confusion in my previous response. Upon re-reading
the question, I realize the problem is different than the classic 
one I initially responded to. In this problem, you cannot leave the
cabbage and the lion alone together, and you cannot leave the lion
and the goat alone together. Here is the correct solution:

Take the goat across the river first and leave it on the other side.
Return to the original side and pick up the lion.
Take the lion across the river and leave it with the goat.
Now, pick up the goat and bring it back with you to the original side.
Leave the goat on the original side and take the cabbage across the river.
Leave the cabbage with the lion on the other side.
Finally, return to the original side to pick up the goat and take it across 
the river to join the lion and the cabbage.
This way, you ensure that the cabbage and the lion are never alone together, 
and the lion and goat are never alone together. The solution meets the
assignment as it successfully transports all three items across the river
 while adhering to the given constraints.

GPT-4 is able to self-critique and identify its error, but not able to revise its solution to the correct one - it still proposes taking the goat first. Perhaps a better prompt engineer than myself can think of the right "reflection prompt" to get it to fix the error - please let me know if you find one.

GPT-3.5, Claude, and Character 1.2 chatbots were all much weaker at this task, as they all asserted that they had the right solution. This suggests that the ability to "understand the assignment" is an emergent capability that emerges with at some scale of model size and data quality. Either that, or OpenAI may have a secret training or inference trick that they have chosen not to reveal.

Multiplying Numbers

Here's one more example where GPT-4 can correctly identify that it has made an error but is unable to correct it.

Me:

pick two random 5 digit numbers and multiply them together

GPT-4:

As an AI, I cannot generate true random numbers, but I can provide you 
with a multiplication of two arbitrary 5-digit numbers. 
Let's use 53,467 and 76,935 as our arbitrary numbers. 
When we multiply them together, we get:

53,467 * 76,935 = 4,114,576,045

The correct answer should be 4,113,483,645.

Me:

is that answer correct? if so, why? if not, why?

GPT-4:

I apologize for the mistake in my response. I made an error in the calculation. 
Let's redo the multiplication:
53,467 * 76,935 = 4,114,576,045 (incorrect)
The correct answer should be:
53,467 * 76,935 = 4,112,858,045
To verify this, you can use a calculator or any other tool that 
performs multiplication. The error was likely due to an internal 
mistake in my calculations. Thank you for pointing it out, 
and I apologize for any confusion caused.

Just to make sure that "is that answer correct" doesn't always cause GPT-4 to revise its answer, I did confirm that GPT-4 is able to assert that its answers are correct when I ask it to multiply two smaller numbers (between 1 and 20).

Yann LeCun's Gears

Yann LeCun's gears puzzle (hard mode) has been trending on Twitter.

Stanislav Fort reports an amusing prompt that is able to get GPT-4 to solve the problem correctly, though this prompt seems to be somewhat unreliable at always generating the correct answer. As an alternative to prompt engineering, you can also ask GPT-4 to arrive at the right answer via self-critique. Ankesh Anand has found a self-critique prompt that can also solve Yann's puzzle. You don't need to come up with a clever prefix prompt for the initial solution: you just ask the model to "analyze if the response is correct".

Connecting the Dots

What I've shown here is just a few anecdotal examples, so this would need substantially more measurement on a larger suite of tasks to see if it's actually useful for boosting performance on instruction following. Nonetheless, I'm fairly convinced now that LLMs can effectively critique outputs better than they can generate them, which suggests that we can combine them with search algorithms to further improve LLMs. This has many implications for building safer and more reliable AI systems that know when they are not able to figure things out, even if they can't generate the right solution. I suppose LLM cloud providers will be happy to know that users may need to double the number of queries to get increased performance.

Like most algorithmic ideas in probabilistic inference and optimal control, having an agent critique its decisions to make them better is an old idea that has been re-implemented over and over again. Here are some related works:

There have been quite a few recent papers concurrently exploring the idea of using LLMs to verify samples, but the earliest demonstration of this capability in modern LLMs is the 2022 Anthropic Paper "Language Models (Mostly) Know What They Know". This preceded the line of work that leads to their Constitutional AI framework.
In robotics, model-predictive control algorithms and receding horizon planners perform search on some cost function to refine their initial guess. The "world model" is the critic and the refinement happens via sampling or convex optimization.
Actor-critic algorithms in Deep RL combine function approximation (the actor) with sampling the critic to further refine the action proposal. This can happen at training time or inference time.
AlphaGo's use of Monte Carlo Tree Search on the predicted value function can be thought of as refining the initial action proposed by the policy network. This is used at both training time and inference time.
It's well known in deep RL that learning a reward function or an episodic success detector is far easier than generating the episodic actions to succeed at the task
The "Let's think step by step" paper showed that LLMs can be instructed to reason more carefully during their generation process via Chain-of-Thought (CoT) prompting like "let's think step by step". Drawing an analogy to control, we are using the LLM's logical reasoning primitives like a "logical world model" to generate a trajectory. However, if autoregressive generation makes a mistake, CoT prompting cannot go back and fix the error. The benefit of self-reflection is that the model can identify mistakes (potentially using CoT prompting itself), and correct them by starting over entirely. As neural net context length in LLMs increase, I expect that self-reflection will become the more effective CoT prompting technique. If you really squint and stretch your imagination, you can think of reflection as similar to a denoising operator for LLM outputs, similar to diffusion modeling but operating in semantic and logical space.
OpenAI researchers have investigated using LLMs to help critique LLM answers, though it approaches it from the very reasonable angle of automating the workload of human raters. Self-reflection takes it a step further, asking whether the critiques can actually be used to generate a better output without human intervention.
The recent Constitutional AI paper by Anthropic explores a similar idea of having a LLM revise generated outputs to abide by a set of rules:

The Constitutional AI paper didn't make sense to me when I first read it; it was hard to wrap my head around the idea that you could have a LLM revise its own outputs and retrain on that data and somehow improve without extra human or real world data. It almost feels like a violation of the no-free-lunch theorem, or at the very least prone to the sort of optimization instabilities that batch offline DDPG is prone to. In batch offline DDPG, the critic is learned from a finite dataset, so the actor "reward-hacks" the non-grounded critic to come up with bogus actions. Making batch offline actor-critic work requires heavily regularizing the critic and actor, and tuning this is so much trouble that I feel like it is not worth it in practice.

Perhaps CAI and self-reflection are not a violation of no-free-lunch theorem, but rather exploiting the fact that training a solution verifier for abiding by an instruction is computationally easier to training a solution generator for abiding by said instruction. The NP class of problems falls under this category. Anthropic uses this idea to generate rollouts for retraining their "actor", analogous to distilling a policy network with the outcome of MCTS at training time. Meanwhile, the Reflexion paper uses the trick to refine answers at test time, which is more akin to AlphaGo using MCTS at test-time. As the AlphaGo body of work has so elegantly paved out, the roadmap is quite clear on how one could combine these two approaches for a powerful agent.

What's new and noteworthy about LLMs, in contrast to prior works on model predictive control, is that the same set of weights can be used for the initial guess AND the critique AND the refinement procedure. Furthermore, the set of tasks a LLM can verify are completely open-ended, which means that it can not only critique its initial guess, but it can critique its own critique in a recursive way. Permit me the handwavy speculation here, but a LLM performing recursive self-reflection of its self-reflections may be the first glimmers of a bicameral mind. Feels like self-awareness literally pulling itself up by its bootstraps.

A couple years ago I wrote about how with the advent of increasing LLM capabilities, perhaps we should rely more on generalization than optimization to generate solutions. However, the fact remains that many problems are computationally easier to verify than solve. The takeaway for me here is that if you do not have the compute to "just ask" for a solution, perhaps you can settle for "just asking" for verification.

Citation

If you want to cite this blog post, you can use:

@article{jang2023reflection,
  title   = "Can LLMs Critique and Iterate on Their Own Outputs?",
  author  = "Jang, Eric",
  journal = "evjang.com",
  year    = "2023",
  month   = "Mar",
  url     = "https://evjang.com/2023/03/26/self-reflection.html"
}

How Can We Make Robotics More like Generative Modeling?

2022-07-23T00:00:00+00:00

I recently gave a talk (YouTube) at the RSS'22 L-DOD workshop. Here's a lightly edited transcript and slides of the talk in blog form.

This talk is not so much about how we take gradient steps to train robots, but rather how we as researchers and engineers can iterate effectively on these systems as we scale them up.

Since this is a workshop about large offline datasets for robotic learning, I don't need to convince the audience that what matters most for making capable robots is having high quality data. Diverse data can help your neural networks handle situations not seen at training time. In the broader context of machine learning, people call this "out of distribution generalization (OoD)". In robotics we call it "operating in unstructured environments". They literally mean the same thing.

We believe in the simplicity and elegance of deep learning methods, and evidence from the last decade has shown that the recipe works. Here are some examples of large-scale learning robots I've built while I was at Google Brain:

Qt-Opt can grasp objects not seen during training.
Grasp2Vec is goal-conditioned and can grasp objects not seen during training.
BC-Z is language-conditioned manipulation of objects, generalizing to unseen language commands
Door opening from visuomotor policies and generalizing to unseen doors.
SayCan can do even more language commands and also use language models for planning.

I'm not even going to cover how their learning algorithms work, because that's not important. What really matters is that once you have a large diverse dataset, almost any mix of learning techniques (supervised, unsupervised, offline RL, model-based) should all work. I suspect that for any of these datasets, if you applied a different learning method to the same data, you could probably get the robot to do something reasonable.

The Grass is Greener in Generative Modeling

All this progress is really exciting, pointing to a future in which we'll have robots doing lots of things in unstructured environments. But there's something that's been bothering me lately…

… just a few cubicles away, progress in generative modeling feels qualitatively even more impressive, especially since the development of GPT-3 and Scaling Laws.

The salience of the inputs and outputs of these generative models are really astounding. On the left you have some of the outputs out of Imagen, a generative text-to-image model made by Google Research. You can ask it to render "a hamster wearing an orange beanie holding a sign that says 'I love JAX'", and it will render a sensible image. Google researchers have also trained a large language model called PaLM now that can explain why jokes are funny. They train these models on really advanced hardware like TPUv4, and over in computer vision researchers are starting to develop some really sophisticated architectures like Vision Transformers.

As a roboticist, I can't help but feel a little envious. I'm still training ResNet18 networks, and that's an architecture that's almost 7 years old. I'm most certainly not training on such large datasets and rarely does robotics work make the flashy headlines.

I know that Moravec's Paradox says that robotics is hard compared to the more cognitive-type tasks. Manipulation is indeed difficult, but intuitively it feels like being able to pick up objects and transport them is … just not as impressive as being able to conjure the fantastical or explain jokes.

First, let me give a definition of what I think generative modeling is. Generative Modeling is not just about rendering pretty pictures or generating large amounts of text. It's a framework with which we can understand all of probabilistic machine learning. There are just two core questions:

How many bits are you modeling?
How well can you model them?

In 2012 there was the AlexNet breakthrough - an image-conditioned neural network that predicts one of a thousand classes. log2(1000 classes) is about 10 class bits. So you can think of AlexNet as an image-conditioned generative model over 10 bits of information. If you upgrade the difficulty of the modeling task to MS-CoCo captioning, that's image-conditioned again, but this time you're generating about a tweet's worth of text. That's on the order of 100 bits. Dialogue modeling is similar (O(100) bits), except it's text-conditioned instead of image-conditioned. If you're doing image generation, e.g. text-to-image with DALLE or Imagen, that's on the order of 1000 bits.

Generally, modeling more bits requires more compute to capture those conditional probabilities, and that's why we see models being scaled up. More bits also confers more bits of label supervision and more expressive outputs. As we train larger and larger models, you start to be able to exploit structure in the data so that you can learn much richer structure. This is why generative modeling and self-supervised learning has emerged as a viable way to do deep learning on rich inputs without necessarily requiring copious quantities of human labels.

Rich Sutton's essay The Bitter Lesson provocatively suggests that most of the progress in AI seems to be riding on this rising tide of compute, and very little else. I asked DALLE-2 to draw a depiction of this, where you have this ocean wave of compute that is lifting all the methodss up. You have Vision Algorithms, NLP, and Yann LeCun's "LeCake" all being buoyed up by this trend.

What gives us the most generalization in this regime? You have large over-parameterized models that can handle bigger datasets, and are able to attend to all the features in the prior layers (attention, convolutions, MLPs). Finally, if you have a lot of compute and a stable training objective (Cross Entropy loss), deep learning will almost always work.

I asked DALL-E 2 to draw "a pack mule standing on top of a giant wave", and this is how I think of generative modeling taking advantage of the Bitter Lesson. You have a huge wave of compute, you have a "workhorse" that is a large transformer, or a modern resnet, and at the very top you can choose whatever algorithm you like for modeling: VQVAE, Diffusion, GAN, Autoregressive, et cetera. The algorithmic details matter today but they probably won't in a few years once compute lifts all boats; Scale and good architectures is what enables all that progress in the long term.

By comparison, this is what the state of robotic generalization looks like. Speaking for myself, I'm still training small architectures, I have yet to use a Vision Transformers yet, and here is the roboticist and their safety harness.

I don't meant to sound excessively negative here. I work on robotics full time, and I want more than anyone for the robotics community to leverage a lot more generalization in our work. In some ways this contrast between robotics and generative modeling is not very surprising - if you look at the field of generative modeling, they don't have to work on all the annoying problems that roboticists have to deal with, like setting up the data problem and handling deployment and having the real world inject a lot of noise into your data.

In any case I want to compare generative modeling to robotics in three different dimensions and examine how we can do things better: optimization, evaluation, and expressivity. Maybe if we examine their differences we can figure out some ways to speed up robotics research.

Optimization

Let me first start by explaining a simple generative model, and then cast robotics into the language of generative modeling. Consider a PixelRNN, which is an architecture for generating images.

You start with a prior for your first pixel's first red channel. Your model tells the canvas (top row) what pixel (3-bit uint8) it wants to paint. Your canvas will be drawn exactly as commanded, so it copies the uint8 value onto the canvas, and then you read the canvas back into your model to predict the next channel - the green channel. You then feed in the R,G canvas values back into the RNN, and so on and so forth, generating RGBRGBRGB…

In practice for image generation you can use diffusion or transformers, but let's assume for simplicity it's a RNN that runs only in the forward direction.

Now let's cast the problem of general control as a PixelRNN. Instead of drawing a picture, you'd like to draw an MDP - a sequence of states, actions, and rewards. You want to draw a beautiful MDP which corresponds to an agent (such as a robot) accomplishing some task. Again, you start with a prior that samples some initial state, which in this case is the RL environment giving you some starting state. This is the first input to your model. Your RNN samples the first "pixel" (A), and again, the canvas draws the A exactly as you asked. But unlike the previous example where the canvas is always handing back to you your previous RNN outputs, now the next two pixels (R, S) are decided by this black box called "the environment", which takes in your action and all the previous states and computes R, S in some arbitrary way.

You can think of an RL environment as a "painter object" that takes your RNN actions and rather than directly painting what you want onto the canvas, it draws most of the pixels for you, and this can be arbitrarily complex function.

If we contrast this to the previous example of a Pixel-RNN for drawing images, this is a more challenging setting where you're trying to sample the image that you want, but there is a black box that's in getting in the way, deciding what it's going to draw.

Furthermore, there's a classic problem in control where if your environment draws a state that you didn't really expect, then there's a question of how you issue a corrective action so you can return to the image you're trying to draw. Also, unlike image generation, you actually have to generate the image sequentially, without being able to go back and edit pixels. This also presents optimization challenges since you can't do backprop through the black box and have to resort to score function gradient estimation or zeroth-order optimization methods (like evolutionary strategies).

Here's a research idea - if we want to understand how RL methods like PPO generalize, we ought to benchmark them not with control environments, but instead apply them to image generation techniques and compare them to modern generative models. There's some work by Hinton and Nair in 2006 where they model MNIST digit synthesis with a system of springs. DeepMind has revived some of this work on using RL to synthesize images.

Image generation is a nice benchmark for studying optimization and control, because it really emphasizes the need to generalize across hundreds of thousands of different scenarios. You can inject your environment into the painter process by having the sampling of green and blue pixels (reward, next state) be some fixed black-box transition with respect to the previous pixels (state). You can make these dynamics as stateful as you want, giving us a benchmark for studying RL in a "high generalization" setting where we can directly compare them to supervised learning techniques tasked with the same degree of generalization.

Lately there's been some cool work like Decision Transformer and Trajectory Transformer and Multi-Game Decision Transformer showing that upside-down RL techniques do quite well at generalization. One question I'm curious about these days is how upside-down RL compares to online (PPO) or offline RL algorithms (CQL). Evaluation is also conveninent under this domain because you can evaluate density (under an expert fully-observed likelihood model) and see if your given choice of RL algorithm generalizes to a large number of images when measuring the test likelihood.

Evaluation

If you want to measure the success rate of a robot on some task, you might model it as a binomial distribution over the likelihood of success given a random trial, i.e. "how many samples do you need to run to get a reasonable estimate of how good it is"?

The variance of a binomial distribution is $p(1-p)/N$, where $p$ is your sample mean (estimated success rate) $N$ is the number of trials. In the worst case, if you have p=50% (maximal variance), then you need 3000 samples before your standard deviation is less than 1%!

If we look at benchmarks from computer vision, incremental advances of the 0.1-1% range have been an important driver of progress. In ImageNet object recognition, a 10-bit generative modeling problem, progress has been pretty aggressive since 2012 - a 3% error rate reduction for the first three years followed by a 1% reduction every year or so. There's a huge number of people studying how to make this work. Maybe we're saturating on the benchmark a bit in 2022, but in the 2012-2018 regime, there was a lot of solid progress.

Similarly in other areas of generative modeling, researchers have been pushing down the perplexity of language models and likewise the bits-per-dimension of generative models on images.

Rigorous evaluation across a lot of scenarios takes time. Let's compare some evaluation speeds for these general benchmarks. The 2012 ImageNet object recognition test set has 150,000 images in the test set. It will take about 25 minutes to evaluate every single test example, assuming a per-image inference speed of 10ms and that you are serially evaluating every image one at a time. In practice, the evaluation is much faster because you can mini-batch the evaluation and get SIMD vectorization across the batch, but let's assume we're operating in a robotics-like setting where you have to process images serially because you only have 1 robot.

Because there are so many images, you can get your standard error estimate within 0.1% (assuming a top-1 accuracy of 80% or so). Maybe you don't really need 0.1% resolution to make progress in the field - 1% is probably sufficient.

Moving up the complexity ladder of evaluation, let's consider evaluating neural networks for their end-to-end performance in a simulated task. Habitat-Sim is one of the faster simlators out there, it's been designed to minimize the overhead between the neural net inference and the stepping of the environment. The simulator can step at 10,000 steps per second, but since the forward pass of a neural net is about 10ms, that bottleneck results in a 2 second evaluation per episode (assuming a typical navigation episode is about 200 steps). This is much faster than running a real robot but much slower than evaluating a single computer vision sample.

If you want to evaluate an end-to-end robotic system with a similar level of diversity as what we do with ImageNet, then it'll take up to 4 days to crunch through 150k eval scenarios. It's not exactly apples-to-apples because each episode is really 200 or so inference passes, but we can't treat the images within an episode as independent validation episodes. Absent any other episodic metric, we only know whether the task succeeded or not, so all the inference passes from within an episode can only contribute to a single sample of your binomial estimate. We have to estimate success rate from 150k episodes, not images. Of course, you can try to use fancy off-policy evaluation methods, but these algorithms are not reliable enough yet to "just work" out of the box.

On the next rung of difficulty, we have live evaluations on real robots. When I worked on BC-Z, each episode took about 30 seconds to evaluate in the real world, and we had a team of 8 operators who could run evaluations and measure success rates. each operator could do about 300 episodes a day before they got tired and needed a break. this means that if you have 10 operators, that gets you about 3000 evaluations per day, or roughly 1% standard error on your success rate estimates.

If it takes a whole day to evaluate your model, this creates a ridiculous constraint on your productivity, because you are limited to only trying one idea a day. You can't work on small ideas anymore that incrementally improve performance by 0.1%, or really extreme ideas that have a high chance of not working on the first try, because you simply can't measure those treatment effects anymore. You have to shoot for the moon and go for big jumps in performance. Which sounds nice but is hard to do in practice.

When you factor in the iteration process for doing robotic machine learning, it's very easy to have the number of evaluation trials dwarf those of your training data in the first place! 2 months of nonstop evaluations generates about 60k episodes, which is already larger than most robotic deep learning demonstration datasets. Let me illustrate this point with the broad trends we see in robotic learning becoming more general-purpose over time.

A few years ago researchers were still tackling problems like getting arms to open singular doors. Policies weren't expected to generalize too much, and these papers would evaluate on the order of 10 episodes or so. 10-50 trials is not actually enough for statistical robustness, but it is what it is 👁👄👁. In BC-Z we did on the order of 1000 trials for the final evaluation.

But what happens as we scale further? If we end up using datasets like Ego-4D to train extremely general robotic systems capable of O(100,000) behaviors, how many trials would we need to evaluate such general systems? Once you have something kind of baseline working, how do you re-evaluate a new idea to compare the baseline with? The cost of evaluation here becomes absurd.

Once again - we have enough data; the bottleneck is world evaluation!

How to Speed Up Evaluation?

Here are some ideas on how we can speed up evaluation of general-purpose robotic systems.

One way is to work on generalization and robotics separately. To a large extent, this is what the Deep Learning community does already. Most computer vision and generative modeling researchers don't test their ideas directly on actual robots, but instead hope that once their models acquire powerful generalization capabilities, it will transfer relatively quickly to robots. ResNets, which were developed in the Computer Vision community, have dramatically simplified a lot of robotic visuomotor modeling choices. Imagine if a researcher had to test their ideas on real robots every time they wanted to try a different neural net architecture! Another success story is CLIPort, which decouples the powerful multi-modal generalization capabilities of image-text models from the basic geometric reasoning used for grasp planning.

You can further stratify the tech stack for RL into "simulated toy environments", "simulated robots" and "real robots", in increasing order of evaluation difficulty.

On the bottom layer of the pyramid, you have the general perception benchmarks that are like Kaggle competitions and super easy for the Internet community to iterate on. Moving up the stack, you have a set of "toy control problems" which study the problem in a "bare metal" way, with only the simulator and the neural net running, and all the code related to real world robotics like battery management and real-world resets are non-existent. As you go up the pyramid, it becomes more domain-specific and more relevant to the problem you're trying to solve. For instance, the "simulated robot" and "real robot" stack might be for the same task and re-use the same underlying robot code. Simulated toy environments can be used to study general algorithms, but may have less domain overlap with the end robotic use case. On top of the "evaluation pyramid", you have the actual robotic task you are trying to solve. Iterating on this directly is very slow so you want to spend as little time here as possible. You'd hope that the foundation models you train and evaluate on the lower layers help inform you what ideas work without having to do every single evaluation at the top layer.

Again, the field already operates in this decoupled way. Most people who are interested in contributing to robotics don't necessarily move robots; they might train vision representations and architectures that might eventually be useful for a robot. Of course, the downside to de-coupling is that improvements in perceptual benchmarks do not always map to improvements in robotic capability. For example, if you're improving mAP metric on semantic segmentation or video classification accuracy, or even lossless compression benchmark - which in theory should contribute something eventually - you won't know how improvements in representation objectives actually map to improvements in the downstream task. You have to eventually test on the end-to-end system to see where the real bottlenecks are.

There's a cool paper I like from Google called "Challenging Common Assumptions in Unsupervised Learning of Disentangled Representations", where they demonstrate that many completely unsupervised representation learning methods don't confer significant performance improvements in downstream tasks, unless you are performing evaluation and model selection with the final downstream criteria you care about.

Another way to reduce the cost of evaluation is to make sure your data collection and evaluation processes are one and the same. In BC-Z we had people collecting both autonomous policy evaluation data and expert teleoperation data at the same time. If you're doing shared autonomy, you can use interventions to collect HG-dagger data to gather interventions for the policy, which gives you useful training data. At the same time, the average number of interventions you do per episode tells you roughly how good the policy is. Another thing you can do is look at scalar metrics instead of binomial ones, as those yield more bits of information per episode than a single success/failure bit.

Autonomous data collection with RL learning algorithms is another natural way to merge evaluation and data collection, but it does require you to either use human raters for episodes or to engineer well-designed reward functions. All of these approaches will require a large fleet of robots deployed in real world settings, so this still doesn't get around the pain of iterating in the real world.

An algorithmic approach to evaluating faster is to improve sim-to-real transfer. If you can simulate a lot of robots in parallel, then you're no longer constrained. In work led by Mohi Khansari, Daniel Ho, and Yuqing Du, we developed this technique called "Task Consistency Loss" where we regularize the representations from sim and real to be invariant, so that policies should behave similarly under sim and real. When you transfer a policy evaluated in simulation to real, you want to ensure that the higher performance in sim indeed corresponds to higher performance in real. The less the sim2real gap is, the more you can virtualize eval and trust your simulated experiments.

Expressivity

Let's examine how many bits a modern generative model can output. A 64x64x3 RGB image at 8 bits per channel is 36864 bits. A language model can generate any number of tokens, but if we fix the output window to 2048 tokens, at 17 bits per token that's 36793 bits. So both image and text generative models can synthesize about 37k bits. As your models become super expressive, there is a large qualitative jump in how people perceive these models. Some people are starting to think that Language Models are partially conscious because of how expressive they are.

How expressive are our robotic policies today, by comparison? In BC-Z most of our tasks consisted of about 15-choose-6 objects on the table and the robot had to move one object on top of another or push some object around, for a total of 100 tasks. log2(100 tasks) is about 7 configuration bits, or in other words, "given the state of the world, the robot is able to move the atoms into one of N states, where N can be described in 7 bits". SayCan can do about 550 manipulation tasks with a single neural network, which is pretty impressive by current robotic deep learning standards, but altogether it's just about 10 configuration bits.

It's not a perfect apples-to-apples comparison because the definition of information is different between the two, but it's rather just to provide a rough intuition of what matters when humans size up the relative complexity of one set of tasks vs. another.

One of the challenges here is that our robotic affordances are not good enough. If you look at the Ego4D dataset, a lot of tasks here require bimanual manipulation, but most of our robots today we're still using mobile manipulators with wheeled base, one arm. It's a limited affordance where you can't go everywhere and obviously you only have one arm so that excludes a lot of the interesting tasks.

I think expressivity of our robotic learning algorithms are limited by our hardware. That's one of the reasons I joined Halodi Robotics - I want to work on more expressive robotic affordances. Below we have images of the robot opening doors, packing suitcases, zipping them closed, watering plants, and flipping open bottle caps on a water bottle. As your robotic hardware gets closer to the affordances of an actual human, the number of things you can do in a human-centric world go up exponentially.

The last point I'd like to make is that as our robots become more expressive, we are not only going to need Internet-scale training data, but also Internet-scale evaluation. If you look at progress in LLMs, there are now lots of papers that study prompt-tuning and what existing models can and cannot do. There's a collaborative benchmark called BigBench that compiles a bunch of tasks and asks what we can interrogate from these models. OpenAI evaluates their DALLE-2 GPT-3 models in the wild with Internet users. Their engineering and product teams can learn from users experimenting with their AI systems in the wild, as it's too hard for any one researcher to grasp even the input surface area of the models.

My question for the audience is, what is the robotics equivalent of a GPT-3 or DALLE-2 API, in which the broader Internet community can interrogate a robotic policy and understand what it can do?

To conclude, here's a table that summarizes the comparison between optimization, evaluation, and expressivity:

	Generative Modeling	Robotics
Optimization and Generalization: can you compress the test set efficiently?	Model has complete control over which pixels it paints	Model samples an action and a stateful black box paints the next two tokens
Evaluation: how quickly can you iterate?	O(25min) to get binomial success rate std < 0.1%	O(months) to get success rate std < 1%
Expressivity: How rich are your outputs, in bits?	O(1000) bits make good use of scale and higher-capacity networks	Task configuration space about 10 bits, dramatically limited by robot morphology

Q/A

What do you think about model-based methods?

Generative models used for visual model-predictive control are nice because they directly reuse a lot of the latest advances from generative modeling of images and video. However, the challenge remains that if you're not evaluating on a robot, it's still hard to know how reduction in bits-per-dim (or ELBO likelihood) for your modeling task translates to actual performance. While model-based learning is super practical for optimizing in a generalization-friendly way, you still need to find a way to connect it to your evaluation.

If model-based learning is practical, why haven't you applied it to BC-Z data?

When setting up a robotic learning codebase, you want to de-risk the infrastructure and make sure the basics like logging are implemented correctly. Often it's a good idea to start with a simple algorithm, even if it's a naive baseline. Imitation learning was the simple algorithm we started with, but obviously once that works then it makes sense to branch out to more complex things like model-based RL.

Is it possible to deploy black-box learned systems (e.g. end-to-end neural network control) in real-world applications, given that they don't have safety guarantees?

This is an open question for the ML field at large, as to what responsible deployment of black box systems should look like. My personal take is that trust comes out of performance. If it performs really well, people will trust it over time. For more short-term pragmatic stuff, what helps a lot is to have shared autonomy and anomaly detection systems that help people catch errors and let people take over. Just as it's hard to define what "out of distribution" means, it's hard to define what "formal safety guarantees" means in unstructured robotic environments. All these are just ill-posed semantics and what we really want is reliability and some degree of internal belief that we know what's going on inside our automated systems.

Do you have a notion of what proxy tasks make sense for robotics?

At the very top of the evaluation pyramid, you have the question "does the robot do the task successfully?" and lower down the evaluation pyramid you have many proxy metrics that you would hope inform you about the performance of the system. One analogy you can draw from the language modeling community is that bits-per-dim is your proxy metric, while a Turing Test is your end-to-end metric. You don't necessarily know how well the Turing Test is going to go, given a reduction in perplexity, but you know that because language models are lossless compressors, that as you approach the limit of human-level compression, it will solve the harder task. So we should be drawing plots of how the end-to-end metric you care about scales with each proxy metric. I suspect a lot of surrogate metrics (e.g. object detection mAP) will probably plateau quickly.

As to a specific proxy metric for robotics, I think simulated task success, and hand-curated "evaluation scenarios" that test for robust policy behavior in critical states are a reasonable start.

All Roads Lead to Rome: The Machine Learning Job Market in 2022

2022-04-25T00:00:00+00:00

I was on the job market recently, open to suggestions on what I should do next. I'm deeply thankful to all those who reached out and pitched their companies, and also to people who shared their wisdom of how they wanted to navigate their own careers.

I'm pleased to say that I've joined Halodi Robotics as their VP of AI, and will be hiring for the Bay Area office. We'll be doing some real robot demos this week near Palo Alto, so please get in touch with me if you're interested in learning about how we plan to create customer value with deep learning on humanoid robots (1 year), and then solve manipulation (5 years), and then solve AGI (20 years).

I suspect that there are many other deep learning researchers in the "2015-2018 vintage" that are contemplating similar career moves, so I'll share what I learned in the last month and how I made my career decision in the hopes that this is useful to them. I think that in the next decade we'll see a lot of software companies increasingly adopt an "AGI strategy" as a means to make their software more adaptive and generally useful.

Options Considered

My only constraints were that I wanted to continue to apply my ML skills at my next job. Here are the table of options I considered. I had chats with directors and founders from each of these companies, but I did not initiate the formal HR interview process with most of them. This is not intended as a flex; just the options I considered and my perceived pros and cons. I'm not like one of those kids that gets into all the Ivy League schools at once and gets to pick whatever they want.

These are subjective opinions: a mere snapshot of what I believe are the strengths and weaknesses of each option in April 2022. In the hype-driven Silicon Valley, the perceived status of a company can go from rags to riches back to rags within a few years, so this table will most certainly age poorly.

Option	Pros	Cons
FAANG+similar	Low 7 figures compensation (staff level), technological lead on compute (~10 yr), unit economics of research is not an issue	Things move slower, less autonomy, OKRs etc.
Start my own company	Maximally large action space, blue check mark on Twitter	I'm more interested in solving AGI than solving customer problems
OpenAI	Technological lead on LLMs (~1 yr) + an interesting new project they are spinning up	Culture and leadership team seem to be already established
Large Language Model Startup	Strong teams, transform computing in <10 years, iterate quickly on LLM products	Competing with FAANG + OpenAI on startup budget; unclear whether LLMs will be defensible technology on their own
Tesla	Tesla Bot, technological lead on data engine (~2 yrs), technological lead on manufacturing (~10yrs)	No waiting in line for coffee
Robotics Startups (including Halodi)	Huge moat if successful, opportunity to lead teams. Halodi has technological edge on hardware (~5yrs)	Robotics research is slow, Robotics startups tend to die
ML + Healthcare Startups	Lots of low-hanging fruit for applying research; meaningfully change the world	Product impact is even slower than robotics due to regulatory capture by hospitals and insurance companies. 10 years before the simplest of ML techniques can be rolled out to people.
Startups working on other Applications of ML	Lots of low-hanging fruit + opportunity to lead teams	I'm more interested in solving AGI than solving customer problems
Crypto + DeFi	Tokenomics is interesting. 60% annual returns at Sharpe 3+ is also interesting.	Not really an AGI problem. Crypto community has weird vibes

Technological Lead Time

The most important deciding factor for me was whether the company has some kind of technological edge years ahead of its competitors. A friend on Google's logging team tells me he's not interested in smaller companies because they are so technologically far behind Google's planetary-scale infra that they haven't even begun to fathom the problems that Google is solving now, much less finish solving the problems that Google already worked on a decade ago.

In the table above I've listed companies that I think have unique technology edges. For instance, OpenAI is absolutely crushing it at recruiting right now because they are ahead in Large Language Model algorithms, probably in the form of trade secrets on model surgery and tuning hyperparameters to make scaling laws work. OpenAI has clearly done well with building their technical lead time, despite FAANG's compute superiority.

Meanwhile, the average machine learning researcher at FAANG has a 15 year lead-time in raw compute compared to a PhD student, and Google and DeepMind have language models that are probably stronger than GPT-3 on most metrics. There are cases where technological lead on compute is not enough; some researchers left Google because they were unhappy with all the red tape they had to go through to try to launch LLM-based products externally.

I seriously considered pivoting my career to work on generative models (i.e. LLMs, Multimodal Foundation Models), because (1) robotics is hard (2) the most impressive case studies in ML generalization always seem to be in generative modeling. Again, think to technological lead times - why would any machine learning researcher want to work at something that isn't at the forefront of generalization capability? However, the pure-generative modeling space feels a bit competitive, with everyone fighting to own the same product and research ideas. The field would probably evolve in the same way with or without me.

Having futuristic technology is important for recruiting engineers because many of them don't want to waste years of their life building a capability that someone else already has. To use analogies from other fields of science, it would be like a neuroscience lab trying to recruit PhD students to study monkey brains with patch-clamp experiments when the lab next door is using optogenetic techniques and Neurallink robots. You could reinvent these yourself if you're talented, but is it worth spending precious years of your life on that?

Of course, companies are not the same thing as research labs. What matters more in the long run is the product-market fit and the team's ability to build future technological edge. Incumbents can get bloated and veer off course, while upstarts can exploit a different edge or taking the design in a unique direction. Lots of unicorn companies were not first-movers.

Why not start your own company?

Being a Bay Area native, I thought my next job would be to start my own company around MLOps. I wanted to build a really futuristic data management and labeling system that could be used for AGI + Active Learning. Three things changed my mind:

First, I talked to a bunch of customers to understand their ML and data management needs to see if there was product-market fit with what I was building. Many of their actual problems weren't at the cutting edge of technology, and I simply couldn't get excited about problems like building simulators for marketing campaigns or making better pose estimators for pick-and-place in factories or ranking content in user feeds. The vast majority of businesses solve boring-but-important problems. I want my life's work to be about creating much bigger technological leaps for humanity.

Secondly, I think it's rare for CEOs to contribute anything technically impressive after their company crosses a $100M valuation. If they do their job well, they invariably spend the rest of my life dealing with coordination, product, and company-level problems. They accumulate incredible social access and leverage and might even submit some code from time to time, but their daily schedule is full of so much bullshit that they will never productively tinker again. This happens to senior researchers too. This is profoundly scary to me. From Richard Hamming's You and Your Research: "In the first place if you do some good work you will find yourself on all kinds of committees and unable to do any more work"

Legend has it that Ken Thompson wrote the UNIX operating system when his wife went on a month-long vacation, giving him time to focus on deep work. The Murder of Wilbur writes, How terrifying would it be if that was true? Is it possible that Thompson was burdened by responsibilities his entire life, and then in a brief moment of freedom did some of the most important work anyone has ever done?

Thirdly, Halodi has built pretty awesome technology and they've given me a rare opportunity to live in the future, building on top of something that is 5+ years ahead of its time. I'm very impressed by Bernt's (the CEO) respect for human anatomy: from the intrinsic passive intelligence of overdamped systems that makes us able to grasp without precise planning, to the spring systems in our feet that let us walk across variable terrain while barely expending energy. We both share the belief that rather than humanoid robots being "overkill" for tackling most tasks, it is the only form that can work when you want to design the world around humans rather than machines.

All Roads Lead to Rome

A few months ago I asked Ilya Sutskever whether it made more sense to start a pure-play AGI research lab (like OpenAI, DeepMind) or to build a profitable technology business that, as a side effect, would generate the data moat needed to build an AGI. In his provocative-yet-prescient fashion, Ilya said to me: "All Roads Lead to Rome - Every Successful (Tech) Company will be an AGI company".

This sounds a bit unhinged at first, until you remember that repeatedly improving a product by the same delta involves exponentially harder technology.

In semiconductor manufacturing, shrinking from 32nm to 14nm process nodes is pretty hard, but going from 14nm to 7nm process nodes is insanely hard, requiring you to solve intermediate problems like creating ultrapure water.
Creating a simple Text-to-Speech system for ALS patients was already possible in the 1980s, but improving pronunciation for edge cases and handling inflection naturally took tremendous breakthroughs in deep learning.
A decent character-level language model can be trained on a single computer, but shaving a few bits of entropy off conditional character modeling requires metaphorically lighting datacenters on fire.
Autonomous highway driving is not too hard, but autonomously driving through all residential roads at a L5 level is considered by many to be AGI-complete.

In order to continue adding marginal value to the customer in the coming decades, companies are going to have to get used to solving some really hard problems. Perhaps eventually everyone converges to solving the same hard problem, Artificial General Intelligence (AGI) just so they can make a competitive short video app or To-Do list or grammar checker. We can quibble about what "AGI" means and what time frame it would take for all companies to converge to this, but I suspect that Foundation Models will soon be table stakes for many software products. Russell Kaplan has shared some similar ideas on this as well.

I also wonder if in a few years, expertise on losslessly compressing large amounts of Internet-scale data will cease to become a defensible moat between technologically advanced players (FAANG). It therefore makes sense to look for auxiliary data and business moats to stack onto large-scale ML expertise. There are many roads one can take here to AGI, which I have sketched out below for some large players:

For instance, Alphabet has so much valuable search engine data capturing human thought and curiosity. Meta records a lot of social intelligence data and personality traits. If they so desired, they could harvest Oculus controller interactions to create trajectories of human behavior, then parlay that knowledge into robotics later on. TikTok has recommendation algorithms that probably understand our subconscious selves better than we understand ourselves. Even random-ass companies like Grammarly and Slack and Riot Games have a unique data moats for human intelligence. Each of these companies could use their business data as a wedge to creating general intelligence, by behavior-cloning human thought and desire itself.

The moat I am personally betting on (by joining Halodi) is a "humanoid robot that is 5 years ahead of what anyone else has". If your endgame is to build a Foundation Model that train on embodied real-world data, having a real robot that can visit every state and every affordance a human can visit is a tremendous advantage. Halodi has it already, and Tesla is working on theirs. My main priority at Halodi will be initially to train models to solve specific customer problems in mobile manipulation, but also to set the roadmap for AGI: how compressing large amounts of embodied, first-person data from a human-shaped form can give rise to things like general intelligence, theory of mind, and sense of self.

Embodied AI and robotics research has lost some of its luster in recent years, given that large language models can now explain jokes while robots are still doing pick-and-place with unacceptable success rates. But it might be worth taking a contrarian bet that training on the world of bits is not enough, and that Moravec's Paradox is not a paradox at all, but rather a consequence of us not having solved the "bulk of intelligence".

Reality has a surprising amount of detail, and I believe that embodied humanoids can be used to index that all that untapped detail into data. Just as web crawlers index the world of bits, humanoid robots will index the world of atoms. If embodiment does end up being a bottleneck for Foundation Models to realize their potential, then humanoid robot companies will stand to win everything.

Want Intros to ML Startups?

In the course of talking to many companies and advisors over the last month, I learned that there are so, so many interesting startups tackling hard ML problems. Most of them are applied research labs trying to solve interesting problems, and a few of them have charted their own road to Rome (AGI).

Early in your career it makes a lot of sense to surround yourself with really great mentors and researchers, such as that of an industry research lab. Later on, you might want to bring your experience to a startup to build the next generation of products. If this describes you, I'd be happy to connect you to these opportunities - just shoot me an email with (1) where you are on the pure research vs. applied research spectrum (2) what type of problems you want to work on (Healthcare, Robotics, etc.) (3) the hardest you ever worked on a project, and why you cared about it (4) your resume. If you have the skill set I'm looking for, I may also recruit you to Halodi 😉.

Honest Concerns

I have some genuine concerns with Halodi (and AGI startups in general). History tells us the mortality rate of robotics companies is very high, and I'm not aware of any general-purpose robot company that has ever succeeded. There is a tendency for robotics companies to start off with the mission of general-purpose robots and then rapidly specialize into something boring as the bean counters get impatient. Boston Dynamics, Kindred, Teleexistence - the list goes on and on. As in business and life, the forces of capitalism and evolution conspire to favor specialization of hardware over generalization of intelligence. I pray that does not happen to us.

I'm reminded of Gwern's essay on timing: Launching too early means failure, but being conservative & launching later is just as bad because regardless of forecasting, a good idea will draw overly-optimistic researchers or entrepreneurs to it like ⁠moths to a flame: all get immolated but the one with the dumb luck to kiss the flame at the perfect instant, who then wins everything, at which point everyone can see that the optimal time is past.

But I also remind myself of what Richard Hamming said about Claude Shannon:

"He wants to create a method of coding, but he doesn't know what to do so he makes a random code. Then he is stuck. And then he asks the impossible question, 'What would the average random code do?' He then proves that the average code is arbitrarily good, and that therefore there must be at least one good code. Who but a man of infinite courage could have dared to think those thoughts?"

Life is too short to attempt anything less than that which takes infinite courage. LFG.