[go: up one dir, main page]

Wenlong Huang

profile photo

I am a Ph.D. candidate in Computer Science at Stanford University, working on AI and robotics. I am advised by Fei-Fei Li as part of the Stanford Vision and Learning Lab (SVL). I'm also working closely with Leslie Kaelbling and Tomás Lozano-Pérez.

I have spent time at MIT CSAIL (2025), NVIDIA Robotics (2024), and Google DeepMind Robotics (2022). I received my undergraduate degree from UC Berkeley (2018 - 2021), advised by Deepak Pathak, Igor Mordatch, and Pieter Abbeel. I also worked with Zhuowen Tu at UC San Diego as a summer intern (2018).

profile photo
Research Highlights
3D World Model for Robotics [Mar 2026]
PointWorld talk preview image YouTube Watch on YouTube
Task Representations with Foundation Models [May 2025]
Task Representations with Foundation Models talk preview image YouTube Watch on YouTube

My research focuses on robot learning, with the aim of developing algorithms with scalable data sources for broad generalization (e.g., across tasks 1,2,3,4, environments 1,2,3,4, and embodiments 1,2,3). Towards this goal, I am currently interested in:

  • Spatial Intelligence: Modeling the interactive, spatial, and counterfactual world - and the behaviors within it - for robotic manipulation.
  • Foundation Models for Robotics: Leveraging rich priors from foundation models and Internet-scale data for broad generalization.
News
Research
PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation
Wenlong Huang, Yu-Wei Chao, Arsalan Mousavian, Ming-Yu Liu, Dieter Fox, Kaichun Mo*, Li Fei-Fei* (*Equal Advising)
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026
CVPR Highlight
Project Page / Paper / Code / Video / Summary

A large 3D world model, pre-trained on 500 hours of in-the-wild 3D interactions, that predicts environment dynamics from RGB-D capture(s) and robot actions with unified state-action representation as 3D point flows.

Dream2Flow: Bridging Video Generation and Open-World Manipulation with 3D Object Flow
Karthik Dharmarajan, Wenlong Huang*, Jiajun Wu, Li Fei-Fei†, Ruohan Zhang† (*Corresponding Author, †Equal Advising)
International Conference on Robotics and Automation (ICRA), 2026
Project Page / Paper / Code / Summary

3D object flows from text-to-video models provide zero-shot guidance for open-world manipulation as tracking goals for downstream (model-based and model-free) policies.

Learning Composable Skills by Discovering Spatial and Temporal Structure with Foundation Models
Neil Nie, Wenlong Huang, Jiayuan Mao, Li Fei-Fei, Weiyu Liu, Jiajun Wu
International Conference on Robotics and Automation (ICRA), 2026
Project Page / Paper

Extract skill segments and 3D entities from unstructured demonstrations to learn composable skills that generalize to longer-horizon test settings and novel geometric constraints.

ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction
Qineng Wang*, Wenlong Huang*, Yu Zhou, Hang Yin, Tianwei Bao, Jianwen Lyu, Weiyu Liu, Ruohan Zhang†, Jiajun Wu†, Li Fei-Fei†, Manling Li† (*Equal Contribution, †Equal Advising)
International Conference on Learning Representations (ICLR), 2026
Oral Presentation at the ICLR 2026 Workshop on World Models
Project Page / Paper / Code / Dataset / Summary

A benchmark that evaluates embodied cognition of VLMs on long-horizon, spatial, and physical reasoning through egocentric world modeling from long-horizon mobile manipulation.

ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation
Wenlong Huang, Chen Wang*, Yunzhu Li*, Ruohan Zhang, Li Fei-Fei (*Equal Contribution)
The Conference on Robot Learning (CoRL), 2024
Best Paper Award at the 2024 CoRL LEAP Workshop
Project Page / Paper / Code / Video / Summary

Large vision models and vision-language models can generate keypoint-based constraints, which can be optimized to achieve multi-stage, in-the-wild, bimanual, and reactive behaviors, without task-specific training or environment models.

UAD: Unsupervised Affordance Distillation for Generalization in Robotic Manipulation
Yihe Tang, Wenlong Huang*, Yingke Wang, Chengshu Li, Roy Yuan, Ruohan Zhang, Jiajun Wu, Li Fei-Fei (*Corresponding Author)
International Conference on Robotics and Automation (ICRA), 2025
Best Paper Award Finalist
Best Paper Award on Robot Perception Finalist
Project Page / Paper / Code / Summary

Fine-grained task-conditioned visual affordance can be distilled from off-the-shelf foundation models, enabling diverse generalization properties in downstream policy learning.

A Real-to-Sim-to-Real Approach to Robotic Manipulation with VLM-Generated Iterative Keypoint Rewards
Shivansh Patel*, Xinchen Yin*, Wenlong Huang, Shubham Garg, Hooshang Nayyeri, Li Fei-Fei, Svetlana Lazebnik, Yunzhu Li (*Equal Contribution)
International Conference on Robotics and Automation (ICRA), 2025
Project Page / Paper / Code / Video / Summary

Vision-Language Models can serve as human proxies specify diverse task objectives by writing keypoint-based reward functions, where they can be autonomously and iteratively refined based on environment feedback.

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, Li Fei-Fei
The Conference on Robot Learning (CoRL), 2023
Oral Presentation
Project Page / Paper / Code / Video / Summary

Large language models and visual-language models can be used to directly label affordances and constraints in the 3D perceptual space. Combined with motion planning, this enables robots to perform diverse everyday manipulation tasks in a zero-shot manner.

PaLM-E: An Embodied Multimodal Language Model
Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, Pete Florence
International Conference on Machine Learning (ICML), 2023.
Project Page / Paper / Google AI Blog / Summary

Language models can digest real-world sensor modalities (e.g., images) to be embodied in the physical world. The largest 562B model is a generalist agent across language, vision, and task planning.

Grounded Decoding: Guiding Text Generation with Grounded Models for Robot Control
Wenlong Huang, Fei Xia, Dhruv Shah, Danny Driess, Andy Zeng, Yao Lu, Pete Florence, Igor Mordatch, Sergey Levine, Karol Hausman, Brian Ichter
Conference on Neural Information Processing Systems (NeurIPS), 2023
Project Page / Paper / Video / Summary

Large language models can be grounded in embodied environments by using continuous probabilities to guide their token decoding, where the guidance is provided by a set of grounded models, such as affordance, safety, and preference functions.

Code as Policies: Language Model Programs for Embodied Control
Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, Andy Zeng
International Conference on Robotics and Automation (ICRA), 2023
Outstanding Robot Learning Paper Award
Project Page / Paper / Code / Video / Google AI Blog / TechCrunch / Summary

Using hierarchical code generation, language models can write hierarchical robot policy code given abstract natural language instructions.

Inner Monologue: Embodied Reasoning through Planning with Language Models
Wenlong Huang*, Fei Xia*, Ted Xiao*, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, Karol Hausman, Brian Ichter (*Equal Contribution)
The Conference on Robot Learning (CoRL), 2022
Project Page / Paper / Video / Two Minute Papers / Summary

Provided with textual embodied feedback, language models can articulate a grounded "thought process" for challenging long-horizon tasks, even under disturbances.

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents
Wenlong Huang, Pieter Abbeel, Deepak Pathak*, Igor Mordatch* (*Equal Advising)
International Conference on Machine Learning (ICML), 2022.
Project Page / Paper / Code / Video / Summary

Large language models (e.g. GPT-3, Codex) contain rich actionable knowledge that can be used to perform task planning for embodied agents, even without additional training.

Generalization in Dexterous Manipulation via Geometry-Aware Multi-Task Learning
Wenlong Huang, Igor Mordatch, Pieter Abbeel, Deepak Pathak
arXiv, 2021
Project Page / Paper / Code / Summary

With appropriate object representation, a multi-task RL policy can control an anthropomorphic hand to manipulate 100+ diverse objects and achieve SOTA performance on unseen ones.

One Policy to Control Them All: Shared Modular Policies for Agent-Agnostic Control
Wenlong Huang, Igor Mordatch, Deepak Pathak
International Conference on Machine Learning (ICML), 2020.
Oral Presentation
Project Page / Paper / Code / Video / Oral Talk / Summary

Expressing robots as collections of modular components that share a control policy can lead to zero-shot generalization across diverse unseen robot morphologies.

3D Volumetric Modeling with Introspective Neural Networks
Wenlong Huang*, Brian Lai*, Weijian Xu, Zhuowen Tu (*Equal Contributions)
Association for the Advancement of Artificial Intelligence (AAAI), 2019.
Paper

Built upon Generative via Discriminative Learning and Introspective Learning frameworks, a single neural network can simultaneously perform classification and generation of 3D volumetric shapes.

Selected Honors
2025Best Paper Award Finalist at ICRA 2025
2025Best Paper Award on Robot Perception Finalist at ICRA 2025
2025Finalist for NVIDIA Graduate Fellowship
2025Finalist for Citadel GQS Fellowship (three in CS/EE)
2024Best Paper Award at LEAP Workshop at CoRL 2024
2023Outstanding Robot Learning Paper Winner at ICRA 2023
2022Stanford School of Engineering Fellowship
Professional Service

Workshop / Tutorial Organization

Conference Reviewer: CoRL, RSS, ICRA, IROS, NeurIPS, ICML, ICLR, CVPR, ICCV, and ECCV

Journal Reviewer: Science Robotics, IJRR, IEEE T-RO, IEEE RA-L, Nature Communications, and IEEE T-AI

Seminar Organization: Stanford Vision Seminar

Teaching
Spring 2026Teaching Assistant: CS231n: Deep Learning for Computer Vision at Stanford University
Winter 2026Guest Lecturer: CSE 571: AI-Robotics at University of Washington
Spring 2025Teaching Assistant: CS231n: Deep Learning for Computer Vision at Stanford University
Spring 2024Teaching Assistant: CS231n: Deep Learning for Computer Vision at Stanford University
Mentoring

Karthik Dharmarajan: Stanford, M.S. in CS → UC Berkeley, Ph.D. student

Yingke Wang: Stanford, M.S. in CS → Stanford, Ph.D. student

Neil Nie: Stanford, M.S. in CS → UC Berkeley, Ph.D. student (on leave) → Verne Robotics, Cofounder

Yihe Tang: Stanford, M.S. in CS → USC, Ph.D. student

Experience
Sep 2022 - Present
Ph.D. Candidate in Computer Science
Advisor: Fei-Fei Li
Jul 2025 - Dec 2025
Visiting Researcher
Jun 2024 - Sep 2024
Research Intern
Aug 2018 - Dec 2021
B.A. in Computer Science
2018
High School Researcher
Advisor: Zhuowen Tu

Template from Jon Barron's website