[go: up one dir, main page]

Eric Wallace


Hello! I am a researcher at OpenAI, where I work to make the next-generation of LLMs more safe, robust, and private. Before this, I did a PhD at UC Berkeley with Dan Klein and Dawn Song.

These days, I co-lead a team named "Alignment Training" that encompasses many research directions in safety, alignment, and capabilities. Feel free to reach out if you are interested in working at OpenAI or looking to disclose vulnerabilities of our models.

Hello! I am a researcher at OpenAI, where I work to make the next-generation of LLMs more safe, robust, and private. Before this, I did a PhD at UC Berkeley with Dan Klein and Dawn Song.

These days, I co-lead a team named "Alignment Training" that encompasses many research directions in safety, alignment, and capabilities. Feel free to reach out if you are interested in working at OpenAI or looking to disclose vulnerabilities of our models.

Current Research

At OpenAI, I work on a variety of research directions on safety, alignment, and capabilities:

  • Robustness to adversarial examples in the form of jailbreaks and prompt injections.
  • Memorization, unlearning, and synthetic data techniques for protecting privacy + copyright.
  • Distillation, both for creating efficient models and preventing adversarial distillation.
  • Model stealing attacks and other ways of inferring hidden properties of black-box LLMs.
  • Frontier risk evaluations in areas such as biology, and how to elicit harmful capabilities.
  • Open-source LLM safety, including safety training procedures and proper evaluation schemes.
  • Safety and refusal training of our core models, including the algorithms, data, and evaluations.

The result of this research has largely been contributions to our core models, including the "o-series" models, GPT-5, deep research, ChatGPT agent mode, and GPT-oss. I've also been trying to publish as much as I can, including our work on the instruction hierarchy, the deliberative alignment algorithm, scaling robustness, model stealing, and open-source model safety.

Selected Publications

Here are a few of my representative papers. See my Google Scholar page for a complete list.

  • The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

    Eric Wallace*, Kai Xiao*, Reimar Leike*, Lilian Weng, Johannes Heidecke, Alex Beutel

    arXiv 2024

    TLDR | Twitter| Paper| Citation

  • Stealing Part of a Production Language Model

    Nicholas Carlini, Daniel Paleka, Krishnamurthy Dj Dvijotham, Thomas Steinke, Jonathan Hayase, A. Feder Cooper, Katherine Lee, Matthew Jagielski, Milad Nasr, Arthur Conmy, Itay Yona, Eric Wallace*, David Rolnick, Florian Tramèr

    ICML 2024. Best Paper Award

    TLDR| Twitter| Paper| Citation

  • Scalable Extraction of Data from (Production) Language Models

    Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Chris Choquette-Choo, Eric Wallace, Florian Tramèr, Katherine Lee

    arXiv 2023

    TLDR | Twitter| Paper| Citation

  • The False Promise of Imitating Proprietary LLMs

    Arnav Gudibande*, Eric Wallace*, Charlie Snell*, Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey Levine, Dawn Song

    ICLR 2024

    TLDR | Twitter #1#2#3| Paper| Code| Citation

  • Poisoning Language Models During Instruction Tuning

    Alexander Wan*, Eric Wallace*, Sheng Shen, Dan Klein

    ICML 2023

    TLDR | Twitter| Paper| Code| Poster| Citation

  • Automated Crossword Solving

    Eric Wallace*, Nicholas Tomlin*, Albert Xu*, Kevin Yang*, Eshaan Pathak*, Matt Ginsberg, Dan Klein

    ACL 2022. First Superhuman Crossword AI

    TLDR| Blog| Demo| Twitter| Paper| Code| Slides| Poster| Citation

  • Calibrate Before Use: Improving Few-shot Performance of Language Models

    Tony Zhao*, Eric Wallace*, Shi Feng, Dan Klein, Sameer Singh

    ICML 2021. Oral Presentation, top 3%

    TLDR| Twitter #1#2| Paper| Code| Slides| Citation

  • Extracting Training Data From Large Language Models

    Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, Colin Raffel

    USENIX Security 2021. PET Award Runner Up

    TLDR| Blog| Twitter #1#2| Paper| Code| Citation

  • AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts

    Taylor Shin*, Yasaman Razeghi*, Robert L Logan IV*, Eric Wallace, Sameer Singh

    EMNLP 2020

    TLDR| Twitter| Paper| Code| Citation

  • Universal Adversarial Triggers for Attacking and Analyzing NLP

    Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, Sameer Singh

    EMNLP 2019

    TLDR| Video| Blog| Twitter| Paper| Code| Slides| Citation

Teaching & Mentoring

I enjoy teaching and mentoring students, and I was involved with multiple courses at Berkeley.