Model-based Offline Reinforcement Learning with Lower Expectile Q-learning

Abstract

MOBILE

LEQ (Ours)

CBOP

Model-based approaches on offline RL falls short in solving long-horizon tasks due to high bias in value estimation from model rollouts.
In this paper, we introduce a novel model-based offline RL method, Lower Expectile Q-learning (LEQ), which enhances long-horizon task performance by mitigating the high bias in model-based value estimation via expectile regression of λ-returns.
Our empirical results show that LEQ significantly outperforms previous model-based offline RL methods on long-horizon tasks, such as the D4RL AntMaze tasks, matching or surpassing the performance of model-free approaches.

Lower-Expectile Q learning (LEQ)

Lower expectile

To mitigate the overestimation problem in estimating the true Q-value from inaccurate world model rollouts, we propose to use expectile regression on target Q-value estimation with small τ.
Expectile regression with small τ tends to choose the target Q-value that is lower than the expectation, effectively providing a conservative estimate of target Q-value.
Another advantage of using expectile regression is that we do not have to exhaustively evaluate Q-values to get τ-expectile, we can calculate conservative estimation only with single trajectory.

λ-returns

To further improve LEQ for long-horizon tasks, we use λ-return instead of 1-step return for Q-learning.
λ-return allows a Q-function and policy to learn from low-bias multi-step returns.
Not only for learning the critic, we propose to directly maximize the λ-returns also for learning the policy.

Utilizing transitions of the dataset

We found that β, the ratio for the loss calculated by the imaginary trajectory and by dataset transition, was crucial on achieving non-zero scores.
When we use β = 0.95, which is used by MOBILE as default, the performance drops to zero.
We suggest that utilizing the true transition from the dataset is important in long-horizon tasks, which was undervalued in prior works.

Experiments

Antmaze

LEQ significantly outperforms the prior model-based approaches for all 8 datasets.
LEQ even performs better than model-free baselines in larger mazes.

(a) umaze

(b) medium

(d) ultra

Locomotions

For D4RL MuJoCo Gym tasks, LEQ achieves comparable results with best score of prior works in 6 out of 12 tasks.
These experiments show that LEQ serves as a general offline RL algorithm, not limited to long-horizon tasks.

Visual Control

For V-D4RL datasets, LEQ is on par with the state-of-the-art methods.
These experiments show the scalablity of LEQ to visual observations.

BibTeX

@inproceedings{park2025leq,
  title={Model-based Offline Reinforcement Learning with Lower Expectile Q-learning},
  author={Kwanyoung Park and Youngwoon Lee},
  booktitle={The Thirteenth International Conference on Learning Representations},
  year={2025},
}