[go: up one dir, main page]

Model-based Offline Reinforcement Learning with Lower Expectile Q-learning

Yonsei University

Abstract

MOBILE LEQ (Ours) CBOP

Lower-Expectile Q learning (LEQ)


Lower expectile
  • To mitigate the overestimation problem in estimating the true Q-value from inaccurate world model rollouts, we propose to use expectile regression on target Q-value estimation with small τ.
  • Expectile regression with small τ tends to choose the target Q-value that is lower than the expectation, effectively providing a conservative estimate of target Q-value.
  • Another advantage of using expectile regression is that we do not have to exhaustively evaluate Q-values to get τ-expectile, we can calculate conservative estimation only with single trajectory.


λ-returns
  • To further improve LEQ for long-horizon tasks, we use λ-return instead of 1-step return for Q-learning.
  • λ-return allows a Q-function and policy to learn from low-bias multi-step returns.
  • Not only for learning the critic, we propose to directly maximize the λ-returns also for learning the policy.


Utilizing transitions of the dataset
  • We found that β, the ratio for the loss calculated by the imaginary trajectory and by dataset transition, was crucial on achieving non-zero scores.
  • When we use β = 0.95, which is used by MOBILE as default, the performance drops to zero.
  • We suggest that utilizing the true transition from the dataset is important in long-horizon tasks, which was undervalued in prior works.


Experiments



Antmaze
  • LEQ significantly outperforms the prior model-based approaches for all 8 datasets.
  • LEQ even performs better than model-free baselines in larger mazes.
(a) umaze (b) medium (c) large (d) ultra

Locomotions
  • For D4RL MuJoCo Gym tasks, LEQ achieves comparable results with best score of prior works in 6 out of 12 tasks.
  • These experiments show that LEQ serves as a general offline RL algorithm, not limited to long-horizon tasks.

Visual Control
  • For V-D4RL datasets, LEQ is on par with the state-of-the-art methods.
  • These experiments show the scalablity of LEQ to visual observations.

BibTeX

@inproceedings{park2025leq,
  title={Model-based Offline Reinforcement Learning with Lower Expectile Q-learning},
  author={Kwanyoung Park and Youngwoon Lee},
  booktitle={The Thirteenth International Conference on Learning Representations},
  year={2025},
}