Abstract
We study goal-conditioned RL through the lens of generalization, but not in the traditional sense of random augmentations and domain randomization. Rather, we aim to learn goal-directed policies that generalize with respect to the horizon: after training to reach nearby goals (which are easy to learn), these policies should succeed in reaching distant goals (which are quite challenging to learn). In the same way that invariance is closely linked with generalization is other areas of machine learning (e.g., normalization layers make a network invariant to scale, and therefore generalize to inputs of varying scales), we show that this notion of horizon generalization is closely linked with invariance to planning: a policy navigating towards a goal will select the same actions as if it were navigating to a waypoint en route to that goal. Thus, such a policy trained to reach nearby goals should succeed at reaching arbitrarily-distant goals. Our theoretical analysis proves that both horizon generalization and planning invariance are possible, under some assumptions. We present new experimental results and recall findings from prior work in support of our theoretical results. Taken together, our results open the door to studying how techniques for invariance and generalization developed in other areas of machine learning might be adapted to achieve this alluring property.
Planning Invariance
A key mathematical tool for understanding horizon generalization is a form of temporal invariance obeyed by optimal policies. In the same way that an image classification model that is invariant to rotations will generalize to images of different orientations, we prove that a policy invariant to planning, under certain assumptions, will exhibit horizon generalization.
Informally, a goal-conditioned policy is invariant to planning if it can reach distant goals with similar success when conditioned directly on the goal compared to when conditioned on a series of intermediate waypoints. In other words, breaking up a complex task into a series of simpler tasks confers no advantage to the policy.
In fact, this form of invariance is already present for a certain class of policies, namely those which with a value function parameterized by a quasimetric. Empirically, we find that other, weaker value parameterizations also confer varying (lesser) degrees of planning invariance.
Planning Invariance enables Horizon Generalization
A policy generalizes over the horizon if performance for start-goal pairs $(s,g)$ separated by a small temporal distance $d(s,g) < c$ yields improved performance over more distant start-goal pairs $(s',g')$ with $d(s',g') > c$.
This property is closely related to the notion of planning invariance, and we theoretically show that planning invariance is a necessary condition for horizon generalization. Empirically, we will also see that planning invariance is closely linked with this notion of horizon generalization.
Empirical Evaluation of Horizon Generalization
${\bf B\kern-.05em{\small I\kern-.025em B}\kern-.08em T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}$
@misc{myers2025horizon,
author = {Myers, Vivek and Ji, Catherine and Eysenbach, Benjamin},
eprint = {2501.02709},
eprinttype = {arXiv},
howpublished = {arXiv:2501.02709},
title = {{Horizon Generalization} in {Reinforcement Learning}},
url = {https://arxiv.org/abs/2501.02709},
year = {2025},
}