Abstract
We study goal-conditioned RL through the lens of generalization, but not in the traditional sense of random augmentations and domain randomization. Rather, we aim to learn goal-directed policies that generalize with respect to the horizon: after training to reach nearby goals (which are easy to learn), these policies should succeed in reaching distant goals (which are quite challenging to learn). In the same way that invariance is closely linked with generalization is other areas of machine learning (e.g., normalization layers make a network invariant to scale, and therefore generalize to inputs of varying scales), we show that this notion of horizon generalization is closely linked with invariance to planning: a policy navigating towards a goal will select the same actions as if it were navigating to a waypoint en route to that goal. Thus, such a policy trained to reach nearby goals should succeed at reaching arbitrarily-distant goals. Our theoretical analysis proves that both horizon generalization and planning invariance are possible, under some assumptions. We present new experimental results and recall findings from prior work in support of our theoretical results. Taken together, our results open the door to studying how techniques for invariance and generalization developed in other areas of machine learning might be adapted to achieve this alluring property.
Planning Invariance
A key mathematical tool for understanding horizon generalization is a form of temporal invariance obeyed by optimal policies. In the same way that an image classification model that is invariant to rotations will generalize to images of different orientations, we prove that a policy invariant to planning, under certain assumptions, will exhibit horizon generalization.
Informally, a goal-conditioned policy is invariant to planning if it can reach distant goals with similar success when conditioned directly on the goal compared to when conditioned on a series of intermediate waypoints. In other words, breaking up a complex task into a series of simpler tasks confers no advantage to the policy.
In fact, this form of invariance is already present for a certain class of policies, namely those which with a value function parameterized by a quasimetric. Empirically, we find that other, weaker value parameterizations also confer varying (lesser) degrees of planning invariance.
Figure. Planning invariance means that a policy should take similar actions when directed towards a goal (purple arrow and purple star) as when directed towards an intermediate waypoint (brown arrow and brown star).
Planning Invariance enables Horizon Generalization
A policy generalizes over the horizon if performance for start-goal pairs $(s,g)$ separated by a small temporal distance $d(s,g) < c$ yields improved performance over more distant start-goal pairs $(s',g')$ with $d(s',g') > c$.
This property is closely related to the notion of planning invariance, and we theoretically show that planning invariance is a necessary condition for horizon generalization. Empirically, we will also see that planning invariance is closely linked with this notion of horizon generalization.
Figure. Empirically, we compare the degree of planning invariance and horizon generalization ($\eta$ in figure) for different GCRL methods in a maze task. These quantities are generally correlated. The exception is the random policy, which is trivially planning invariant but does not generalize over the horizon—planning invariance is necessary but not sufficient for horizon generalization.
Empirical Evaluation of Horizon Generalization
Figure. We evaluate several RL methods, measuring the horizon generalization of each. These results reveal that (i) some degree of horizon generalization is possible; (ii) the learning algorithm influences the degree of generalization; (iii) the value function architecture influences the degree of generalization; and (iv) no method achieves perfect generalization, suggesting room for improvement in future work.
Figure. (Left) A large Ant maze environment with a winding S-shaped corridor. (Right) A humanoid environment with a complex, high-dimensional observation space. We evaluate the horizon generalization as measured by $\eta$ for a quasimetric architecture (CMD) and a standard architecture (CRL), quantifying the ratio of success rates when evaluating at 5m vs 10m, 15m vs 30m, and 25m vs 50m after training to reach goals within 10m.
Figure. We evaluate on $(s, g)$ pairs of varying distances, observing that metric regression with a quasimetric exhibits strong horizon generalization.
Figure. In line with our analysis, the policy that has strong horizon generalization is also more invariant to planning: combining that policy with planning does not increase performance.
${\bf B\kern-.05em{\small I\kern-.025em B}\kern-.08em T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}$
@inproceedings{myers2025horizon,
author = {Myers, Vivek and Ji, Catherine and Eysenbach, Benjamin},
booktitle = {{International Conference} on {Learning Representations}},
title = {{Horizon Generalization} in {Reinforcement Learning}},
url = {https://arxiv.org/abs/2501.02709},
year = {2025},
}