Horizon Generalization in Reinforcement Learning

Vivek Myers 1 Catherine Ji 2 Benjamin Eysenbach 2
Equal Contribution 1 UC Berkeley 2 Princeton University

Abstract

We study goal-conditioned RL through the lens of generalization, but not in the traditional sense of random augmentations and domain randomization. Rather, we aim to learn goal-directed policies that generalize with respect to the horizon: after training to reach nearby goals (which are easy to learn), these policies should succeed in reaching distant goals (which are quite challenging to learn). In the same way that invariance is closely linked with generalization is other areas of machine learning (e.g., normalization layers make a network invariant to scale, and therefore generalize to inputs of varying scales), we show that this notion of horizon generalization is closely linked with invariance to planning: a policy navigating towards a goal will select the same actions as if it were navigating to a waypoint en route to that goal. Thus, such a policy trained to reach nearby goals should succeed at reaching arbitrarily-distant goals. Our theoretical analysis proves that both horizon generalization and planning invariance are possible, under some assumptions. We present new experimental results and recall findings from prior work in support of our theoretical results. Taken together, our results open the door to studying how techniques for invariance and generalization developed in other areas of machine learning might be adapted to achieve this alluring property.

Planning Invariance

A key mathematical tool for understanding horizon generalization is a form of temporal invariance obeyed by optimal policies. In the same way that an image classification model that is invariant to rotations will generalize to images of different orientations, we prove that a policy invariant to planning, under certain assumptions, will exhibit horizon generalization.

Informally, a goal-conditioned policy is invariant to planning if it can reach distant goals with similar success when conditioned directly on the goal compared to when conditioned on a series of intermediate waypoints. In other words, breaking up a complex task into a series of simpler tasks confers no advantage to the policy.

Definition Planning Invariance $\def\sc#1{\dosc#1\csod} \def\dosc#1#2\csod{{\rm #1{\small #2}}}$Consider a deterministic MDP with states $\S$, actions $\A$, and goal-conditioned Kronecker delta reward function $r_{g}(s) = \delta_{(s,g)}$. For any goal-conditioned policy $\pi(a \mid s, g)$ where $g \in \S$, we say that $\pi(a \mid s, g)$ is invariant under planning operator $\sc{PLAN} \in \planclass$ if and only if \[ \pi(a \mid s, g) = \pi(a \mid s, w), \text{ where } w = \sc{PLAN}(s, g). \]

In fact, this form of invariance is already present for a certain class of policies, namely those which with a value function parameterized by a quasimetric. Empirically, we find that other, weaker value parameterizations also confer varying (lesser) degrees of planning invariance.

Theorem Quasimetric policies are invariant under $\sc{PLAN}_{d}$ Given a deterministic MDP with states $\gS$, actions $\gA$, and goal-conditioned Kronecker delta reward function $r_{g}(s) = \delta_{(s,g)}$, define quasimetric policy $\pi_{d}(a\mid s,g)$ and quasimetric planner class $ \planclass_{d}$. Then, for every quasimetric planner $\sc{PLAN}_{d} \in \planclass_{d}$, there always exists a policy $\pi_{d}(a\mid s,g)$ that is planning invariant: \[ \pi_{d}(a\mid s,g) = \pi_{d}\bigl(a\mid s,w \text{ for } w= \sc{PLAN}_{d}(s,g)\bigr). \]

Figure. Planning invariance means that a policy should take similar actions when directed towards a goal (purple arrow and purple star) as when directed towards an intermediate waypoint (brown arrow and brown star).

Planning Invariance enables Horizon Generalization

A policy generalizes over the horizon if performance for start-goal pairs $(s,g)$ separated by a small temporal distance $d(s,g) < c$ yields improved performance over more distant start-goal pairs $(s',g')$ with $d(s',g') > c$.

Definition Horizon Generalization Suppose $c > 0$ and $d(s, g)$ is a quasimetric over the start-goal space $\S \times \S$. In the single-goal, controlled ("fixed") case, a policy $\pi(a\mid s,g)$ generalizes over the horizon if optimality over nearby start-goal pairs $\mathcal{B}_{c} = \{(s,g) \in \mathcal{S \times S} \mid d(s,g) < c\}$ everywhere implies optimality over the entire state space $\mathcal{S}$.

This property is closely related to the notion of planning invariance, and we theoretically show that planning invariance is a necessary condition for horizon generalization. Empirically, we will also see that planning invariance is closely linked with this notion of horizon generalization.

Definition Path Relaxation Operator Let $\sc{PATH}_{d}(s,g)$ be the path relaxation operator over quasimetric $d(s,g)$. For any triplet of states $(s,w,g) \in \gS \times \gS \times \gS$, \[ d(s,g) \leftarrow \sc{PATH}_{d}(s,g) \triangleq \min_w d(s, w) + d(w, g). \]
Theorem Horizon generalization exists Consider a deterministic goal-conditioned MDP with states $\gS$, actions $\gA$, and goal-conditioned Kronecker delta reward function $r_{g}(s) = \delta_{(s,g)}$ where there are no states outside of $\gS$. Let finite thresholds $c > 0$ and quasimetrics $d(s,g)$ over the start-goal space $\gS \times \gS$ be given. Then, a quasimetric policy $\pi_{d}(a\mid s,g)$ that is optimal over $\cB_{c} = \{(s,g) \in \mathcal{S \times S} \mid d(s,g) < c\}$ is optimal over the entire start-goal space $\cS \times \cS$.

Figure. Empirically, we compare the degree of planning invariance and horizon generalization ($\eta$ in figure) for different GCRL methods in a maze task. These quantities are generally correlated. The exception is the random policy, which is trivially planning invariant but does not generalize over the horizon—planning invariance is necessary but not sufficient for horizon generalization.

Empirical Evaluation of Horizon Generalization

Figure. We evaluate several RL methods, measuring the horizon generalization of each. These results reveal that (i) some degree of horizon generalization is possible; (ii) the learning algorithm influences the degree of generalization; (iii) the value function architecture influences the degree of generalization; and (iv) no method achieves perfect generalization, suggesting room for improvement in future work.

Figure. (Left) A large Ant maze environment with a winding S-shaped corridor. (Right) A humanoid environment with a complex, high-dimensional observation space. We evaluate the horizon generalization as measured by $\eta$ for a quasimetric architecture (CMD) and a standard architecture (CRL), quantifying the ratio of success rates when evaluating at 5m vs 10m, 15m vs 30m, and 25m vs 50m after training to reach goals within 10m.

Figure. We evaluate on $(s, g)$ pairs of varying distances, observing that metric regression with a quasimetric exhibits strong horizon generalization.

Figure. In line with our analysis, the policy that has strong horizon generalization is also more invariant to planning: combining that policy with planning does not increase performance.

${\bf B\kern-.05em{\small I\kern-.025em B}\kern-.08em T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}$

@misc{myers2025horizon,
    author       = {Myers, Vivek and Ji, Catherine and Eysenbach, Benjamin},
    eprint       = {2501.02709},
    eprinttype   = {arXiv},
    howpublished = {arXiv:2501.02709},
    title        = {{Horizon Generalization} in {Reinforcement Learning}},
    url          = {https://arxiv.org/abs/2501.02709},
    year         = {2025},
}