Do's and Don'ts:
Learning Desirable Skills with Instruction Videos
NeurIPS 2024

Abstract


Unsupervised skill discovery is a learning paradigm that aims to acquire diverse behaviors without explicit rewards. However, it faces challenges in learning complex behaviors and often leads to learning unsafe or undesirable behaviors. For instance, in various continuous control tasks, current unsupervised skill discovery methods succeed in learning basic locomotions like standing but struggle with learning more complex movements such as walking and running. Moreover, they may acquire unsafe behaviors like tripping and rolling or navigate to undesirable locations such as pitfalls or hazardous areas. In response, we present DoDont(Do’s and Dont’s), an instruction-based skill discovery algorithm composed of two stages. First, in instruction learning stage, DoDont leverages action-free instruction videos to train an instruction network to distinguish desirable transitions from undesirable ones. Then, in the skill learning stage, the instruction network adjusts the reward function of the skill discovery algorithm to weight the desired behaviors. Specifically, we integrate the instruction network into a distance-maximizing skill discovery algorithm, where the instruction network serves as the distance function. Empirically, with less than 8 instruction videos, DoDont effectively learns desirable behaviors and avoids undesirable ones across complex continuous control tasks.

Motivation

Is the purely unsupervised assumption of unsupervised RL ideal in the real-world?


DIAYN

LSD

RND

METRA

Despite notable advancements in unsupervised skill discovery (USD) algorithms, acquiring diverse policies in enviornments with large state and action spaces remains a significant challenge.

Two major issues arise when training agents with USD in these complex enviornments.

  • First, since the vast state and action spaces enable the agent to develop a wide variety of behaviors, learning simple behaviors like standing may be feasible (e.g., DIAYN, LSD), but mastering complex behaviors such as walking or running can take an exceedingly long time.
  • Second, agent can develop undesirable and risky behaviors during training, such as tripping, or rolling (e.g., RND, METRA).

Method overview

To address these challenges, we propose DoDont, a skill discovery algorithm that integrates USD objectives with intended behavioral goals. Instead of relying on a hand-designed reward, DoDont learns a reward function from a small set of instruction videos that demonstrate desirable and undesirable behaviors.

DoDont consists of two stages.

Stage 1. Train instruction network with instruction videos
DoDont starts by collecting instruction videos of desirable (Do's) and undesirable behaviors (Don'ts).
We then train an instruction network which assigns higher values to desirable behaviors and lower values to undesirable ones.

Stage 2. Skill learning with instruction network
We integrate the learned instructioin network into the online distance-maximizing skill discovery algorithm.
We set the trained instruction network as the distance metric in the distance-maximizing skill discovery framework.

Objective function of DoDont

$$\begin{aligned} \text{Maximize} \ \ r(s, z, s') = \hat{p}_{\psi}(s, s')(\phi(s') - \phi(s))^\top z \quad \text{s.t.} \ \ \|\phi(s) - \phi(s')\|_2 \leq 1. \end{aligned}$$

This equation represents the final objective of DoDont. Essentially, this can be interpreted as simply multiplying instruction network \(\hat{p}_{\psi}\) to the original learning objective function of METRA. For a detailed derivation, please refer to the main paper.

Locomotion results


DoDont (4-D skills, 2 rollouts each for 9 randomly sampled skills)

(Do's: 4 run videos, Don'ts: 4 random action videos)

DoDont (One video) (4-D skills, 2 rollouts each for 9 randomly sampled skills)

(Do's: one run video, Don'ts: one random action video)

METRA (4-D skills, 2 rollouts each for 9 randomly sampled skills)


METRA\(\dagger\) (using task reward) (4-D skills, 2 rollouts each for 9 randomly sampled skills)


SMERL (16-D skills, 2 rollouts each for 9 randomly sampled skills)


DGPO (16-D skills, 2 rollouts each for 9 randomly sampled skills)

Manipulation results

DoDont (24 skills, 2 rollouts each)

(Do's: D4RL kitchen dataset, Don'ts: Random action videos)

METRA (24 skills, 2 rollouts each)

METRA\(\dagger\) (using task reward) (24 skills, 2 rollouts each)

SMERL (24 skills, 2 rollouts each)

DGPO (24 skills, 2 rollouts each)


The website template was borrowed from Seohong Park and Jon Barron.