Using Both Demonstrations and Language Instructions to Efficiently Learn Robotic Tasks

ICLR'23 Albert Yu, Ray Mooney
UT Austin
{albertyu, mooney}@utexas.edu

Method Overview

Training Codebase released! (2023 May 16)

Human Language Dataset released! (2023 Apr 28)

Environment Codebase released! (2023 Apr 1)

Datasets released! (2023 Mar 30)

Code release update (2023 Mar 15)

We are in the process of cleaning up our code. We plan to make public our environment codebase by the end of March and our training codebase by the end of April. Both will be linked to this website before the ICLR conference.

Motivation
Contributions

Motivation

Humans learn new complex tasks often by watching a video, which provides both a visual expert demonstration of how to do the task and also accompanying language instructions (via audio/speech) that help the learner follow along. Learning complex tasks is a lot harder when relying on a single modality (such as only demonstrations or only language instructions). Likewise, for a teacher, it is often harder to clearly specify or teach new tasks with only one modality.

Current multitask policies use task embeddings based on one hot vectors, language embeddings, or demonstration embeddings. However, language instructions and video demonstrations can often be ambiguous, especially if they were provided in environments that do not perfectly align with the environment the robot is evaluated in.

In this work, we show that there exists robotic tasks complex enough where it is beneficial to provide both demonstrations and language instructions, as this is more efficient for both the end-user to specify and also the robot to learn from. Additionally, providing both language embedding features and visual demonstration features helps resolve ambiguities and decrease teacher effort needed when specifying new tasks. This allows the two modalities to contextually complement each other, enabling the robot to more clearly understand what a new task is and how to perform it.

Contributions

We present DeL-TaCo (Joint Demonstration-Language Task-Conditioning), a framework for conditioning a multitask policy simultaneously on both a demonstration and corresponding language instruction.
We introduce a challenging distribution of hundreds of robotic pick-and-place tasks and show that DeL-TaCo improves generalization ability and significantly decreases the number of expert demonstrations needed when learning novel tasks during test time.
To our knowledge, this is the first work to show that simultaneously conditioning a multi-task robotic manipulation policy on both demonstration and language embeddings improves sample efficiency and generalization over conditioning on either modality alone.

Using Both Demonstrations and Language Instructions to Efficiently Learn Robotic Tasks

Method Overview

Table of contents

Motivation

Contributions