|
|
--- |
|
|
license: apache-2.0 |
|
|
--- |
|
|
|
|
|
# This&That V1.0 Model Card |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
[**Project Page**](https://cfeng16.github.io/this-and-that/) **|** [**Paper (ArXiv)**](https://arxiv.org/abs/2407.05530) **|** [**Code**](https://github.com/Kiteretsu77/This_and_That_VDM) |
|
|
|
|
|
</div> |
|
|
|
|
|
## Introduction |
|
|
|
|
|
We propose a robot learning method for communicating, planning, and executing a wide range of tasks, dubbed This&That. |
|
|
We achieve robot planning for general tasks by leveraging the power of video generative models trained on |
|
|
internet-scale data containing rich physical and semantic context. |
|
|
In this work, we tackle three fundamental challenges in video-based planning: 1) unambiguous task |
|
|
communication with simple human instructions, 2) controllable video generation |
|
|
that respects user intents, and 3) translating visual planning into robot actions. |
|
|
We propose language-gesture conditioning to generate videos, which is both simpler |
|
|
and clearer than existing language-only methods, especially in complex and uncertain environments. |
|
|
We then suggest a behavioral cloning design that seamlessly incorporates the video plans. This&That demonstrates state-of-the-art effectiveness |
|
|
in addressing the above three challenges, and justifies the use of video generation vas an intermediate representation for generalizable task planning and execution. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Citation |
|
|
```bibtex |
|
|
@article{wang2024language, |
|
|
title={This\&That: Language-Gesture Controlled Video Generation for Robot Planning}, |
|
|
author={Wang, Boyang and Sridhar, Nikhil and Feng, Chao and Van der Merwe, Mark and Fishman, Adam and Fazeli, Nima and Park, Jeong Joon}, |
|
|
journal={arXiv preprint arXiv:2407.05530}, |
|
|
year={2024} |
|
|
} |
|
|
``` |
|
|
|
|
|
|