Do As I Can, Not As I Say (first paper)
The contribution of this research presents a system, named SayCan, that significantly improves a robot’s task understanding and execution capabilities by introducing the use of large language models (LLMs). By interpreting high-level instructions and breaking them down into low-level, feasible skills. The most important point is that SayCan design a value function named affordance function which simultaneously evaluates both the possibility of the LLM generating a certain low-level instruction and the feasibility of executing the low-level instruction in a specific environment. Then the robot executes the low-level instruction. After that, SayCan adds the executed instruction to LLM query and proceeds to next round of evalutaion. SayCan allows a robot to understand and execute complex commands by using LLM to decompose into several lower-level steps then evaluating the feasibility of these steps in the current environment.
Strengths of this paper include its innovative approach to make robot understand human instrucion and evalute the feasibility in the current environment, breaking the limitations that the instructions generated by LLM cannot be executed in the real environment. It has successful integration of LLMs into robotics, and the adaptability of variety of tasks in multiple languages, attesting to the universal appeal of the model. Additionally, the paper is highly detailed, with results backed by extensive experiments conducted in realistic environments such as office kitchens.
Weaknesses of this paper include the robot can only use the low-level instructions designed by the human. So it can only execute limited tasks. For example, it can only recognize limited object — apple, can and something, and perform limited actions. That is to say, the primary bottleneck of the system is in the range and capabilities of the underlying skills. And it inherits the limitations and biases of LLMs.
the paper introduces a robust and innovative method called SayCan that integrates Large Language Models (LLMs) with robots. The explanation of the technical concepts related to LLMs, value functions, and affordance functions is detailed and informative. The experiments is easy to be reproducted so that it has a high degree of credibility.
My confidence and Final Recommendation
I have moderate confidence in reviewing this paper. I suggest that additional experiments be conducted on the impact of expanding low-level instructions on the robot’s generalization ability.
VIMA: General Robot Manipulation with Multimodal Prompts (second papaer)
The contribution of this paper lies in the development of a transformer-based robot agent, VIMA, which can perform various robot manipulation tasks via multimodal prompts. Additionally, the authors have presented VIMA-BENCH, a diverse benchmark suite supporting a wide array of multimodal-prompted tasks. VIMA proposes a new method of tokenize which makes text and image a uniform token style. Robot can use a uniform IO interface to capture the features from the instruction composed of text and images.
The paper demonstrates the strength of VIMA in terms of scalability, data efficiency, and robustness. VIMA outperforms other models in zero-shot generalization tasks and shows stronger performance with less training data. It also showcases an impressive robustness towards detection inaccuracies. But I think the benchmark framework for multimodal robot learning is more important cause it can be cosidered as a foundation of standard benchmark toolkit.
The weaknesses of this paper is that although it proves VIMA can handle variable-length object sequences, the impact on performance is not reflected. In the other hand, altought It proves to have a strong zero-shot generalization. It success rate is around 50% which is not very ideal. So there is still a large room for improvement.
The author of this paper open-source the simulation environment, training dataset, algorithm code, and pretrained model checkpoints to ensure reproducibility. So I believe the reproducibility of this paper should be high if there were enough computational resources.
The paper is well-structured and clear. Author first introduce overall characterristics of VIMA and VIMA benchmark. Then detailed introducing the design concept and methods used in VIMA. Finally using its benchmark suite to test the performance of the model. They do a lot of experiments to test impact on different aspects.
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control (third paper)
The paper introduces Vision-Language-Action (VLA) models, a novel method of integrating pretrained vision-language models trained on abundant web-scale data into end-to-end robotic control. Many other works that use LLM for robots only use LLM for instruction decomposition (such as first paper); the robot then generates action information based on the decomposed instructions (for example, if you want the robot to get a bottle of Coke, the LLM only decomposes and tells the robot “first open the refrigerator door”, “then take out the Coke” these two more specific instructions). Here, the authors use LLM to directly generate action sequences, omitting the intermediate steps. Its idea seems like more unified for image, text and robot action control. Compared to the previous two articles, it make the output of robot control a direct generated token of the modal just like text generated from LLM. It obviously more efficient from an intuitive point of view.
- The work addresses a critical gap in bridging high-level language and vision knowledge with low-level robotic actions, presenting a unique pathway for enhancing robot operations.
- The approach integrates the benefits of large-scale pretraining on language and vision-language data from the web. This promises improved generalization and the emergence of new capabilities in the robotic domain.
- The strategy of representing actions as text tokens is innovative and allows for a straightforward blending of natural language responses and robotic actions.
RT2, as a VLA, its emergent capabilities only manifest in aspects related to VLMs, but it cannot realize new skills at the physical level and is still limited by the skill categories in the dataset. The authors hope to use some new data collection methods to expand skills in the future, such as learning from videos.
Using VLA for real-time control of robots, the frequency is still not high enough. Later, it can be achieved by quantizing and distilling the model to deploy it on lower-cost hardware platforms, while performing higher-frequency inference.
The approach of representing robot actions as text tokens and incorporating them directly into the training set is unique and innovative. It makes the training of robot model just like the training of the existing language model, making use of the existing training results.
The conclusion of this paper is validated by experimental results, and it is logically clear and correctly formatted.