Action Transformers for Robots

This repository contains quickstart code to train and evaluate an Action Chunking Transformer (ACT) to perform various robot manipulation tasks using the ALOHA gym environment.

In the ALOHA TransferCubeTask, the right arm of a robot needs to pick up a red cube and place it inside the gripper of the left arm.

View my training & evaluation graphs:

For the Transfer Cube task (single object)
For the Insertion task (two objects)

What is an Action Chunking Transformer (ACT)?

An Action Chunking Transformer is a novel imitation learning algorithm designed to handle the complexities of fine-grained robotic manipulation tasks. It leverages the strengths of action chunking and the Transformer architecture to improve the learning and execution of these tasks.

ACT Key Concepts

Action Chunking:
- Definition: Action chunking refers to grouping a sequence of actions together and treating them as a single unit. Instead of predicting one action at a time, the model predicts a sequence of actions for multiple timesteps.
- Purpose: This reduces the effective horizon of the task, which helps in mitigating the compounding error problem. Compounding errors occur when small prediction errors accumulate over time, leading the robot to states that are outside the training distribution and causing task failures.
- Implementation: In the context of the Action Chunking Transformer, the policy models the probability distribution of a sequence of actions given the current observation.
Transformer Architecture:
- Transformers: Originally designed for natural language processing tasks, Transformers are effective at handling sequence data and capturing long-range dependencies.
- Encoder-Decoder Structure: In this implementation, the Transformer encoder processes the observation inputs (including visual data and joint positions), and the Transformer decoder predicts the sequence of actions.
- Conditional Variational Autoencoder (CVAE): The Action Chunking Transformer uses a CVAE to handle the variability in human demonstrations. The CVAE encoder compresses the observed actions and joint positions into a latent variable z, which the decoder then uses, along with the observations, to predict the sequence of actions.
Temporal Ensembling:
- Definition: Temporal ensembling involves averaging the predictions of overlapping action chunks to produce smoother and more accurate trajectories.
- Purpose: This technique addresses the potential issue of abrupt changes between action chunks and ensures smoother transitions by incorporating new observations continuously and averaging the predicted actions.
- Implementation: The policy is queried at each timestep, producing overlapping chunks of actions. These predicted actions are then combined using an exponential weighting scheme, which prioritizes more recent predictions but still takes older ones into account.

ACT Workflow

Data Collection:
- Human demonstrations are collected using a teleoperation system. The joint positions of the leader robot (operated by the human) are recorded as the actions, and observations include images from multiple cameras and the joint positions of the follower robot.
Training:
- The CVAE encoder processes the collected data to learn a latent representation z.
- The Transformer decoder, conditioned on z and the current observations, predicts the sequence of future actions.
- The model is trained to minimize the reconstruction loss (difference between predicted and actual actions) and the KL-divergence regularization loss to ensure the latent space is well-structured.
Inference:
- During execution, the policy generates action sequences based on the current observation and the mean of the prior distribution of z.
- Temporal ensembling is applied to combine predictions from overlapping action chunks, ensuring smooth and precise motion.

ACT Advantages

Reduction of Compounding Errors: By predicting action sequences, the effective horizon is reduced, and errors do not compound as rapidly.
Handling of Non-Markovian Behavior: Action chunking can manage pauses and other non-Markovian behaviors in human demonstrations, improving the robustness of the policy.
Smooth and Precise Actions: Temporal ensembling helps in producing smooth and accurate actions, which are crucial for fine-grained manipulation tasks.

Training

available_tasks_per_env = {
    "aloha": [
        "AlohaInsertion-v0",
        "AlohaTransferCube-v0",
    ],
    "pusht": ["PushT-v0"],
    "xarm": ["XarmLift-v0"],
}
available_datasets_per_env = {
    "aloha": [
        "lerobot/aloha_sim_insertion_human",
        "lerobot/aloha_sim_insertion_scripted",
        "lerobot/aloha_sim_transfer_cube_human",
        "lerobot/aloha_sim_transfer_cube_scripted",
    ],
    "pusht": ["lerobot/pusht"],
    "xarm": [
        "lerobot/xarm_lift_medium",
        "lerobot/xarm_lift_medium_replay",
        "lerobot/xarm_push_medium",
        "lerobot/xarm_push_medium_replay",
    ],
}

python train.py \
   hydra.job.name=act_aloha_sim_transfer_cube_human \
   hydra.run.dir=outputs/train/act_aloha_sim_transfer_cube_human \
   policy=act \
   policy.use_vae=true \
   env=aloha \
   env.task=AlohaTransferCube-v0 \
   dataset_repo_id=lerobot/aloha_sim_transfer_cube_human \
   training.eval_freq=10000 \
   training.log_freq=250 \
   training.offline_steps=100000 \
   training.save_model=true \
   training.save_freq=25000 \
   eval.n_episodes=50 \
   eval.batch_size=50 \
   wandb.enable=true \
   device=cuda

python train.py \
   hydra.job.name=act_aloha_sim_insertion_human \
   hydra.run.dir=outputs/train/act_aloha_sim_insertion_human \
   policy=act \
   policy.use_vae=true \
   env=aloha \
   env.task=AlohaInsertion-v0 \
   dataset_repo_id=lerobot/aloha_sim_insertion_human \
   training.eval_freq=10000 \
   training.log_freq=250 \
   training.offline_steps=100000 \
   training.save_model=true \
   training.save_freq=25000 \
   eval.n_episodes=50 \
   eval.batch_size=50 \
   wandb.enable=true \
   device=cuda

Evaluation

Evaluate a policy on an environment by running rollouts and computing metrics.

For example, to evaluate a model from the HF model hub (diffusion_pusht) for 10 episodes.

python eval.py -p lerobot/diffusion_pusht eval.n_episodes=10

To evaluate a model checkpoint from this repo's training script for 10 episodes.

python eval.py \
    -p outputs/train/diffusion_pusht/checkpoints/005000 \
    eval.n_episodes=10

Note that in both examples, the repo/folder should contain at least config.json, config.yaml and model.safetensors.

Note the formatting for providing the number of episodes. Generally, you may provide any number of arguments with qualified.parameter.name=value. In this case, the parameter eval.n_episodes appears as n_episodes nested under eval in the config.yaml found here.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
configs		configs
examples		examples
lerobot		lerobot
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
aloha.gif		aloha.gif
eval.py		eval.py
requirements.txt		requirements.txt
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Action Transformers for Robots

What is an Action Chunking Transformer (ACT)?

ACT Key Concepts

ACT Workflow

ACT Advantages

Training

Evaluation

About

Languages

License

KhaledSharif/robot-transformers

Folders and files

Latest commit

History

Repository files navigation

Action Transformers for Robots

What is an Action Chunking Transformer (ACT)?

ACT Key Concepts

ACT Workflow

ACT Advantages

Training

Evaluation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages