One Demonstration Is Enough for Real-World Robotic Reinforcement Learning

Yuwan Liu1,2,3*, Hongze Yu3*, Song Liu3, Yuhan Wang3, Junge Zhang1,4, Yaodong Yang5, Yuanpei Chen3, Ceyao Zhang3,5†
1National Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institution of Automation, Chinese Academy of Sciences 2Beijing Academy of Artificial Intelligence 3PKU-PsiBot Joint Lab 4School of Artificial Intelligence, University of Chinese Academy of Sciences 5Institute for Artificial Intelligence, Peking University
*Equal contribution, Corresponding author.
Teaser image
Overview of AutoSERL. Auto Intervention 1 (Sliding Window Intervention): the robot is guided to the nearest point within the sliding window only when the angle θ between the trajectory's forward direction and the vector to that point satisfies θ ≤ 90° , preventing the robot from being pulled back to already-visited positions. Auto Intervention 2 (Safety Recovery Mechanism): when the robot is stuck, it is guided to the recovery point and the demonstration segment is replayed to restore progress. Policy Training: intervention-guided transitions and the single demonstration are stored in the Demo Buffer and Replay Buffer for policy training.

Abstract

Learning effective robot control policies on physical hardware is challenging due to costly data collection and the difficulty of reward specification. Prior work has incorporated demonstrations into reinforcement learning (RL), yet existing approaches either require large numbers of demonstrations or depend on continuous human intervention during training. To address these limitations, we present AutoSERL, a framework that leverages a single demonstration to fully automate the intervention process in real-world robot RL. The framework includes three complementary mechanisms to accomplish certain tasks: a sliding window intervention mechanism that continuously guides exploration to prevent local optima and unsafe deviations, a safety recovery mechanism that detects and corrects failure states via predefined trajectory recovery points, and an intervention termination criterion that automatically disables guidance once the policy can independently complete the task, preserving its exploration advantage. We evaluate AutoSERL on six contact-intensive manipulation tasks across two robot platforms, spanning insertion, hanging, and hinge-based tasks. AutoSERL consistently outperforms SERL initialized with 20 demonstrations, behavior cloning, and MILES — a dedicated one-shot imitation learning baseline — across all tasks while matching HIL-SERL, achieves 100% success rate on insertion tasks, and demonstrates improved robustness to positional variations, all from a single demonstration.

Experiments

We implement three categories of contact-intensive manipulation tasks: insertion tasks (plug insertion and USB insertion), hanging tasks (hanging a correction tape, a hanger, and a spoon), and a hinge-based task (drawer pulling using a hook). These tasks are characterized by rich physical contact and low tolerance for execution errors, requiring high manipulation precision, accurate pose alignment, stable contact control, and strong robustness to disturbances.

Training Efficiency Analysis

We compare AutoSERL with SERL to evaluate training efficiency. AutoSERL and SERL are assessed by measuring their success rates under identical training durations.

Success rate comparison between SERL and AutoSERL across tasks
Task Training Time SERL AutoSERL
Training Evaluation Training Evaluation
USB Insertion 8 min

Trained with SERL for 8 minutes.

Checkpoint at 8 min, 50 trials, success rate 20/50.

Trained with AutoSERL for 8 minutes.

Checkpoint at 8 min, 50 trials, success rate 50/50.

Plug Insertion 8 min

Trained with SERL for 8 minutes.

Checkpoint at 8 min, 50 trials, success rate 0/50.

Trained with AutoSERL for 8 minutes.

Checkpoint at 8 min, 50 trials, success rate 50/50.

Hanger Suspension 33 min

Trained with SERL for 33 minutes.

Checkpoint at 33 min, 50 trials, success rate 0/50.

Trained with AutoSERL for 33 minutes.

Checkpoint at 33 min, 50 trials, success rate 50/50.

Correction Tape Suspension 25 min

Trained with SERL for 25 minutes.

Checkpoint at 25 min, 50 trials, success rate 6/50.

Trained with AutoSERL for 25 minutes.

Checkpoint at 25 min, 50 trials, success rate 50/50.

Spoon Suspension 35 min

Trained with SERL for 35 minutes.

Checkpoint at 35 min, 50 trials, success rate 0/50.

Trained with AutoSERL for 35 minutes.

Checkpoint at 35 min, 50 trials, success rate 50/50.

Drawer Opening 45 min

Trained with SERL for 45 minutes.

Checkpoint at 45 min, 50 trials, success rate 0/50.

Trained with AutoSERL for 45 minutes.

Checkpoint at 45 min, 50 trials, success rate 50/50.

We compare AutoSERL with HIL-SERL to evaluate training efficiency. AutoSERL and HIL- SERL are assessed by measuring the minimum training time required to reach a 100% success rate.

Training time comparison between AutoSERL and HIL-SERL across tasks

To further validate the effectiveness of AutoSERL, we compare it with BC and MILES in terms of final success rate.

Success rate comparison between AutoSERL, BC, and MILES across tasks

Robustness Analysis

Inter-seed variance of AutoSERL on plug insertion

We retrain the plug insertion task with five seeds (40–44). Under all seeds, the method achieves 100% or near 100% success rates, indicating robustness to random initialization and low performance variance across different seeds.

Positional variation robustness comparison between SERL and AutoSERL

In the plug insertion task, we randomize the initial plug position within a ±3 cm range in the x–y plane while keeping the socket position fixed to evaluate robustness to positional variations. AutoSERL achieves higher success rates and more stable convergence than SERL, demonstrating improved robustness to positional variations.

Heuristic Hyperparameter Analysis

Overall, both parameters degrade performance when set too small or too large, and perform best within a moderate range. The stagnation window length lstag defines the number of steps required for the end-effector to move beyond th1 and is set according to the task's action scale.

Hyperparameter sensitivity analysis for th1
Hyperparameter sensitivity analysis for th2

Ablation Studies

To evaluate the contribution of each component in AutoSERL, we conduct three ablation studies: (1) No sliding window intervention: the sliding window intervention mechanism is removed, and no intervention points are used to guide the robot during training. (2) No recovery mechanism: the safety recovery mechanism is removed, and when the robot encounters failure states, it must rely solely on its own exploration to recover. (3) No intervention termination: the intervention termination criterion is removed, and the intervention continuously supervises the training process until the end.
Ablation results show that the sliding window intervention, recovery mechanism, and intervention termination are all essential to AutoSERL's performance, as removing any of them degrades learning efficiency or success rates.

Ablation study results on plug insertion
Ablation study results on USB insertion
Ablation study results on drawer opening

Policy and Demonstration Trajectory Comparison

3D trajectory comparison between demonstration and policy rollout

Taking the plug insertion task as an example, we collect a non-optimal demonstration trajectory of length 99 and train the policy using this trajectory. Rolling out the first checkpoint that achieves a 50/50 success rate yields a trajectory of length 54. This suggests that the policy goes beyond simple imitation and performs trajectory-level optimization over the demonstration.


Conclusion

We propose AutoSERL, a real-world RL method that enables training through automatic intervention using only a single demonstration trajectory. AutoSERL incorporates sliding window intervention, safety recovery, and intervention termination to ensure safe and stable real-world reinforcement learning. AutoSERL outperforms multi-demonstration RL, behavior cloning, and one-shot imitation learning baselines across six contact-intensive tasks spanning three task categories while matching HIL-SERL. We hope this work will inspire future research on automated real-world robotic RL and enable its extension to tasks with more diverse failure modes and higher-dimensional action spaces.


BibTeX