One Demonstration Is Enough for Real-World Robotic Reinforcement Learning

Yuwan Liu^1,2,3*, Hongze Yu^3*, Song Liu³, Yuhan Wang³, Junge Zhang^1,4, Yaodong Yang⁵, Yuanpei Chen³, Ceyao Zhang^3,5†

¹National Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institution of Automation, Chinese Academy of Sciences ²Beijing Academy of Artificial Intelligence ³PKU-PsiBot Joint Lab ⁴School of Artificial Intelligence, University of Chinese Academy of Sciences ⁵Institute for Artificial Intelligence, Peking University

^*Equal contribution, ^†Corresponding author.

PDF arXiv Code

Overview of AutoSERL. Auto Intervention 1 (Sliding Window Intervention): the robot is guided to the nearest point within the sliding window only when the angle θ between the trajectory's forward direction and the vector to that point satisfies θ ≤ 90° , preventing the robot from being pulled back to already-visited positions. Auto Intervention 2 (Safety Recovery Mechanism): when the robot is stuck, it is guided to the recovery point and the demonstration segment is replayed to restore progress. Policy Training: intervention-guided transitions and the single demonstration are stored in the Demo Buffer and Replay Buffer for policy training.

Abstract

Learning effective robot control policies on physical hardware is challenging due to costly data collection and the difficulty of reward specification. Prior work has incorporated demonstrations into reinforcement learning (RL), yet existing approaches either require large numbers of demonstrations or depend on continuous human intervention during training. To address these limitations, we present AutoSERL, a framework that leverages a single demonstration to fully automate the intervention process in real-world robot RL. The framework includes three complementary mechanisms to accomplish certain tasks: a sliding window intervention mechanism that continuously guides exploration to prevent local optima and unsafe deviations, a safety recovery mechanism that detects and corrects failure states via predefined trajectory recovery points, and an intervention termination criterion that automatically disables guidance once the policy can independently complete the task, preserving its exploration advantage. We evaluate AutoSERL on six contact-intensive manipulation tasks across two robot platforms, spanning insertion, hanging, and hinge-based tasks. AutoSERL consistently outperforms SERL initialized with 20 demonstrations, behavior cloning, and MILES — a dedicated one-shot imitation learning baseline — across all tasks while matching HIL-SERL, achieves 100% success rate on insertion tasks, and demonstrates improved robustness to positional variations, all from a single demonstration.

Experiments

We implement three categories of contact-intensive manipulation tasks: insertion tasks (plug insertion and USB insertion), hanging tasks (hanging a correction tape, a hanger, and a spoon), and a hinge-based task (drawer pulling using a hook). These tasks are characterized by rich physical contact and low tolerance for execution errors, requiring high manipulation precision, accurate pose alignment, stable contact control, and strong robustness to disturbances.

Training Efficiency Analysis

We compare AutoSERL with SERL to evaluate training efficiency. AutoSERL and SERL are assessed by measuring their success rates under identical training durations.

Success rate comparison between SERL and AutoSERL across tasks

Task	Training Time	SERL		AutoSERL
Task	Training Time	Training	Evaluation	Training	Evaluation
USB Insertion	8 min	Trained with SERL for 8 minutes.	Checkpoint at 8 min, 50 trials, success rate 20/50.	Trained with AutoSERL for 8 minutes.	Checkpoint at 8 min, 50 trials, success rate 50/50.
Plug Insertion	8 min	Trained with SERL for 8 minutes.	Checkpoint at 8 min, 50 trials, success rate 0/50.	Trained with AutoSERL for 8 minutes.	Checkpoint at 8 min, 50 trials, success rate 50/50.
Hanger Suspension	33 min	Trained with SERL for 33 minutes.	Checkpoint at 33 min, 50 trials, success rate 0/50.	Trained with AutoSERL for 33 minutes.	Checkpoint at 33 min, 50 trials, success rate 50/50.
Correction Tape Suspension	25 min	Trained with SERL for 25 minutes.	Checkpoint at 25 min, 50 trials, success rate 6/50.	Trained with AutoSERL for 25 minutes.	Checkpoint at 25 min, 50 trials, success rate 50/50.
Spoon Suspension	35 min	Trained with SERL for 35 minutes.	Checkpoint at 35 min, 50 trials, success rate 0/50.	Trained with AutoSERL for 35 minutes.	Checkpoint at 35 min, 50 trials, success rate 50/50.
Drawer Opening	45 min	Trained with SERL for 45 minutes.	Checkpoint at 45 min, 50 trials, success rate 0/50.	Trained with AutoSERL for 45 minutes.	Checkpoint at 45 min, 50 trials, success rate 50/50.

We compare AutoSERL with HIL-SERL to evaluate training efficiency. AutoSERL and HIL- SERL are assessed by measuring the minimum training time required to reach a 100% success rate.

Training time comparison between AutoSERL and HIL-SERL across tasks

To further validate the effectiveness of AutoSERL, we compare it with BC and MILES in terms of final success rate.

Success rate comparison between AutoSERL, BC, and MILES across tasks

Robustness Analysis

Inter-seed variance of AutoSERL on plug insertion

We retrain the plug insertion task with five seeds (40–44). Under all seeds, the method achieves 100% or near 100% success rates, indicating robustness to random initialization and low performance variance across different seeds.

Positional variation robustness comparison between SERL and AutoSERL

In the plug insertion task, we randomize the initial plug position within a ±3 cm range in the x–y plane while keeping the socket position fixed to evaluate robustness to positional variations. AutoSERL achieves higher success rates and more stable convergence than SERL, demonstrating improved robustness to positional variations.

Heuristic Hyperparameter Analysis

Overall, both parameters degrade performance when set too small or too large, and perform best within a moderate range. The stagnation window length l_stag defines the number of steps required for the end-effector to move beyond th₁ and is set according to the task's action scale.

Hyperparameter sensitivity analysis for th1

Hyperparameter sensitivity analysis for th2

Ablation Studies

To evaluate the contribution of each component in AutoSERL, we conduct three ablation studies: (1) No sliding window intervention: the sliding window intervention mechanism is removed, and no intervention points are used to guide the robot during training. (2) No recovery mechanism: the safety recovery mechanism is removed, and when the robot encounters failure states, it must rely solely on its own exploration to recover. (3) No intervention termination: the intervention termination criterion is removed, and the intervention continuously supervises the training process until the end.
Ablation results show that the sliding window intervention, recovery mechanism, and intervention termination are all essential to AutoSERL's performance, as removing any of them degrades learning efficiency or success rates.

Ablation study results on plug insertion

Ablation study results on drawer opening

Policy and Demonstration Trajectory Comparison

3D trajectory comparison between demonstration and policy rollout

Taking the plug insertion task as an example, we collect a non-optimal demonstration trajectory of length 99 and train the policy using this trajectory. Rolling out the first checkpoint that achieves a 50/50 success rate yields a trajectory of length 54. This suggests that the policy goes beyond simple imitation and performs trajectory-level optimization over the demonstration.

Conclusion

We propose AutoSERL, a real-world RL method that enables training through automatic intervention using only a single demonstration trajectory. AutoSERL incorporates sliding window intervention, safety recovery, and intervention termination to ensure safe and stable real-world reinforcement learning. AutoSERL outperforms multi-demonstration RL, behavior cloning, and one-shot imitation learning baselines across six contact-intensive tasks spanning three task categories while matching HIL-SERL. We hope this work will inspire future research on automated real-world robotic RL and enable its extension to tasks with more diverse failure modes and higher-dimensional action spaces.