FlashSAC: Fast and Stable
Off-Policy Reinforcement Learning
for High-Dimensional Robot Control

A fast and stable off-policy RL algorithm that achieves the highest asymptotic performance in the shortest wall-clock time among existing methods for high-dimensional sim-to-real robotic control

1Holiday Robotics 2KAIST 3KRAFTON 4Turing Inc 5TU Darmstadt 6hessian.AI 7KTH Royal Institute of Technology 8German Research Center for AI (DFKI) * equal contribution

TL;DR

If you're using PPO, try FlashSAC!

Video Results

Low DoF

State-based Low DoF Learning Curve

High DoF

State-based High DoF Learning Curve

Sim-to-Real (Flat)

Sim-to-Real (Flat)

Sim-to-Real (Rough)

Sim-to-Real (Rough)
G1 Stair

Motivation

PPO has been the default for sim-to-real RL in constrained domains like quadruped locomotion and gripper manipulation. But modern robot learning — humanoids, dexterous manipulation, vision-based control — pushes into higher dimensions where discarding past experience after every update is no longer affordable.

Off-policy RL is the natural alternative, reusing replay data for far higher efficiency. Yet it remains uncommon for sim-to-real, as fitting a critic via the bootstrapped Bellman objective is slow and unstable:

$$\mathcal{L}_Q = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} \left[ \left( Q_\theta(s, a) - (r + \gamma Q_\theta(s', a')) \right)^2 \right]$$

Targets depend on the critic's own predictions, so errors compound through repeated bootstrapping. FlashSAC resolves this with three mechanisms: (i) fast training via fewer updates, larger models, and higher throughput; (ii) stable training by bounding weight, feature, and gradient norms; and (iii) broad exploration for diverse coverage. Across 60+ tasks in 10 simulators, it outperforms PPO and strong off-policy baselines, cutting sim-to-real humanoid walking from hours to minutes.

Algorithm

1. Fast Training

Following scaling trends from supervised learning, FlashSAC trades frequent small updates for high data throughput, large models, and infrequent gradient updates — a regime made possible by the stability mechanisms below.

Fast Training Components

  • Massively Parallel Simulation: 1024 parallel environments for rapid, diverse data collection.
  • Large Replay Buffer: 10M transitions (10× standard) preserve long-tail experiences.
  • Large Model, Large Batch, Fewer Updates: A 2.5M-parameter, 6-layer actor/critic with batch size 2048 and a UTD ratio of 2/1024.
  • Code Optimization: JIT-compiled PyTorch with mixed precision.
Fast Training Ablation

Ablation of FlashSAC's fast training components: massively parallel simulation, large replay buffer, large model with large batches and fewer updates, and code optimizations.

2. Stable Training

Scaling alone worsens the stability of bootstrapped critic updates FlashSAC stabilizes training by constraining weight, feature, and gradient norms.

Architecture Design

  • Inverted Residual Backbone: Transformer-style inverted bottleneck blocks with residual connections, followed by RMSNorm to bound per-sample feature norms before the value heads.
  • Pre-activation Batch Normalization: BN before each nonlinearity keeps activations well-scaled and exploits large-batch statistics for a smoother loss landscape.
FlashSAC Architecture

FlashSAC Architecture. The architecture consists of stacked inverted residual blocks with pre-activation batch normalization and post-RMS normalization.

Training Techniques

  • Cross-Batch Value Prediction: Concatenates current and next-state transitions in one forward pass so predicted and target Q-values share the same BN statistics.
  • Distributional Critic with Adaptive Reward Scaling: Q-values are categorical over \( [G_{\min}, G_{\max}] \), trained via cross-entropy. Rewards are normalized to keep returns within the fixed support: $$\bar{r}_t = \frac{r_t}{\max\!\left(\sqrt{\sigma_{t,G}^2 + \epsilon},\; G_{t,\max} / G_{\max}\right)}.$$
  • Weight Normalization: Projects weight vectors onto the unit sphere after each step, encoding information through direction rather than scale.
Stable Training Ablation

Ablation of FlashSAC's stable training components that constrain weight, feature, and gradient norms throughout training.

3. Exploration

FlashSAC uses two complementary mechanisms to broaden state-action coverage.

Exploration Mechanisms

  • Unified Entropy Target: Parameterizes the entropy target by a fixed action std \( \sigma_\text{tgt} \), giving \( \bar{\mathcal{H}} = \tfrac{1}{2}|\mathcal{A}|\log(2\pi e\,\sigma_\text{tgt}^2) \) — consistent across embodiments without per-task tuning (\( \sigma_\text{tgt} = 0.15 \)).
  • Noise Repetition: Holds a sampled action noise vector for \( k \) steps, with \( k \) drawn from a Zeta distribution — inducing temporal correlation with minimal overhead.
Exploration Ablation

Ablation of FlashSAC's exploration mechanisms: unified entropy target and noise repetition.

Citation

@article{kim2026flashsac,
  title={FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control},
  author={Kim, Donghu and Lee, Youngdo and Park, Minho and Kim, Kinam and Seno, Takuma and
          Nahrendra, I Made Aswin and Min, Sehee and Palenicek, Daniel and Vogt, Florian and
          Kragic, Danica and Peters, Jan and Choo, Jaegul and Lee, Hojoon},
  journal={arXiv preprint arXiv:2602},
  year={2026}
}