FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control

Motivation

PPO has been the default for sim-to-real RL in constrained domains like quadruped locomotion and gripper manipulation. But modern robot learning — humanoids, dexterous manipulation, vision-based control — pushes into higher dimensions where discarding past experience after every update is no longer affordable.

Off-policy RL is the natural alternative, reusing replay data for far higher efficiency. Yet it remains uncommon for sim-to-real, as fitting a critic via the bootstrapped Bellman objective is slow and unstable:

$$\mathcal{L}_Q = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} \left[ \left( Q_\theta(s, a) - (r + \gamma Q_\theta(s', a')) \right)^2 \right]$$

Targets depend on the critic's own predictions, so errors compound through repeated bootstrapping. FlashSAC resolves this with three mechanisms: (i) fast training via fewer updates, larger models, and higher throughput; (ii) stable training by bounding weight, feature, and gradient norms; and (iii) broad exploration for diverse coverage. Across 60+ tasks in 10 simulators, it outperforms PPO and strong off-policy baselines, cutting sim-to-real humanoid walking from hours to minutes.

Algorithm

1. Fast Training

Following scaling trends from supervised learning, FlashSAC trades frequent small updates for high data throughput, large models, and infrequent gradient updates — a regime made possible by the stability mechanisms below.

Fast Training Components

Massively Parallel Simulation: 1024 parallel environments for rapid, diverse data collection.
Large Replay Buffer: 10M transitions (10× standard) preserve long-tail experiences.
Large Model, Large Batch, Fewer Updates: A 2.5M-parameter, 6-layer actor/critic with batch size 2048 and a UTD ratio of 2/1024.
Code Optimization: JIT-compiled PyTorch with mixed precision.

Ablation of FlashSAC's fast training components: massively parallel simulation, large replay buffer, large model with large batches and fewer updates, and code optimizations.

2. Stable Training

Scaling alone worsens the stability of bootstrapped critic updates FlashSAC stabilizes training by constraining weight, feature, and gradient norms.

Architecture Design

Inverted Residual Backbone: Transformer-style inverted bottleneck blocks with residual connections, followed by RMSNorm to bound per-sample feature norms before the value heads.
Pre-activation Batch Normalization: BN before each nonlinearity keeps activations well-scaled and exploits large-batch statistics for a smoother loss landscape.

FlashSAC Architecture. The architecture consists of stacked inverted residual blocks with pre-activation batch normalization and post-RMS normalization.

Training Techniques

Cross-Batch Value Prediction: Concatenates current and next-state transitions in one forward pass so predicted and target Q-values share the same BN statistics.
Distributional Critic with Adaptive Reward Scaling: Q-values are categorical over $ [G_{\min}, G_{\max}] $, trained via cross-entropy. Rewards are normalized to keep returns within the fixed support: $$\bar{r}_t = \frac{r_t}{\max\!\left(\sqrt{\sigma_{t,G}^2 + \epsilon},\; G_{t,\max} / G_{\max}\right)}.$$
Weight Normalization: Projects weight vectors onto the unit sphere after each step, encoding information through direction rather than scale.

Ablation of FlashSAC's stable training components that constrain weight, feature, and gradient norms throughout training.

3. Exploration

FlashSAC uses two complementary mechanisms to broaden state-action coverage.

Exploration Mechanisms

Unified Entropy Target: Parameterizes the entropy target by a fixed action std $ \sigma_\text{tgt} $, giving $ \bar{\mathcal{H}} = \tfrac{1}{2}|\mathcal{A}|\log(2\pi e\,\sigma_\text{tgt}^2) $ — consistent across embodiments without per-task tuning ($ \sigma_\text{tgt} = 0.15 $).
Noise Repetition: Holds a sampled action noise vector for $ k $ steps, with $ k $ drawn from a Zeta distribution — inducing temporal correlation with minimal overhead.

Ablation of FlashSAC's exploration mechanisms: unified entropy target and noise repetition.

FlashSAC: Fast and Stable
Off-Policy Reinforcement Learning
for High-Dimensional Robot Control

TL;DR

Video Results

Low DoF

High DoF

Sim-to-Real (Flat)

Sim-to-Real (Rough)