DanceOPD: On-Policy Generative Field Distillation

Core idea

Treat every source capability as a velocity field, then learn where and how to query those fields on the student's own rollout.

Animated method overview. DanceOPD uses hard routing to select one frozen capability field, queries it on a low-noise on-policy student state, and matches the selected velocity with a local velocity MSE loss.

Overview

Latest manuscript summary

Modern image generation demands a single model that unifies diverse capabilities, including text-to-image generation, local editing, and global editing. These abilities are rarely naturally aligned: editing can degrade T2I performance, while global and local editing can interfere with each other.

DanceOPD is an on-policy generative field distillation framework for flow-matching models. Each sample is routed to one frozen capability field, one low-noise student-induced state is queried, and the student is trained with a simple velocity MSE objective. The same formulation also absorbs operator-defined fields such as classifier-free guidance.

DanceOPD overview chart exactly matching the manuscript figure — Overview. Exact high-resolution manuscript overview with coordinate-aligned interactive hotspots.

5.347

GEditBench Avg in T2I + Edit Composition

0.849

GenEval Overall while preserving T2I

5.498

GEditBench Avg in Local + Global Edit Composition

5.833

Best CFG absorption diagnostic Avg

Challenge: capability synthesis is a field-query problem

Three query-induced alignment failures

Once each frozen source is viewed as a velocity field over the shared flow state space, capability synthesis depends on three choices: which field supervises a sample, where the field is queried, and how many states from a rollout are used.

Target-field ambiguity

Softly averaging several source fields can destroy the semantic identity of a capability query. The update may point to no real teacher behavior.

DanceOPD: hard-routed sample-wise field matching

State-distribution mismatch

Data states or teacher trajectories are off-policy for the student. They miss the states the deployed model actually visits at inference time.

DanceOPD: query on stop-gradient student rollout states

Trajectory-query correlation

Dense states from the same rollout share prompt, noise, dynamics, and history. More states can over-weight one correlated path.

DanceOPD: one low-noise semantic-side query

T2I + Edit Composition

Add editing ability while retaining text-to-image prompt following and visual quality.

Local + Global Edit Composition

Fuse preservation-heavy local editing with transformation-heavy global editing.

Realism / Style Absorption

Move the student toward a quality or style field while keeping base T2I behavior.

CFG Absorption

Internalize classifier-free guidance as an operator-defined velocity field.

Method: DanceOPD

Hard-routed semantic-side on-policy velocity matching

DanceOPD keeps each local target semantically well-defined, queries the target where the current student actually goes, and avoids dense correlated supervision. The full update is a local field-matching step on a stop-gradient rollout state.

1 · Route one sample

Keep one semantic target per sample instead of averaging teachers.

m ∼ π(m), (x,c) ∼ D_m

u_m(z,t,c) = v_m(z,t,c)

2 · Query on-policy

Ask the frozen field at a state from the current student rollout.

z_0:T^θ = Rollout(v_θ; z_T, c)

z̄_t = sg(z_t^θ)

3 · Match velocities

The selected field and student velocity meet in one local MSE.

single low-noise query, K = 1

L = E || v_θ(z̄_t,t,c) − v_m(z̄_t,t,c) ||₂²

4 · Absorb operator fields

Classifier-free guidance is another velocity field to distill.

guided velocity field

v_α = v_∅ + α(v_cond − v_∅)

A. Route one capability query

m ∼ π(m), (x,c) ∼ D_m

Each sample chooses exactly one frozen capability field. Unless stated otherwise, active capability buckets use a uniform route ratio.

B. Query on the student trajectory

z_0:T^θ = Rollout(v_θ; z_T, c)

The target field is queried at sg(z_t^θ), exposing the teacher to student-visited states without backpropagating through the solver.

C. Use one semantic-side state

K = 1, low-noise query

Low-noise states concentrate edit, style, and visual-attribute signals; one query avoids within-rollout correlation.

Main results

Multi-capability synthesis under shared sources

The desired behavior is not a midpoint between specialists. A single student should strengthen the target capability while preserving the anchor capability under the same deployment model.

T2I + Edit

5.347 GEditBench / 0.849 GenEval

+8.1% over the best reproduced OPD baseline and +8.5% over the edit source on GEditBench.

Local + Global Edit

5.498 GEditBench / 0.848 GenEval

+16.1% over the best competing composition baseline and +7.9% over the local edit source.

Method	GEditBench Avg ↑	GenEval Overall ↑	Takeaway
Joint training	4.617	0.808	Mixed supervision dilutes edit capability.
Weight merge	0.344	0.836	Preserves T2I but collapses editing.
Off-policy distill.	4.528	0.818	Teacher states leave a train–inference mismatch.
DiffusionOPD	4.947	0.833	Improves editing but below DanceOPD.
Flow-OPD	4.854	0.814	OPD baseline still suffers capability interference.
DanceOPD	5.347	0.849	Best edit score and best GenEval in this block.

Method	GEditBench Avg ↑	GenEval Overall ↑	Takeaway
Joint training	4.546	0.821	Conflict between preservation and transformation.
Weight merge	4.715	0.811	Static parameter interpolation remains a compromise.
Off-policy distill.	4.736	0.798	Target ability improves less and T2I drops.
DiffusionOPD	4.661	0.822	Below DanceOPD on both metrics.
Flow-OPD	4.679	0.827	Stable but not enough to fuse local/global behaviors.
DanceOPD	5.498	0.848	Best capability synthesis in the harder conflict setting.

Source	GEditBench Avg ↑	GenEval Overall ↑	Role
T2I	—	0.832	Anchor generation field.
Edit	4.930	0.711	General edit source.
Local Edit	5.095	0.793	Preservation-heavy source.
Global Edit	3.750	0.808	Transformation-heavy source.

Qualitative examples of DanceOPD — Qualitative examples. The composed student supports diverse text-to-image and editing behaviors while retaining strong original generative capability.

Diagnostics and ablations

Why the design choices matter

The latest ablations show that failures are not simply about loss naming or training length. They trace back to query construction: ambiguous targets, off-policy states, and correlated dense trajectory samples.

Hard routing vs. soft fusion

5.751 hard-routed MSE vs. 4.994 soft-teacher MSE. Averaging all teachers erases capability identity.

Low-noise semantic query

At 2k steps, low-t reaches 5.751, above median-t 4.649 and high-t 4.813.

Single query beats dense queries

K=1 reaches 5.751; weighted K=4 drops to 5.330, and weighted K=16 drops to 5.127.

Moderate rollout discretization is enough

At 2k steps, 8/16/20/28 rollout steps stay in a practical band; 16 steps gives 5.751 / 0.858.

Plain MSE is stable

Velocity MSE reaches 5.751, outperforming timestep weighting, KL weighting, DMD-style, SDS-style, and consistency variants.

CFG absorption composes

Training α and inference β multiply approximately. Best measured composition is 5.833; over-guided αβ=49 drops to 4.015.

Field absorption and rollout diagnostics — Field absorption / CFG / rollout. DanceOPD absorbs realism and CFG fields while rollout discretization remains stable.

Routing, objective, and dense-query diagnostics — Routing and dense-query diagnostics. Hard routing with single-query MSE is the strongest default.

Ablation trends. Low-t semantic queries and the strongest relevant initialization are most reliable.

Realism-field absorption. The student moves toward the realism teacher while preserving prompt content.

Qualitative gallery

T2I, edit, local/global fusion, and training progression

The gallery follows the manuscript organization: global edits, local/global edits, additional material and style edits, pure T2I preservation, same-object transformations, and local/global training progression.

T2I and edit fusion global edits — Global edits. Strong style and scene transformations while preserving structure.

T2I and edit fusion local and global edits — Local and global edits. Content preservation for local changes and stronger global transformations.

Additional edit cases. Balanced attribute changes across material, lighting, and style.

Text-to-image ability after fusion — T2I preservation. Retains text-to-image quality while learning editing.

Various edits on the same object — Same-object edits. Diverse object-level transformations with preserved generative capability.

Training progression in local and global edit fusion — Training progression. The student progressively absorbs target edit capability while maintaining scene identity.

Citation

Paper available on arXiv

arXiv:2606.27377 is now available. Code will be added once it is released.

@misc{zhou2026danceopdonpolicygenerativefield,
      title={DanceOPD: On-Policy Generative Field Distillation}, 
      author={Wei Zhou and Xiongwei Zhu and Zelin Xu and Bo Dong and Lixue Gong and Yongyuan Liang and Meng Chu and Leigang Qu and Lingdong Kong and Wei Liu and Tat-Seng Chua},
      year={2026},
      eprint={2606.27377},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.27377}, 
}

Overview

Challenge: capability synthesis is a field-query problem

Target-field ambiguity

State-distribution mismatch

Trajectory-query correlation

T2I + Edit Composition

Local + Global Edit Composition

Realism / Style Absorption

CFG Absorption

Method: DanceOPD

m ∼ π(m), (x,c) ∼ Dm

z0:Tθ = Rollout(vθ; zT, c)

K = 1, low-noise query

Main results

5.347 GEditBench / 0.849 GenEval

5.498 GEditBench / 0.848 GenEval

Diagnostics and ablations

Hard routing vs. soft fusion

Low-noise semantic query

Single query beats dense queries

Moderate rollout discretization is enough

Plain MSE is stable

CFG absorption composes

Qualitative gallery

Citation

m ∼ π(m), (x,c) ∼ D_m

z_0:T^θ = Rollout(v_θ; z_T, c)