Skip to content

MM-Speech/WavAlign

Repository files navigation

WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training

Findings of ACL 2026
Adaptive post-training for end-to-end spoken dialogue models

Project Page Code Release Citation

Status Venue Task

WavAlign method overview

WavAlign is a modality-aware post-training recipe for end-to-end spoken dialogue models that improves semantic intelligence while preserving speech naturalness and expressiveness.

This repository accompanies our paper "WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training", accepted to Findings of ACL 2026.

Project page: speechrl.github.io

News

  • 2026-04 Initial code release for the WavAlign training pipeline.
  • 2026-04 Project homepage is live at https://speechrl.github.io/
  • Coming soon Model checkpoints.
  • Coming soon Training and evaluation datasets.

Overview

WavAlign is built around a simple principle from the paper:

  • Use preference optimization where the signal is most reliable: the semantic text channel.
  • Keep speech generation anchored with supervised targets to avoid acoustic drift.
  • Dynamically regulate the RL/SFT mixture using rollout reward quality and discriminability.
  • Support both online RL-style optimization and offline DPO-style optimization under the same mixed text-speech setup.

This repository currently releases the core post-training code used for:

  • masked RL + SFT training for spoken dialogue models
  • adaptive RL/SFT mixing with EMA-smoothed rollout gating
  • text-token / speech-token masking controls
  • offline and online DPO training
  • DPO pair construction from scored multi-sample generations

Code Release

WavAlign/
├── assets/                        # README assets
├── config/                        # DeepSpeed templates
├── dataset/                       # Generic JSONL/HF dataset loader
├── dpo/                           # DPO trainer
├── examples/                      # Minimal schema examples
├── scripts/                       # Launch scripts
├── trainer/                       # RL+SFT trainers
├── utils/                         # Reward model + DPO pair builder
├── train_vita_audio_rl_sft_masked.py
└── train_vita_audio_dpo.py

Installation

This code depends on a local checkout of the VITA-Audio codebase.

git clone <this-repo>
cd WavAlign
pip install -r requirements.txt
export VITA_AUDIO_ROOT=/path/to/VITA-Audio

For RL training with API-based reward scoring, also set:

export WAVALIGN_REWARD_API_KEY=...
export WAVALIGN_REWARD_API_BASE=...  # full chat/completions URL
export WAVALIGN_REWARD_MODEL=...

The reward client expects an OpenAI-compatible multimodal chat endpoint, and WAVALIGN_REWARD_API_BASE should point to the full request URL used by that service.

Data Format

The training pipeline uses a simple JSONL schema. See examples/sample_rl_sft.jsonl and examples/sample_dpo.jsonl.

RL + SFT training sample:

{
  "messages": [
    {"role": "system", "content": "You are Luke, the voice AI assistant. You can speak and listen."},
    {"role": "user", "content": "...\n\n<|audio|>"}
  ],
  "audios": ["audio/example_question.wav"],
  "sft_target_text": "text target",
  "sft_target_audio": "audio/example_answer.wav",
  "task_type": "s2s",
  "question_text": "plain-text prompt for reward evaluation"
}

DPO training adds:

{
  "rejected_text": "worse response",
  "rejected_audio": "audio/example_bad.wav"
}

Quick Start

Masked RL + SFT:

export VITA_AUDIO_ROOT=/path/to/VITA-Audio
export WAVALIGN_REWARD_API_KEY=...
bash scripts/train_rl_sft_masked.sh plus-vanilla

The default RL launcher enables the paper-style adaptive mixing controller. To fall back to fixed mixing, pass --adaptive_mixing False.

Offline DPO:

export VITA_AUDIO_ROOT=/path/to/VITA-Audio
bash scripts/train_dpo.sh plus-vanilla

Build DPO pairs from scored candidates:

python utils/dpo_pair_builder.py \
  --input_path scored_generations.json \
  --output_path dpo_pairs.jsonl \
  --chosen_source best_output \
  --rejected_source worst_output \
  --score_mode sum

Notes

  • This repository focuses on the training recipe and trainer implementation.
  • Project page, paper metadata, and future artifact updates will be maintained at https://speechrl.github.io/
  • Checkpoints and datasets are managed separately from this repository.

Citation

If you find this work useful, please cite:

@misc{chen2026wavalignenhancingintelligenceexpressiveness,
      title={WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training}, 
      author={Yifu Chen and Shengpeng Ji and Qian Chen and Tianle Liang and Yangzhuo Li and Ziqing Wang and Wen Wang and Jingyu Lu and Haoxiao Wang and Xueyi Pu and Fan Zhuo and Zhou Zhao},
      year={2026},
      eprint={2604.14932},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2604.14932}, 
}

About

Official repository for "WavAlign" (ACL 2026 Findings) . An Post-Train Framework for spoken dialogue models.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors