WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training
Findings of ACL 2026
Adaptive post-training for end-to-end spoken dialogue models
WavAlign is a modality-aware post-training recipe for end-to-end spoken dialogue models that improves semantic intelligence while preserving speech naturalness and expressiveness.
This repository accompanies our paper "WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training", accepted to Findings of ACL 2026.
Project page: speechrl.github.io
2026-04Initial code release for the WavAlign training pipeline.2026-04Project homepage is live at https://speechrl.github.io/Coming soonModel checkpoints.Coming soonTraining and evaluation datasets.
WavAlign is built around a simple principle from the paper:
- Use preference optimization where the signal is most reliable: the semantic text channel.
- Keep speech generation anchored with supervised targets to avoid acoustic drift.
- Dynamically regulate the RL/SFT mixture using rollout reward quality and discriminability.
- Support both online RL-style optimization and offline DPO-style optimization under the same mixed text-speech setup.
This repository currently releases the core post-training code used for:
- masked RL + SFT training for spoken dialogue models
- adaptive RL/SFT mixing with EMA-smoothed rollout gating
- text-token / speech-token masking controls
- offline and online DPO training
- DPO pair construction from scored multi-sample generations
WavAlign/
├── assets/ # README assets
├── config/ # DeepSpeed templates
├── dataset/ # Generic JSONL/HF dataset loader
├── dpo/ # DPO trainer
├── examples/ # Minimal schema examples
├── scripts/ # Launch scripts
├── trainer/ # RL+SFT trainers
├── utils/ # Reward model + DPO pair builder
├── train_vita_audio_rl_sft_masked.py
└── train_vita_audio_dpo.py
This code depends on a local checkout of the VITA-Audio codebase.
git clone <this-repo>
cd WavAlign
pip install -r requirements.txt
export VITA_AUDIO_ROOT=/path/to/VITA-AudioFor RL training with API-based reward scoring, also set:
export WAVALIGN_REWARD_API_KEY=...
export WAVALIGN_REWARD_API_BASE=... # full chat/completions URL
export WAVALIGN_REWARD_MODEL=...The reward client expects an OpenAI-compatible multimodal chat endpoint, and
WAVALIGN_REWARD_API_BASE should point to the full request URL used by that service.
The training pipeline uses a simple JSONL schema. See examples/sample_rl_sft.jsonl and examples/sample_dpo.jsonl.
RL + SFT training sample:
{
"messages": [
{"role": "system", "content": "You are Luke, the voice AI assistant. You can speak and listen."},
{"role": "user", "content": "...\n\n<|audio|>"}
],
"audios": ["audio/example_question.wav"],
"sft_target_text": "text target",
"sft_target_audio": "audio/example_answer.wav",
"task_type": "s2s",
"question_text": "plain-text prompt for reward evaluation"
}DPO training adds:
{
"rejected_text": "worse response",
"rejected_audio": "audio/example_bad.wav"
}Masked RL + SFT:
export VITA_AUDIO_ROOT=/path/to/VITA-Audio
export WAVALIGN_REWARD_API_KEY=...
bash scripts/train_rl_sft_masked.sh plus-vanillaThe default RL launcher enables the paper-style adaptive mixing controller. To fall back to fixed mixing, pass --adaptive_mixing False.
Offline DPO:
export VITA_AUDIO_ROOT=/path/to/VITA-Audio
bash scripts/train_dpo.sh plus-vanillaBuild DPO pairs from scored candidates:
python utils/dpo_pair_builder.py \
--input_path scored_generations.json \
--output_path dpo_pairs.jsonl \
--chosen_source best_output \
--rejected_source worst_output \
--score_mode sum- This repository focuses on the training recipe and trainer implementation.
- Project page, paper metadata, and future artifact updates will be maintained at https://speechrl.github.io/
- Checkpoints and datasets are managed separately from this repository.
If you find this work useful, please cite:
@misc{chen2026wavalignenhancingintelligenceexpressiveness,
title={WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training},
author={Yifu Chen and Shengpeng Ji and Qian Chen and Tianle Liang and Yangzhuo Li and Ziqing Wang and Wen Wang and Jingyu Lu and Haoxiao Wang and Xueyi Pu and Fan Zhuo and Zhou Zhao},
year={2026},
eprint={2604.14932},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2604.14932},
}