GitHunt
DA

πŸ’πŸ“² Self-evolving customer service framework, SEAD, operates without any human-labeled data. It can be quickly launched just by changing the SOP and user profiles.

SEAD: Self-Evolving Agent for Service Dialogue

A co-evolutionary reinforcement learning framework for training dialogue agents that adapt to diverse user scenarios without requiring additional training data.

Huggingface
Github
Paper

mainfig

SEAD (Self-Evolving Agent for Service Dialogue) - Co-evolutionary Training Loop Framework. The controller samples initial states (Phase 1), which initialize dialogues producing trajectories (Phase 2), used to train the agent with rewards (Phase 3) and compute completion rates (Phase 4), which feed back to adjust sampling distributions, closing the co-evolutionary loop.

✨If you like this project, please give it a starπŸŒŸβ€”it's the best encouragement for usπŸ₯Ί!✨


πŸ”₯ News

[2026-02-04] β€” We've open-sourced our full research stack!

  • πŸ“„ Paper (arXiv): arXiv:2602.03548
  • πŸ’» Code (GitHub): Complete training, inference, and evaluation pipelines
  • πŸ€— Model (Hugging Face): dayll/SEAD-14B
  • πŸ“Š Benchmark: Benchmark and evaluation code is available

What's included:

  • βœ… End-to-end training, inference, and evaluation pipelines
  • βœ… Reproducible configs and scripts
  • βœ… Pretrained checkpoints (14B parameters)
  • βœ… Comprehensive evaluation suite
  • βœ… Clear documentation and examples

πŸ“‹ Table of Contents


✨ Highlights

🎯 Zero Training Data Required: Our co-evolutionary framework eliminates the need for manually collected dialogue data

πŸš€ State-of-the-Art Performance: Achieves 52.0% completion rate, outperforming GPT-4o (44.2%) with only 14B parameters

πŸ’° Cost-Effective: Zero inference cost compared to commercial APIs (GPT-4o: Β₯727.28 for 1000 samples)

πŸ”„ Self-Evolving: Automatic curriculum learning through adaptive state sampling

⚑ Efficient Training: Supports distributed training on 8 GPUs with vLLM acceleration


βš™οΈ Features

πŸŽ“ Training & Optimization

  • βœ… Co-evolutionary Framework: Adaptive curriculum learning via state controller
  • βœ… Distributed Training: Multi-GPU support with efficient parallelization
  • βœ… Checkpoint Management: Automatic saving and resuming

πŸ€– Model

SEAD is now available on huggingface-hub:

Model Name HF Checkpoint Size
SEAD-14b πŸ€— dayll/SEAD-14B 14B

πŸ† Performance

Experimental Results

Main Results Comparison

Method Params CR (%) ATT ↓ UPA EI TI CI Total Cost (CNY)
Foundation Models
Qwen2.5-14B-Instruct 14B 38.7 10.5Β±2.1 0.883Β±0.085 0.34Β±1.11 0.68Β±1.53 0.63Β±1.58 0.00
Qwen2.5-32B-Instruct 32B 38.3 9.9Β±2.15 0.899Β±0.068 -0.11Β±0.54 0.76Β±0.91 2.25Β±1.15 0.00
Qwen2.5-72B-Instruct 72B 39.0 9.6Β±2.18 0.818Β±0.144 0.51Β±1.32 1.06Β±1.72 1.18Β±1.59 0.00
Large Model APIs
GPT-4o -- 44.2 10.8Β±2.10 0.867Β±0.117 0.04Β±0.97 0.97Β±1.29 1.34Β±1.42 727.28
DeepSeek-Chat 671B 31.6 11.3Β±2.10 0.863Β±0.084 -0.20Β±0.97 0.27Β±1.24 0.76Β±1.50 87.36
Qwen3-235B 235B 32.3 10.4Β±2.50 0.765Β±0.170 -0.24Β±0.83 0.80Β±1.14 1.54Β±1.50 69.36
LongCat-Flash 560B 42.2 10.0Β±2.31 0.925Β±0.079 0.28Β±1.15 1.33Β±1.57 1.56Β±1.46 23.08
SEAD (Ours) 14B 52.0 9.6Β±2.09 0.912Β±0.071 0.63Β±1.12 1.57Β±1.51 1.55Β±1.39 0.00

Metrics:

  • Params: Model parameters (B=billion, "--" indicates undisclosed or not applicable)
  • CR: Completion Rate (%)
  • ATT: Average Turns to Target (lower is better ↓)
  • UPA: User Portrait Accuracy
  • EI: Emotion Improvement
  • TI: Trust Improvement
  • CI: Cooperation Improvement
  • Total Cost: Total inference cost for 1000 multi-turn samples (CNY)

Note: Bold indicates best results. Underlined indicates second-best results. Standard deviations are shown where available.

Dynamic Training Results

mainfig
With the advancement of training, the model's metrics steadily improve, highlighting the effectiveness of RL. The hard business metric, Task Completion, achieves a significant boost, showing that the model has learned better strategies through free exploration. The increase in User Profile Accuracy demonstrates that the model understands users better, while the steady rise in the Trust Variation Mean indicates that the model can more easily gain user trust through conversation.

⬇️ Installation

Environment

conda create -n SEAD python=3.10
conda activate SEAD
# install torch [or you can skip this step and let vllm to install the correct version for you]
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121
# install vllm
pip3 install vllm==0.6.3 # or you can install 0.5.4, 0.4.2 and 0.3.1

# verl
pip install -e .

# flash attention 2
pip3 install flash-attn --no-build-isolation
pip install wandb

Alternatively, you can configure the environment according to requirements.txt:

pip install -r requirements.txt

The User Role-play Model operates within an isolated environment.

conda create -n vllm python=3.10
pip install -r requirements_vllm.txt

πŸš€ Quick start

To modify prompts (such as user profiles and SOP), edit the files in: ./verl/trainer/config/format_prompt/.

Common user behaviors can be modified by editing
./assets/client_action.jsonl. These behaviors are randomly sampled and incorporated into user prompts to ensure dialogue diversity."

Training

Our model requires no additional training data. Simply load the base model to start training:

Run RL training on 8 gpus

conda activate SEAD
bash ./scripts/main.sh 

Training Configuration:

Edit ./scripts/main.sh to customize:

  • Base model path

Edit ./scripts/train_chatbot.sh to customize:

  • Batch size and learning rate
  • Checkpoint save frequency

To visualize dynamic curves locally, run the following command:

python for_evaluation/metrics_vis.py

The generated plots will be saved in ./outputs/evaluation/report.

Evaluation

Test any local model or your custom-trained model:

# Create Evaluation Set
python utils/create_prompt_data.py \
    --train_samples 0 \
    --test_samples 1000 \
    --behavior_library ./assets/client_action.jsonl \
    --out_dir ./outputs/evaluation/test_set/ \
    --temp_dir ./outputs/evaluation/test_set/user_param/
# Run the evaluation following the instructions in the log
bash ./for_evaluation/vllm_test_suite.sh

Modify ./for_evaluation/vllm_test_suite.sh to set:

  • Model checkpoint path

πŸ™ Acknowledge

The SEAD framework draws inspiration from pioneering projects such as Search-R1, is built upon veRL and RAGEN.
We would like to sincerely thank the teams behind these projects for their invaluable contributions to open-source research and development.

🏷️ Citation

@article{SEADv1,
  title={SEAD: Self-Evolving Agent for Multi-Turn Service Dialogue},
  author={Yuqin Dai, Ning Gao, Wei Zhang, Jie Wang, Zichen Luo, Jinpeng Wang, Yujie Wang, Ruiyuan Wu, Chaozheng Wang},
  journal={arXiv preprint arXiv:2602.03548},
  year={2026}
}
Da1yuqin/SEAD | GitHunt