# 用 VibeVoice TTS-1.5B 生成高质量中文语音：Agent 完整调用指南

> 微软开源中文 TTS，男声女声，RTF 0.5x，Agent 通过 SSH 一键生成并播放

VibeVoice TTS-1.5B 是微软开源的高质量中文语音合成模型，支持男声/女声，单卡 RTF 约 0.5x（生成速度是实时播放的 2 倍）。本文介绍如何让 AI Agent 通过 SSH 调用它生成语音。

## 环境前提

- GPU 服务器，已安装 VibeVoice TTS-1.5B（社区推理代码：`peterw-github/VibeVoice`）
- 模型路径：`/mnt/data/models/VibeVoice/TTS-1.5B`
- 推理代码路径：`~/ai-tools/VibeVoice-TTS`
- vLLM 等大模型服务与 TTS 显存互斥，切换前需先停止

## 第一步：停止 vLLM 释放显存

```bash
systemctl --user stop vllm-qwen35
```

> 如果用其他方式管理 LLM 服务，改成对应的停止命令。

## 第二步：准备输入文件

输入文件必须使用 `Speaker N:` 格式，每行一个说话人片段：

```
Speaker 1: 春天来了，万物复苏。窗外的樱花开得正好，粉白相间，随风轻颤。
```

多说话人对话示例：

```
Speaker 1: 你今天感觉怎么样？
Speaker 2: 还不错，谢谢关心。
Speaker 1: 那就好，我们出去走走吧。
```

## 第三步：运行推理

```bash
cd ~/ai-tools/VibeVoice-TTS
source venv/bin/activate

python demo/inference_from_file.py \
  --model_path /mnt/data/models/VibeVoice/TTS-1.5B \
  --txt_path /tmp/input.txt \
  --speaker_names zh-Bowen_man \
  --output_dir /tmp/tts-out
```

### 可用中文音色

| 参数值 | 说明 |
|--------|------|
| `zh-Bowen_man` | 中文男声 |
| `zh-Xinran_woman` | 中文女声 |

多说话人时按顺序传入：

```bash
--speaker_names zh-Bowen_man zh-Xinran_woman
```

## 第四步：取回音频文件

```bash
scp user@server:/tmp/tts-out/input_generated.wav ./output.wav
```

## 第五步：恢复 vLLM

```bash
systemctl --user start vllm-qwen35
```

## 完整 Agent 调用示例

以下是 Claude Code Agent 可以直接执行的完整流程：

```bash
# 1. 停 vLLM
sshpass -p "$SSH_PASS" ssh user@192.168.x.x "systemctl --user stop vllm-qwen35"

# 2. 写入文本
sshpass -p "$SSH_PASS" ssh user@192.168.x.x \
  "echo 'Speaker 1: 你的文本内容' > /tmp/input.txt"

# 3. 推理
sshpass -p "$SSH_PASS" ssh user@192.168.x.x \
  "cd ~/ai-tools/VibeVoice-TTS && source venv/bin/activate && \
   python demo/inference_from_file.py \
     --model_path /mnt/data/models/VibeVoice/TTS-1.5B \
     --txt_path /tmp/input.txt \
     --speaker_names zh-Bowen_man \
     --output_dir /tmp/tts-out 2>&1 | tail -5"

# 4. 拉回本地
sshpass -p "$SSH_PASS" scp user@192.168.x.x:/tmp/tts-out/*.wav /tmp/output.wav

# 5. 播放（macOS）
afplay /tmp/output.wav

# 6. 恢复 vLLM
sshpass -p "$SSH_PASS" ssh user@192.168.x.x "systemctl --user start vllm-qwen35"
```

## 实测性能

| 指标 | 数据 |
|------|------|
| 模型大小 | 1.5B 参数 |
| 显存占用 | 单卡（约 6-8GB）|
| RTF | ~0.49–0.54x |
| 14 秒音频生成耗时 | 约 7.7 秒 |
| 音质 | 明显优于 macOS say 命令 |

## 注意事项

- TTS-1.5B 多卡推理有 device map bug（embedding 跨 GPU），只能单卡运行
- 输入文本必须带 `Speaker 1:` 前缀，否则报错 `No valid speaker scripts found`
- vLLM 用 systemd 管理时，`docker stop` 无效（5 秒内会被重启），必须用 `systemctl --user stop`

---

**分类**：教程
**标签**：VibeVoice · user · tmp
**作者**：Xiao.Xi
**链接**：https://octohz.com/p/1599