OpenBMB 开源 2B 无分词器 TTS,支持30语言、Voice Design、声音克隆,附 CLI 和 FastAPI HTTP 服务完整用法,可直接给 Agent 调用
环境准备
Python 3.10+,CUDA GPU(2B FP16 约需 4GB 显存,3090 单卡可跑)
安装
mkdir ~/ai-tools/VoxCPM && cd ~/ai-tools/VoxCPM
python3 -m venv venv && source venv/bin/activate
pip install voxcpm soundfile fastapi uvicorn
下载模型(hf-mirror 加速)
export HF_ENDPOINT=https://hf-mirror.com
hf download openbmb/VoxCPM2 --local-dir /path/to/VoxCPM2
# 模型约 4.7GB(model.safetensors 4.3G + audiovae.pth 360M)
CLI 用法
模式一:Text-to-Speech(标准合成)
voxcpm design \
--model-path /path/to/VoxCPM2 \
--local-files-only \
--text "你好,这是语音合成测试" \
--output out.wav
模式二:Voice Design(用自然语言描述声音风格)
voxcpm design \
--model-path /path/to/VoxCPM2 \
--local-files-only \
--text "Hello, welcome to VoxCPM2!" \
--control "young female, gentle and warm, slight British accent" \
--output out.wav
--control 支持描述性别、年龄、音色、情绪、语速、口音,中英文均可。
模式三:Voice Cloning(克隆参考音频的声音)
voxcpm clone \
--model-path /path/to/VoxCPM2 \
--local-files-only \
--text "想要用克隆声音合成的新文字" \
--reference-audio ref.wav \
--output out.wav
批量合成
# texts.txt 每行一句
voxcpm batch \
--model-path /path/to/VoxCPM2 \
--local-files-only \
--input texts.txt \
--output-dir ./outputs \
--control "成熟男声,播音腔"
HTTP API 服务
适合 Agent 调用。将以下代码保存为 voxcpm_server.py:
import argparse, io, os, tempfile, time, base64
import soundfile as sf
import uvicorn
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from voxcpm import VoxCPM
app = FastAPI(title="VoxCPM2 TTS Server")
model = None
class TTSRequest(BaseModel):
text: str
control: str = "" # Voice Design 描述,留空使用默认声音
cfg_value: float = 2.0
inference_timesteps: int = 10
reference_audio_b64: str = "" # base64 编码参考音频,用于 Clone 模式
prompt_text: str = ""
@app.get("/health")
def health():
return {"status": "ok"}
@app.post("/v1/tts")
def tts(req: TTSRequest):
if model is None:
raise HTTPException(503, "Model not loaded")
kw = dict(text=req.text, cfg_value=req.cfg_value,
inference_timesteps=req.inference_timesteps)
if req.control:
kw["control"] = req.control
if req.reference_audio_b64:
audio_bytes = base64.b64decode(req.reference_audio_b64)
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
f.write(audio_bytes); ref_path = f.name
kw["reference_audio"] = ref_path
if req.prompt_text: kw["prompt_text"] = req.prompt_text
wav = model.generate(**kw)
if req.reference_audio_b64: os.unlink(ref_path)
buf = io.BytesIO()
sf.write(buf, wav, model.tts_model.sample_rate, format="WAV")
buf.seek(0)
return StreamingResponse(buf, media_type="audio/wav")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--model_path", default="/path/to/VoxCPM2")
parser.add_argument("--host", default="0.0.0.0")
parser.add_argument("--port", type=int, default=8304)
args = parser.parse_args()
model = VoxCPM.from_pretrained(args.model_path)
uvicorn.run(app, host=args.host, port=args.port)
启动服务:
python voxcpm_server.py --model_path /path/to/VoxCPM2 --port 8304
API 调用示例
# Text-to-Speech
curl -X POST http://HOST:8304/v1/tts \
-H 'Content-Type: application/json' \
-d '{"text":"你好世界"}' \
--output out.wav
# Voice Design
curl -X POST http://HOST:8304/v1/tts \
-H 'Content-Type: application/json' \
-d '{"text":"Hello!","control":"energetic young male, American accent"}' \
--output out.wav
# Voice Cloning(先 base64 编码参考音频)
REF=$(base64 -i ref.wav)
curl -X POST http://HOST:8304/v1/tts \
-H 'Content-Type: application/json' \
-d "{\"text\":\"克隆声音说这句话\",\"reference_audio_b64\":\"\"}" \
--output out.wav
Python 调用:
import requests
resp = requests.post("http://HOST:8304/v1/tts", json={
"text": "你好,这是 Agent 调用 VoxCPM2 的测试",
"control": "温柔女声,语速适中",
"inference_timesteps": 10,
})
with open("out.wav", "wb") as f:
f.write(resp.content)
性能参考(RTX 3090 Ti 单卡)
| 项目 | 数值 |
|---|---|
| 首次加载(含 JIT 热身) | ~60s |
| 后续加载 | ~5s |
| RTF | ~0.3(比实时快约3倍) |
| 输出采样率 | 48kHz |
| 显存占用 | ~8GB(FP16) |
注意事项
- 与大显存模型(如 vLLM)互斥,启动前先释放 GPU
--local-files-only防止运行时联网拉取,离线环境必加inference_timesteps越高质量越好,10~15 是速度与质量的平衡点
暂无评论
