VoxCPM2 部署与使用指南 —— 30语言TTS + Voice Design + 声音克隆


OpenBMB 开源 2B 无分词器 TTS,支持30语言、Voice Design、声音克隆,附 CLI 和 FastAPI HTTP 服务完整用法,可直接给 Agent 调用

环境准备

Python 3.10+,CUDA GPU(2B FP16 约需 4GB 显存,3090 单卡可跑)


安装

mkdir ~/ai-tools/VoxCPM && cd ~/ai-tools/VoxCPM
python3 -m venv venv && source venv/bin/activate
pip install voxcpm soundfile fastapi uvicorn

下载模型(hf-mirror 加速)

export HF_ENDPOINT=https://hf-mirror.com
hf download openbmb/VoxCPM2 --local-dir /path/to/VoxCPM2
# 模型约 4.7GB(model.safetensors 4.3G + audiovae.pth 360M)

CLI 用法

模式一:Text-to-Speech(标准合成)

voxcpm design \
  --model-path /path/to/VoxCPM2 \
  --local-files-only \
  --text "你好,这是语音合成测试" \
  --output out.wav

模式二:Voice Design(用自然语言描述声音风格)

voxcpm design \
  --model-path /path/to/VoxCPM2 \
  --local-files-only \
  --text "Hello, welcome to VoxCPM2!" \
  --control "young female, gentle and warm, slight British accent" \
  --output out.wav

--control 支持描述性别、年龄、音色、情绪、语速、口音,中英文均可。

模式三:Voice Cloning(克隆参考音频的声音)

voxcpm clone \
  --model-path /path/to/VoxCPM2 \
  --local-files-only \
  --text "想要用克隆声音合成的新文字" \
  --reference-audio ref.wav \
  --output out.wav

批量合成

# texts.txt 每行一句
voxcpm batch \
  --model-path /path/to/VoxCPM2 \
  --local-files-only \
  --input texts.txt \
  --output-dir ./outputs \
  --control "成熟男声,播音腔"

HTTP API 服务

适合 Agent 调用。将以下代码保存为 voxcpm_server.py

import argparse, io, os, tempfile, time, base64
import soundfile as sf
import uvicorn
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from voxcpm import VoxCPM

app = FastAPI(title="VoxCPM2 TTS Server")
model = None

class TTSRequest(BaseModel):
    text: str
    control: str = ""           # Voice Design 描述,留空使用默认声音
    cfg_value: float = 2.0
    inference_timesteps: int = 10
    reference_audio_b64: str = ""  # base64 编码参考音频,用于 Clone 模式
    prompt_text: str = ""

@app.get("/health")
def health():
    return {"status": "ok"}

@app.post("/v1/tts")
def tts(req: TTSRequest):
    if model is None:
        raise HTTPException(503, "Model not loaded")
    kw = dict(text=req.text, cfg_value=req.cfg_value,
              inference_timesteps=req.inference_timesteps)
    if req.control:
        kw["control"] = req.control
    if req.reference_audio_b64:
        audio_bytes = base64.b64decode(req.reference_audio_b64)
        with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
            f.write(audio_bytes); ref_path = f.name
        kw["reference_audio"] = ref_path
        if req.prompt_text: kw["prompt_text"] = req.prompt_text
    wav = model.generate(**kw)
    if req.reference_audio_b64: os.unlink(ref_path)
    buf = io.BytesIO()
    sf.write(buf, wav, model.tts_model.sample_rate, format="WAV")
    buf.seek(0)
    return StreamingResponse(buf, media_type="audio/wav")

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--model_path", default="/path/to/VoxCPM2")
    parser.add_argument("--host", default="0.0.0.0")
    parser.add_argument("--port", type=int, default=8304)
    args = parser.parse_args()
    model = VoxCPM.from_pretrained(args.model_path)
    uvicorn.run(app, host=args.host, port=args.port)

启动服务:

python voxcpm_server.py --model_path /path/to/VoxCPM2 --port 8304

API 调用示例

# Text-to-Speech
curl -X POST http://HOST:8304/v1/tts \
  -H 'Content-Type: application/json' \
  -d '{"text":"你好世界"}' \
  --output out.wav

# Voice Design
curl -X POST http://HOST:8304/v1/tts \
  -H 'Content-Type: application/json' \
  -d '{"text":"Hello!","control":"energetic young male, American accent"}' \
  --output out.wav

# Voice Cloning(先 base64 编码参考音频)
REF=$(base64 -i ref.wav)
curl -X POST http://HOST:8304/v1/tts \
  -H 'Content-Type: application/json' \
  -d "{\"text\":\"克隆声音说这句话\",\"reference_audio_b64\":\"\"}" \
  --output out.wav

Python 调用:

import requests

resp = requests.post("http://HOST:8304/v1/tts", json={
    "text": "你好,这是 Agent 调用 VoxCPM2 的测试",
    "control": "温柔女声,语速适中",
    "inference_timesteps": 10,
})
with open("out.wav", "wb") as f:
    f.write(resp.content)

性能参考(RTX 3090 Ti 单卡)

项目数值
首次加载(含 JIT 热身)~60s
后续加载~5s
RTF~0.3(比实时快约3倍)
输出采样率48kHz
显存占用~8GB(FP16)

注意事项

  • 与大显存模型(如 vLLM)互斥,启动前先释放 GPU
  • --local-files-only 防止运行时联网拉取,离线环境必加
  • inference_timesteps 越高质量越好,10~15 是速度与质量的平衡点
1200举报0Xiao.Xi12天前
点击获取 ^_^
被收录:

暂无评论