# VoxCPM2 部署与使用指南 —— 30语言TTS + Voice Design + 声音克隆 > OpenBMB 开源 2B 无分词器 TTS,支持30语言、Voice Design、声音克隆,附 CLI 和 FastAPI HTTP 服务完整用法,可直接给 Agent 调用 ## 环境准备 Python 3.10+,CUDA GPU(2B FP16 约需 4GB 显存,3090 单卡可跑) --- ## 安装 ```bash mkdir ~/ai-tools/VoxCPM && cd ~/ai-tools/VoxCPM python3 -m venv venv && source venv/bin/activate pip install voxcpm soundfile fastapi uvicorn ``` --- ## 下载模型(hf-mirror 加速) ```bash export HF_ENDPOINT=https://hf-mirror.com hf download openbmb/VoxCPM2 --local-dir /path/to/VoxCPM2 # 模型约 4.7GB(model.safetensors 4.3G + audiovae.pth 360M) ``` --- ## CLI 用法 ### 模式一:Text-to-Speech(标准合成) ```bash voxcpm design \ --model-path /path/to/VoxCPM2 \ --local-files-only \ --text "你好,这是语音合成测试" \ --output out.wav ``` ### 模式二:Voice Design(用自然语言描述声音风格) ```bash voxcpm design \ --model-path /path/to/VoxCPM2 \ --local-files-only \ --text "Hello, welcome to VoxCPM2!" \ --control "young female, gentle and warm, slight British accent" \ --output out.wav ``` `--control` 支持描述性别、年龄、音色、情绪、语速、口音,中英文均可。 ### 模式三:Voice Cloning(克隆参考音频的声音) ```bash voxcpm clone \ --model-path /path/to/VoxCPM2 \ --local-files-only \ --text "想要用克隆声音合成的新文字" \ --reference-audio ref.wav \ --output out.wav ``` ### 批量合成 ```bash # texts.txt 每行一句 voxcpm batch \ --model-path /path/to/VoxCPM2 \ --local-files-only \ --input texts.txt \ --output-dir ./outputs \ --control "成熟男声,播音腔" ``` --- ## HTTP API 服务 适合 Agent 调用。将以下代码保存为 `voxcpm_server.py`: ```python import argparse, io, os, tempfile, time, base64 import soundfile as sf import uvicorn from fastapi import FastAPI, HTTPException from fastapi.responses import StreamingResponse from pydantic import BaseModel from voxcpm import VoxCPM app = FastAPI(title="VoxCPM2 TTS Server") model = None class TTSRequest(BaseModel): text: str control: str = "" # Voice Design 描述,留空使用默认声音 cfg_value: float = 2.0 inference_timesteps: int = 10 reference_audio_b64: str = "" # base64 编码参考音频,用于 Clone 模式 prompt_text: str = "" @app.get("/health") def health(): return {"status": "ok"} @app.post("/v1/tts") def tts(req: TTSRequest): if model is None: raise HTTPException(503, "Model not loaded") kw = dict(text=req.text, cfg_value=req.cfg_value, inference_timesteps=req.inference_timesteps) if req.control: kw["control"] = req.control if req.reference_audio_b64: audio_bytes = base64.b64decode(req.reference_audio_b64) with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f: f.write(audio_bytes); ref_path = f.name kw["reference_audio"] = ref_path if req.prompt_text: kw["prompt_text"] = req.prompt_text wav = model.generate(**kw) if req.reference_audio_b64: os.unlink(ref_path) buf = io.BytesIO() sf.write(buf, wav, model.tts_model.sample_rate, format="WAV") buf.seek(0) return StreamingResponse(buf, media_type="audio/wav") if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--model_path", default="/path/to/VoxCPM2") parser.add_argument("--host", default="0.0.0.0") parser.add_argument("--port", type=int, default=8304) args = parser.parse_args() model = VoxCPM.from_pretrained(args.model_path) uvicorn.run(app, host=args.host, port=args.port) ``` 启动服务: ```bash python voxcpm_server.py --model_path /path/to/VoxCPM2 --port 8304 ``` --- ## API 调用示例 ```bash # Text-to-Speech curl -X POST http://HOST:8304/v1/tts \ -H 'Content-Type: application/json' \ -d '{"text":"你好世界"}' \ --output out.wav # Voice Design curl -X POST http://HOST:8304/v1/tts \ -H 'Content-Type: application/json' \ -d '{"text":"Hello!","control":"energetic young male, American accent"}' \ --output out.wav # Voice Cloning(先 base64 编码参考音频) REF=$(base64 -i ref.wav) curl -X POST http://HOST:8304/v1/tts \ -H 'Content-Type: application/json' \ -d "{\"text\":\"克隆声音说这句话\",\"reference_audio_b64\":\"\"}" \ --output out.wav ``` Python 调用: ```python import requests resp = requests.post("http://HOST:8304/v1/tts", json={ "text": "你好,这是 Agent 调用 VoxCPM2 的测试", "control": "温柔女声,语速适中", "inference_timesteps": 10, }) with open("out.wav", "wb") as f: f.write(resp.content) ``` --- ## 性能参考(RTX 3090 Ti 单卡) | 项目 | 数值 | |------|------| | 首次加载(含 JIT 热身) | ~60s | | 后续加载 | ~5s | | RTF | ~0.3(比实时快约3倍) | | 输出采样率 | 48kHz | | 显存占用 | ~8GB(FP16) | --- ## 注意事项 - 与大显存模型(如 vLLM)互斥,启动前先释放 GPU - `--local-files-only` 防止运行时联网拉取,离线环境必加 - `inference_timesteps` 越高质量越好,10~15 是速度与质量的平衡点 --- **分类**:教程 **标签**:path · model · text **作者**:Xiao.Xi **链接**:https://octohz.com/p/1609