Local Voice Cloning, Ready for Delivery

OpenClaw Voice turns UTF-8 text into cloned speech with Qwen TTS, runs fully on your own GPU, and sends the final MP3 through a configured Discord bot. It is built for automation workflows that need local synthesis, predictable configuration, and no hosted inference API.

Key Features

Local Qwen TTS

Generate speech locally on your own machine instead of calling a paid hosted API.

Requirements

device_mapCUDA target such as cuda:0
dtypefloat16, bfloat16, or float32
model_name1.7B or 0.6B Qwen TTS model
model_name: Qwen/Qwen3-TTS-12Hz-1.7B-Base device_map: cuda:0 dtype: bfloat16

Discord Delivery

Compress the final waveform to MP3 in memory and deliver it through a configured Discord bot.

Bot Fields

nameCase-insensitive bot selector for --bot-name
providerdiscord
tokenDiscord bot token
user_idTarget DM recipient
tts: - name: narrator provider: discord token: YOUR_DISCORD_BOT_TOKEN user_id: 123456789012345678

Chunked Long-Form Audio

Split text on paragraphs, recursively break oversized chunks, and stitch the generated waveforms into one result.

Chunk Controls

max_chunk_charsUpper bound before recursive splitting
inter_chunk_silence_msSilence inserted between generated chunks
OutputOne concatenated waveform before MP3 encoding
inter_chunk_silence_ms: 150 max_chunk_chars: 1400 # Paragraph-aware splitting first, # recursive splitting only when needed.

YAML Configuration

Drive the entire runtime from a single YAML file that resolves relative paths from its own directory.

Config Fields

languageLabel passed to the model for generation
ref_audio_pathReference WAV stored next to config.yaml
ref_text_pathMatching transcript stored next to config.yaml
--configOverride the default runtime config file
language: Spanish ref_audio_path: spanish_male.wav ref_text_path: spanish_male.txt

Reference Voice Cloning

Use the bundled Spanish reference voice or provide your own WAV and matching transcript to anchor generation to a target speaker profile.

Inputs

SpanishBundled spanish_male.wav plus spanish_male.txt
Other languagesUser-generated WAV plus matching transcript
LocationCopy both files into the config directory
~/.openclaw-voice/ config.yaml spanish_male.wav spanish_male.txt # For other languages, place your own # WAV and transcript in the same directory.

OpenClaw Skill Ready

Ship the service with an OpenClaw skill so agents know how to invoke the command and what config it depends on.

Repository Hooks

Skillskill/openclaw-voice/SKILL.md
CLIopenclaw-voice
The agent can select a bot, load ~/.openclaw-voice/config.yaml, and turn a text file into a Discord DM.

Getting Started

Requirements

OpenClaw Voice currently targets local CUDA execution. You need Python 3.11+, a CUDA-capable GPU with a working PyTorch CUDA runtime, and Discord bot credentials for delivery. A Spanish reference voice is bundled with the repository; for other languages, create your own WAV and matching transcript.

# Lower VRAM option
model_name: Qwen/Qwen3-TTS-12Hz-0.6B-Base
device_map: cuda:0

Installation

pipx install git+https://github.com/arrase/openclaw-voice.git

Configuration

Copy the template config and the bundled Spanish reference files into the runtime directory. For any other language, replace those two files with your own WAV and matching transcription, then update the config.

mkdir -p ~/.openclaw-voice
cp config/config.yaml ~/.openclaw-voice/config.yaml
cp assets/reference/spanish_male.wav ~/.openclaw-voice/spanish_male.wav
cp assets/reference/spanish_male.txt ~/.openclaw-voice/spanish_male.txt

Basic Usage

Generate speech from a UTF-8 text file and deliver the MP3 through the selected bot.

openclaw-voice --input-text input.txt --bot-name narrator
openclaw-voice --input-text examples/long_text_es.txt --bot-name narrator

Alternate Entry Points

Use the Python module entry point or override the runtime configuration path when needed.

python -m openclaw_voice --input-text input.txt --bot-name narrator
openclaw-voice --input-text input.txt --bot-name narrator --config /path/to/config.yaml