venice_ai.audio_helpers
Helpers for Venice TTS that work around model-specific output limits.
The main entry point is :func:stream_long_text. It splits a long input into
sentence-aligned segments, dispatches them to client.audio.create_speech
in parallel, and yields the concatenated audio bytes in input order.
Two server-side issues motivate this helper:
-
tts-qwen3-0-6b and tts-qwen3-1-7b silently truncate output at exactly 15.896875 s (664 MP3 frames at 24 kHz, identical for both sizes). Past ~25 words of input, the model either speeds up unnaturally or stops mid-sentence. Reproduced across all four
Accept-Encodingvariants and both qwen3 sizes — the cap is server-side, not transport. -
Six of ten Venice TTS models buffer the full response before sending any bytes, despite
stream=True: qwen3 family, orpheus, chatterbox, inworld, gemini. Onlytts-xai-v1andtts-kokorodeliver chunks progressively. Parallel segment fan-out converts the buffering models into pseudo-streaming because chunk 1 can be in flight while chunk 0 is still generating.
See :data:MODEL_WORD_BUDGETS for the per-model split thresholds. Update
that constant — not the helper's logic — when Venice ships fixes or new
models. This file is intentionally the only place that knows about the
qwen3 ceiling.
Known limitation: voice drift across segments
On qwen3-family models, successive segments rendered through this helper
may exhibit subtle voice timbre / tone differences ("voice drift") even
with identical voice, model, and input style. Cause: /audio/speech
does not currently accept a seed parameter, so each segment call samples
fresh RNG state on the server.
Empirical test (12-line poem on tts-qwen3-1-7b/Serena):
temperature / top_p adjustments (0.2/0.8 and 0.05/0.5) and a stable
style prompt were tried as anchors. None produced an audibly-clear
improvement over Venice defaults; run-to-run variance dominated any
configuration delta. temperature, top_p, and prompt remain
caller-controlled on :func:stream_long_text for use cases where they
help, but no default is baked in. The real fix is a server-side seed
parameter on /audio/speech.
venice_ai.audio_helpers.DEFAULT_WORD_BUDGET
DEFAULT_WORD_BUDGET = 100
Words per segment for any model not in :data:MODEL_WORD_BUDGETS.
Chosen to produce ~30-45 s segments at typical TTS pace, which is large
enough to amortize per-request overhead and small enough that ~4 in
parallel saturate a typical user's perceived audio playback rate.
split_text_for_tts
def split_text_for_tts(text: str,
model: str | None = None,
*,
max_words: int | None = None) -> list[str]
Split text into TTS-friendly segments only when the budget is exceeded.
For text that fits under max_words the result is [text] and the
caller can pass it to create_speech unchanged — the helper is a
no-op for short inputs.
When the input exceeds the budget the splitter is greedy: packs
sentences (respecting paragraph and sentence terminators as boundaries)
into a segment until the next sentence would push word count over
max_words, then starts a new segment.
A single sentence longer than max_words is emitted as its own
over-budget segment — splitting mid-sentence produces worse audio than
one over-budget chunk.
Empirical note: paragraph-boundary forcing was tried earlier and removed. Several Venice TTS models render each segment faster than the corresponding portion of a longer call (kokoro shrinks ~30 % when given one stanza at a time instead of three together), so forcing extra splits when not needed costs audible audio with no upside.
Arguments:
text: Input text. Empty / whitespace-only returns[""].model: Venice TTS model id. Used to look up :data:MODEL_WORD_BUDGETSwhenmax_wordsis not given.max_words: Override the per-model word budget.
Raises:
ValueError: Ifmax_wordsis given and not positive.
Returns:
Segments in input order. Always at least one element.
stream_long_text
async def stream_long_text(
client: VeniceClient,
*,
input: str,
model: str,
voice: str | Voice,
response_format: str | ResponseFormat = "mp3",
speed: float | None = None,
language: str | None = None,
prompt: str | None = None,
temperature: float | None = None,
top_p: float | None = None,
max_words_per_segment: int | None = None,
max_concurrency: int = 4,
on_segment_complete: SegmentCallback | None = None,
timeout: float | None = None) -> AsyncIterator[bytes]
Stream TTS audio for long input by splitting into parallel segments.
Splits input via :func:split_text_for_tts, dispatches each segment
to client.audio.create_speech(stream=True, ...) with bounded
concurrency, and yields the resulting mp3 bytes in input order.
Only response_format="mp3" is supported. Other formats require
container-level demuxing to concatenate cleanly and would produce
malformed output if naively appended.
If input fits in a single segment, this short-circuits to
create_speech directly — no extra task scheduling overhead.
Arguments:
client: A connectedVeniceClient.input: Text to synthesize.model: Venice TTS model id (e.g."tts-qwen3-1-7b").voice: Model-specific voice id.response_format: Only"mp3"supported in this helper.max_words_per_segment: Override the per-model budget.max_concurrency: Concurrent in-flight create_speech calls. Default 4 keeps headroom under Venice's 60 req/min limit for typical single-document use.on_segment_complete: Optional callback invoked as each segment finishes its stream, with(segment_index, latency_seconds, bytes).timeout: Per-segment timeout in seconds, forwarded tocreate_speech.
Raises:
NotImplementedError: Ifresponse_formatis not"mp3".ValueError: Ifmax_concurrency < 1. Exceptions raised inside any segment task are re-raised when that segment's bytes would have been yielded; later segments are cancelled.