Skip to main content

venice_ai.audio_helpers

Helpers for Venice TTS that work around model-specific output limits.

The main entry point is :func:stream_long_text. It splits a long input into sentence-aligned segments, dispatches them to client.audio.create_speech in parallel, and yields the concatenated audio bytes in input order.

Two server-side issues motivate this helper:

  1. tts-qwen3-0-6b and tts-qwen3-1-7b silently truncate output at exactly 15.896875 s (664 MP3 frames at 24 kHz, identical for both sizes). Past ~25 words of input, the model either speeds up unnaturally or stops mid-sentence. Reproduced across all four Accept-Encoding variants and both qwen3 sizes — the cap is server-side, not transport.

  2. Six of ten Venice TTS models buffer the full response before sending any bytes, despite stream=True: qwen3 family, orpheus, chatterbox, inworld, gemini. Only tts-xai-v1 and tts-kokoro deliver chunks progressively. Parallel segment fan-out converts the buffering models into pseudo-streaming because chunk 1 can be in flight while chunk 0 is still generating.

See :data:MODEL_WORD_BUDGETS for the per-model split thresholds. Update that constant — not the helper's logic — when Venice ships fixes or new models. This file is intentionally the only place that knows about the qwen3 ceiling.

Known limitation: voice drift across segments

On qwen3-family models, successive segments rendered through this helper may exhibit subtle voice timbre / tone differences ("voice drift") even with identical voice, model, and input style. Cause: /audio/speech does not currently accept a seed parameter, so each segment call samples fresh RNG state on the server.

Empirical test (12-line poem on tts-qwen3-1-7b/Serena): temperature / top_p adjustments (0.2/0.8 and 0.05/0.5) and a stable style prompt were tried as anchors. None produced an audibly-clear improvement over Venice defaults; run-to-run variance dominated any configuration delta. temperature, top_p, and prompt remain caller-controlled on :func:stream_long_text for use cases where they help, but no default is baked in. The real fix is a server-side seed parameter on /audio/speech.

venice_ai.audio_helpers.DEFAULT_WORD_BUDGET

DEFAULT_WORD_BUDGET = 100

Words per segment for any model not in :data:MODEL_WORD_BUDGETS. Chosen to produce ~30-45 s segments at typical TTS pace, which is large enough to amortize per-request overhead and small enough that ~4 in parallel saturate a typical user's perceived audio playback rate.

split_text_for_tts

def split_text_for_tts(text: str,
model: str | None = None,
*,
max_words: int | None = None) -> list[str]

Split text into TTS-friendly segments only when the budget is exceeded.

For text that fits under max_words the result is [text] and the caller can pass it to create_speech unchanged — the helper is a no-op for short inputs.

When the input exceeds the budget the splitter is greedy: packs sentences (respecting paragraph and sentence terminators as boundaries) into a segment until the next sentence would push word count over max_words, then starts a new segment.

A single sentence longer than max_words is emitted as its own over-budget segment — splitting mid-sentence produces worse audio than one over-budget chunk.

Empirical note: paragraph-boundary forcing was tried earlier and removed. Several Venice TTS models render each segment faster than the corresponding portion of a longer call (kokoro shrinks ~30 % when given one stanza at a time instead of three together), so forcing extra splits when not needed costs audible audio with no upside.

Arguments:

  • text: Input text. Empty / whitespace-only returns [""].
  • model: Venice TTS model id. Used to look up :data:MODEL_WORD_BUDGETS when max_words is not given.
  • max_words: Override the per-model word budget.

Raises:

  • ValueError: If max_words is given and not positive.

Returns:

Segments in input order. Always at least one element.

stream_long_text

async def stream_long_text(
client: VeniceClient,
*,
input: str,
model: str,
voice: str | Voice,
response_format: str | ResponseFormat = "mp3",
speed: float | None = None,
language: str | None = None,
prompt: str | None = None,
temperature: float | None = None,
top_p: float | None = None,
max_words_per_segment: int | None = None,
max_concurrency: int = 4,
on_segment_complete: SegmentCallback | None = None,
timeout: float | None = None) -> AsyncIterator[bytes]

Stream TTS audio for long input by splitting into parallel segments.

Splits input via :func:split_text_for_tts, dispatches each segment to client.audio.create_speech(stream=True, ...) with bounded concurrency, and yields the resulting mp3 bytes in input order.

Only response_format="mp3" is supported. Other formats require container-level demuxing to concatenate cleanly and would produce malformed output if naively appended.

If input fits in a single segment, this short-circuits to create_speech directly — no extra task scheduling overhead.

Arguments:

  • client: A connected VeniceClient.
  • input: Text to synthesize.
  • model: Venice TTS model id (e.g. "tts-qwen3-1-7b").
  • voice: Model-specific voice id.
  • response_format: Only "mp3" supported in this helper.
  • max_words_per_segment: Override the per-model budget.
  • max_concurrency: Concurrent in-flight create_speech calls. Default 4 keeps headroom under Venice's 60 req/min limit for typical single-document use.
  • on_segment_complete: Optional callback invoked as each segment finishes its stream, with (segment_index, latency_seconds, bytes).
  • timeout: Per-segment timeout in seconds, forwarded to create_speech.

Raises:

  • NotImplementedError: If response_format is not "mp3".
  • ValueError: If max_concurrency < 1. Exceptions raised inside any segment task are re-raised when that segment's bytes would have been yielded; later segments are cancelled.