Skip to main content

venice_ai.resources.audio

Venice AI Audio API resources.

This module provides classes for interacting with the Venice AI Audio API for text-to-speech (TTS) and automatic-speech-recognition (ASR / Whisper).

Music generation has its own resource at :mod:venice_ai.resources.music (accessed via client.music); pre-v2.0.0 it lived here alongside TTS, but the namespaces are now split so each resource covers a single content domain.

The audio API allows for:

  • Converting text to natural-sounding speech (text-to-speech)
  • Selecting from multiple voice options for speech synthesis
  • Controlling speech speed and output format
  • Both full and streaming response modes
  • Transcribing and translating audio to text (Whisper models)

venice_ai.resources.audio.REGION_LANGUAGE_MAPPING

REGION_LANGUAGE_MAPPING = {
"a": {
"language": "English",
"accent": "American"
...

Mapping of single-letter region codes to language and accent information.

This dictionary provides language and accent metadata for voice model region codes used in TTS model identifiers. The region codes are typically found as prefixes in voice model names (e.g., "tts-kokoro-a" uses region "a" for American English).

Each region code maps to a dictionary containing:

  • language: The primary language spoken by voices in this region
  • accent: The specific accent or variant within that language

Region Codes:

  • a: American English
  • b: British English
  • c: Canadian English
  • d: Standard German
  • e: European Spanish
  • f: Standard French
  • g: General English
  • h: Standard Hindi
  • i: Standard Italian
  • j: Standard Japanese
  • k: Standard Korean
  • p: Standard Portuguese
  • r: Standard Russian
  • s: Scottish English
  • u: US English (alternative encoding)
  • w: Welsh English
  • x: Australian English
  • y: Indian English
  • z: Mandarin Chinese

Notes:

This mapping is used internally by the :meth:Audio.get_voices method to provide language and accent information when listing available voices.

Audio Objects

class Audio(APIResource["VeniceClient"])

Asynchronous interface for Venice AI's Audio API.

The Audio class covers two capabilities served under the /audio/* namespace: Text-to-Speech (TTS) synthesis and speech-to-text transcription (Whisper). Async music generation was split out to its own resource in v2.0.0; use :class:~venice_ai.resources.music.Music (accessed as client.music) instead.

Core Capabilities:

  • Text-to-Speech Generation: Convert text input to high-quality speech audio
  • Voice Selection: Choose from multiple voice models with different characteristics
  • Format Control: Generate audio in various formats (MP3, WAV, etc.)
  • Speed Adjustment: Control speech rate from 0.25x to 4.0x normal speed
  • Streaming Support: Real-time audio generation and chunk-based delivery
  • Voice Discovery: List and filter available voice models by attributes
  • Transcription & Translation: Speech-to-text via Whisper-family models

Usage Patterns: The Audio class is designed to be accessed through the Venice AI client's :attr:~venice_ai._client.VeniceClient.audio property rather than instantiated directly. This ensures proper authentication, configuration, and connection management.

Performance Considerations:

  • Streaming mode reduces latency for long text inputs
  • Batch operations are more efficient than individual requests
  • Voice model caching improves subsequent request performance
  • Audio format selection impacts file size and quality trade-offs

Arguments:

  • client - The Venice AI client instance providing authentication and connection management. This client handles all HTTP communication, error handling, and response parsing.

Example:

Basic text-to-speech generation:

async with VeniceClient() as client:
# Generate speech audio
audio_bytes = await client.audio.create_speech(
model="tts-kokoro",
input="Welcome to Venice AI's text-to-speech service!",
voice="af_heart",
response_format="mp3"
)

# Save audio to file
with open("welcome.mp3", "wb") as f:
f.write(audio_bytes)

Real-time streaming generation:

async with VeniceClient() as client:
# Stream audio chunks as they're generated
stream = await client.audio.create_speech(
model="tts-kokoro",
input="This is a longer text that will be streamed...",
voice="af_heart",
stream=True
)

# Process chunks in real-time
with open("streamed_audio.mp3", "wb") as f:
async for chunk in stream:
f.write(chunk)
# Optionally process or play chunk immediately

Notes:

All methods in this class are asynchronous and must be awaited. The class inherits from :class:~venice_ai._resource.APIResource which provides the underlying HTTP request infrastructure and error handling.

See Also:

  • :class:~venice_ai.types.audio.Voice: Enumeration of available voice options
  • :class:~venice_ai.types.audio.ResponseFormat: Supported audio output formats
  • :class:~venice_ai.types.audio.AudioResponse: Response wrapper for audio data

Audio.create_speech

async def create_speech(
*,
input: str,
model: str,
voice: str | Voice,
response_format: str | ResponseFormat | None = None,
speed: float | None = None,
language: str | None = None,
prompt: str | None = None,
temperature: float | None = None,
top_p: float | None = None,
stream: bool = False,
timeout: float | aiohttp.ClientTimeout | None = None
) -> AudioResponse | AsyncIterator[bytes]

Generates audio from input text asynchronously.

Converts the provided text to speech using the specified model and voice using asynchronous requests. The audio can be returned either as complete binary data or as an async stream of audio chunks for real-time processing.

Arguments:

  • model (str): ID of the model to use for speech generation (e.g., "tts-kokoro").
  • input (str): The text to convert to speech. Maximum length varies by model.
  • voice (Union[str, venice_ai.types.audio.Voice]): The voice to use for the generated audio. Can be a string literal or a :class:~venice_ai.types.audio.Voice enum value (e.g., Voice.AF_HEART or "af_heart"). Voice IDs are per-model — call :meth:get_voices (e.g. await client.audio.get_voices(model_id="tts-kokoro")) for the live catalog of voices a given model accepts.
  • response_format (Optional[Union[str, venice_ai.types.audio.ResponseFormat]]): The format to return the audio in. Can be a string literal or a :class:~venice_ai.types.audio.ResponseFormat enum value. Defaults to "mp3".
  • speed (Optional[float]): The speed of the generated audio. Select a value from 0.25 to 4.0. Defaults to 1.0.
  • language (Optional[str]): Optional language hint. Accepted values are model-specific: Qwen 3 accepts full names ("English", "Chinese", ...); xAI and ElevenLabs accept ISO 639-1 codes ("en", "ja", ...); MiniMax accepts full names. Unsupported values are silently ignored. Omit to let the model auto-detect.
  • prompt (Optional[str]): Optional style prompt controlling emotion and delivery (e.g. "Very happy."). Supported by models advertising supportsPromptParam (currently Qwen 3 TTS); ignored otherwise. Max 500 characters.
  • temperature (Optional[float]): Optional sampling temperature (0–2) for speech generation. Supported by models advertising supportsTemperatureParam (Qwen 3, Orpheus, Chatterbox HD).
  • top_p (Optional[float]): Optional nucleus-sampling parameter (0–1). Supported by models advertising supportsTopPParam (currently Qwen 3 TTS).
  • stream (Optional[bool]): Whether to stream the audio data. If True, returns an AsyncIterator of audio chunks. If False, returns the complete audio data. Defaults to False.
  • timeout: Request timeout in seconds or an aiohttp.ClientTimeout object. If not provided, uses the client's default timeout.

Raises:

  • venice_ai.exceptions.APIError: If the API request fails.

  • ValueError: If the input text is empty or invalid parameters are provided. Example: Basic non-streaming text-to-speech:

    .. code-block:: python

    import asyncio
    from venice_ai import VeniceClient
    from venice_ai.types.audio import Voice, ResponseFormat

    async def generate_speech():
    async with VeniceClient() as client:

    # Generate speech with enum values
    audio_bytes = await client.audio.create_speech(
    model="tts-kokoro",
    input="Hello, this is a test.",
    voice=Voice.AF_HEART
    )

    # Save to file
    with open("speech.mp3", "wb") as f:
    f.write(audio_bytes)

    # Using string literals and different format
    audio_bytes = await client.audio.create_speech(
    model="tts-kokoro",
    input="Hello with different settings.",
    voice="af_heart",
    response_format="wav",
    speed=1.2
    )

    asyncio.run(generate_speech())

    Streaming text-to-speech:

    .. code-block:: python

    async def stream_speech():
    async with VeniceClient() as client:

    # Stream audio data
    stream = await client.audio.create_speech(
    model="tts-kokoro",
    input="This is a streamed audio example.",
    voice="af_heart",
    stream=True
    )

    # Write streamed chunks to file
    with open("streamed_speech.mp3", "wb") as f:
    async for chunk in stream:
    f.write(chunk)

    asyncio.run(stream_speech())

Returns:

If stream is False, returns the audio data as AudioResponse (awaitable). If stream is True, returns an AsyncIterator yielding chunks of audio data as bytes.

Audio.stream_long_text

async def stream_long_text(
*,
input: str,
model: str,
voice: str | Voice,
response_format: str | ResponseFormat = "mp3",
speed: float | None = None,
language: str | None = None,
prompt: str | None = None,
temperature: float | None = None,
top_p: float | None = None,
max_words_per_segment: int | None = None,
max_concurrency: int = 4,
on_segment_complete: Any = None,
timeout: float | None = None) -> AsyncIterator[bytes]

Stream TTS audio for long input by splitting into parallel segments.

Convenience method that delegates to

Returns:

An :class:AsyncIterator of audio bytes in input order. Example:

# voice values are model-specific — call
# client.audio.get_voices(model_id=...) for the live catalog.
async for chunk in client.audio.stream_long_text(
input=long_poem,
model="tts-kokoro",
voice="af_heart",
):
buffer.write(chunk)

Audio.get_voices

async def get_voices(*,
model_id: str | None = None,
gender: Literal["male", "female", "unknown"]
| None = None,
region_code: str | None = None) -> VoiceList

Lists available text-to-speech (TTS) voices by filtering all available models asynchronously.

This is a client-side convenience method.

Arguments:

  • model_id (Optional[str]): Optional. Filter voices by specific model ID.
  • gender: Optional. Filter voices by gender.
  • region_code: Optional. Filter voices by region code.

Raises:

  • venice_ai.exceptions.APIError: If the API request fails.

Returns:

A list of available TTS voices matching the specified filters.

Audio.transcribe

async def transcribe(
*,
file: str | bytes | BinaryIO | Path,
model: str = "nvidia/parakeet-tdt-0.6b-v3",
response_format: str | None = None,
timestamps: bool | None = None,
language: str | None = None) -> AudioTranscriptionResponse | str

Transcribe audio to text (POST /audio/transcriptions).

Converts audio input into text using the specified ASR (Automatic Speech Recognition) model. Supports various audio formats including WAV, FLAC, MP3, M4A, AAC, and MP4.

Arguments:

  • file (Union[str, bytes, BinaryIO, Path]): Audio file to transcribe. Can be a file path (string or Path), raw audio bytes, or a file-like object opened in binary mode.
  • model (str): ASR model ID to use for transcription. Defaults to "nvidia/parakeet-tdt-0.6b-v3".
  • response_format (Optional[str]): "json" (default, or None) returns an :class:AudioTranscriptionResponse with the transcribed text and optional word-level timestamps; "text" returns the raw transcript as a plain str (the server responds with text/plain, so it is not JSON-decoded).
  • timestamps (Optional[bool]): If True, include word-level timestamps in the response.
  • language (Optional[str]): Language of the input audio (e.g., "en").

Raises:

  • ValueError: If the model ID is invalid or the file cannot be read.
  • venice_ai.exceptions.APIError: If the API request fails. Example: Basic audio transcription:
async with VeniceClient() as client:
result = await client.audio.transcribe(
file="recording.mp3",
model="nvidia/parakeet-tdt-0.6b-v3",
)
print(result.text)
Transcription with timestamps:
async with VeniceClient() as client:
result = await client.audio.transcribe(
file="recording.wav",
timestamps=True,
)
print(result.text)
if result.words:
for word in result.words:
print(f"{word.word}: {word.start}s - {word.end}s")

Returns:

AudioTranscriptionResponse | str: Either the parsed :class:AudioTranscriptionResponse (for "json" / default) or the plain transcript str (for "text").

Audio.create_voice

async def create_voice(*,
file: str | bytes | BinaryIO | Path,
model: str | None = None) -> ClonedVoice

Clone a voice from an audio sample (POST /v1/audio/voices).

Returns a :class:~venice_ai.types.api.audio.ClonedVoice whose id is a vv_<id> handle. Pass that handle as the voice parameter to

Arguments:

  • file (Union[str, bytes, BinaryIO, Path]): Voice sample — a file path (str/Path), raw bytes, or a binary file-like object. A clean 5–10s speech recording is recommended. Accepted containers depend on the model (tts-chatterbox-hd: MP3/WAV/FLAC/M4A; tts-minimax-speech-02-hd: MP3/WAV).
  • model (Optional[str]): Optional. The Venice TTS model to pair the handle with — "tts-chatterbox-hd" or "tts-minimax-speech-02-hd". When omitted, the API applies its default (tts-chatterbox-hd); the chosen model is returned on :attr:ClonedVoice.model.

Raises:

  • ValueError: If model is provided and invalid, or the file cannot be read.
  • venice_ai.exceptions.APIError: If the API request fails.

Returns:

ClonedVoice: The cloned-voice handle and its paired model.