venice_ai.resources.audio
Venice AI Audio API resources.
This module provides classes for interacting with the Venice AI Audio API for text-to-speech (TTS) and automatic-speech-recognition (ASR / Whisper).
Music generation has its own resource at :mod:venice_ai.resources.music
(accessed via client.music); pre-v2.0.0 it lived here alongside TTS,
but the namespaces are now split so each resource covers a single content
domain.
The audio API allows for:
- Converting text to natural-sounding speech (text-to-speech)
- Selecting from multiple voice options for speech synthesis
- Controlling speech speed and output format
- Both full and streaming response modes
- Transcribing and translating audio to text (Whisper models)
venice_ai.resources.audio.REGION_LANGUAGE_MAPPING
REGION_LANGUAGE_MAPPING = {
"a": {
"language": "English",
"accent": "American"
...
Mapping of single-letter region codes to language and accent information.
This dictionary provides language and accent metadata for voice model region codes used in TTS model identifiers. The region codes are typically found as prefixes in voice model names (e.g., "tts-kokoro-a" uses region "a" for American English).
Each region code maps to a dictionary containing:
- language: The primary language spoken by voices in this region
- accent: The specific accent or variant within that language
Region Codes:
- a: American English
- b: British English
- c: Canadian English
- d: Standard German
- e: European Spanish
- f: Standard French
- g: General English
- h: Standard Hindi
- i: Standard Italian
- j: Standard Japanese
- k: Standard Korean
- p: Standard Portuguese
- r: Standard Russian
- s: Scottish English
- u: US English (alternative encoding)
- w: Welsh English
- x: Australian English
- y: Indian English
- z: Mandarin Chinese
Notes:
This mapping is used internally by the :meth:Audio.get_voices method
to provide language and accent information when listing available voices.
Audio Objects
class Audio(APIResource["VeniceClient"])
Asynchronous interface for Venice AI's Audio API.
The Audio class covers two capabilities served under the /audio/*
namespace: Text-to-Speech (TTS) synthesis and speech-to-text transcription
(Whisper). Async music generation was split out to its own resource in
v2.0.0; use :class:~venice_ai.resources.music.Music (accessed as
client.music) instead.
Core Capabilities:
- Text-to-Speech Generation: Convert text input to high-quality speech audio
- Voice Selection: Choose from multiple voice models with different characteristics
- Format Control: Generate audio in various formats (MP3, WAV, etc.)
- Speed Adjustment: Control speech rate from 0.25x to 4.0x normal speed
- Streaming Support: Real-time audio generation and chunk-based delivery
- Voice Discovery: List and filter available voice models by attributes
- Transcription & Translation: Speech-to-text via Whisper-family models
Usage Patterns:
The Audio class is designed to be accessed through the Venice AI client's
:attr:~venice_ai._client.VeniceClient.audio property rather than instantiated directly.
This ensures proper authentication, configuration, and connection management.
Performance Considerations:
- Streaming mode reduces latency for long text inputs
- Batch operations are more efficient than individual requests
- Voice model caching improves subsequent request performance
- Audio format selection impacts file size and quality trade-offs
Arguments:
client- The Venice AI client instance providing authentication and connection management. This client handles all HTTP communication, error handling, and response parsing.
Example:
Basic text-to-speech generation:
async with VeniceClient() as client:
# Generate speech audio
audio_bytes = await client.audio.create_speech(
model="tts-kokoro",
input="Welcome to Venice AI's text-to-speech service!",
voice="af_heart",
response_format="mp3"
)
# Save audio to file
with open("welcome.mp3", "wb") as f:
f.write(audio_bytes)
Real-time streaming generation:
async with VeniceClient() as client:
# Stream audio chunks as they're generated
stream = await client.audio.create_speech(
model="tts-kokoro",
input="This is a longer text that will be streamed...",
voice="af_heart",
stream=True
)
# Process chunks in real-time
with open("streamed_audio.mp3", "wb") as f:
async for chunk in stream:
f.write(chunk)
# Optionally process or play chunk immediately
Notes:
All methods in this class are asynchronous and must be awaited. The class
inherits from :class:~venice_ai._resource.APIResource which provides
the underlying HTTP request infrastructure and error handling.
See Also:
- :class:
~venice_ai.types.audio.Voice: Enumeration of available voice options - :class:
~venice_ai.types.audio.ResponseFormat: Supported audio output formats - :class:
~venice_ai.types.audio.AudioResponse: Response wrapper for audio data
Audio.create_speech
async def create_speech(
*,
input: str,
model: str,
voice: str | Voice,
response_format: str | ResponseFormat | None = None,
speed: float | None = None,
language: str | None = None,
prompt: str | None = None,
temperature: float | None = None,
top_p: float | None = None,
stream: bool = False,
timeout: float | aiohttp.ClientTimeout | None = None
) -> AudioResponse | AsyncIterator[bytes]
Generates audio from input text asynchronously.
Converts the provided text to speech using the specified model and voice using asynchronous requests. The audio can be returned either as complete binary data or as an async stream of audio chunks for real-time processing.
Arguments:
model(str): ID of the model to use for speech generation (e.g., "tts-kokoro").input(str): The text to convert to speech. Maximum length varies by model.voice(Union[str, venice_ai.types.audio.Voice]): The voice to use for the generated audio. Can be a string literal or a :class:~venice_ai.types.audio.Voiceenum value (e.g.,Voice.AF_HEARTor"af_heart"). Voice IDs are per-model — call :meth:get_voices(e.g.await client.audio.get_voices(model_id="tts-kokoro")) for the live catalog of voices a given model accepts.response_format(Optional[Union[str, venice_ai.types.audio.ResponseFormat]]): The format to return the audio in. Can be a string literal or a :class:~venice_ai.types.audio.ResponseFormatenum value. Defaults to "mp3".speed(Optional[float]): The speed of the generated audio. Select a value from 0.25 to 4.0. Defaults to 1.0.language(Optional[str]): Optional language hint. Accepted values are model-specific: Qwen 3 accepts full names ("English","Chinese", ...); xAI and ElevenLabs accept ISO 639-1 codes ("en","ja", ...); MiniMax accepts full names. Unsupported values are silently ignored. Omit to let the model auto-detect.prompt(Optional[str]): Optional style prompt controlling emotion and delivery (e.g."Very happy."). Supported by models advertisingsupportsPromptParam(currently Qwen 3 TTS); ignored otherwise. Max 500 characters.temperature(Optional[float]): Optional sampling temperature (0–2) for speech generation. Supported by models advertisingsupportsTemperatureParam(Qwen 3, Orpheus, Chatterbox HD).top_p(Optional[float]): Optional nucleus-sampling parameter (0–1). Supported by models advertisingsupportsTopPParam(currently Qwen 3 TTS).stream(Optional[bool]): Whether to stream the audio data. If True, returns an AsyncIterator of audio chunks. If False, returns the complete audio data. Defaults to False.timeout: Request timeout in seconds or an aiohttp.ClientTimeout object. If not provided, uses the client's default timeout.
Raises:
-
venice_ai.exceptions.APIError: If the API request fails. -
ValueError: If the input text is empty or invalid parameters are provided. Example: Basic non-streaming text-to-speech:.. code-block:: python
import asynciofrom venice_ai import VeniceClientfrom venice_ai.types.audio import Voice, ResponseFormatasync def generate_speech():async with VeniceClient() as client:# Generate speech with enum valuesaudio_bytes = await client.audio.create_speech(model="tts-kokoro",input="Hello, this is a test.",voice=Voice.AF_HEART)# Save to filewith open("speech.mp3", "wb") as f:f.write(audio_bytes)# Using string literals and different formataudio_bytes = await client.audio.create_speech(model="tts-kokoro",input="Hello with different settings.",voice="af_heart",response_format="wav",speed=1.2)asyncio.run(generate_speech())Streaming text-to-speech:
.. code-block:: python
async def stream_speech():async with VeniceClient() as client:# Stream audio datastream = await client.audio.create_speech(model="tts-kokoro",input="This is a streamed audio example.",voice="af_heart",stream=True)# Write streamed chunks to filewith open("streamed_speech.mp3", "wb") as f:async for chunk in stream:f.write(chunk)asyncio.run(stream_speech())
Returns:
If stream is False, returns the audio data as AudioResponse (awaitable). If stream is True, returns an AsyncIterator yielding chunks of audio data as bytes.
Audio.stream_long_text
async def stream_long_text(
*,
input: str,
model: str,
voice: str | Voice,
response_format: str | ResponseFormat = "mp3",
speed: float | None = None,
language: str | None = None,
prompt: str | None = None,
temperature: float | None = None,
top_p: float | None = None,
max_words_per_segment: int | None = None,
max_concurrency: int = 4,
on_segment_complete: Any = None,
timeout: float | None = None) -> AsyncIterator[bytes]
Stream TTS audio for long input by splitting into parallel segments.
Convenience method that delegates to
Returns:
An :class:AsyncIterator of audio bytes in input order.
Example:
# voice values are model-specific — call
# client.audio.get_voices(model_id=...) for the live catalog.
async for chunk in client.audio.stream_long_text(
input=long_poem,
model="tts-kokoro",
voice="af_heart",
):
buffer.write(chunk)
Audio.get_voices
async def get_voices(*,
model_id: str | None = None,
gender: Literal["male", "female", "unknown"]
| None = None,
region_code: str | None = None) -> VoiceList
Lists available text-to-speech (TTS) voices by filtering all available models asynchronously.
This is a client-side convenience method.
Arguments:
model_id(Optional[str]): Optional. Filter voices by specific model ID.gender: Optional. Filter voices by gender.region_code: Optional. Filter voices by region code.
Raises:
venice_ai.exceptions.APIError: If the API request fails.
Returns:
A list of available TTS voices matching the specified filters.
Audio.transcribe
async def transcribe(
*,
file: str | bytes | BinaryIO | Path,
model: str = "nvidia/parakeet-tdt-0.6b-v3",
response_format: str | None = None,
timestamps: bool | None = None,
language: str | None = None) -> AudioTranscriptionResponse | str
Transcribe audio to text (POST /audio/transcriptions).
Converts audio input into text using the specified ASR (Automatic Speech Recognition) model. Supports various audio formats including WAV, FLAC, MP3, M4A, AAC, and MP4.
Arguments:
file(Union[str, bytes, BinaryIO, Path]): Audio file to transcribe. Can be a file path (string or Path), raw audio bytes, or a file-like object opened in binary mode.model(str): ASR model ID to use for transcription. Defaults to"nvidia/parakeet-tdt-0.6b-v3".response_format(Optional[str]):"json"(default, orNone) returns an :class:AudioTranscriptionResponsewith the transcribed text and optional word-level timestamps;"text"returns the raw transcript as a plainstr(the server responds withtext/plain, so it is not JSON-decoded).timestamps(Optional[bool]): IfTrue, include word-level timestamps in the response.language(Optional[str]): Language of the input audio (e.g.,"en").
Raises:
ValueError: If the model ID is invalid or the file cannot be read.venice_ai.exceptions.APIError: If the API request fails. Example: Basic audio transcription:
async with VeniceClient() as client:
result = await client.audio.transcribe(
file="recording.mp3",
model="nvidia/parakeet-tdt-0.6b-v3",
)
print(result.text)
Transcription with timestamps:
async with VeniceClient() as client:
result = await client.audio.transcribe(
file="recording.wav",
timestamps=True,
)
print(result.text)
if result.words:
for word in result.words:
print(f"{word.word}: {word.start}s - {word.end}s")
Returns:
AudioTranscriptionResponse | str: Either the parsed :class:AudioTranscriptionResponse (for
"json" / default) or the plain transcript str (for
"text").
Audio.create_voice
async def create_voice(*,
file: str | bytes | BinaryIO | Path,
model: str | None = None) -> ClonedVoice
Clone a voice from an audio sample (POST /v1/audio/voices).
Returns a :class:~venice_ai.types.api.audio.ClonedVoice whose id
is a vv_<id> handle. Pass that handle as the voice parameter to
Arguments:
file(Union[str, bytes, BinaryIO, Path]): Voice sample — a file path (str/Path), raw bytes, or a binary file-like object. A clean 5–10s speech recording is recommended. Accepted containers depend on the model (tts-chatterbox-hd: MP3/WAV/FLAC/M4A;tts-minimax-speech-02-hd: MP3/WAV).model(Optional[str]): Optional. The Venice TTS model to pair the handle with —"tts-chatterbox-hd"or"tts-minimax-speech-02-hd". When omitted, the API applies its default (tts-chatterbox-hd); the chosen model is returned on :attr:ClonedVoice.model.
Raises:
ValueError: Ifmodelis provided and invalid, or the file cannot be read.venice_ai.exceptions.APIError: If the API request fails.
Returns:
ClonedVoice: The cloned-voice handle and its paired model.