Skip to main content

Venice AI Chat Completions API Resources.

This module provides asynchronous client interfaces for Venice AI's Chat Completions API, enabling sophisticated conversational AI interactions with advanced language models. The Chat Completions API supports both streaming and non-streaming responses, tool calling, structured output generation, and fine-grained control over model behavior.

Key Features:

  • Conversational AI: Multi-turn chat conversations with context preservation
  • Streaming Responses: Real-time token-by-token response generation
  • Tool Integration: Function calling and external tool integration capabilities
  • Structured Output: JSON schema-guided response formatting
  • Model Variety: Access to multiple state-of-the-art language models
  • Advanced Controls: Temperature, top-p, frequency penalties, and more
  • Asynchronous Operations: Full async/await support for scalable applications

Supported Capabilities:

  • Multi-turn Conversations: Maintain context across multiple exchanges
  • System Instructions: Define AI behavior and personality through system messages
  • Function Calling: Enable AI to call external functions and APIs
  • Response Formatting: Control output structure with JSON schemas
  • Content Filtering: Optional safety and content moderation features
  • Reproducible Generation: Seed-based deterministic outputs
  • Token Management: Precise control over response length and token usage

The chat completions system enables sophisticated AI applications including:

  • Virtual Assistants: Intelligent chatbots and conversational interfaces
  • Content Generation: Creative writing, documentation, and content creation
  • Code Assistance: Programming help, code review, and technical guidance
  • Data Analysis: Structured data processing and analysis workflows
  • Decision Support: AI-powered recommendations and decision assistance
  • Educational Tools: Tutoring systems and interactive learning platforms

Example:

.. code-block:: python

import asyncio from venice_ai import VeniceClient

async def chat_conversation(): async with VeniceClient() as client:

Single-turn conversation

response = await client.chat.completions.create( model="llama-3.3-70b", messages=[

  • {"role" - "system", "content": "You are a helpful assistant."},

  • {"role" - "user", "content": "Explain quantum computing"} ], temperature=0.7, max_completion_tokens=500 )

    print(response["choices"][0]["message"]["content"])

    asyncio.run(chat_conversation())

    Performance Considerations:

    • Streaming reduces latency for long responses
    • Batch conversations are more efficient than individual requests
    • Model selection affects both quality and response speed
    • Token limits impact both cost and conversation length

Notes:

All operations in this module are asynchronous and require proper async/await handling. The ChatCompletions class is accessed through the :attr:VeniceClient.chat.completions property for proper authentication and configuration.

ChatCompletions Objects

class ChatCompletions(APIResource["VeniceClient"])

Provides access to asynchronous chat completion operations.

This class manages asynchronous chat completion operations with Venice AI models, supporting both standard (non-streaming) and streaming response formats. It serves as the primary interface for chat-based interactions with Venice AI language models in asynchronous contexts.

The class handles parameter validation, request formation, and response parsing for asynchronous chat completion requests.

Arguments:

  • _client (`venice_ai._client.VeniceClient Example:

    .. code-block:: python

    from venice_ai import VeniceClient
    import asyncio

    async def main():
    # Initialize the async client
    async with VeniceClient(api_key="your-api-key") as client:

    # Create a chat completion asynchronously
    response = await client.chat.completions.create(
    model="venice-1",
    messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Tell me about Venice AI."}
    ]
    )

    # Access the response content
    print(response["choices"][0]["message"]["content"])

    # Run the async function
    asyncio.run(main())`): The client instance used to make API requests.

ChatCompletions.create

async def create(
*,
model: str,
messages: Sequence[UserMessage | AssistantMessage | SystemMessage
| ToolMessage | DeveloperMessage],
stream: bool = False,
stream_cls: type[ChunkModelFactory[ChatCompletionChunk]] | None = None,
**kwargs: Any
) -> ChatCompletionResponse | AsyncIterable[ChatCompletionChunk]

Create a model response for the given chat conversation asynchronously.

This method handles the core functionality of the chat completions API, allowing for both synchronous and streaming responses in async contexts. It sends the provided messages and parameters to the Venice AI API and returns either a complete response or a stream of partial responses.

The method automatically formats the request body, applies appropriate defaults, and routes the request to either the standard or streaming endpoint based on the stream parameter.

Wraps POST /api/v1/chat/completions. Streaming requests use the same path with Server-Sent Events.

Arguments:

  • model - ID of the model to use. Resolve via client.models.resolve_chat() rather than hardcoding.
  • messages - Sequence of messages forming the conversation. Each entry is one of :class:UserMessage, :class:AssistantMessage, :class:SystemMessage, :class:ToolMessage, or :class:DeveloperMessage.
  • stream - If True, stream back partial progress as AsyncIterator[ChatCompletionChunk]. Defaults to False, which returns a single :class:ChatCompletionResponse.
  • stream_cls - Optional stream wrapper class for streaming responses. Must conform to the :class:~venice_ai.types.api.streaming.ChunkModelFactory protocol. Defaults to :class:~venice_ai.streaming.Stream.
  • frequency_penalty - Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far.
  • max_completion_tokens - Maximum number of completion tokens to generate. On reasoning-capable models this is a strict cap on total completion tokens — visible output plus internal reasoning tokens — not just the visible output. (max_tokens was accepted as an alias in v1 but is removed in v2; passing it raises TypeError.)
  • n - Number of chat completion choices to generate for each input message.
  • presence_penalty - Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far.
  • response_format - Specifies the format the model must output (e.g. for JSON mode). Accepts :class:JSONSchemaFormat, :class:JSONObjectFormat, :class:TextResponseFormat, or a Pydantic BaseModel subclass (auto-converted to a strict JSON schema).
  • seed - Random seed for reproducible outputs.
  • stop - Up to 4 sequences where the API will stop generating further tokens.
  • temperature - Sampling temperature between 0.0 and 2.0. Higher values make output more random, lower values more focused and deterministic. Defaults to 0.7 server-side.
  • top_p - Nucleus sampling parameter between 0.0 and 1.0. Defaults to 1.0 server-side.
  • max_temp - Upper bound for dynamic temperature scaling (0.0-2.0). Used with min_temp to let the model adjust temperature per-token within a range instead of using a fixed temperature.
  • min_temp - Lower bound for dynamic temperature scaling (0.0-2.0). See max_temp.
  • min_p - Minimum probability threshold (0.0-1.0) relative to the most likely token, used as an alternative to top_p.
  • tools - List of tools the model may call.
  • tool_choice - Controls which (if any) tool is called by the model. Can be "none", "auto", or a :class:SpecificToolChoice.
  • user - Unique identifier representing your end-user (discarded by API but supported for OpenAI compatibility).
  • venice_parameters - Venice-specific parameters for fine-tuning model behavior.
  • reasoning_effort - Controls thinking depth on reasoning models. One of "none", "minimal", "low", "medium", "high", "xhigh", or "max". Takes precedence over reasoning.effort when both are set.
  • reasoning - Nested reasoning configuration. Accepts effort (same enum as reasoning_effort) and summary ("auto" / "concise" / "detailed").
  • prompt_cache_key - Routing hint to improve cache hit rates across multi-turn conversations. Requests sharing the same key are more likely to hit cached prompt prefixes.
  • prompt_cache_retention - Cache retention tier. "default" uses the standard TTL; "extended" or "24h" keep the prompt cached for longer, improving hit rates for long-running agents at a small storage premium.
  • store - OpenAI-compat flag forwarded to the upstream model; controls whether the completion is stored server-side for replay.
  • text - OpenAI-compat text configuration (e.g.
  • ```{"verbosity"` - "low"}``). Forwarded verbatim.
  • include - OpenAI-compat inclusion specifier; an array of response-enrichment opt-in strings passed through to the model.
  • metadata - OpenAI-compat free-form metadata dict attached to the request. Forwarded verbatim - useful for client-side observability.
  • logprobs - Whether to return log probabilities of the output tokens.
  • top_logprobs - Number of most likely tokens to return at each token position if logprobs is True.
  • parallel_tool_calls - Whether to enable parallel function calling during tool use.
  • repetition_penalty - Penalty for token repetition.
  • stop_token_ids - List of token IDs at which to stop generation.
  • top_k - Number of highest probability vocabulary tokens to keep for top-k-filtering.
  • stream_options - Additional options for controlling streaming behavior.
  • e2ee - Engage Venice confidential-compute (TEE) end-to-end encryption. The flow runs when e2ee is truthy OR when venice_parameters.enable_e2ee is True; it requires an e2ee-* model and the [e2ee] extra (cryptography). True uses defaults; pass a :class:~venice_ai.tee.types.TeeOptions to control the attestation freshness nonce or supply a full quote verifier. When engaged the SDK verifies the model's attestation (fail-closed), encrypts each user/system message to the model key, forces a wire stream with the X-Venice-TEE-* headers, and decrypts the response locally (reassembling a normal :class:ChatCompletionResponse when stream is False). Tools, web search/scraping, and multimodal content are rejected with :class:~venice_ai.exceptions.InvalidRequestError before any network call, and the Venice system prompt is forced off. SECURITY LIMITATION: the baseline attestation verifier trusts Venice's server-side verified claim and does not perform full client-side TDX / NVIDIA quote verification; a one-time :class:UserWarning is emitted on engagement.
  • kwargs - Additional keyword arguments forwarded to the request body for forward-compatibility.

Returns:

:class:~venice_ai.types.api.chat.ChatCompletionResponse when stream is False, otherwise an AsyncIterable of :class:~venice_ai.types.api.streaming.ChatCompletionChunk.

Raises:

  • InvalidRequestError - If parameters are invalid or malformed.
  • AuthenticationError - If the API key is invalid or missing.
  • PermissionDeniedError - If access is denied to the requested model or feature.
  • NotFoundError - If the model or resource is not found.
  • RateLimitError - If rate limits are exceeded for the account.
  • TypeError - If the legacy max_tokens kwarg is supplied (use max_completion_tokens in v2).
  • APIError - For other API-related errors not covered by specific exceptions.

Example:

.. code-block:: python

import asyncio from venice_ai import VeniceClient

async def main(): async with VeniceClient() as client: model = await client.models.resolve_chat() response = await client.chat.completions.create( model=model, messages=[

  • {"role" - "system", "content": "You are a helpful assistant."},

  • {"role" - "user", "content": "Explain async programming in Python."}, ], temperature=0.3, ) print(response.choices[0].message.content)

    asyncio.run(main())

    Streaming variant

    async def stream_example(): async with VeniceClient() as client: model = await client.models.resolve_chat() async for chunk in await client.chat.completions.create( model=model,

  • messages=[{"role" - "user", "content": "Tell me a story."}], stream=True, max_completion_tokens=200, ): content = chunk.choices[0].delta.content or "" if content: print(content, end="", flush=True)

    asyncio.run(stream_example())

ChatCompletions.stream

async def stream(*,
model: str,
messages: Sequence[UserMessage | AssistantMessage
| SystemMessage | ToolMessage
| DeveloperMessage],
e2ee: bool | TeeOptions = False,
**kwargs: Any) -> ChatStream

Shorthand for create(stream=True) returning a :class:~venice_ai.streaming.ChatStream.

The returned stream supports async context manager, :meth:~ChatStream.text_deltas, and :meth:~ChatStream.collect:

async with await client.chat.completions.stream(model=model, messages=messages) as s:
async for text in s.text_deltas():
print(text, end="")

Wraps ``POST /api/v1/chat/completions`` (Server-Sent Events).

Arguments:

  • model - Model id to use.
  • messages - Conversation messages.
  • e2ee - Engage the Venice confidential-compute (TEE) end-to-end encryption flow. True uses defaults; pass a :class:~venice_ai.tee.types.TeeOptions to control the attestation nonce or supply a full quote verifier. Requires an e2ee-* model and the [e2ee] extra; the deltas yielded by the returned stream are already decrypted plaintext. See :meth:create for the engagement rules and limitations.
  • kwargs - All other parameters accepted by :meth:create.

Returns:

A :class:~venice_ai.streaming.ChatStream instance ready to be iterated as an async context manager.

Raises:

  • InvalidRequestError - If parameters fail server-side validation, or if E2EE is requested on an incompatible request (non-e2ee- model, tools, web search/scraping, or multimodal content).
  • AuthenticationError - If the API key is missing or invalid.
  • RateLimitError - If account-level rate limits are exceeded.
  • APIError - For other HTTP-level failures.

ChatCompletions.estimate_cost

async def estimate_cost(*,
model: str,
messages: Sequence[UserMessage | AssistantMessage
| SystemMessage | ToolMessage
| DeveloperMessage],
expected_completion_tokens: int = 500,
tokens_per_word: float = 1.3) -> ChatCostEstimate

Estimate the USD cost of a chat completion before sending it.

Mirrors the pre-flight quote() method on client.video and client.audio for symmetry. Token counts are heuristic (word-count x tokens_per_word) - the same approximation used by :func:venice_ai.costs.estimate_completion_cost - so the result is an estimate, not a guarantee.

SDK-side helper. The pricing lookup wraps GET /api/v1/models via :meth:client.models.list; no other wire calls are made.

Arguments:

  • model - Chat model id whose pricing to look up.
  • messages - The messages you intend to send.
  • expected_completion_tokens - Caller's budget for the model's response. Defaults to 500.
  • tokens_per_word - Word -> token conversion ratio. Default 1.3 is tuned for English; raise for code/CJK.

Returns:

:class:~venice_ai.costs.ChatCostEstimate with the prompt / completion / total USD breakdown.

Raises:

  • ValueError - If model is not present in :meth:client.models.list or has no LLM token-based pricing (the estimate would otherwise be meaningless).

  • APIError - For HTTP-level failures while fetching the model catalog.

    Example:

estimate = await client.chat.completions.estimate_cost(
model=await client.models.resolve_chat(),
messages=[UserMessage(content="Summarize this contract...")],
expected_completion_tokens=1500,
)
if estimate.total_cost_usd > Decimal("0.10"):
raise BudgetError(f"Too expensive: ${estimate.total_cost_usd}")

ChatCompletions.parse

async def parse(*,
model: str,
messages: Sequence[UserMessage | AssistantMessage
| SystemMessage | ToolMessage
| DeveloperMessage],
response_format: type[T],
schema_name: str | None = None,
strict: bool = True,
**kwargs: Any) -> ParsedChatCompletion[T]

Auto-validating sibling of :meth:create for structured output.

Pass a Pydantic BaseModel subclass as response_format; the SDK builds the JSON Schema from it, sends the request, and validates the first choice's content against the schema before returning. Errors surface at this callsite (pydantic.ValidationError) instead of downstream when the caller would otherwise have run :meth:ChatCompletionResponse.parse_as themselves.

Wraps POST /api/v1/chat/completions with response_format={"type": "json_schema", ...}.

Arguments:

  • model - Chat model id.
  • messages - Conversation messages.
  • response_format - A Pydantic BaseModel subclass describing the desired response shape.
  • schema_name - Optional schema name sent to the API. Defaults to response_format.__name__. Some providers display this name in tool-style UIs.
  • strict - Whether to set strict: true in the JSON Schema payload. Recommended; the API may then reject responses that don't match the schema rather than silently returning bad JSON.
  • kwargs - All other keyword arguments accepted by :meth:create (temperature, max_completion_tokens, tools, etc.) are forwarded unchanged. stream is rejected - :meth:parse does not support streaming.

Returns:

:class:~venice_ai.types.api.chat.ParsedChatCompletion with the validated parsed instance and the underlying :class:ChatCompletionResponse.

Raises:

  • ValueError - If stream=True is passed (use :meth:stream instead) or the model returns a tool-call-only / multimodal choice with no text content.

  • TypeError - If :meth:create returns an unexpected non- :class:ChatCompletionResponse value.

  • pydantic.ValidationError - If the model's response doesn't match response_format.

  • InvalidRequestError - If parameters fail server-side validation.

  • AuthenticationError - If the API key is missing or invalid.

  • RateLimitError - If account-level rate limits are exceeded.

  • APIError - For other HTTP-level failures.

    Example:

class Person(BaseModel):
- `name` - str
- `age` - int

result = await client.chat.completions.parse(
model=await client.models.resolve_chat(),
messages=[UserMessage(content="Tell me about Marie Curie.")],
response_format=Person,
)
- `person` - Person = result.parsed
print(result.usage.total_tokens, person.name, person.age)

ChatCompletions.batch

async def batch(
requests: Sequence[dict[str, Any]],
*,
max_concurrency: int = 10,
return_exceptions: bool = True
) -> list[ChatCompletionResponse | BaseException]

Run many :meth:create calls in parallel with bounded concurrency.

Each entry in requests is a kwargs dict unpacked into :meth:create. Result order matches input order. By default, per-request exceptions are collected into the result list (mirroring asyncio.gather(return_exceptions=True)) so a single failure does not abort the whole batch - set return_exceptions=False for all-or-nothing semantics.

Streaming is rejected: an entry with stream=True results in a ValueError for that slot (or aborts the batch when return_exceptions=False). Use :meth:stream directly inside asyncio.gather if you need concurrent streams.

SDK-side helper that fans out across :meth:create; each child call wraps POST /api/v1/chat/completions.

Arguments:

  • requests - Sequence of kwargs dicts for :meth:create.
  • max_concurrency - Maximum concurrent in-flight requests (default 10). Must be >= 1.
  • return_exceptions - If True (default), exceptions for individual requests appear in their slot in the result list. If False, the first exception raises and cancels pending tasks.

Returns:

A list of :class:ChatCompletionResponse (and :class:BaseException instances when return_exceptions=True) in input order.

Raises:

  • ValueError - If max_concurrency < 1.

  • TypeError - If a child :meth:create returns an unexpected non-:class:ChatCompletionResponse value (with return_exceptions=False).

  • APIError - First child failure when return_exceptions=False.

    Example:

results = await client.chat.completions.batch(
[
- `{"model"` - model, "messages": [UserMessage(content=q)]}
for q in questions
],
max_concurrency=5,
)
for q, r in zip(questions, results, strict=True):
if isinstance(r, BaseException):
print(f"{q!r} failed: {r}")
else:
print(f"{q!r} -> {r.choices[0].message.content!r}")

ChatCompletions.run_with_tools

async def run_with_tools(*,
model: str,
messages: Sequence[UserMessage | AssistantMessage
| SystemMessage | ToolMessage
| DeveloperMessage],
tools: Sequence[Callable[..., Any] | Tool],
on_tool_call: Callable[[ToolCall, Any], None]
| None = None,
on_tool_error: Callable[[ToolCall, Exception], str]
| None = None,
parallel: bool = False,
max_iterations: int = 10,
**create_kwargs: Any) -> ToolLoopResult

Run an automatic tool-call loop until the model produces a final answer.

Drives the canonical "create -> check finish_reason -> execute tools -> re-call" loop so callers don't have to. tools accepts bare Python callables (auto-converted with :func:venice_ai.tool_from_function and registered as the dispatch handler) or pre-built :class:Tool definitions paired with a separate on_tool_call dispatcher; the two shapes can be mixed in one list.

Both sync and async tool callables are supported - they're detected with :func:inspect.iscoroutinefunction and invoked or awaited as appropriate. The caller's messages list is not mutated; the returned :class:ToolLoopResult exposes a fresh history copy along with the final response and iteration count.

SDK-side orchestrator that calls POST /api/v1/chat/completions once per iteration via :meth:create.

Arguments:

  • model - Model id to use for every iteration.
  • messages - Initial chat messages. Not mutated.
  • tools - Bare callables, :class:Tool definitions, or a mix. Bare callables are introspected with :func:tool_from_function.
  • on_tool_call - Optional observation hook invoked after each tool runs successfully - receives the :class:ToolCall and the handler's return value. Read-only; does not affect what the model sees.
  • on_tool_error - Optional override for tool-error handling. Defaults to :func:_default_on_tool_error, which logs the exception (with traceback) to the venice_ai.tools logger and returns a formatted string sent back to the model so it can recover. Pass a function that re-raises for strict propagation.
  • parallel - If True, multiple tool calls in one assistant response run concurrently via :func:asyncio.gather. Default False (sequential) to avoid surprise concurrency on tool functions that share state. Only set to True when handlers are concurrency-safe.
  • max_iterations - Maximum number of model round trips before giving up. Default 10.
  • create_kwargs - Forwarded to :meth:create on every iteration (e.g. temperature, max_completion_tokens, response_format, venice_parameters). stream is managed by this method and rejected if passed.

Returns:

A :class:ToolLoopResult with the terminal response, full message history, and round-trip count.

Raises:

  • MaxIterationsExceededError - If the loop hits max_iterations while still receiving finish_reason="tool_calls" responses.
  • ValueError - If stream is passed via create_kwargs, if a tool dispatch handler is missing for a tool the model called, if the model returns no choices, or if max_iterations < 1.
  • TypeError - If :meth:create returns an unexpected non-:class:ChatCompletionResponse value.
  • InvalidRequestError - If parameters fail server-side validation.
  • AuthenticationError - If the API key is missing or invalid.
  • RateLimitError - If account-level rate limits are exceeded.
  • APIError - For other HTTP-level failures during any iteration.