Venice AI Chat Completions API Resources.
This module provides asynchronous client interfaces for Venice AI's Chat Completions API, enabling sophisticated conversational AI interactions with advanced language models. The Chat Completions API supports both streaming and non-streaming responses, tool calling, structured output generation, and fine-grained control over model behavior.
Key Features:
- Conversational AI: Multi-turn chat conversations with context preservation
- Streaming Responses: Real-time token-by-token response generation
- Tool Integration: Function calling and external tool integration capabilities
- Structured Output: JSON schema-guided response formatting
- Model Variety: Access to multiple state-of-the-art language models
- Advanced Controls: Temperature, top-p, frequency penalties, and more
- Asynchronous Operations: Full async/await support for scalable applications
Supported Capabilities:
- Multi-turn Conversations: Maintain context across multiple exchanges
- System Instructions: Define AI behavior and personality through system messages
- Function Calling: Enable AI to call external functions and APIs
- Response Formatting: Control output structure with JSON schemas
- Content Filtering: Optional safety and content moderation features
- Reproducible Generation: Seed-based deterministic outputs
- Token Management: Precise control over response length and token usage
The chat completions system enables sophisticated AI applications including:
- Virtual Assistants: Intelligent chatbots and conversational interfaces
- Content Generation: Creative writing, documentation, and content creation
- Code Assistance: Programming help, code review, and technical guidance
- Data Analysis: Structured data processing and analysis workflows
- Decision Support: AI-powered recommendations and decision assistance
- Educational Tools: Tutoring systems and interactive learning platforms
Example:
.. code-block:: python
import asyncio from venice_ai import VeniceClient
async def chat_conversation(): async with VeniceClient() as client:
Single-turn conversation
response = await client.chat.completions.create( model="llama-3.3-70b", messages=[
-
{"role"- "system", "content": "You are a helpful assistant."}, -
{"role"- "user", "content": "Explain quantum computing"} ], temperature=0.7, max_completion_tokens=500 )print(response["choices"][0]["message"]["content"])
asyncio.run(chat_conversation())
Performance Considerations:
- Streaming reduces latency for long responses
- Batch conversations are more efficient than individual requests
- Model selection affects both quality and response speed
- Token limits impact both cost and conversation length
Notes:
All operations in this module are asynchronous and require proper async/await
handling. The ChatCompletions class is accessed through the
:attr:VeniceClient.chat.completions property for proper authentication and configuration.
ChatCompletions Objects
class ChatCompletions(APIResource["VeniceClient"])
Provides access to asynchronous chat completion operations.
This class manages asynchronous chat completion operations with Venice AI models, supporting both standard (non-streaming) and streaming response formats. It serves as the primary interface for chat-based interactions with Venice AI language models in asynchronous contexts.
The class handles parameter validation, request formation, and response parsing for asynchronous chat completion requests.
Arguments:
-
_client(`venice_ai._client.VeniceClient Example:.. code-block:: python
from venice_ai import VeniceClientimport asyncioasync def main():# Initialize the async clientasync with VeniceClient(api_key="your-api-key") as client:# Create a chat completion asynchronouslyresponse = await client.chat.completions.create(model="venice-1",messages=[{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Tell me about Venice AI."}])# Access the response contentprint(response["choices"][0]["message"]["content"])# Run the async functionasyncio.run(main())`): The client instance used to make API requests.
ChatCompletions.create
async def create(
*,
model: str,
messages: Sequence[UserMessage | AssistantMessage | SystemMessage
| ToolMessage | DeveloperMessage],
stream: bool = False,
stream_cls: type[ChunkModelFactory[ChatCompletionChunk]] | None = None,
**kwargs: Any
) -> ChatCompletionResponse | AsyncIterable[ChatCompletionChunk]
Create a model response for the given chat conversation asynchronously.
This method handles the core functionality of the chat completions API, allowing for both synchronous and streaming responses in async contexts. It sends the provided messages and parameters to the Venice AI API and returns either a complete response or a stream of partial responses.
The method automatically formats the request body, applies appropriate defaults,
and routes the request to either the standard or streaming endpoint based on
the stream parameter.
Wraps POST /api/v1/chat/completions. Streaming requests use the
same path with Server-Sent Events.
Arguments:
model- ID of the model to use. Resolve viaclient.models.resolve_chat()rather than hardcoding.messages- Sequence of messages forming the conversation. Each entry is one of :class:UserMessage, :class:AssistantMessage, :class:SystemMessage, :class:ToolMessage, or :class:DeveloperMessage.stream- IfTrue, stream back partial progress asAsyncIterator[ChatCompletionChunk]. Defaults toFalse, which returns a single :class:ChatCompletionResponse.stream_cls- Optional stream wrapper class for streaming responses. Must conform to the :class:~venice_ai.types.api.streaming.ChunkModelFactoryprotocol. Defaults to :class:~venice_ai.streaming.Stream.frequency_penalty- Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far.max_completion_tokens- Maximum number of completion tokens to generate. On reasoning-capable models this is a strict cap on total completion tokens — visible output plus internal reasoning tokens — not just the visible output. (max_tokenswas accepted as an alias in v1 but is removed in v2; passing it raisesTypeError.)n- Number of chat completion choices to generate for each input message.presence_penalty- Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far.response_format- Specifies the format the model must output (e.g. for JSON mode). Accepts :class:JSONSchemaFormat, :class:JSONObjectFormat, :class:TextResponseFormat, or a PydanticBaseModelsubclass (auto-converted to a strict JSON schema).seed- Random seed for reproducible outputs.stop- Up to 4 sequences where the API will stop generating further tokens.temperature- Sampling temperature between 0.0 and 2.0. Higher values make output more random, lower values more focused and deterministic. Defaults to0.7server-side.top_p- Nucleus sampling parameter between 0.0 and 1.0. Defaults to1.0server-side.max_temp- Upper bound for dynamic temperature scaling (0.0-2.0). Used withmin_tempto let the model adjust temperature per-token within a range instead of using a fixedtemperature.min_temp- Lower bound for dynamic temperature scaling (0.0-2.0). Seemax_temp.min_p- Minimum probability threshold (0.0-1.0) relative to the most likely token, used as an alternative totop_p.tools- List of tools the model may call.tool_choice- Controls which (if any) tool is called by the model. Can be"none","auto", or a :class:SpecificToolChoice.user- Unique identifier representing your end-user (discarded by API but supported for OpenAI compatibility).venice_parameters- Venice-specific parameters for fine-tuning model behavior.reasoning_effort- Controls thinking depth on reasoning models. One of"none","minimal","low","medium","high","xhigh", or"max". Takes precedence overreasoning.effortwhen both are set.reasoning- Nested reasoning configuration. Acceptseffort(same enum asreasoning_effort) andsummary("auto"/"concise"/"detailed").prompt_cache_key- Routing hint to improve cache hit rates across multi-turn conversations. Requests sharing the same key are more likely to hit cached prompt prefixes.prompt_cache_retention- Cache retention tier."default"uses the standard TTL;"extended"or"24h"keep the prompt cached for longer, improving hit rates for long-running agents at a small storage premium.store- OpenAI-compat flag forwarded to the upstream model; controls whether the completion is stored server-side for replay.text- OpenAI-compat text configuration (e.g.- ```{"verbosity"` - "low"}``). Forwarded verbatim.
include- OpenAI-compat inclusion specifier; an array of response-enrichment opt-in strings passed through to the model.metadata- OpenAI-compat free-form metadata dict attached to the request. Forwarded verbatim - useful for client-side observability.logprobs- Whether to return log probabilities of the output tokens.top_logprobs- Number of most likely tokens to return at each token position iflogprobsisTrue.parallel_tool_calls- Whether to enable parallel function calling during tool use.repetition_penalty- Penalty for token repetition.stop_token_ids- List of token IDs at which to stop generation.top_k- Number of highest probability vocabulary tokens to keep for top-k-filtering.stream_options- Additional options for controlling streaming behavior.e2ee- Engage Venice confidential-compute (TEE) end-to-end encryption. The flow runs whene2eeis truthy OR whenvenice_parameters.enable_e2eeisTrue; it requires ane2ee-*model and the[e2ee]extra (cryptography).Trueuses defaults; pass a :class:~venice_ai.tee.types.TeeOptionsto control the attestation freshness nonce or supply a full quote verifier. When engaged the SDK verifies the model's attestation (fail-closed), encrypts each user/system message to the model key, forces a wire stream with theX-Venice-TEE-*headers, and decrypts the response locally (reassembling a normal :class:ChatCompletionResponsewhenstreamisFalse). Tools, web search/scraping, and multimodal content are rejected with :class:~venice_ai.exceptions.InvalidRequestErrorbefore any network call, and the Venice system prompt is forced off. SECURITY LIMITATION: the baseline attestation verifier trusts Venice's server-sideverifiedclaim and does not perform full client-side TDX / NVIDIA quote verification; a one-time :class:UserWarningis emitted on engagement.kwargs- Additional keyword arguments forwarded to the request body for forward-compatibility.
Returns:
:class:~venice_ai.types.api.chat.ChatCompletionResponse when
stream is False, otherwise an AsyncIterable of
:class:~venice_ai.types.api.streaming.ChatCompletionChunk.
Raises:
InvalidRequestError- If parameters are invalid or malformed.AuthenticationError- If the API key is invalid or missing.PermissionDeniedError- If access is denied to the requested model or feature.NotFoundError- If the model or resource is not found.RateLimitError- If rate limits are exceeded for the account.TypeError- If the legacymax_tokenskwarg is supplied (usemax_completion_tokensin v2).APIError- For other API-related errors not covered by specific exceptions.
Example:
.. code-block:: python
import asyncio from venice_ai import VeniceClient
async def main(): async with VeniceClient() as client: model = await client.models.resolve_chat() response = await client.chat.completions.create( model=model, messages=[
-
{"role"- "system", "content": "You are a helpful assistant."}, -
{"role"- "user", "content": "Explain async programming in Python."}, ], temperature=0.3, ) print(response.choices[0].message.content)asyncio.run(main())
Streaming variant
async def stream_example(): async with VeniceClient() as client: model = await client.models.resolve_chat() async for chunk in await client.chat.completions.create( model=model,
-
messages=[{"role"- "user", "content": "Tell me a story."}], stream=True, max_completion_tokens=200, ): content = chunk.choices[0].delta.content or "" if content: print(content, end="", flush=True)asyncio.run(stream_example())
ChatCompletions.stream
async def stream(*,
model: str,
messages: Sequence[UserMessage | AssistantMessage
| SystemMessage | ToolMessage
| DeveloperMessage],
e2ee: bool | TeeOptions = False,
**kwargs: Any) -> ChatStream
Shorthand for create(stream=True) returning a :class:~venice_ai.streaming.ChatStream.
The returned stream supports async context manager,
:meth:~ChatStream.text_deltas, and :meth:~ChatStream.collect:
async with await client.chat.completions.stream(model=model, messages=messages) as s:
async for text in s.text_deltas():
print(text, end="")
Wraps ``POST /api/v1/chat/completions`` (Server-Sent Events).
Arguments:
model- Model id to use.messages- Conversation messages.e2ee- Engage the Venice confidential-compute (TEE) end-to-end encryption flow.Trueuses defaults; pass a :class:~venice_ai.tee.types.TeeOptionsto control the attestation nonce or supply a full quote verifier. Requires ane2ee-*model and the[e2ee]extra; the deltas yielded by the returned stream are already decrypted plaintext. See :meth:createfor the engagement rules and limitations.kwargs- All other parameters accepted by :meth:create.
Returns:
A :class:~venice_ai.streaming.ChatStream instance ready to be
iterated as an async context manager.
Raises:
InvalidRequestError- If parameters fail server-side validation, or if E2EE is requested on an incompatible request (non-e2ee-model, tools, web search/scraping, or multimodal content).AuthenticationError- If the API key is missing or invalid.RateLimitError- If account-level rate limits are exceeded.APIError- For other HTTP-level failures.
ChatCompletions.estimate_cost
async def estimate_cost(*,
model: str,
messages: Sequence[UserMessage | AssistantMessage
| SystemMessage | ToolMessage
| DeveloperMessage],
expected_completion_tokens: int = 500,
tokens_per_word: float = 1.3) -> ChatCostEstimate
Estimate the USD cost of a chat completion before sending it.
Mirrors the pre-flight quote() method on client.video and
client.audio for symmetry. Token counts are heuristic
(word-count x tokens_per_word) - the same approximation used
by :func:venice_ai.costs.estimate_completion_cost - so the
result is an estimate, not a guarantee.
SDK-side helper. The pricing lookup wraps
GET /api/v1/models via :meth:client.models.list; no other
wire calls are made.
Arguments:
model- Chat model id whose pricing to look up.messages- The messages you intend to send.expected_completion_tokens- Caller's budget for the model's response. Defaults to500.tokens_per_word- Word -> token conversion ratio. Default1.3is tuned for English; raise for code/CJK.
Returns:
:class:~venice_ai.costs.ChatCostEstimate with the prompt /
completion / total USD breakdown.
Raises:
-
ValueError- Ifmodelis not present in :meth:client.models.listor has no LLM token-based pricing (the estimate would otherwise be meaningless). -
APIError- For HTTP-level failures while fetching the model catalog.Example:
estimate = await client.chat.completions.estimate_cost(
model=await client.models.resolve_chat(),
messages=[UserMessage(content="Summarize this contract...")],
expected_completion_tokens=1500,
)
if estimate.total_cost_usd > Decimal("0.10"):
raise BudgetError(f"Too expensive: ${estimate.total_cost_usd}")
ChatCompletions.parse
async def parse(*,
model: str,
messages: Sequence[UserMessage | AssistantMessage
| SystemMessage | ToolMessage
| DeveloperMessage],
response_format: type[T],
schema_name: str | None = None,
strict: bool = True,
**kwargs: Any) -> ParsedChatCompletion[T]
Auto-validating sibling of :meth:create for structured output.
Pass a Pydantic BaseModel subclass as response_format; the
SDK builds the JSON Schema from it, sends the request, and
validates the first choice's content against the schema before
returning. Errors surface at this callsite
(pydantic.ValidationError) instead of downstream when the
caller would otherwise have run
:meth:ChatCompletionResponse.parse_as themselves.
Wraps POST /api/v1/chat/completions with
response_format={"type": "json_schema", ...}.
Arguments:
model- Chat model id.messages- Conversation messages.response_format- A PydanticBaseModelsubclass describing the desired response shape.schema_name- Optional schema name sent to the API. Defaults toresponse_format.__name__. Some providers display this name in tool-style UIs.strict- Whether to setstrict: truein the JSON Schema payload. Recommended; the API may then reject responses that don't match the schema rather than silently returning bad JSON.kwargs- All other keyword arguments accepted by :meth:create(temperature,max_completion_tokens,tools, etc.) are forwarded unchanged.streamis rejected - :meth:parsedoes not support streaming.
Returns:
:class:~venice_ai.types.api.chat.ParsedChatCompletion with
the validated parsed instance and the underlying
:class:ChatCompletionResponse.
Raises:
-
ValueError- Ifstream=Trueis passed (use :meth:streaminstead) or the model returns a tool-call-only / multimodal choice with no text content. -
TypeError- If :meth:createreturns an unexpected non- :class:ChatCompletionResponsevalue. -
pydantic.ValidationError- If the model's response doesn't matchresponse_format. -
InvalidRequestError- If parameters fail server-side validation. -
AuthenticationError- If the API key is missing or invalid. -
RateLimitError- If account-level rate limits are exceeded. -
APIError- For other HTTP-level failures.Example:
class Person(BaseModel):
- `name` - str
- `age` - int
result = await client.chat.completions.parse(
model=await client.models.resolve_chat(),
messages=[UserMessage(content="Tell me about Marie Curie.")],
response_format=Person,
)
- `person` - Person = result.parsed
print(result.usage.total_tokens, person.name, person.age)
ChatCompletions.batch
async def batch(
requests: Sequence[dict[str, Any]],
*,
max_concurrency: int = 10,
return_exceptions: bool = True
) -> list[ChatCompletionResponse | BaseException]
Run many :meth:create calls in parallel with bounded concurrency.
Each entry in requests is a kwargs dict unpacked into
:meth:create. Result order matches input order. By default,
per-request exceptions are collected into the result list
(mirroring asyncio.gather(return_exceptions=True)) so a single
failure does not abort the whole batch - set
return_exceptions=False for all-or-nothing semantics.
Streaming is rejected: an entry with stream=True results in a
ValueError for that slot (or aborts the batch when
return_exceptions=False). Use :meth:stream directly inside
asyncio.gather if you need concurrent streams.
SDK-side helper that fans out across :meth:create; each child
call wraps POST /api/v1/chat/completions.
Arguments:
requests- Sequence of kwargs dicts for :meth:create.max_concurrency- Maximum concurrent in-flight requests (default10). Must be>= 1.return_exceptions- IfTrue(default), exceptions for individual requests appear in their slot in the result list. IfFalse, the first exception raises and cancels pending tasks.
Returns:
A list of :class:ChatCompletionResponse (and
:class:BaseException instances when
return_exceptions=True) in input order.
Raises:
-
ValueError- Ifmax_concurrency < 1. -
TypeError- If a child :meth:createreturns an unexpected non-:class:ChatCompletionResponsevalue (withreturn_exceptions=False). -
APIError- First child failure whenreturn_exceptions=False.Example:
results = await client.chat.completions.batch(
[
- `{"model"` - model, "messages": [UserMessage(content=q)]}
for q in questions
],
max_concurrency=5,
)
for q, r in zip(questions, results, strict=True):
if isinstance(r, BaseException):
print(f"{q!r} failed: {r}")
else:
print(f"{q!r} -> {r.choices[0].message.content!r}")
ChatCompletions.run_with_tools
async def run_with_tools(*,
model: str,
messages: Sequence[UserMessage | AssistantMessage
| SystemMessage | ToolMessage
| DeveloperMessage],
tools: Sequence[Callable[..., Any] | Tool],
on_tool_call: Callable[[ToolCall, Any], None]
| None = None,
on_tool_error: Callable[[ToolCall, Exception], str]
| None = None,
parallel: bool = False,
max_iterations: int = 10,
**create_kwargs: Any) -> ToolLoopResult
Run an automatic tool-call loop until the model produces a final answer.
Drives the canonical "create -> check finish_reason -> execute
tools -> re-call" loop so callers don't have to. tools accepts
bare Python callables (auto-converted with
:func:venice_ai.tool_from_function and registered as the
dispatch handler) or pre-built :class:Tool definitions paired
with a separate on_tool_call dispatcher; the two shapes can be
mixed in one list.
Both sync and async tool callables are supported - they're
detected with :func:inspect.iscoroutinefunction and invoked or
awaited as appropriate. The caller's messages list is not
mutated; the returned :class:ToolLoopResult exposes a fresh
history copy along with the final response and iteration count.
SDK-side orchestrator that calls
POST /api/v1/chat/completions once per iteration via
:meth:create.
Arguments:
model- Model id to use for every iteration.messages- Initial chat messages. Not mutated.tools- Bare callables, :class:Tooldefinitions, or a mix. Bare callables are introspected with :func:tool_from_function.on_tool_call- Optional observation hook invoked after each tool runs successfully - receives the :class:ToolCalland the handler's return value. Read-only; does not affect what the model sees.on_tool_error- Optional override for tool-error handling. Defaults to :func:_default_on_tool_error, which logs the exception (with traceback) to thevenice_ai.toolslogger and returns a formatted string sent back to the model so it can recover. Pass a function that re-raises for strict propagation.parallel- IfTrue, multiple tool calls in one assistant response run concurrently via :func:asyncio.gather. DefaultFalse(sequential) to avoid surprise concurrency on tool functions that share state. Only set toTruewhen handlers are concurrency-safe.max_iterations- Maximum number of model round trips before giving up. Default10.create_kwargs- Forwarded to :meth:createon every iteration (e.g.temperature,max_completion_tokens,response_format,venice_parameters).streamis managed by this method and rejected if passed.
Returns:
A :class:ToolLoopResult with the terminal response, full
message history, and round-trip count.
Raises:
MaxIterationsExceededError- If the loop hitsmax_iterationswhile still receivingfinish_reason="tool_calls"responses.ValueError- Ifstreamis passed viacreate_kwargs, if a tool dispatch handler is missing for a tool the model called, if the model returns no choices, or ifmax_iterations < 1.TypeError- If :meth:createreturns an unexpected non-:class:ChatCompletionResponsevalue.InvalidRequestError- If parameters fail server-side validation.AuthenticationError- If the API key is missing or invalid.RateLimitError- If account-level rate limits are exceeded.APIError- For other HTTP-level failures during any iteration.