Advanced Features

This document covers Venice AI SDK's enterprise-grade features for production deployments.

Rate Limiting

The SDK rate-limits automatically — no configuration is required. Behavior is controlled by RateLimiterConfig (venice_ai.rate_limiting.config), selected via RateLimiterMode:

SIMPLE (default) — in-memory and reactive: backs off on 429s (min_backoff/max_backoff, max_retries, failure-window circuit breaking). Single-process; no Redis.
ADAPTIVE — proactive (prevents 429s) with multi-worker coordination via Redis. Requires pip install venice-ai[adaptive] and a redis_url.
DISABLED — no rate limiting (testing only; not recommended in production).

Configuration (set on the client config's rate_limiter field):

from venice_ai import VeniceClient
from venice_ai.core.config import VeniceAIConfig
from venice_ai.rate_limiting.config import RateLimiterConfig, RateLimiterMode

# Default is SIMPLE — nothing to configure. To tune reactive backoff:
config = VeniceAIConfig(
    rate_limiter=RateLimiterConfig(min_backoff=1.0, max_backoff=60.0, max_retries=3)
)

# Proactive, multi-worker (requires the `adaptive` extra + Redis):
config = VeniceAIConfig(
    rate_limiter=RateLimiterConfig(
        mode=RateLimiterMode.ADAPTIVE,
        redis_url="redis://localhost:6379",
        account_id="acct_123",
    )
)

client = VeniceClient(config=config, api_key="your-key")

create_production_config() and the other presets in venice_ai.presets set a sensible RateLimiterConfig for you.

Monitor rate limits:

response = await client.chat.completions.create(...)

if response.response_rate_limits:
    remaining = response.response_rate_limits.remaining_requests
    reset_time = response.response_rate_limits.reset_requests
    print(f"Remaining requests: {remaining}")
    print(f"Reset at: {reset_time}")

Error Handling & Retries

Comprehensive exception hierarchy for precise error handling:

from venice_ai.exceptions import (
    VeniceError,
    APIError,
    AuthenticationError,
    RateLimitError,
    InvalidRequestError,
    APITimeoutError,
    APIConnectionError
)

try:
    response = await client.chat.completions.create(...)
except AuthenticationError:
    print("Invalid API key - check your credentials")
except RateLimitError as e:
    print(f"Rate limit exceeded - retry after {e.retry_after_seconds} seconds")
except InvalidRequestError as e:
    print(f"Invalid request: {e}")
except APITimeoutError:
    print("Request timed out - consider increasing timeout")
except APIConnectionError:
    print("Network error - check your connection")
except APIError as e:
    print(f"API error: {e}")
except VeniceError as e:
    print(f"SDK error: {e}")

Automatic retries with backoff:

import asyncio

async def make_request_with_retry(client, max_retries=3):
    for attempt in range(max_retries + 1):
        try:
            return await client.chat.completions.create(...)
        except RateLimitError as e:
            if attempt < max_retries:
                wait_time = 2 ** attempt  # Exponential backoff
                await asyncio.sleep(wait_time)
                continue
            raise
        except (APITimeoutError, APIConnectionError):
            if attempt < max_retries:
                await asyncio.sleep(1 + attempt)
                continue
            raise

-> Full example: examples/basic/error_handling.py

Distributed State Management

For multi-instance deployments, use Redis backend:

Key features:

Per-event-loop connection pooling
Distributed rate limit coordination
Cross-instance state synchronization
Header-based state sync with fallback to release-only mode on validation failures

Setup:

from venice_ai.core.config import RedisBackendConfig

redis_config = RedisBackendConfig(
    redis_url="redis://localhost:6379",
    max_connections=20,
    default_ttl=3600,
    key_prefix="venice:v2:",
    connection_timeout=5.0
)

config = VeniceAIConfig(
    backend=BackendConfig(
        backend_type=BackendType.REDIS,
        redis=redis_config
    )
)

Retry Strategy

from venice_ai.middleware.retry import RetryOptions, create_retry_middleware

retry_options = RetryOptions(
    max_attempts=3,
    base_delay=1.0,
    retry_status_codes={500, 502, 503, 504},  # matches the RetryOptions default
)
retry_middleware = create_retry_middleware(retry_options)

Note: 429 is intentionally omitted from retry_status_codes. Rate-limit (429) retries are handled separately by SimpleRateLimiter, which honors the Retry-After header with its own backoff; adding 429 here would double-retry.

Monitoring & Observability

The venice_ai.observability package exposes Prometheus-style metrics for production monitoring. Health checks and OpenTelemetry tracing helpers are not bundled — wire your own (e.g. via the OpenTelemetry SDK / a sidecar) if you need them.

Enhanced Metrics

Production-focused Prometheus metrics covering streaming fallbacks, custom stream usage, and tier-discovery coalescing. The class exposes its counters/histograms/gauges as attributes so you can call the standard prometheus_client API on them (.labels(...).inc(), .observe(...), etc.):

from venice_ai.observability import EnhancedMetrics, EnhancedMetricsConfig

metrics_config = EnhancedMetricsConfig(
    enabled=True,
    include_detailed_metrics=True,
    prometheus_port=8000,
)
metrics = EnhancedMetrics(config=metrics_config)

# Counters / histograms are exposed as attributes:
metrics.streaming_fallback_total.labels(
    endpoint="chat.completions", reason="server_disconnect"
).inc()
metrics.custom_stream_duration_seconds.labels(stream_type="audio").observe(0.245)
metrics.tier_discovery_coalesced_total.inc()

Response Header Access

response = await client.chat.completions.create(...)

# Rate limits
if response.response_rate_limits:
    print(f"Remaining: {response.response_rate_limits.remaining_requests}")

# Deprecation warnings
if response.deprecation_info and response.deprecation_info.is_deprecated:
    print(f"Warning: {response.deprecation_info.warning}")

# Account balance
if response.balance_info:
    print(f"Balance: {response.balance_info.usd} USD")

-> Full example: examples/headers/header_access_example.py

Performance & Optimization

Connection Pooling

http_config = HttpClientConfig(
    max_connections=200,           # Total connection pool size
    max_keepalive_connections=50,  # Persistent connections
    timeout=30.0
)

Redis Optimization

redis_config = RedisBackendConfig(
    redis_url="redis://localhost:6379",
    max_connections=50,
    connection_timeout=5.0,
    max_retries=3,
    default_ttl=3600
)

Caching Strategies

from venice_ai.core.config import StateConfig, CachePolicy

state_config = StateConfig(
    cache_policy=CachePolicy.WRITE_BACK,
    cache_ttl=5.0,
    batch_size=100,
    enable_background_cleanup=True
)

Rate Limit Tuning

scheduler_config = SchedulerConfig(
    mode=SchedulerMode.INTELLIGENT,
    max_concurrent_executions=100,
    max_queue_size=5000,
    rate_limit_buffer_ratio=0.9,
    overflow_policy="reject"
)

Tips

Use streaming for long responses - Reduce time to first token
Batch embeddings requests - More efficient than individual calls
Enable connection pooling - Reuse HTTP connections
Configure appropriate timeouts - Balance reliability and speed
Use Redis in production - Better performance for distributed state
Monitor queue depths - Adjust max_queue_size based on traffic

x402 Wallet Authentication

Venice's /x402/* billing endpoints use Sign-In-With-Ethereum (EIP-4361 SIWE) on Base chain (8453) instead of Bearer tokens. The SDK ships an optional helper, venice_ai.auth.x402.X402Auth, that builds the base64-encoded X-Sign-In-With-X header from a private key.

Install

pip install 'venice-ai[x402]'

This pulls eth-account (local Ethereum account + signing) and siwe (EIP-4361 message builder).

Usage

import os
from venice_ai import VeniceClient
from venice_ai.auth.x402 import X402Auth

auth = X402Auth(private_key=os.environ["WALLET_PRIVATE_KEY"])
# auth.wallet_address is derived from the key — no extra config needed.

async with VeniceClient() as client:
    bal = await client.x402.balance(auth=auth)
    print(f"${bal.data.balanceUsd} on {auth.wallet_address}")

    txns = await client.x402.transactions(auth=auth)
    for t in txns.data.transactions[:5]:
        print(t.createdAt, t.type, t.amount)

Private-key hygiene

Never hardcode the private key. Read it from an environment variable, a secret manager, or an HSM.
Never commit it. Add any file that contains the key to .gitignore, including .env, *.pem, and ad-hoc scratch files.
Use a dedicated wallet for Venice billing. Don't reuse your main wallet — minimise blast radius if a test key leaks.
Rotate periodically. Transfer remaining balance to a new wallet, revoke exposure, and point the SDK at the new X402Auth.

What gets sent on the wire

Each balance() / transactions() call builds a fresh SIWE message with:

domain = outerface.venice.ai, uri = https://outerface.venice.ai, version = 1, chain_id = 8453, statement = "Sign in to Venice API"
A fresh 16-hex-char CSPRNG nonce (secrets.token_hex(8))
ISO-8601 issued_at and expiration_time 10 min apart (configurable via the ttl_seconds= kwarg)
A hex signature from the wallet over the EIP-191-encoded SIWE text

The private key itself never appears in the header or in any SDK log — the scrubber in _client.py redacts X-Sign-In-With-X (alongside Authorization) at DEBUG-level request logging.

Top-ups

top_up() uses standard Bearer auth (VENICE_API_KEY); the optional payment_header= kwarg carries a pre-signed x402 payment payload. An empty POST returns the documented 402 Payment Required with structured payment requirements — the SDK surfaces that as an APIError whose response body contains the accept spec (chains, assets, amounts). Sign the payment payload out-of-band (e.g. with @venice-ai/x402-client), then pass the base64 string back to top_up(payment_header=...).

Testing

VCRpy tests for x402 endpoints use a deterministic throwaway key (never funded, publicly known in the test suite). The root conftest scrubs X-Sign-In-With-X and X-402-Payment from every recorded cassette via filter_headers, so signed tokens never land on disk.

Rate Limiting​

Error Handling & Retries​

Distributed State Management​

Retry Strategy​

Monitoring & Observability​

Enhanced Metrics​

Response Header Access​

Performance & Optimization​

Connection Pooling​

Redis Optimization​

Caching Strategies​

Rate Limit Tuning​

Tips​

x402 Wallet Authentication​

Install​

Usage​

Private-key hygiene​

What gets sent on the wire​

Top-ups​

Testing​