Skip to main content

venice_ai.rate_limiting.simple

SimpleRateLimiter - Lightweight, in-memory rate limiting for Venice AI SDK.

This module provides:

  • Per-model rate limit state tracking (from response headers)
  • Exponential backoff with jitter on 429 responses
  • Global abuse protection (blocks all requests after threshold failures)
  • Automatic cleanup of stale model state
  • Memory bounds via max_models limit

For production deployments requiring distributed coordination, use ADAPTIVE mode with the adaptive-rate-limiter package.

RateLimiterProtocol Objects

class RateLimiterProtocol(Protocol)

Protocol defining the rate limiter interface for VeniceClient.

Both SimpleRateLimiter and AdaptiveScheduler must implement this interface. The key method is submit_request() which orchestrates the complete request lifecycle.

Note: acquire() and update_from_headers() are NOT part of this protocol. They are implementation details used internally by SimpleRateLimiter. AdaptiveScheduler handles rate limiting through its mode strategies instead. VeniceClient only uses submit_request(), lifecycle methods, and classifier.

RateLimiterProtocol.submit_request

async def submit_request(
metadata: "RequestMetadata",
request_func: Callable[[], Awaitable[Any]],
error_factory: Callable[..., Exception] | None = None) -> Any

Submit a request for rate-limited execution.

This is THE PRIMARY interface method. It must:

  1. Check rate limit state and wait if needed
  2. Execute the request via request_func
  3. Inspect response status - if 429, create RateLimitError using error_factory
  4. Update state from response/error headers
  5. Handle 429 errors with retry logic
  6. Return the response or re-raise after max retries

Arguments:

  • metadata - Request metadata containing model_id, endpoint, etc.
  • request_func - Async callable that executes the actual HTTP request. Returns raw response (including 429s).
  • error_factory - Optional callable to create errors from response.
  • Signature - (message, request, body, response) -> Exception If provided, used to create RateLimitError for 429 responses. If not provided, a default RateLimitError is created.

RateLimiterProtocol.is_running

def is_running() -> bool

Check if the rate limiter is running.

RateLimiterProtocol.start

async def start() -> None

Start the rate limiter.

RateLimiterProtocol.stop

async def stop() -> None

Stop the rate limiter.

RateLimiterProtocol.classifier

@property
def classifier() -> Any | None

Optional request classifier for VeniceClient compatibility.

ModelBucketState Objects

@dataclass
class ModelBucketState()

Per-model rate limit state.

Tracks rate limits from response headers and backoff state.

ModelBucketState.is_rate_limited

def is_rate_limited() -> tuple[bool, float]

Check if this model is currently rate limited.

Returns:

Tuple of (is_limited, wait_time_seconds)

SimpleRateLimiter Objects

class SimpleRateLimiter()

Lightweight, in-memory rate limiter for Venice AI SDK.

Features:

  • Per-model rate limit state tracking (from response headers)
  • Exponential backoff with jitter on 429 responses
  • Global abuse protection (blocks all requests after threshold failures)
  • Automatic cleanup of stale model state
  • Memory bounds via max_models limit

Limitations (by design):

  • Single-process only (no cross-worker coordination)
  • Reactive only (responds to 429s, does not prevent them)
  • No token prediction (cannot estimate before request)
  • No cold-start protection (concurrent cold starts may stampede)

For production deployments, use ADAPTIVE mode with adaptive-rate-limiter.

SimpleRateLimiter.BACKOFF_JITTER

BACKOFF_JITTER = 0.1

±10% jitter

SimpleRateLimiter.CLEANUP_INTERVAL

CLEANUP_INTERVAL = 300.0

5 minutes

SimpleRateLimiter.__init__

def __init__(min_backoff: float = 1.0,
max_backoff: float = 60.0,
failure_threshold: int = 20,
failure_window: float = 30.0,
block_duration: float = 30.0,
max_models: int = 1000,
stale_threshold: float = 3600.0,
max_retries: int = 3)

Initialize SimpleRateLimiter.

Arguments:

  • min_backoff - Minimum backoff time in seconds (default: 1.0)
  • max_backoff - Maximum backoff time in seconds (default: 60.0)
  • failure_threshold - Number of failures within window to trigger global block (default: 20)
  • failure_window - Time window for counting failures in seconds (default: 30.0)
  • block_duration - Duration of global block in seconds (default: 30.0)
  • max_models - Maximum number of models to track (default: 1000)
  • stale_threshold - Time after which unused models are cleaned up (default: 3600.0)
  • max_retries - Maximum number of retry attempts for 429 responses (default: 3)

SimpleRateLimiter.is_running

def is_running() -> bool

Check if the rate limiter is running.

SimpleRateLimiter.start

async def start() -> None

Start the rate limiter.

SimpleRateLimiter.stop

async def stop() -> None

Stop the rate limiter and cleanup.

SimpleRateLimiter.classifier

@property
def classifier() -> Any | None

Optional request classifier for VeniceClient compatibility.

SimpleRateLimiter.classifier

@classifier.setter
def classifier(value: Any) -> None

Set the request classifier.

SimpleRateLimiter.acquire

async def acquire(model: str) -> tuple[bool, float]

Attempt to acquire permission to make a request.

Arguments:

  • model - Model identifier

Returns:

Tuple of (can_proceed, wait_time_seconds)

SimpleRateLimiter.update_from_headers

async def update_from_headers(model: str,
headers: dict[str, str],
status_code: int = 200) -> None

Update rate limit state from response headers.

Arguments:

  • model - Model identifier
  • headers - Response headers (case-insensitive)
  • status_code - HTTP status code

SimpleRateLimiter.record_failure

async def record_failure(model: str) -> None

Record a failure for the given model.

SimpleRateLimiter.record_success

async def record_success(model: str) -> None

Record a success, resetting the failure count.

SimpleRateLimiter.get_state

async def get_state(model: str) -> dict[str, Any] | None

Get the current state for a model (for debugging).

SimpleRateLimiter.get_all_states

async def get_all_states() -> dict[str, dict[str, Any]]

Get all model states (for debugging).

SimpleRateLimiter.clear

async def clear() -> None

Clear all state.

SimpleRateLimiter.get_stats

def get_stats() -> dict[str, Any]

Get limiter statistics.

SimpleRateLimiter.submit_request

async def submit_request(
metadata: "RequestMetadata",
request_func: Callable[[], Awaitable[Any]],
error_factory: Callable[..., Exception] | None = None) -> Any

Submit a request for rate-limited execution.

This is the PRIMARY interface method, invoked by VeniceClient for every request. It orchestrates the complete request lifecycle:

  1. Check if rate limited (wait if needed)
  2. Execute the request via request_func
  3. Inspect response - if 429, create RateLimitError using error_factory
  4. Update state from response headers
  5. Handle 429 errors with backoff and retry
  6. Return the response

IMPORTANT: request_func() returns raw responses including 429s. This method is responsible for detecting 429s and creating errors.

Arguments:

  • metadata - Request metadata containing model_id, endpoint, etc.
  • request_func - Async callable that executes the actual HTTP request. Returns raw response (including 429s).
  • error_factory - Optional callable to create errors from response.
  • Signature - (message, request, body, response) -> Exception If provided, used to create RateLimitError for 429 responses. If not provided, a basic RateLimitError is created.

Returns:

The response from executing request_func

Raises:

  • RateLimitError - If max retries exceeded while rate limited

NoOpRateLimiter Objects

class NoOpRateLimiter()

Rate limiter that does nothing (for testing/disabled mode).

WARNING: Using this in production bypasses all rate limit protection. Use only for testing or when you have external rate limiting in place.

Implements the full RateLimiterProtocol including submit_request().

NoOpRateLimiter.submit_request

async def submit_request(
metadata: "RequestMetadata",
request_func: Callable[[], Awaitable[Any]],
error_factory: Callable[..., Exception] | None = None) -> Any

Execute request without any rate limiting.

Note: error_factory is accepted but ignored - NoOpRateLimiter does not create errors, it passes responses through directly.