venice_ai.rate_limiting.simple

SimpleRateLimiter - Lightweight, in-memory rate limiting for Venice AI SDK.

This module provides:

Per-model rate limit state tracking (from response headers)
Exponential backoff with jitter on 429 responses
Global abuse protection (blocks all requests after threshold failures)
Automatic cleanup of stale model state
Memory bounds via max_models limit

For production deployments requiring distributed coordination, use ADAPTIVE mode with the adaptive-rate-limiter package.

RateLimiterProtocol Objects

class RateLimiterProtocol(Protocol)

Protocol defining the rate limiter interface for VeniceClient.

Both SimpleRateLimiter and AdaptiveScheduler must implement this interface. The key method is submit_request() which orchestrates the complete request lifecycle.

Note: acquire() and update_from_headers() are NOT part of this protocol. They are implementation details used internally by SimpleRateLimiter. AdaptiveScheduler handles rate limiting through its mode strategies instead. VeniceClient only uses submit_request(), lifecycle methods, and classifier.

RateLimiterProtocol.submit_request

async def submit_request(
        metadata: "RequestMetadata",
        request_func: Callable[[], Awaitable[Any]],
        error_factory: Callable[..., Exception] | None = None) -> Any

Submit a request for rate-limited execution.

This is THE PRIMARY interface method. It must:

Check rate limit state and wait if needed
Execute the request via request_func
Inspect response status - if 429, create RateLimitError using error_factory
Update state from response/error headers
Handle 429 errors with retry logic
Return the response or re-raise after max retries

Arguments:

metadata - Request metadata containing model_id, endpoint, etc.
request_func - Async callable that executes the actual HTTP request. Returns raw response (including 429s).
error_factory - Optional callable to create errors from response.
Signature - (message, request, body, response) -> Exception If provided, used to create RateLimitError for 429 responses. If not provided, a default RateLimitError is created.

RateLimiterProtocol.is_running

def is_running() -> bool

Check if the rate limiter is running.

RateLimiterProtocol.start

async def start() -> None

Start the rate limiter.

RateLimiterProtocol.stop

async def stop() -> None

Stop the rate limiter.

RateLimiterProtocol.classifier

@property
def classifier() -> Any | None

Optional request classifier for VeniceClient compatibility.

ModelBucketState Objects

@dataclass
class ModelBucketState()

Per-model rate limit state.

Tracks rate limits from response headers and backoff state.

ModelBucketState.is_rate_limited

def is_rate_limited() -> tuple[bool, float]

Check if this model is currently rate limited.

Returns:

Tuple of (is_limited, wait_time_seconds)

SimpleRateLimiter Objects

class SimpleRateLimiter()

Lightweight, in-memory rate limiter for Venice AI SDK.

Features:

Per-model rate limit state tracking (from response headers)
Exponential backoff with jitter on 429 responses
Global abuse protection (blocks all requests after threshold failures)
Automatic cleanup of stale model state
Memory bounds via max_models limit

Limitations (by design):

Single-process only (no cross-worker coordination)
Reactive only (responds to 429s, does not prevent them)
No token prediction (cannot estimate before request)
No cold-start protection (concurrent cold starts may stampede)

For production deployments, use ADAPTIVE mode with adaptive-rate-limiter.

SimpleRateLimiter.BACKOFF_JITTER

BACKOFF_JITTER = 0.1

±10% jitter

SimpleRateLimiter.CLEANUP_INTERVAL

CLEANUP_INTERVAL = 300.0

5 minutes

SimpleRateLimiter.init

def __init__(min_backoff: float = 1.0,
             max_backoff: float = 60.0,
             failure_threshold: int = 20,
             failure_window: float = 30.0,
             block_duration: float = 30.0,
             max_models: int = 1000,
             stale_threshold: float = 3600.0,
             max_retries: int = 3)

Initialize SimpleRateLimiter.

Arguments:

min_backoff - Minimum backoff time in seconds (default: 1.0)
max_backoff - Maximum backoff time in seconds (default: 60.0)
failure_threshold - Number of failures within window to trigger global block (default: 20)
failure_window - Time window for counting failures in seconds (default: 30.0)
block_duration - Duration of global block in seconds (default: 30.0)
max_models - Maximum number of models to track (default: 1000)
stale_threshold - Time after which unused models are cleaned up (default: 3600.0)
max_retries - Maximum number of retry attempts for 429 responses (default: 3)

SimpleRateLimiter.is_running

def is_running() -> bool

Check if the rate limiter is running.

SimpleRateLimiter.start

async def start() -> None

Start the rate limiter.

SimpleRateLimiter.stop

async def stop() -> None

Stop the rate limiter and cleanup.

SimpleRateLimiter.classifier

@property
def classifier() -> Any | None

Optional request classifier for VeniceClient compatibility.

SimpleRateLimiter.classifier

@classifier.setter
def classifier(value: Any) -> None

Set the request classifier.

SimpleRateLimiter.acquire

async def acquire(model: str) -> tuple[bool, float]

Attempt to acquire permission to make a request.

Arguments:

model - Model identifier

Returns:

Tuple of (can_proceed, wait_time_seconds)

SimpleRateLimiter.update_from_headers

async def update_from_headers(model: str,
                              headers: dict[str, str],
                              status_code: int = 200) -> None

Update rate limit state from response headers.

Arguments:

model - Model identifier
headers - Response headers (case-insensitive)
status_code - HTTP status code

SimpleRateLimiter.record_failure

async def record_failure(model: str) -> None

Record a failure for the given model.

SimpleRateLimiter.record_success

async def record_success(model: str) -> None

Record a success, resetting the failure count.

SimpleRateLimiter.get_state

async def get_state(model: str) -> dict[str, Any] | None

Get the current state for a model (for debugging).

SimpleRateLimiter.get_all_states

async def get_all_states() -> dict[str, dict[str, Any]]

Get all model states (for debugging).

SimpleRateLimiter.clear

async def clear() -> None

Clear all state.

SimpleRateLimiter.get_stats

def get_stats() -> dict[str, Any]

Get limiter statistics.

SimpleRateLimiter.submit_request

async def submit_request(
        metadata: "RequestMetadata",
        request_func: Callable[[], Awaitable[Any]],
        error_factory: Callable[..., Exception] | None = None) -> Any

Submit a request for rate-limited execution.

This is the PRIMARY interface method, invoked by VeniceClient for every request. It orchestrates the complete request lifecycle:

Check if rate limited (wait if needed)
Execute the request via request_func
Inspect response - if 429, create RateLimitError using error_factory
Update state from response headers
Handle 429 errors with backoff and retry
Return the response

IMPORTANT: request_func() returns raw responses including 429s. This method is responsible for detecting 429s and creating errors.

Arguments:

metadata - Request metadata containing model_id, endpoint, etc.
request_func - Async callable that executes the actual HTTP request. Returns raw response (including 429s).
error_factory - Optional callable to create errors from response.
Signature - (message, request, body, response) -> Exception If provided, used to create RateLimitError for 429 responses. If not provided, a basic RateLimitError is created.

Returns:

The response from executing request_func

Raises:

RateLimitError - If max retries exceeded while rate limited

NoOpRateLimiter Objects

class NoOpRateLimiter()

Rate limiter that does nothing (for testing/disabled mode).

WARNING: Using this in production bypasses all rate limit protection. Use only for testing or when you have external rate limiting in place.

Implements the full RateLimiterProtocol including submit_request().

NoOpRateLimiter.submit_request

async def submit_request(
        metadata: "RequestMetadata",
        request_func: Callable[[], Awaitable[Any]],
        error_factory: Callable[..., Exception] | None = None) -> Any

Execute request without any rate limiting.

Note: error_factory is accepted but ignored - NoOpRateLimiter does not create errors, it passes responses through directly.

RateLimiterProtocol Objects​

RateLimiterProtocol.submit_request​

RateLimiterProtocol.is_running​

RateLimiterProtocol.start​

RateLimiterProtocol.stop​

RateLimiterProtocol.classifier​

ModelBucketState Objects​

ModelBucketState.is_rate_limited​

SimpleRateLimiter Objects​

SimpleRateLimiter.BACKOFF_JITTER​

SimpleRateLimiter.CLEANUP_INTERVAL​

SimpleRateLimiter.__init__​

SimpleRateLimiter.is_running​

SimpleRateLimiter.start​

SimpleRateLimiter.stop​

SimpleRateLimiter.classifier​

SimpleRateLimiter.classifier​

SimpleRateLimiter.acquire​

SimpleRateLimiter.update_from_headers​

SimpleRateLimiter.record_failure​

SimpleRateLimiter.record_success​

SimpleRateLimiter.get_state​

SimpleRateLimiter.get_all_states​

SimpleRateLimiter.clear​

SimpleRateLimiter.get_stats​

SimpleRateLimiter.submit_request​

NoOpRateLimiter Objects​

NoOpRateLimiter.submit_request​

RateLimiterProtocol Objects

RateLimiterProtocol.submit_request

RateLimiterProtocol.is_running

RateLimiterProtocol.start

RateLimiterProtocol.stop

RateLimiterProtocol.classifier

ModelBucketState Objects

ModelBucketState.is_rate_limited

SimpleRateLimiter Objects

SimpleRateLimiter.BACKOFF_JITTER

SimpleRateLimiter.CLEANUP_INTERVAL

SimpleRateLimiter.init

SimpleRateLimiter.is_running

SimpleRateLimiter.start

SimpleRateLimiter.stop

SimpleRateLimiter.classifier

SimpleRateLimiter.classifier

SimpleRateLimiter.acquire

SimpleRateLimiter.update_from_headers

SimpleRateLimiter.record_failure

SimpleRateLimiter.record_success

SimpleRateLimiter.get_state

SimpleRateLimiter.get_all_states

SimpleRateLimiter.clear

SimpleRateLimiter.get_stats

SimpleRateLimiter.submit_request

NoOpRateLimiter Objects

NoOpRateLimiter.submit_request