venice_ai.rate_limiting.simple
SimpleRateLimiter - Lightweight, in-memory rate limiting for Venice AI SDK.
This module provides:
- Per-model rate limit state tracking (from response headers)
- Exponential backoff with jitter on 429 responses
- Global abuse protection (blocks all requests after threshold failures)
- Automatic cleanup of stale model state
- Memory bounds via max_models limit
For production deployments requiring distributed coordination, use ADAPTIVE mode with the adaptive-rate-limiter package.
RateLimiterProtocol Objects
class RateLimiterProtocol(Protocol)
Protocol defining the rate limiter interface for VeniceClient.
Both SimpleRateLimiter and AdaptiveScheduler must implement this interface. The key method is submit_request() which orchestrates the complete request lifecycle.
Note: acquire() and update_from_headers() are NOT part of this protocol. They are implementation details used internally by SimpleRateLimiter. AdaptiveScheduler handles rate limiting through its mode strategies instead. VeniceClient only uses submit_request(), lifecycle methods, and classifier.
RateLimiterProtocol.submit_request
async def submit_request(
metadata: "RequestMetadata",
request_func: Callable[[], Awaitable[Any]],
error_factory: Callable[..., Exception] | None = None) -> Any
Submit a request for rate-limited execution.
This is THE PRIMARY interface method. It must:
- Check rate limit state and wait if needed
- Execute the request via request_func
- Inspect response status - if 429, create RateLimitError using error_factory
- Update state from response/error headers
- Handle 429 errors with retry logic
- Return the response or re-raise after max retries
Arguments:
metadata- Request metadata containing model_id, endpoint, etc.request_func- Async callable that executes the actual HTTP request. Returns raw response (including 429s).error_factory- Optional callable to create errors from response.Signature- (message, request, body, response) -> Exception If provided, used to create RateLimitError for 429 responses. If not provided, a default RateLimitError is created.
RateLimiterProtocol.is_running
def is_running() -> bool
Check if the rate limiter is running.
RateLimiterProtocol.start
async def start() -> None
Start the rate limiter.
RateLimiterProtocol.stop
async def stop() -> None
Stop the rate limiter.
RateLimiterProtocol.classifier
@property
def classifier() -> Any | None
Optional request classifier for VeniceClient compatibility.
ModelBucketState Objects
@dataclass
class ModelBucketState()
Per-model rate limit state.
Tracks rate limits from response headers and backoff state.
ModelBucketState.is_rate_limited
def is_rate_limited() -> tuple[bool, float]
Check if this model is currently rate limited.
Returns:
Tuple of (is_limited, wait_time_seconds)
SimpleRateLimiter Objects
class SimpleRateLimiter()
Lightweight, in-memory rate limiter for Venice AI SDK.
Features:
- Per-model rate limit state tracking (from response headers)
- Exponential backoff with jitter on 429 responses
- Global abuse protection (blocks all requests after threshold failures)
- Automatic cleanup of stale model state
- Memory bounds via max_models limit
Limitations (by design):
- Single-process only (no cross-worker coordination)
- Reactive only (responds to 429s, does not prevent them)
- No token prediction (cannot estimate before request)
- No cold-start protection (concurrent cold starts may stampede)
For production deployments, use ADAPTIVE mode with adaptive-rate-limiter.
SimpleRateLimiter.BACKOFF_JITTER
BACKOFF_JITTER = 0.1
±10% jitter
SimpleRateLimiter.CLEANUP_INTERVAL
CLEANUP_INTERVAL = 300.0
5 minutes
SimpleRateLimiter.__init__
def __init__(min_backoff: float = 1.0,
max_backoff: float = 60.0,
failure_threshold: int = 20,
failure_window: float = 30.0,
block_duration: float = 30.0,
max_models: int = 1000,
stale_threshold: float = 3600.0,
max_retries: int = 3)
Initialize SimpleRateLimiter.
Arguments:
min_backoff- Minimum backoff time in seconds (default: 1.0)max_backoff- Maximum backoff time in seconds (default: 60.0)failure_threshold- Number of failures within window to trigger global block (default: 20)failure_window- Time window for counting failures in seconds (default: 30.0)block_duration- Duration of global block in seconds (default: 30.0)max_models- Maximum number of models to track (default: 1000)stale_threshold- Time after which unused models are cleaned up (default: 3600.0)max_retries- Maximum number of retry attempts for 429 responses (default: 3)
SimpleRateLimiter.is_running
def is_running() -> bool
Check if the rate limiter is running.
SimpleRateLimiter.start
async def start() -> None
Start the rate limiter.
SimpleRateLimiter.stop
async def stop() -> None
Stop the rate limiter and cleanup.
SimpleRateLimiter.classifier
@property
def classifier() -> Any | None
Optional request classifier for VeniceClient compatibility.
SimpleRateLimiter.classifier
@classifier.setter
def classifier(value: Any) -> None
Set the request classifier.
SimpleRateLimiter.acquire
async def acquire(model: str) -> tuple[bool, float]
Attempt to acquire permission to make a request.
Arguments:
model- Model identifier
Returns:
Tuple of (can_proceed, wait_time_seconds)
SimpleRateLimiter.update_from_headers
async def update_from_headers(model: str,
headers: dict[str, str],
status_code: int = 200) -> None
Update rate limit state from response headers.
Arguments:
model- Model identifierheaders- Response headers (case-insensitive)status_code- HTTP status code
SimpleRateLimiter.record_failure
async def record_failure(model: str) -> None
Record a failure for the given model.
SimpleRateLimiter.record_success
async def record_success(model: str) -> None
Record a success, resetting the failure count.
SimpleRateLimiter.get_state
async def get_state(model: str) -> dict[str, Any] | None
Get the current state for a model (for debugging).
SimpleRateLimiter.get_all_states
async def get_all_states() -> dict[str, dict[str, Any]]
Get all model states (for debugging).
SimpleRateLimiter.clear
async def clear() -> None
Clear all state.
SimpleRateLimiter.get_stats
def get_stats() -> dict[str, Any]
Get limiter statistics.
SimpleRateLimiter.submit_request
async def submit_request(
metadata: "RequestMetadata",
request_func: Callable[[], Awaitable[Any]],
error_factory: Callable[..., Exception] | None = None) -> Any
Submit a request for rate-limited execution.
This is the PRIMARY interface method, invoked by VeniceClient for every request.
It orchestrates the complete request lifecycle:
- Check if rate limited (wait if needed)
- Execute the request via request_func
- Inspect response - if 429, create RateLimitError using error_factory
- Update state from response headers
- Handle 429 errors with backoff and retry
- Return the response
IMPORTANT: request_func() returns raw responses including 429s. This method is responsible for detecting 429s and creating errors.
Arguments:
metadata- Request metadata containing model_id, endpoint, etc.request_func- Async callable that executes the actual HTTP request. Returns raw response (including 429s).error_factory- Optional callable to create errors from response.Signature- (message, request, body, response) -> Exception If provided, used to create RateLimitError for 429 responses. If not provided, a basic RateLimitError is created.
Returns:
The response from executing request_func
Raises:
RateLimitError- If max retries exceeded while rate limited
NoOpRateLimiter Objects
class NoOpRateLimiter()
Rate limiter that does nothing (for testing/disabled mode).
WARNING: Using this in production bypasses all rate limit protection. Use only for testing or when you have external rate limiting in place.
Implements the full RateLimiterProtocol including submit_request().
NoOpRateLimiter.submit_request
async def submit_request(
metadata: "RequestMetadata",
request_func: Callable[[], Awaitable[Any]],
error_factory: Callable[..., Exception] | None = None) -> Any
Execute request without any rate limiting.
Note: error_factory is accepted but ignored - NoOpRateLimiter does not create errors, it passes responses through directly.