Rate Limiting Architecture

Venice AI uses a two-layer rate limiting design. Understanding the separation is essential before modifying either layer.

Layer 1: Discovery (`core/rate_limit_discovery.py`)

Purpose: Learn what the limits are.

Calls api_keys/rate_limits to fetch per-model rate limit data from Venice.
Groups models into RateLimitBucket objects (keyed by model ID).
Caches results with configurable TTL (default 5 minutes).
Implements cache stampede prevention via future coalescing so concurrent callers share a single API request.

Key class: RateLimitDiscovery

Layer 2: Enforcement (`rate_limiting/simple.py`)

Purpose: Enforce the discovered limits at request time.

Maintains per-MODEL rate-limit state parsed from response headers (rpm/tpm limits + reset times).
Reactive only: after a 429 it backs off with exponential backoff + jitter and gates on the backoff window and rpm_remaining; it does NOT proactively enforce RPM/RPD/TPM before sending (TPM is stored but not gated; there is no RPD field).
Adds global abuse protection that blocks ALL requests after a failure threshold within a window.

Key class: SimpleRateLimiter (implements RateLimiterProtocol)

How They Connect

VeniceClient
  --> builds RequestMetadata (no classifier in SIMPLE mode; minimal fallback metadata)
  --> rate_limiter.submit_request(metadata, request_func)
  --> SimpleRateLimiter.acquire(model) (checks per-model header-derived state + global block)
  --> HTTP request
  --> SimpleRateLimiter.update_from_headers(model, headers, status) (records limits/backoff)

RequestClassifier and RateLimitDiscovery are constructed and wired only in ADAPTIVE mode (factory.py); the default SIMPLE limiter uses neither and resolves nothing to buckets.

Optional: Adaptive Scheduler (extracted package)

The reactive limiter above is the default. For multi-process / multi-worker deployments, the SDK can swap it for the proactive AdaptiveScheduler that ships in the standalone adaptive-rate-limiter PyPI package (pip install venice-ai[adaptive]). When RateLimiterMode.ADAPTIVE is configured, factory._create_rate_limiter constructs an AdaptiveScheduler backed by Redis, wraps it in AdaptiveSchedulerAdapter (in factory.py) to satisfy RateLimiterProtocol, and returns it in place of SimpleRateLimiter. The adapter bridges Venice's RateLimitDiscovery

RequestClassifier to the extracted library's protocols. ADAPTIVE mode requires redis_url to be set (either on rate_limiter.redis_url or on backend.redis.redis_url).

Note: The AdaptiveScheduler is always constructed in SchedulerMode.INTELLIGENT mode regardless of the SchedulerConfig.mode setting. SchedulerConfig.mode is not consulted in the ADAPTIVE path because INTELLIGENT mode is required — it uses the VeniceProvider and VeniceClassifierAdapter that are also constructed during ADAPTIVE initialisation. Setting SchedulerConfig(mode=SchedulerMode.BASIC) or SchedulerMode.ACCOUNT has no effect when RateLimiterMode.ADAPTIVE is active.

Configuration

See config/rate_limiter.yaml for reference configuration with annotated fields.

Layer 1: Discovery (core/rate_limit_discovery.py)​

Layer 2: Enforcement (rate_limiting/simple.py)​

How They Connect​

Optional: Adaptive Scheduler (extracted package)​

Configuration​

Layer 1: Discovery (`core/rate_limit_discovery.py`)

Layer 2: Enforcement (`rate_limiting/simple.py`)

How They Connect

Optional: Adaptive Scheduler (extracted package)

Configuration