Rate Limiting Architecture
Venice AI uses a two-layer rate limiting design. Understanding the separation is essential before modifying either layer.
Layer 1: Discovery (core/rate_limit_discovery.py)
Purpose: Learn what the limits are.
- Calls
api_keys/rate_limitsto fetch per-model rate limit data from Venice. - Groups models into
RateLimitBucketobjects (keyed by model ID). - Caches results with configurable TTL (default 5 minutes).
- Implements cache stampede prevention via future coalescing so concurrent callers share a single API request.
Key class: RateLimitDiscovery
Layer 2: Enforcement (rate_limiting/simple.py)
Purpose: Enforce the discovered limits at request time.
- Maintains per-MODEL rate-limit state parsed from response headers (rpm/tpm limits + reset times).
- Reactive only: after a 429 it backs off with exponential backoff + jitter and gates on the backoff window and
rpm_remaining; it does NOT proactively enforce RPM/RPD/TPM before sending (TPM is stored but not gated; there is no RPD field). - Adds global abuse protection that blocks ALL requests after a failure threshold within a window.
Key class: SimpleRateLimiter (implements RateLimiterProtocol)
How They Connect
VeniceClient
--> builds RequestMetadata (no classifier in SIMPLE mode; minimal fallback metadata)
--> rate_limiter.submit_request(metadata, request_func)
--> SimpleRateLimiter.acquire(model) (checks per-model header-derived state + global block)
--> HTTP request
--> SimpleRateLimiter.update_from_headers(model, headers, status) (records limits/backoff)
RequestClassifier and RateLimitDiscovery are constructed and wired only in
ADAPTIVE mode (factory.py); the default SIMPLE limiter uses neither and
resolves nothing to buckets.
Optional: Adaptive Scheduler (extracted package)
The reactive limiter above is the default. For multi-process / multi-worker
deployments, the SDK can swap it for the proactive AdaptiveScheduler that
ships in the standalone adaptive-rate-limiter PyPI package
(pip install venice-ai[adaptive]). When RateLimiterMode.ADAPTIVE is
configured, factory._create_rate_limiter constructs an AdaptiveScheduler
backed by Redis, wraps it in AdaptiveSchedulerAdapter (in factory.py)
to satisfy RateLimiterProtocol, and returns it in place of
SimpleRateLimiter. The adapter bridges Venice's RateLimitDiscovery
RequestClassifierto the extracted library's protocols. ADAPTIVE mode requiresredis_urlto be set (either onrate_limiter.redis_urlor onbackend.redis.redis_url).
Note: The
AdaptiveScheduleris always constructed inSchedulerMode.INTELLIGENTmode regardless of theSchedulerConfig.modesetting.SchedulerConfig.modeis not consulted in the ADAPTIVE path because INTELLIGENT mode is required — it uses theVeniceProviderandVeniceClassifierAdapterthat are also constructed during ADAPTIVE initialisation. SettingSchedulerConfig(mode=SchedulerMode.BASIC)orSchedulerMode.ACCOUNThas no effect whenRateLimiterMode.ADAPTIVEis active.
Configuration
See config/rate_limiter.yaml for reference configuration with annotated fields.