Skip to main content

Cost Calculation and Estimation Utilities

This module provides comprehensive cost calculation and estimation utilities for Venice AI API usage. It supports real-time cost tracking, pre-request estimation, and detailed breakdown of usage costs across different pricing models.

All calculations are based on actual token usage and current model pricing information.

Key Features:

  • Real-time Cost Calculation: Calculate actual costs from API responses
  • Pre-request Estimation: Estimate costs before making API calls
  • Token-based Pricing: Accurate cost calculation based on token consumption
  • Model-specific Pricing: Different pricing tiers for different models

Pricing Models:

  • Input Tokens: Cost for processing input text (prompts, messages)
  • Output Tokens: Cost for generating output text (completions, responses)
  • Flat Rate Models: Simple per-request pricing for some operations
  • Tiered Pricing: Volume-based pricing with different rates

Cost Types:

  • USD: Traditional US Dollar pricing for enterprise billing
  • DIEM: Platform currency where 1 DIEM = $1 USD

Example:

>>> from venice_ai.costs import calculate_completion_cost, estimate_completion_cost
>>>
>>> # Calculate actual cost from a completion
>>> completion = await client.chat.completions.create(...)
>>> model_pricing = await client.get_model_pricing("llama-3.3-70b")
>>> cost = calculate_completion_cost(completion, model_pricing)
>>> print(f"Cost: ${cost['usd']:.6f} USD")
>>>
>>> # Estimate cost before making request
>>> estimated_cost = estimate_completion_cost(
... prompt="Your prompt here",
... estimated_completion_tokens=500,
... model_pricing=model_pricing
... )
>>> print(f"Estimated: ${estimated_cost['usd']:.6f} USD")

ChatCostEstimate Objects

class ChatCostEstimate(BaseModel)

Pre-flight cost estimate for a chat completion request.

Returned by :meth:client.chat.completions.estimate_cost. Token counts are heuristic word-count approximations (the same approach the :func:estimate_completion_cost helper uses); total_cost_usd is therefore an estimate, not a guarantee.

calculate_completion_cost

def calculate_completion_cost(
completion: ChatCompletion,
model_pricing: ModelPricing | None) -> dict[str, Decimal]

Calculate the actual cost of a completed chat completion request.

This function analyzes a ChatCompletion response and calculates the precise cost based on actual token usage reported by the API. It handles both input and output token pricing.

The calculation uses the token usage data from the completion response and applies the current pricing structure for the specific model used. This provides accurate post-request cost tracking for billing and analytics.

Token Cost Calculation:

  • Input Cost: prompt_tokens × input_cost_per_million_tokens
  • Output Cost: completion_tokens × output_cost_per_million_tokens
  • Total Cost: Input Cost + Output Cost

Arguments:

  • completion - The completed ChatCompletion response containing actual token usage data from the API. Must include usage information with prompt_tokens and completion_tokens counts.
  • model_pricing - Current pricing information for the model that was used. Contains input and output costs per million tokens. If None, returns zero cost.

Returns:

Dictionary with cost breakdown containing:

  • 'usd': Total cost in US Dollars as a Decimal with exact precision

Notes:

If the completion lacks usage data or model_pricing is None, the function returns zero cost rather than raising an exception to maintain robust operation in production environments.

Example:

>>> from venice_ai import VeniceClient
>>> from venice_ai.costs import calculate_completion_cost
>>>
>>> client = VeniceClient(api_key="your-api-key")
>>>
>>> # Create a chat completion
>>> completion = await client.chat.completions.create(
... model="llama-3.3-70b",
... messages=[{"role": "user", "content": "Hello world!"}]
... )
>>>
>>> # Get current model pricing
>>> model_pricing = await client.get_model_pricing("llama-3.3-70b")
>>>
>>> # Calculate actual costs
>>> costs = calculate_completion_cost(completion, model_pricing)
>>> print(f"Cost: ${costs['usd']:.6f} USD")
>>> print(f"Tokens: {completion.usage.total_tokens} total")

calculate_embedding_cost

def calculate_embedding_cost(
embedding_response: Any,
model_pricing: ModelPricing | None) -> dict[str, Decimal]

Calculate the actual cost of a completed embedding request.

This function analyzes an embedding response and calculates the cost based on the total tokens processed during the embedding generation. Unlike chat completions, embeddings typically use only input token pricing since they don't generate variable-length outputs.

Embedding Cost Calculation:

  • Input Processing: total_tokens × input_cost_per_million_tokens
  • Fixed Output: Embeddings have fixed output dimensions
  • Total Cost: Primarily based on input token processing

Arguments:

  • embedding_response - The completed embedding response containing usage data. Must include a usage object with total_tokens count from the embedding operation.
  • model_pricing - Current pricing information for the embedding model. Contains input costs per million tokens. If None, returns zero cost.

Returns:

Dictionary with cost breakdown containing:

  • 'usd': Total cost in US Dollars as a Decimal with exact precision

Example:

>>> from venice_ai import VeniceClient
>>> from venice_ai.costs import calculate_embedding_cost
>>>
>>> client = VeniceClient(api_key="your-api-key")
>>>
>>> # Create embeddings
>>> response = await client.embeddings.create(
... model="text-embedding-3-small",
... input="Hello, world! This is a sample text."
... )
>>>
>>> # Get current model pricing
>>> model_pricing = await client.get_model_pricing("text-embedding-3-small")
>>>
>>> # Calculate actual costs
>>> costs = calculate_embedding_cost(response, model_pricing)
>>> print(f"Cost: ${costs['usd']:.6f} USD")
>>> print(f"Tokens processed: {response.usage.total_tokens}")

estimate_completion_cost

def estimate_completion_cost(
prompt: str,
estimated_completion_tokens: int,
model_pricing: ModelPricing | None,
tokens_per_word: float = 1.3) -> dict[str, Decimal]

Estimate the cost of a chat completion before making the API request.

This function provides pre-request cost estimation based on prompt analysis and expected completion length. It uses heuristic token counting to estimate input costs and user-provided estimates for output costs, enabling budget planning and cost-aware request optimization.

The estimation is particularly useful for:

  • Budget planning and cost control
  • Optimizing prompts for cost efficiency
  • Batch processing cost estimation
  • User-facing cost previews

Estimation Methodology:

  • Input Tokens: Estimated from word count using configurable ratio
  • Output Tokens: User-provided estimate based on expected response length
  • Pricing: Applied using current model pricing structure
  • Accuracy: Approximation only - actual costs may vary

Arguments:

  • prompt - The input text to estimate token costs for. This is analyzed for word count and converted to estimated tokens using the tokens_per_word ratio.
  • estimated_completion_tokens - Expected number of tokens in the model's response. This should be estimated based on the desired response length and complexity.
  • model_pricing - Current pricing information for the target model. Contains input and output costs per million tokens. If None, returns zero cost.
  • tokens_per_word - Conversion ratio from words to tokens. Default of 1.3 is optimized for English text. Adjust for other contexts:
    • English text: ~1.3 tokens/word (default)
    • Japanese/Chinese: ~2.0 tokens/word
    • Code/technical: ~1.5-2.0 tokens/word
    • Mixed content: Adjust based on composition

Returns:

Dictionary with estimated cost breakdown containing:

  • 'usd': Estimated total cost in US Dollars as a Decimal with exact precision

Accuracy Notes:

  • Token estimation is heuristic and may not match exact tokenization
  • Actual costs depend on precise tokenizer behavior
  • Output token count is user-estimated and may vary significantly
  • Different models may have different tokenization patterns

Example:

>>> from venice_ai import VeniceClient
>>> from venice_ai.costs import estimate_completion_cost
>>>
>>> client = VeniceClient(api_key="your-api-key")
>>>
>>> # Get current model pricing
>>> model_pricing = await client.get_model_pricing("llama-3.3-70b")
>>>
>>> # Estimate costs for different scenarios
>>> prompt = "Write a detailed explanation of quantum computing"
>>>
>>> # Short response estimate
>>> short_cost = estimate_completion_cost(
... prompt=prompt,
... estimated_completion_tokens=200,
... model_pricing=model_pricing
... )
>>>
>>> # Long response estimate
>>> long_cost = estimate_completion_cost(
... prompt=prompt,
... estimated_completion_tokens=1000,
... model_pricing=model_pricing
... )
>>>
>>> print(f"Short response: ${short_cost['usd']:.6f} USD")
>>> print(f"Long response: ${long_cost['usd']:.6f} USD")
>>> print(f"Cost difference: ${long_cost['usd'] - short_cost['usd']:.6f} USD")

CostRecord Objects

class CostRecord(BaseModel)

One per-request cost-tracking entry.

CostSummary Objects

class CostSummary(BaseModel)

Aggregate stats produced by :meth:CostTracker.summary.

BudgetRemaining Objects

class BudgetRemaining(BaseModel)

Remaining-budget snapshot returned by :meth:BudgetManager.remaining.

CostTracker Objects

class CostTracker()

Stateful, async-safe accumulator for per-request API costs.

Wraps the existing :func:calculate_completion_cost and :func:calculate_embedding_cost helpers. Three integration paths:

  • Manual — call :meth:track on each response yourself.
  • Wired-on-client — pass to VeniceClient(cost_tracker=tracker); the SDK calls :meth:track automatically on every chat / embeddings response.
  • From-client factory — :meth:from_client builds a tracker pre-populated with the live pricing map.

All mutating operations take a single :class:asyncio.Lock so concurrent in-flight requests can update state safely.

CostTracker.__init__

def __init__(pricing_map: dict[str, ModelPricing] | None = None) -> None

Arguments:

  • pricing_map: {model_id: LLMModelPricing}. Models absent from the map produce zero-cost records (the underlying helpers gracefully handle missing pricing).

CostTracker.from_client

@classmethod
async def from_client(cls, client: VeniceClient) -> CostTracker

Build a tracker pre-populated with the live chat-pricing map.

CostTracker.track

async def track(response: ChatCompletion | EmbeddingsResponse,
*,
model: str | None = None,
metadata: dict[str, Any] | None = None) -> Decimal

Record one response and return its USD cost.

Arguments:

  • response: A :class:ChatCompletionResponse or :class:EmbeddingsResponse.
  • model: Override the model id used to look up pricing. Defaults to response.model.
  • metadata: Free-form metadata stored on the resulting :class:CostRecord.

Raises:

  • TypeError: For unsupported response types.

CostTracker.summary

async def summary() -> CostSummary

Aggregate stats across all tracked requests.

CostTracker.by_model

async def by_model() -> dict[str, Decimal]

USD cost grouped by model id.

CostTracker.reset

async def reset() -> None

Clear all tracked state.

BudgetManager Objects

class BudgetManager()

Daily / monthly USD-cap enforcement layered on a :class:CostTracker.

Either daily_usd or monthly_usd may be None to disable that cap. The tracker is shared, not owned — BudgetManager does not call :meth:CostTracker.reset; callers manage rollover themselves.

BudgetManager.can_afford

async def can_afford(estimated_cost_usd: Decimal) -> bool

True if adding estimated_cost_usd keeps both caps satisfied.

BudgetManager.remaining

async def remaining() -> BudgetRemaining

Snapshot of remaining headroom and usage percentages.