Cost Calculation and Estimation Utilities
This module provides comprehensive cost calculation and estimation utilities for Venice AI API usage. It supports real-time cost tracking, pre-request estimation, and detailed breakdown of usage costs across different pricing models.
All calculations are based on actual token usage and current model pricing information.
Key Features:
- Real-time Cost Calculation: Calculate actual costs from API responses
- Pre-request Estimation: Estimate costs before making API calls
- Token-based Pricing: Accurate cost calculation based on token consumption
- Model-specific Pricing: Different pricing tiers for different models
Pricing Models:
- Input Tokens: Cost for processing input text (prompts, messages)
- Output Tokens: Cost for generating output text (completions, responses)
- Flat Rate Models: Simple per-request pricing for some operations
- Tiered Pricing: Volume-based pricing with different rates
Cost Types:
- USD: Traditional US Dollar pricing for enterprise billing
- DIEM: Platform currency where 1 DIEM = $1 USD
Example:
>>> from venice_ai.costs import calculate_completion_cost, estimate_completion_cost
>>>
>>> # Calculate actual cost from a completion
>>> completion = await client.chat.completions.create(...)
>>> model_pricing = await client.get_model_pricing("llama-3.3-70b")
>>> cost = calculate_completion_cost(completion, model_pricing)
>>> print(f"Cost: ${cost['usd']:.6f} USD")
>>>
>>> # Estimate cost before making request
>>> estimated_cost = estimate_completion_cost(
... prompt="Your prompt here",
... estimated_completion_tokens=500,
... model_pricing=model_pricing
... )
>>> print(f"Estimated: ${estimated_cost['usd']:.6f} USD")
ChatCostEstimate Objects
class ChatCostEstimate(BaseModel)
Pre-flight cost estimate for a chat completion request.
Returned by :meth:client.chat.completions.estimate_cost. Token counts
are heuristic word-count approximations (the same approach the
:func:estimate_completion_cost helper uses); total_cost_usd is
therefore an estimate, not a guarantee.
calculate_completion_cost
def calculate_completion_cost(
completion: ChatCompletion,
model_pricing: ModelPricing | None) -> dict[str, Decimal]
Calculate the actual cost of a completed chat completion request.
This function analyzes a ChatCompletion response and calculates the precise cost based on actual token usage reported by the API. It handles both input and output token pricing.
The calculation uses the token usage data from the completion response and applies the current pricing structure for the specific model used. This provides accurate post-request cost tracking for billing and analytics.
Token Cost Calculation:
- Input Cost: prompt_tokens × input_cost_per_million_tokens
- Output Cost: completion_tokens × output_cost_per_million_tokens
- Total Cost: Input Cost + Output Cost
Arguments:
completion- The completed ChatCompletion response containing actual token usage data from the API. Must include usage information with prompt_tokens and completion_tokens counts.model_pricing- Current pricing information for the model that was used. Contains input and output costs per million tokens. If None, returns zero cost.
Returns:
Dictionary with cost breakdown containing:
- 'usd': Total cost in US Dollars as a Decimal with exact precision
Notes:
If the completion lacks usage data or model_pricing is None, the function returns zero cost rather than raising an exception to maintain robust operation in production environments.
Example:
>>> from venice_ai import VeniceClient
>>> from venice_ai.costs import calculate_completion_cost
>>>
>>> client = VeniceClient(api_key="your-api-key")
>>>
>>> # Create a chat completion
>>> completion = await client.chat.completions.create(
... model="llama-3.3-70b",
... messages=[{"role": "user", "content": "Hello world!"}]
... )
>>>
>>> # Get current model pricing
>>> model_pricing = await client.get_model_pricing("llama-3.3-70b")
>>>
>>> # Calculate actual costs
>>> costs = calculate_completion_cost(completion, model_pricing)
>>> print(f"Cost: ${costs['usd']:.6f} USD")
>>> print(f"Tokens: {completion.usage.total_tokens} total")
calculate_embedding_cost
def calculate_embedding_cost(
embedding_response: Any,
model_pricing: ModelPricing | None) -> dict[str, Decimal]
Calculate the actual cost of a completed embedding request.
This function analyzes an embedding response and calculates the cost based on the total tokens processed during the embedding generation. Unlike chat completions, embeddings typically use only input token pricing since they don't generate variable-length outputs.
Embedding Cost Calculation:
- Input Processing: total_tokens × input_cost_per_million_tokens
- Fixed Output: Embeddings have fixed output dimensions
- Total Cost: Primarily based on input token processing
Arguments:
embedding_response- The completed embedding response containing usage data. Must include a usage object with total_tokens count from the embedding operation.model_pricing- Current pricing information for the embedding model. Contains input costs per million tokens. If None, returns zero cost.
Returns:
Dictionary with cost breakdown containing:
- 'usd': Total cost in US Dollars as a Decimal with exact precision
Example:
>>> from venice_ai import VeniceClient
>>> from venice_ai.costs import calculate_embedding_cost
>>>
>>> client = VeniceClient(api_key="your-api-key")
>>>
>>> # Create embeddings
>>> response = await client.embeddings.create(
... model="text-embedding-3-small",
... input="Hello, world! This is a sample text."
... )
>>>
>>> # Get current model pricing
>>> model_pricing = await client.get_model_pricing("text-embedding-3-small")
>>>
>>> # Calculate actual costs
>>> costs = calculate_embedding_cost(response, model_pricing)
>>> print(f"Cost: ${costs['usd']:.6f} USD")
>>> print(f"Tokens processed: {response.usage.total_tokens}")
estimate_completion_cost
def estimate_completion_cost(
prompt: str,
estimated_completion_tokens: int,
model_pricing: ModelPricing | None,
tokens_per_word: float = 1.3) -> dict[str, Decimal]
Estimate the cost of a chat completion before making the API request.
This function provides pre-request cost estimation based on prompt analysis and expected completion length. It uses heuristic token counting to estimate input costs and user-provided estimates for output costs, enabling budget planning and cost-aware request optimization.
The estimation is particularly useful for:
- Budget planning and cost control
- Optimizing prompts for cost efficiency
- Batch processing cost estimation
- User-facing cost previews
Estimation Methodology:
- Input Tokens: Estimated from word count using configurable ratio
- Output Tokens: User-provided estimate based on expected response length
- Pricing: Applied using current model pricing structure
- Accuracy: Approximation only - actual costs may vary
Arguments:
prompt- The input text to estimate token costs for. This is analyzed for word count and converted to estimated tokens using the tokens_per_word ratio.estimated_completion_tokens- Expected number of tokens in the model's response. This should be estimated based on the desired response length and complexity.model_pricing- Current pricing information for the target model. Contains input and output costs per million tokens. If None, returns zero cost.tokens_per_word- Conversion ratio from words to tokens. Default of 1.3 is optimized for English text. Adjust for other contexts:- English text: ~1.3 tokens/word (default)
- Japanese/Chinese: ~2.0 tokens/word
- Code/technical: ~1.5-2.0 tokens/word
- Mixed content: Adjust based on composition
Returns:
Dictionary with estimated cost breakdown containing:
- 'usd': Estimated total cost in US Dollars as a Decimal with exact precision
Accuracy Notes:
- Token estimation is heuristic and may not match exact tokenization
- Actual costs depend on precise tokenizer behavior
- Output token count is user-estimated and may vary significantly
- Different models may have different tokenization patterns
Example:
>>> from venice_ai import VeniceClient
>>> from venice_ai.costs import estimate_completion_cost
>>>
>>> client = VeniceClient(api_key="your-api-key")
>>>
>>> # Get current model pricing
>>> model_pricing = await client.get_model_pricing("llama-3.3-70b")
>>>
>>> # Estimate costs for different scenarios
>>> prompt = "Write a detailed explanation of quantum computing"
>>>
>>> # Short response estimate
>>> short_cost = estimate_completion_cost(
... prompt=prompt,
... estimated_completion_tokens=200,
... model_pricing=model_pricing
... )
>>>
>>> # Long response estimate
>>> long_cost = estimate_completion_cost(
... prompt=prompt,
... estimated_completion_tokens=1000,
... model_pricing=model_pricing
... )
>>>
>>> print(f"Short response: ${short_cost['usd']:.6f} USD")
>>> print(f"Long response: ${long_cost['usd']:.6f} USD")
>>> print(f"Cost difference: ${long_cost['usd'] - short_cost['usd']:.6f} USD")
CostRecord Objects
class CostRecord(BaseModel)
One per-request cost-tracking entry.
CostSummary Objects
class CostSummary(BaseModel)
Aggregate stats produced by :meth:CostTracker.summary.
BudgetRemaining Objects
class BudgetRemaining(BaseModel)
Remaining-budget snapshot returned by :meth:BudgetManager.remaining.
CostTracker Objects
class CostTracker()
Stateful, async-safe accumulator for per-request API costs.
Wraps the existing :func:calculate_completion_cost and
:func:calculate_embedding_cost helpers. Three integration paths:
- Manual — call :meth:
trackon each response yourself. - Wired-on-client — pass to
VeniceClient(cost_tracker=tracker); the SDK calls :meth:trackautomatically on every chat / embeddings response. - From-client factory — :meth:
from_clientbuilds a tracker pre-populated with the live pricing map.
All mutating operations take a single :class:asyncio.Lock so concurrent
in-flight requests can update state safely.
CostTracker.__init__
def __init__(pricing_map: dict[str, ModelPricing] | None = None) -> None
Arguments:
pricing_map:{model_id: LLMModelPricing}. Models absent from the map produce zero-cost records (the underlying helpers gracefully handle missing pricing).
CostTracker.from_client
@classmethod
async def from_client(cls, client: VeniceClient) -> CostTracker
Build a tracker pre-populated with the live chat-pricing map.
CostTracker.track
async def track(response: ChatCompletion | EmbeddingsResponse,
*,
model: str | None = None,
metadata: dict[str, Any] | None = None) -> Decimal
Record one response and return its USD cost.
Arguments:
response: A :class:ChatCompletionResponseor :class:EmbeddingsResponse.model: Override the model id used to look up pricing. Defaults toresponse.model.metadata: Free-form metadata stored on the resulting :class:CostRecord.
Raises:
TypeError: For unsupported response types.
CostTracker.summary
async def summary() -> CostSummary
Aggregate stats across all tracked requests.
CostTracker.by_model
async def by_model() -> dict[str, Decimal]
USD cost grouped by model id.
CostTracker.reset
async def reset() -> None
Clear all tracked state.
BudgetManager Objects
class BudgetManager()
Daily / monthly USD-cap enforcement layered on a :class:CostTracker.
Either daily_usd or monthly_usd may be None to disable that
cap. The tracker is shared, not owned — BudgetManager does not call
:meth:CostTracker.reset; callers manage rollover themselves.
BudgetManager.can_afford
async def can_afford(estimated_cost_usd: Decimal) -> bool
True if adding estimated_cost_usd keeps both caps satisfied.
BudgetManager.remaining
async def remaining() -> BudgetRemaining
Snapshot of remaining headroom and usage percentages.