title: "Управление затратами агентов: маршрутизация моделей, сжатие и бюджет токенов" slug: agent-cost-management-routing-compression-2026-ru date: 2026-02-21 lang: ru

Управление затратами агентами: маршрутизация моделей, сжатие и бюджет токенов

Ключевые факты

litellm 1.82.0 (PyPI, March 2026): Unified API for multi-model routing, config format: {"model_list": [{"model_name": "...", "litellm_params": {...}}]}
tiktoken 0.12.0: Token counter library; cl100k_base encoding for GPT-4/GPT-4o; English prose averages 1.3 tokens per word, code averages 2.5-3.5 tokens per word
GPT-4o pricing (March 2026): $2.50 input / $10.00 output per 1M tokens
GPT-4o-mini pricing: $0.15 input / $0.60 output per 1M tokens (17x cheaper input, 17x cheaper output vs GPT-4o)
Claude Opus 4.5 pricing: $15.00 input / $75.00 output per 1M tokens
Claude Haiku 4.5 pricing: $0.80 input / $4.00 output per 1M tokens (19x cheaper input, 19x cheaper output vs Opus)
Gemini 1.5 Flash pricing: $0.075 input / $0.30 output per 1M tokens (cheapest major model tier)
LLMLingua 0.2.2: Prompt compression library using small LM; compression ratios: 2x-20x depending on aggressiveness; typical production: 3x-5x without quality loss
GPTCache 0.1.44: Semantic caching library; backends: in-memory, Redis, Milvus; exact-match cache vs embedding-based similarity cache (cosine similarity threshold typically 0.95+)
anthropic 0.84.0: Official Python SDK; usage object returns input_tokens and output_tokens for cost tracking
openai 2.24.0: Official Python SDK; usage object returns prompt_tokens, completion_tokens, total_tokens
langchain-core 1.2.16: Base abstractions; BaseMessage classes (HumanMessage, AIMessage, SystemMessage, ToolMessage) for conversation history
Token overhead per message: Approximately 4-6 tokens per message for formatting (role markers, delimiters); 10-step conversation with 200-token messages = ~2040 tokens total
Context window growth: Agent conversations grow linearly without compression; 20-step run with 500 tokens/step input = 10,000 tokens of history by step 20
Typical agent cost breakdown: LLM API calls (70-85%), tool execution compute (10-20%), vector storage/retrieval (5-10%)
Cost multiplier from tool calls: Agent with 5 tool calls per task averages 7-12 LLM API calls total (initial call + tool result processing + continuation)
Compression trade-offs: 3x compression reduces cost by ~66% but adds 200-500ms latency per compression step; use cheaper model (GPT-4o-mini) for summarization to minimize compression cost
Budget enforcement patterns: Hard limit (raise exception), soft limit (downgrade to cheaper model), warn-only (log but continue); production systems typically use soft limit with alerting
Semantic cache hit rates: Exact-match cache: 15-25% hit rate in production; embedding-based cache with 0.95 threshold: 30-50% hit rate; cache hit saves 100% of LLM cost for that call

Источники затрат в системах агентов

LLM агенты накапливают затраты из множества источников. Понимание breakdown позволяет целенаправленную оптимизацию:

Основные драйверы затрат

Вызовы LLM API составляют большинство затрат агентов. Каждый вызов взимает плату за input токены (prompt + history + tools) и output токены (ответ модели). Multi-step агенты усугубляют это: 10-шаговой reasoning loop с 3,000 token контекстом на вызов потребляет 30,000 input токенов плюс output, стоит $0.075 на GPT-4o input в одиночку.

Неконтролируемый рост контекста — самая частая ошибка затрат. Каждый шаг агента добавляет сообщения в history. Без сжатия или pruning, контекст растет линейно: шаг 1 = 2,000 токенов, шаг 10 = 6,500 токенов, шаг 20 = 12,000 токенов. Кумулятивная input стоимость всех шагов становится квадратичной.

Tool call overhead добавляет множественные API вызовы на шаг. Когда агент вызывает tool, flow становится: (1) модель решает вызвать tool, (2) tool выполняется, (3) модель обрабатывает результат и решает следующее действие. Один пользовательский запрос с 3 tool вызовами генерирует 5-7 LLM API вызовов.

Вторичные драйверы затрат

Операции vector database для semantic search и caching добавляют per-query затраты. Генерация embeddings стоит $0.02 за 1M токенов (OpenAI text-embedding-3-small). Vector storage и similarity search добавляют negligible стоимость на малой масштабе, но становятся measurable выше 10M vectors.

Tool execution compute зависит от сложности tool. Запросы database, API вызовы к внешним сервисам и file processing обычно стоят меньше $0.001 на invocation. Дорогие tools (video processing, large dataset analytics) могут превысить LLM стоимость.

Storage для conversation history negligible для active sessions (in-memory), но добавляет стоимость для long-term retention. 1M сообщений по 500 байт в среднем = 500MB, стоит $0.023/месяц на S3 standard storage.

Распределение затрат на практике

Типичный production агент, обслуживающий 10,000 запросов/день:

Вызовы LLM API: 75-85% от всей стоимости
Генерация embeddings: 5-10%
Vector database: 3-7%
Выполнение tools: 5-10%
Storage и logging: <2%

Приоритет оптимизации ясен: сначала снизить LLM API стоимость (model routing, compression, caching), затем оптимизировать дорогие tool вызовы, потом storage.

Фреймворк принятия решений

Когда использовать дешёвые vs дорогие модели

Используйте дешёвые модели (GPT-4o-mini, Haiku, Gemini Flash) для:

Structured data extraction из fixed templates или known schemas
Classification в predefined категории с clear decision boundaries
Formatting и validation structured outputs (JSON, XML, SQL)
Простая summarization коротких документов (< 2000 токенов)
Tool result processing когда результат unambiguous
Fact retrieval из provided context без inference

Используйте medium модели (GPT-4o, Claude Sonnet) для:

Multi-step reasoning над provided context (2-5 шагов)
Code generation following standard patterns
Comparative analysis с explicit tradeoff evaluation
Planning из high-level goals в concrete steps
Ambiguity resolution с moderate complexity

Используйте дорогие модели (Claude Opus, o1) для:

Novel problem solving без established patterns
Complex multi-domain synthesis требующий cross-cutting knowledge
Architecture и system design решения
Deep reasoning требующий 10+ logical steps
Creative tasks требующие originality и nuance

Decision tree маршрутизации

Вход: пользовательский запрос + описание задачи

├─ Задача в predefined простых паттернах? (extract, classify, format, validate)
│  └─ ДА → Используйте дешёвую модель (mini/Haiku)
│
├─ Does task require code generation or multi-step reasoning?
│  ├─ Simple code (CRUD, API calls) → Use cheap model
│  └─ Complex code or reasoning → Continue evaluation
│
├─ Does task require creativity, novelty, or cross-domain synthesis?
│  └─ YES → Use expensive model (Opus/o1)
│
└─ DEFAULT → Use medium model (GPT-4o/Sonnet)

Когда кешировать vs переиспользовать

Use semantic caching when:

Same or similar queries appear frequently (>5% repeat rate)
Query latency matters more than marginal cost
Responses have low time-sensitivity (valid for minutes to hours)

Skip caching when:

Queries are unique (long-tail distribution)
Responses must be real-time (market data, live metrics)
Cache infrastructure cost exceeds LLM cost savings

Когда сжимать vs расширять контекст

Compress history when:

Conversation exceeds 4,000-8,000 tokens
Older messages are contextual background, not active decision factors
Long-running sessions (>15 steps)

Keep full history when:

High-stakes decisions where all context matters (legal, medical)
Short sessions (<5 steps)
Debugging or audit trails required

Таблица ссылок параметров

Parameter	Recommended Value	Notes
`max_tokens_before_compression`	4000	Trigger history compression above this token count
`tokens_to_keep_recent`	1500	Keep last N tokens verbatim after compression
`summary_model`	`gpt-4o-mini`	Use cheap model for summarization to minimize compression cost
`summary_max_tokens`	600	Target summary length; 3x-5x compression from original
`tool_response_max_tokens`	2000	Truncate tool results exceeding this limit
`run_budget_usd`	0.50	Per-run budget ceiling; adjust by use case
`budget_warn_threshold`	0.80	Warn when this fraction of budget consumed
`cache_ttl_seconds`	3600	Semantic cache entry lifetime (1 hour typical)
`cache_similarity_threshold`	0.95	Cosine similarity for cache hit (0.95 = very similar)
`recursion_limit`	20	Maximum agent steps before forced termination
`temperature`	0.0	Deterministic routing decisions; 0.3-0.7 for creative tasks
`max_tokens` (output)	2048	Typical output limit; increase for code generation (4096)
`context_window_max`	128000	Model context limit (GPT-4o); stay well below for safety
`compression_trigger_steps`	10	Compress every N steps in long-running agents
`tool_call_timeout_seconds`	30	Timeout for tool execution to prevent hanging
`embedding_batch_size`	100	Batch embeddings for caching to reduce API overhead

Типичные ошибки

Ошибка 1: Отправка полной истории разговора на каждый вызов без сжатия

Symptom: Cost scales quadratically with conversation length; 20-step agent costs 10x more than expected.

Impact: A 20-step conversation with 500 tokens/step averages 6,000 tokens input per call; total input = 120,000 tokens = $0.30 on GPT-4o. With compression to 1,500 tokens after step 10, total input drops to 60,000 tokens = $0.15 (50% savings).

❌ Wrong:

# Naive approach: append indefinitely
messages = []

for step in range(20):
    messages.append(HumanMessage(content=user_input))
    response = llm.invoke(messages)  # Sends ALL messages every time
    messages.append(AIMessage(content=response.content))
    # Step 20: sending 40 messages, ~12,000 tokens

✅ Correct:

# Compress history when threshold exceeded
messages = []
summary = ""
compressor = HistoryCompressor(max_tokens_before_compression=4000)

for step in range(20):
    messages.append(HumanMessage(content=user_input))

    # Compress if needed
    messages, summary, compressed = compressor.maybe_compress(messages, summary)

    response = llm.invoke(messages)  # Sends compressed history
    messages.append(AIMessage(content=response.content))
    # Step 20: sending ~1,500 recent tokens + 500 summary = 2,000 tokens

Ошибка 2: Использование дорогой модели для всех шагов независимо от сложности

Symptom: Every agent call costs $0.05-0.10 even for trivial tasks like classification or formatting.

Impact: Running GPT-4o ($2.50 input) for simple classification that GPT-4o-mini ($0.15 input) handles equally well wastes 17x cost. At 1,000 simple tasks/day with 1,500 tokens average: GPT-4o = $3.75/day, mini = $0.22/day.

❌ Wrong:

# Always use expensive model
llm = ChatOpenAI(model="gpt-4o", temperature=0)

# Simple classification task
response = llm.invoke([
    HumanMessage(content="Classify this email as spam or not spam: ...")
])
# Cost: $0.0025 for 1,000 tokens

✅ Correct:

# Route by complexity
def classify_and_route(task_description: str, messages: list):
    if is_simple_task(task_description):
        llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
        # Cost: $0.00015 for 1,000 tokens (17x cheaper)
    elif is_complex_task(task_description):
        llm = ChatOpenAI(model="gpt-4o", temperature=0)
    else:
        llm = ChatOpenAI(model="gpt-4o", temperature=0)  # Default medium

    return llm.invoke(messages)

def is_simple_task(description: str) -> bool:
    simple_patterns = ["classify", "extract", "format", "validate"]
    return any(pattern in description.lower() for pattern in simple_patterns)

Ошибка 3: Отсутствие enforcement бюджета приводит к неконтролируемым затратам

Symptom: Agent enters infinite loop or complexity explosion; single run costs $5-50.

Impact: Without budget ceiling, bugs or adversarial inputs can cause catastrophic cost. One production incident: recursive agent generated 200+ API calls in 15 minutes, costing $87 before manual intervention.

❌ Wrong:

# No budget check
total_cost = 0.0

while not task_complete:
    response = llm.invoke(messages)
    cost = calculate_cost(response.usage)
    total_cost += cost
    # Loop continues indefinitely if task never completes

✅ Correct:

# Hard budget limit with graceful failure
BUDGET_LIMIT = 1.00  # $1 per run
total_cost = 0.0

while not task_complete:
    if total_cost >= BUDGET_LIMIT:
        raise BudgetExceededError(
            f"Run exceeded budget: ${total_cost:.4f} >= ${BUDGET_LIMIT}"
        )

    response = llm.invoke(messages)
    cost = calculate_cost(response.usage)
    total_cost += cost

    # Also enforce max steps
    if step_count >= 20:
        break

Ошибка 4: Кеширование без TTL вызывает устаревшие ответы

Symptom: Agent returns outdated information hours or days after cache entry; users report incorrect data.

Impact: Semantic cache without expiration serves stale results indefinitely. Example: cached stock price from yesterday served for today's query, causing incorrect trading decision.

❌ Wrong:

# Cache with no expiration
cache = {}

def cached_llm_call(prompt: str):
    if prompt in cache:
        return cache[prompt]  # Could be days old

    response = llm.invoke(prompt)
    cache[prompt] = response
    return response

✅ Correct:

# Cache with TTL
import time

cache = {}
cache_timestamps = {}
CACHE_TTL = 3600  # 1 hour

def cached_llm_call(prompt: str):
    if prompt in cache:
        if time.time() - cache_timestamps[prompt] < CACHE_TTL:
            return cache[prompt]
        else:
            # Expired, remove from cache
            del cache[prompt]
            del cache_timestamps[prompt]

    response = llm.invoke(prompt)
    cache[prompt] = response
    cache_timestamps[prompt] = time.time()
    return response

Ошибка 5: Усечение ответов tool теряет критическую информацию

Symptom: Agent fails tasks because truncation removed necessary data; repeated tool calls for same information.

Impact: Aggressive truncation (e.g., first 100 chars) can remove key data. Example: Database query returns 50 rows, truncation keeps first 5 rows, agent misses target data in row 23.

❌ Wrong:

# Naive truncation: first N characters
def truncate_tool_response(response: str, max_length: int = 200) -> str:
    return response[:max_length]  # Loses tail of response

# Agent calls tool
db_results = query_database("SELECT * FROM users WHERE active=true")
# Returns 50 rows, 5000 chars
truncated = truncate_tool_response(db_results, 200)
# Agent only sees first 3 rows, misses most results

✅ Correct:

# Smart truncation: preserve structure and metadata
def smart_truncate_tool_response(response: any, max_tokens: int = 2000) -> str:
    if isinstance(response, list):
        # Show first few items + summary
        preview = response[:5]
        summary = f"Showing 5 of {len(response)} total items"
        return json.dumps({"preview": preview, "summary": summary})

    if isinstance(response, str):
        if token_count(response) <= max_tokens:
            return response
        # Truncate but add metadata
        truncated = response[:max_tokens]
        return f"{truncated}\n[Truncated: {len(response)} total chars, showing first {len(truncated)}]"

    return str(response)

Model Routing

Production agents route different steps to different model tiers based on task complexity. This reduces cost by 40-70% without quality loss.

LiteLLM Router configuration

LiteLLM provides unified API across OpenAI, Anthropic, Google, and other providers. Router automatically handles fallbacks and load balancing.

from litellm import Router
import os

model_list = [
    {
        "model_name": "cheap",
        "litellm_params": {
            "model": "gpt-4o-mini",
            "api_key": os.environ["OPENAI_API_KEY"],
        },
    },
    {
        "model_name": "medium",
        "litellm_params": {
            "model": "gpt-4o",
            "api_key": os.environ["OPENAI_API_KEY"],
        },
    },
    {
        "model_name": "expensive",
        "litellm_params": {
            "model": "claude-opus-4-5",
            "api_key": os.environ["ANTHROPIC_API_KEY"],
        },
    },
]

router = Router(
    model_list=model_list,
    fallbacks=[
        {"cheap": ["medium"]},
        {"medium": ["expensive"]},
    ],
    num_retries=2,
)

# Route based on task
def route_completion(prompt: str, complexity: str):
    model_map = {
        "simple": "cheap",
        "medium": "medium",
        "complex": "expensive",
    }

    response = router.completion(
        model=model_map[complexity],
        messages=[{"role": "user", "content": prompt}],
    )

    return response

Complexity-based routing with pattern matching

import re
from dataclasses import dataclass
from typing import Optional

@dataclass
class TaskClassifier:
    """Classifies task complexity for model routing."""

    SIMPLE_PATTERNS = [
        r"\bextract\b.*\bfrom\b",
        r"\bclassify\b.*\b(as|into)\b",
        r"\bformat\b.*\b(as|to)\b",
        r"\bvalidate\b",
        r"\blist\b.*\b(all|the)\b",
        r"\bcount\b",
    ]

    COMPLEX_PATTERNS = [
        r"\bdesign\b.*\b(system|architecture)\b",
        r"\bnovel\b",
        r"\bcreative\b",
        r"\bsynthesi[sz]e\b.*\bacross\b",
        r"\btrade.?off",
        r"\bcompare\b.*\band\b.*\bcontrast\b",
    ]

    def classify(self, task_description: str) -> str:
        """Returns 'simple', 'medium', or 'complex'."""
        text = task_description.lower()

        # Check complex patterns first
        complex_matches = sum(
            1 for pattern in self.COMPLEX_PATTERNS 
            if re.search(pattern, text)
        )
        if complex_matches >= 2:
            return "complex"

        # Check simple patterns
        simple_matches = sum(
            1 for pattern in self.SIMPLE_PATTERNS
            if re.search(pattern, text)
        )
        if simple_matches >= 1 and len(text.split()) < 50:
            return "simple"

        # Default to medium
        return "medium"

# Использование
classifier = TaskClassifier()

tasks = [
    "Extract the email addresses from this text",
    "Design a distributed caching system with fallback handling",
    "Generate a SQL query to find top customers by revenue",
]

for task in tasks:
    complexity = classifier.classify(task)
    print(f"{complexity}: {task}")

# Output:
# simple: Extract the email addresses from this text
# complex: Design a distributed caching system with fallback handling
# medium: Generate a SQL query to find top customers by revenue

Fallback chain with retry logic

from typing import Callable, List, Dict, Any
import time

class FallbackRouter:
    """Routes with fallback to more capable models on failure."""

    def __init__(self, model_chain: List[Dict[str, Any]]):
        """
        model_chain: [
            {"name": "cheap", "model": "gpt-4o-mini", "cost_per_1m_in": 0.15},
            {"name": "medium", "model": "gpt-4o", "cost_per_1m_in": 2.50},
        ]
        """
        self.model_chain = model_chain

    def call_with_fallback(
        self,
        messages: List[Dict],
        validation_fn: Callable[[str], bool] = None,
        max_attempts: int = 3,
    ) -> Dict[str, Any]:
        """
        Try models in order until validation passes.
        Returns: {"response": str, "model_used": str, "cost": float}
        """

        for attempt, model_config in enumerate(self.model_chain):
            try:
                response = router.completion(
                    model=model_config["model"],
                    messages=messages,
                )

                content = response.choices[0].message.content

                # Validate response if validator provided
                if validation_fn and not validation_fn(content):
                    if attempt < len(self.model_chain) - 1:
                        print(f"Validation failed for {model_config['name']}, trying next model")
                        continue
                    else:
                        raise ValueError("All models failed validation")

                # Calculate cost
                usage = response.usage
                cost = (
                    (usage.prompt_tokens / 1_000_000) * model_config["cost_per_1m_in"] +
                    (usage.completion_tokens / 1_000_000) * model_config.get("cost_per_1m_out", model_config["cost_per_1m_in"] * 4)
                )

                return {
                    "response": content,
                    "model_used": model_config["name"],
                    "cost": cost,
                    "tokens_in": usage.prompt_tokens,
                    "tokens_out": usage.completion_tokens,
                }

            except Exception as e:
                if attempt < len(self.model_chain) - 1:
                    print(f"Error with {model_config['name']}: {e}, trying next model")
                    continue
                else:
                    raise

# Использование with validation
def validate_json_response(response: str) -> bool:
    """Ensure response is valid JSON."""
    try:
        import json
        json.loads(response)
        return True
    except:
        return False

fallback_router = FallbackRouter([
    {"name": "cheap", "model": "gpt-4o-mini", "cost_per_1m_in": 0.15, "cost_per_1m_out": 0.60},
    {"name": "medium", "model": "gpt-4o", "cost_per_1m_in": 2.50, "cost_per_1m_out": 10.00},
])

result = fallback_router.call_with_fallback(
    messages=[{"role": "user", "content": "Extract user data as JSON: ..."}],
    validation_fn=validate_json_response,
)

print(f"Used {result['model_used']}, cost ${result['cost']:.4f}")

Компрессия промптов

Large prompts with repetitive or verbose content can be compressed before sending to the LLM, reducing input token cost by 50-80%.

LLMLingua compression

LLMLingua uses a small language model to identify and remove unimportant tokens while preserving semantic meaning.

from llmlingua import PromptCompressor
import tiktoken

# Initialize compressor (downloads small LM first time)
compressor = PromptCompressor(
    model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
    device_map="cpu",  # or "cuda" if GPU available
)

encoder = tiktoken.encoding_for_model("gpt-4o")

def compress_prompt(
    prompt: str,
    instruction: str = "",
    target_ratio: float = 0.5,
) -> dict:
    """
    Compress prompt to target ratio.
    Returns: {"compressed": str, "original_tokens": int, "compressed_tokens": int, "ratio": float}
    """

    original_tokens = len(encoder.encode(prompt))

    compressed_result = compressor.compress_prompt(
        prompt,
        instruction=instruction,
        rate=target_ratio,  # Target compression ratio (0.5 = 50% of original)
    )

    compressed_prompt = compressed_result["compressed_prompt"]
    compressed_tokens = len(encoder.encode(compressed_prompt))

    return {
        "compressed": compressed_prompt,
        "original_tokens": original_tokens,
        "compressed_tokens": compressed_tokens,
        "ratio": compressed_tokens / original_tokens,
        "savings_tokens": original_tokens - compressed_tokens,
    }

# Пример: compress verbose documentation
long_prompt = """
The system architecture consists of multiple interconnected components that work together
to provide a seamless user experience. The frontend application is built using React and
TypeScript, providing a modern and responsive interface. The backend services are implemented
in Python using FastAPI framework, ensuring high performance and scalability. The database
layer uses PostgreSQL for relational data storage and Redis for caching frequently accessed
data. All components communicate via REST APIs with JSON payloads.
"""

result = compress_prompt(
    long_prompt,
    instruction="Summarize the system architecture",
    target_ratio=0.3,
)

print(f"Original: {result['original_tokens']} tokens")
print(f"Compressed: {result['compressed_tokens']} tokens")
print(f"Ratio: {result['ratio']:.2f}")
print(f"Savings: {result['savings_tokens']} tokens")
print(f"\nCompressed text:\n{result['compressed']}")

Context truncation strategies

When compression is too slow or unavailable, truncation provides fast size reduction.

from typing import List
from langchain_core.messages import BaseMessage, SystemMessage, HumanMessage, AIMessage
import tiktoken

class ContextTruncator:
    """Truncate conversation history to fit token budget."""

    def __init__(self, max_tokens: int = 4000):
        self.max_tokens = max_tokens
        self.encoder = tiktoken.encoding_for_model("gpt-4o")

    def count_tokens(self, messages: List[BaseMessage]) -> int:
        total = 0
        for msg in messages:
            content = msg.content if isinstance(msg.content, str) else str(msg.content)
            total += len(self.encoder.encode(content)) + 4  # 4 tokens overhead per message
        return total

    def truncate_keep_recent(
        self,
        messages: List[BaseMessage],
        keep_system: bool = True,
    ) -> List[BaseMessage]:
        """Keep most recent messages that fit in budget."""

        if self.count_tokens(messages) <= self.max_tokens:
            return messages

        result = []
        token_count = 0

        # Always keep system message if present
        if keep_system and messages and isinstance(messages[0], SystemMessage):
            system_msg = messages[0]
            result.append(system_msg)
            token_count += len(self.encoder.encode(system_msg.content)) + 4
            messages = messages[1:]

        # Add recent messages in reverse until budget exhausted
        for msg in reversed(messages):
            content = msg.content if isinstance(msg.content, str) else str(msg.content)
            msg_tokens = len(self.encoder.encode(content)) + 4

            if token_count + msg_tokens <= self.max_tokens:
                result.insert(1 if keep_system else 0, msg)
                token_count += msg_tokens
            else:
                break

        return result

    def truncate_sliding_window(
        self,
        messages: List[BaseMessage],
        keep_first_n: int = 2,
        keep_last_n: int = 5,
    ) -> List[BaseMessage]:
        """Keep first N and last N messages, compress middle."""

        if len(messages) <= keep_first_n + keep_last_n:
            return messages

        first_messages = messages[:keep_first_n]
        last_messages = messages[-keep_last_n:]
        middle_count = len(messages) - keep_first_n - keep_last_n

        # Create summary message for middle
        summary_msg = SystemMessage(
            content=f"[{middle_count} messages omitted from conversation history]"
        )

        return first_messages + [summary_msg] + last_messages

# Использование
truncator = ContextTruncator(max_tokens=2000)

messages = [
    SystemMessage(content="You are a helpful assistant"),
    HumanMessage(content="What is the capital of France?"),
    AIMessage(content="The capital of France is Paris."),
    # ... many more messages
]

# Strategy 1: Keep recent
truncated_recent = truncator.truncate_keep_recent(messages)

# Strategy 2: Sliding window
truncated_window = truncator.truncate_sliding_window(messages, keep_first_n=2, keep_last_n=3)

Summarization-based compression

Use a cheap model to summarize older conversation history.

from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage, AIMessage

class SummarizationCompressor:
    """Compress history by summarizing older messages."""

    def __init__(self, summary_model: str = "gpt-4o-mini"):
        self.summary_llm = ChatOpenAI(model=summary_model, temperature=0, max_tokens=500)
        self.encoder = tiktoken.encoding_for_model("gpt-4o")

    def compress_history(
        self,
        messages: List[BaseMessage],
        keep_recent_n: int = 5,
    ) -> List[BaseMessage]:
        """Summarize old messages, keep recent ones verbatim."""

        if len(messages) <= keep_recent_n + 1:
            return messages

        # Split messages
        system_msg = messages[0] if isinstance(messages[0], SystemMessage) else None
        start_idx = 1 if system_msg else 0

        to_summarize = messages[start_idx:-keep_recent_n]
        to_keep = messages[-keep_recent_n:]

        # Build summary prompt
        conversation_text = "\n".join([
            f"{type(msg).__name__}: {msg.content[:200]}"
            for msg in to_summarize
        ])

        summary_prompt = f"""Summarize this conversation history concisely. Preserve key facts, decisions, and context.

{conversation_text}

Summary (2-3 sentences):"""

        summary_response = self.summary_llm.invoke([HumanMessage(content=summary_prompt)])
        summary_content = f"[Previous conversation summary]\n{summary_response.content}"

        # Build compressed history
        result = []
        if system_msg:
            result.append(system_msg)
        result.append(SystemMessage(content=summary_content))
        result.extend(to_keep)

        return result

# Использование
compressor = SummarizationCompressor()

# Long conversation
long_messages = [
    SystemMessage(content="You are a helpful assistant"),
    # ... 20 messages of conversation
]

compressed = compressor.compress_history(long_messages, keep_recent_n=5)
print(f"Original: {len(long_messages)} messages")
print(f"Compressed: {len(compressed)} messages (summary + 5 recent)")

Semantic Caching

Semantic caching stores LLM responses and retrieves them for similar queries, eliminating redundant API calls.

GPTCache integration

from gptcache import Cache
from gptcache.adapter import openai
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation
from gptcache.embedding import OpenAI as CacheOpenAI
import os

# Initialize cache
cache_base = CacheBase("sqlite")
vector_base = VectorBase("faiss", dimension=1536)

data_manager = get_data_manager(cache_base, vector_base)

# Embedding function for semantic similarity
embedding_func = CacheOpenAI()

# Similarity evaluator (cosine distance)
similarity_evaluator = SearchDistanceEvaluation(
    max_distance=0.05,  # Cosine distance threshold (lower = more similar)
    positive=False,     # Lower is better for distance
)

# Initialize cache
Cache.init(
    pre_embedding_func=lambda x: x["messages"][-1]["content"],  # Use last message
    embedding_func=embedding_func.to_embeddings,
    data_manager=data_manager,
    similarity_evaluation=similarity_evaluator,
)

# Set OpenAI API key
openai.api_key = os.environ["OPENAI_API_KEY"]

# Use cached completions
def cached_completion(messages: List[dict], model: str = "gpt-4o-mini"):
    """Make OpenAI call with semantic caching."""

    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
    )

    return response

# Пример usage
response1 = cached_completion([
    {"role": "user", "content": "What is the capital of France?"}
])
print("Response 1 (API call):", response1.choices[0].message.content)

# Similar query - should hit cache
response2 = cached_completion([
    {"role": "user", "content": "What's the capital city of France?"}
])
print("Response 2 (from cache):", response2.choices[0].message.content)

Exact-match cache with TTL

For exact duplicate queries, simple dictionary cache with expiration works well.

import time
import hashlib
import json
from typing import Dict, Any, Optional

class ExactMatchCache:
    """Simple exact-match cache with TTL."""

    def __init__(self, ttl_seconds: int = 3600):
        self.ttl = ttl_seconds
        self.cache: Dict[str, Dict[str, Any]] = {}

    def _hash_key(self, messages: List[dict]) -> str:
        """Generate cache key from messages."""
        # Serialize messages to deterministic JSON
        key_str = json.dumps(messages, sort_keys=True)
        return hashlib.sha256(key_str.encode()).hexdigest()

    def get(self, messages: List[dict]) -> Optional[str]:
        """Retrieve from cache if present and not expired."""
        key = self._hash_key(messages)

        if key in self.cache:
            entry = self.cache[key]
            if time.time() - entry["timestamp"] < self.ttl:
                return entry["response"]
            else:
                # Expired, remove
                del self.cache[key]

        return None

    def set(self, messages: List[dict], response: str) -> None:
        """Store in cache with timestamp."""
        key = self._hash_key(messages)
        self.cache[key] = {
            "response": response,
            "timestamp": time.time(),
        }

    def clear_expired(self) -> int:
        """Remove expired entries. Returns count removed."""
        now = time.time()
        expired_keys = [
            key for key, entry in self.cache.items()
            if now - entry["timestamp"] >= self.ttl
        ]

        for key in expired_keys:
            del self.cache[key]

        return len(expired_keys)

# Использование with LLM
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
cache = ExactMatchCache(ttl_seconds=3600)

def cached_chat_completion(messages: List[dict], model: str = "gpt-4o-mini"):
    # Check cache first
    cached_response = cache.get(messages)
    if cached_response:
        print("Cache hit!")
        return {"content": cached_response, "from_cache": True}

    # Cache miss - call API
    print("Cache miss, calling API")
    response = client.chat.completions.create(
        model=model,
        messages=messages,
    )

    content = response.choices[0].message.content
    cache.set(messages, content)

    return {"content": content, "from_cache": False}

# Test
messages = [{"role": "user", "content": "Hello"}]

result1 = cached_chat_completion(messages)  # API call
result2 = cached_chat_completion(messages)  # Cache hit

Cost Tracking & Dashboards

Production systems require real-time cost tracking and alerting to prevent budget overruns.

Per-run cost tracking

import uuid
from dataclasses import dataclass, field
from datetime import datetime
from typing import List, Optional

@dataclass
class TokenUsage:
    input_tokens: int
    output_tokens: int
    model: str

    def cost(self, input_price_per_1m: float, output_price_per_1m: float) -> float:
        return (
            (self.input_tokens / 1_000_000) * input_price_per_1m +
            (self.output_tokens / 1_000_000) * output_price_per_1m
        )

@dataclass
class StepRecord:
    step_number: int
    timestamp: datetime
    model: str
    usage: TokenUsage
    cost_usd: float
    duration_ms: float

@dataclass
class RunRecord:
    run_id: str
    started_at: datetime
    task_description: str
    steps: List[StepRecord] = field(default_factory=list)
    ended_at: Optional[datetime] = None
    status: str = "running"  # running, completed, failed, budget_exceeded

    @property
    def total_cost(self) -> float:
        return sum(step.cost_usd for step in self.steps)

    @property
    def total_tokens(self) -> int:
        return sum(step.usage.input_tokens + step.usage.output_tokens for step in self.steps)

    @property
    def duration_seconds(self) -> Optional[float]:
        if self.ended_at:
            return (self.ended_at - self.started_at).total_seconds()
        return None

class CostTracker:
    """Track costs across all agent runs."""

    def __init__(self):
        self.runs: Dict[str, RunRecord] = {}

    def start_run(self, task: str) -> str:
        """Start new run tracking."""
        run_id = str(uuid.uuid4())
        self.runs[run_id] = RunRecord(
            run_id=run_id,
            started_at=datetime.utcnow(),
            task_description=task,
        )
        return run_id

    def record_step(
        self,
        run_id: str,
        model: str,
        input_tokens: int,
        output_tokens: int,
        cost_usd: float,
        duration_ms: float,
    ) -> None:
        """Record individual step."""
        run = self.runs[run_id]
        step = StepRecord(
            step_number=len(run.steps) + 1,
            timestamp=datetime.utcnow(),
            model=model,
            usage=TokenUsage(input_tokens, output_tokens, model),
            cost_usd=cost_usd,
            duration_ms=duration_ms,
        )
        run.steps.append(step)

    def end_run(self, run_id: str, status: str = "completed") -> RunRecord:
        """Mark run as complete."""
        run = self.runs[run_id]
        run.ended_at = datetime.utcnow()
        run.status = status
        return run

    def get_summary(self) -> dict:
        """Get cost summary across all runs."""
        completed_runs = [r for r in self.runs.values() if r.status == "completed"]

        return {
            "total_runs": len(self.runs),
            "completed_runs": len(completed_runs),
            "total_cost_usd": sum(r.total_cost for r in self.runs.values()),
            "avg_cost_per_run": (
                sum(r.total_cost for r in completed_runs) / len(completed_runs)
                if completed_runs else 0
            ),
            "total_tokens": sum(r.total_tokens for r in self.runs.values()),
        }

# Использование
tracker = CostTracker()

run_id = tracker.start_run("Analyze customer feedback")

# Record steps
tracker.record_step(
    run_id=run_id,
    model="gpt-4o-mini",
    input_tokens=1500,
    output_tokens=300,
    cost_usd=0.00043,
    duration_ms=850,
)

tracker.end_run(run_id, status="completed")

summary = tracker.get_summary()
print(f"Total cost: ${summary['total_cost_usd']:.4f}")

Budget enforcement middleware

class BudgetEnforcer:
    """Enforce budget limits with configurable policies."""

    def __init__(
        self,
        daily_budget_usd: float = 100.0,
        per_run_budget_usd: float = 1.0,
        alert_threshold: float = 0.8,
    ):
        self.daily_budget = daily_budget_usd
        self.per_run_budget = per_run_budget_usd
        self.alert_threshold = alert_threshold
        self.daily_spend = 0.0
        self.daily_reset_time = datetime.utcnow().replace(hour=0, minute=0, second=0)

    def _check_daily_reset(self) -> None:
        """Reset daily counter at midnight UTC."""
        now = datetime.utcnow()
        if now.date() > self.daily_reset_time.date():
            self.daily_spend = 0.0
            self.daily_reset_time = now.replace(hour=0, minute=0, second=0)

    def check_run_budget(self, run_spend: float) -> None:
        """Raise error if run budget exceeded."""
        if run_spend >= self.per_run_budget:
            raise BudgetExceededError(
                f"Run budget exceeded: ${run_spend:.4f} >= ${self.per_run_budget}"
            )

        # Alert at threshold
        if run_spend >= self.per_run_budget * self.alert_threshold:
            print(f"Warning: Run at {run_spend/self.per_run_budget:.0%} of budget")

    def check_daily_budget(self, new_cost: float) -> None:
        """Raise error if daily budget would be exceeded."""
        self._check_daily_reset()

        if self.daily_spend + new_cost > self.daily_budget:
            raise BudgetExceededError(
                f"Daily budget exceeded: ${self.daily_spend + new_cost:.2f} > ${self.daily_budget}"
            )

        # Alert at threshold
        if self.daily_spend >= self.daily_budget * self.alert_threshold:
            print(f"Warning: Daily spend at {self.daily_spend/self.daily_budget:.0%} of budget")

    def record_cost(self, cost: float) -> None:
        """Record cost against daily budget."""
        self._check_daily_reset()
        self.daily_spend += cost

class BudgetExceededError(Exception):
    pass

# Использование in agent
enforcer = BudgetEnforcer(daily_budget_usd=100.0, per_run_budget_usd=0.50)

def run_agent_with_budget(task: str):
    run_cost = 0.0

    for step in range(20):
        # Check budget before each step
        enforcer.check_run_budget(run_cost)
        enforcer.check_daily_budget(0.10)  # Estimate next step cost

        # Make LLM call
        response = llm.invoke(messages)
        step_cost = calculate_cost(response.usage)

        run_cost += step_cost
        enforcer.record_cost(step_cost)

        if task_complete:
            break

    return run_cost

Prometheus metrics export

from prometheus_client import Counter, Histogram, Gauge, generate_latest

# Define metrics
llm_cost_total = Counter(
    "llm_cost_usd_total",
    "Total LLM API cost in USD",
    ["model", "status"],
)

llm_tokens_total = Counter(
    "llm_tokens_total",
    "Total tokens processed",
    ["model", "direction"],  # direction: input or output
)

llm_run_duration = Histogram(
    "llm_run_duration_seconds",
    "Agent run duration",
    ["status"],
    buckets=[0.5, 1, 2, 5, 10, 30, 60, 120],
)

llm_active_runs = Gauge(
    "llm_active_runs",
    "Currently active agent runs",
)

# Integration with tracker
class MetricsExporter:
    """Export cost tracking metrics to Prometheus."""

    @staticmethod
    def record_step(step: StepRecord):
        """Record step metrics."""
        llm_cost_total.labels(model=step.model, status="completed").inc(step.cost_usd)
        llm_tokens_total.labels(model=step.model, direction="input").inc(step.usage.input_tokens)
        llm_tokens_total.labels(model=step.model, direction="output").inc(step.usage.output_tokens)

    @staticmethod
    def record_run(run: RunRecord):
        """Record run metrics."""
        if run.duration_seconds:
            llm_run_duration.labels(status=run.status).observe(run.duration_seconds)

    @staticmethod
    def get_metrics() -> str:
        """Generate Prometheus metrics text."""
        return generate_latest().decode("utf-8")

# FastAPI endpoint
from fastapi import FastAPI, Response

app = FastAPI()

@app.get("/metrics")
def metrics():
    return Response(
        content=MetricsExporter.get_metrics(),
        media_type="text/plain; version=0.0.4",
    )

Performance & Benchmarks

Note: The following figures are illustrative estimates based on typical production configurations, not measurements from a specific system.

Cost reduction from routing

Routing simple tasks to cheap models provides substantial savings:

Baseline: All tasks on GPT-4o ($2.50 input / $10.00 output per 1M tokens)
With routing: Simple tasks (40% of workload) → GPT-4o-mini ($0.15 input / $0.60 output)

For workload of 10,000 tasks/day, average 2,000 input tokens, 500 output tokens per task:

Baseline cost: 10,000 × ((2,000 / 1M) × $2.50 + (500 / 1M) × $10.00) = $100/day
With routing: 6,000 × GPT-4o cost + 4,000 × mini cost = $60 + $2.40 = $62.40/day
Savings: 37.6% cost reduction

Compression latency vs cost trade-off

History compression adds latency but reduces token cost:

Compression time: 200-500ms per compression event (using GPT-4o-mini for summarization)
Compression frequency: Every 4,000 tokens (approximately every 5-8 steps)
Token reduction: 4,000 tokens → 600 tokens (85% reduction in that segment)

For 20-step conversation:

Without compression: 120,000 cumulative input tokens = $0.30 on GPT-4o
With compression (2 compressions): 60,000 cumulative input tokens + $0.001 compression cost = $0.15 total
Latency added: 2 × 400ms = 800ms total (spread across 20 steps)
Net result: 50% cost reduction, 4% latency increase (800ms over typical 20-second run)

Cache hit rate impact

Semantic caching effectiveness depends on query repetition:

Low repetition workload (unique queries): 5-10% hit rate, minimal savings
Medium repetition (customer support, FAQs): 30-50% hit rate, 30-50% cost reduction
High repetition (classification, validation): 60-80% hit rate, 60-80% cost reduction

For 1,000 requests/day at $0.01 per request:

No cache: $10/day
30% hit rate: $7/day (30% savings)
60% hit rate: $4/day (60% savings)

Cache infrastructure cost (Redis with embeddings): $0.20-0.50/day for 1,000 requests, negligible compared to savings.

Combined optimization impact

Applying all techniques compounds savings:

Starting point: 100,000 agent runs/month, $0.08/run = $8,000/month

Optimizations applied sequentially:

Model routing (40% simple → mini): $8,000 → $5,600 (30% reduction)
History compression: $5,600 → $3,920 (30% additional reduction)
Semantic caching (35% hit rate): $3,920 → $2,548 (35% additional reduction)
Tool response truncation: $2,548 → $2,293 (10% additional reduction)

Total savings: $8,000 → $2,293 (71% cost reduction)

Infrastructure cost: Compression compute + cache storage = ~$50/month

Net monthly savings: $5,657

Latency comparison

End-to-end latency for typical agent task (3 LLM calls, 2 tool calls):

Baseline (no optimizations): 4.2 seconds
With routing (cheap models): 3.8 seconds (10% faster due to cheaper model lower latency)
With compression: 4.6 seconds (10% slower due to compression overhead)
With exact-match cache: 1.2 seconds on cache hit (70% faster)
With semantic cache: 1.5 seconds on cache hit (65% faster)

Cache hits provide both cost savings and latency improvements.

ZORIN

Управление затратами агентов: маршрутизация моделей, сжатие и бюджет токенов

title: "Управление затратами агентов: маршрутизация моделей, сжатие и бюджет токенов" slug: agent-cost-management-routing-compression-2026-ru date: 2026-02-21 lang: ru

Управление затратами агентами: маршрутизация моделей, сжатие и бюджет токенов

Ключевые факты

Источники затрат в системах агентов

Основные драйверы затрат

Вторичные драйверы затрат

Распределение затрат на практике

Фреймворк принятия решений

Когда использовать дешёвые vs дорогие модели

Decision tree маршрутизации

Когда кешировать vs переиспользовать

Когда сжимать vs расширять контекст

Таблица ссылок параметров

Типичные ошибки

Ошибка 1: Отправка полной истории разговора на каждый вызов без сжатия

Ошибка 2: Использование дорогой модели для всех шагов независимо от сложности

Ошибка 3: Отсутствие enforcement бюджета приводит к неконтролируемым затратам

Ошибка 4: Кеширование без TTL вызывает устаревшие ответы

Ошибка 5: Усечение ответов tool теряет критическую информацию

Model Routing

LiteLLM Router configuration

Complexity-based routing with pattern matching

Fallback chain with retry logic

Компрессия промптов

LLMLingua compression

Context truncation strategies

Summarization-based compression

Semantic Caching

GPTCache integration

Exact-match cache with TTL

Cost Tracking & Dashboards

Per-run cost tracking

Budget enforcement middleware

Prometheus metrics export

Performance & Benchmarks

Cost reduction from routing

Compression latency vs cost trade-off

Cache hit rate impact

Combined optimization impact

Latency comparison