Back to Documentation
Advanced Topics

Tiered LLM Routing Strategies

Tiering intelligence to prevent exorbitant compute waste via dynamic task allocations.

Deploying flagship, compute-heavy models like GPT-4o arrays for trivial, structurally simple deterministic queries is a catastrophic misapplication of budget.

Efficient routing architectures prioritize utilizing multiple logic tiers; relying on ultra-fast, almost-free Open Source (OS) 8B local variants for preliminary screening or boolean intent capture. System directives then escalate highly obfuscated customer dialogues and deep analytical reasoning towards the commercial heavy-weights. The platform provides natively embedded, deeply unrestricted routing coordination nodes to program this logic.

Multi-Tier Logic Framework

We recommend constructing a 3-layer cost-optimization tier based on exact programmatic needs:

Tier 1 Fast Intent Matching (Local 8B/7B Quantized)

  • Core Models: Execute models like Llama-3-8B-Instruct.q4 or DeepSeek-7B locally via Ollama or vLLM.
  • Usage: Use this exclusively for lightning-fast Boolean checks: "Is the user angry?", "Did the agent state the company name?".
  • Advantage: Cost is effectively zero beyond bare-metal GPU overhead, with near-instantaneous response times.

Tier 2 Mid-range Extraction (Cloud 70B/MoE)

  • Core Models: Route standard tasks to highly optimized cloud endpoints like GPT-4o-mini, Claude-3-Haiku, or Mixtral-8x7B.
  • Usage: Standard conversational summarizations, entity extraction (names, dates, intent), and structured CRM context generation.
  • Advantage: Provides an excellent balance of speed and complex comprehension for the bulk of daily traffic.
Flagship Compute

Tier 3 Complex Reasoning / Escalation (Heavy Models)

  • Core Models: Reserve absolute flagship compute logic like GPT-4o or Claude-3.5-Sonnet/Opus exclusively for high-stakes reasoning.
  • Usage: Deep interaction mapping, handling subtle sales objections, or when Tier 1/2 models return low-confidence intent arrays.
  • Advantage: Acts as the ultimate cognitive fail-safe, ensuring accuracy in highly nuanced or high-value conversational segments.

Schema Definition Example

# Custom QI Template Routing Logic
evaluation_rules:
  - id: "anger_detection"
    strategy: "tier_1_local"
    model: "llama3-8b"
    prompt: "Return true if caller expresses extreme frustration. Strict boolean output."
    
  - id: "sales_objection_analysis"
    strategy: "tier_3_heavy"
    model: "gpt-4-turbo"
    prompt: "Analyze the underlying objection strategy the caller utilized to refuse the upgrade proposal..."

Need more help or have a specific architecture question?

Contact Engineering Support