AGENTIC_AI :: ON_PREMISE_GUIDE

Agentic AI On-Premise

Running LLM agents locally in 2026: architecture patterns, framework selection, security constraints, and the failure modes that don't appear in cloud demos. Multi-step tool use, persistent memory, and autonomous workflows — without sending data to third-party APIs.

> AGENTS_VS_SINGLE_INFERENCE

SINGLE INFERENCE (standard)

→ One prompt → one response
→ Stateless between calls
→ Fixed VRAM use per request
→ Predictable latency
→ Simple to monitor

AGENTIC (multi-step)

→ Multiple LLM calls per task (3–50+)
→ Stateful: memory, tool results, conversation
→ Variable VRAM pressure from context growth
→ Non-deterministic total latency
→ Requires observability pipeline

> HARDWARE_IMPLICATIONS

Agentic workloads stress hardware differently from single-inference

> CONTEXT_WINDOW_PRESSURE

Agent chains accumulate tool outputs, plan steps, and history in the context window. A 10-step agent with 1K-token tool results consumes 10K+ tokens of KV cache — multiplying VRAM usage per active session.

IMPLICATION: Size VRAM for 32K+ context, not just 4K inference.

> CONCURRENCY_LIMITS

Each concurrent agent session holds a long context in memory. On a single RTX 4090 (24GB), you may handle 4–8 concurrent 7B agents or 1–2 concurrent 70B agents. Plan for queue management.

IMPLICATION: Multi-GPU or request queuing needed for production.

> STORAGE_FOR_MEMORY

Persistent agent memory (semantic search, episodic memory, tool caches) requires fast local storage. NVMe Gen 4 + PostgreSQL / SQLite + a vector store (ChromaDB, Qdrant) on-premise.

IMPLICATION: 1TB NVMe minimum for multi-agent persistent memory.

> FRAMEWORK_SELECTION_MATRIX

Frameworks compatible with local LLM endpoints (Ollama, llama.cpp, vLLM)

FRAMEWORK	LOCAL LLM	MULTI-AGENT	PERSISTENCE	OBSERVABILITY	BEST FIT
LangGraph	✓ Full	✓ Native	✓ Checkpoints	✓ LangSmith	Complex stateful workflows, human-in-the-loop, production
AutoGen / AG2	✓ Full	✓ Native	⚡ Partial	⚡ Basic	Conversation-driven multi-agent teams, research/prototyping
CrewAI	✓ Full	✓ Role-based	⚡ Basic	○ Limited	Role-based task crews, content pipelines, easier onboarding
Pydantic AI	✓ Full	⚡ Basic	○ Manual	⚡ Logfire	Type-safe structured outputs, single-agent with tools, minimal dependencies
Haystack 2	✓ Full	⚡ Pipeline	✓ Document stores	✓ OpenTelemetry	RAG-heavy pipelines, document Q&A, enterprise search
Smolagents	✓ Full	○ Limited	○ Manual	○ Basic	Lightweight code-executing agents, HuggingFace ecosystem

> REFERENCE_STACK_2026

Minimal production-grade agentic setup — fully on-premise

> INFERENCE_LAYER

$ Ollama (development) or vLLM (production)

$ Model: Qwen3.6 27B Q4 or Llama 3.3 70B Q4

$ OpenAI-compatible REST API on :11434 / :8000

↑ Framework connects via OPENAI_BASE_URL override

> ORCHESTRATION_LAYER

$ LangGraph (stateful graphs + human-in-loop)

$ Tool registry: Python functions + docstrings

$ State: TypedDict + Pydantic validation

$ Checkpointer: PostgreSQL (production) / SQLite (dev)

> MEMORY_LAYER

$ Short-term: in-graph state (context window)

$ Long-term semantic: ChromaDB / Qdrant (local)

$ Episodic: PostgreSQL JSONB table

$ Embeddings: all-MiniLM-L6-v2 (local sentence-transformers)

> OBSERVABILITY

$ Phoenix (Arize) — local OTEL trace UI

$ Prometheus + Grafana for inference metrics

$ Structured logging: JSON → local ELK or Loki

$ Alert: token budget exceeded, tool timeout, loop detect

> SECURITY_FAILURE_MODES

Agentic AI introduces attack surfaces absent in single-inference deployments

⚠ PROMPT INJECTION VIA TOOLS

Tool outputs (web search, file reads, API calls) may contain adversarial instructions that hijack the agent's plan. Mitigation: validate and sanitize all tool outputs before re-injecting into context.

⚠ UNBOUNDED TOOL EXECUTION

Agents with file system or shell access can cause damage if given a malformed task or if hijacked. Mitigation: sandbox tools (Docker, restricted users, no network), define explicit allowed paths/commands.

⚠ INFINITE LOOP / TOKEN EXHAUSTION

A confused agent may loop indefinitely, exhausting VRAM and blocking other sessions. Mitigation: hard step limit (e.g. max_iterations=25), total token budget, timeout per task, circuit breaker pattern.

⚠ MEMORY POISONING

Long-term memory stores can be poisoned by adversarial documents retrieved and stored by the agent. Mitigation: human review before committing to long-term store, namespace isolation per user/session.

> MINIMUM_VIABLE_GUARDRAILS

# Python pseudo-config — LangGraph agent guardrails

MAX_ITERATIONS = 25          # Hard cap on reasoning steps
MAX_TOKENS_PER_TASK = 50_000 # Total token budget per agent run
TOOL_TIMEOUT_SECONDS = 30    # Per-tool execution timeout
ALLOWED_TOOL_PATHS = [        # File system sandboxing
    "/app/data/readonly/",
    "/tmp/agent_workspace/"
]
HUMAN_IN_LOOP_BEFORE = [      # Require approval before these tools
    "send_email",
    "write_database",
    "execute_shell"
]
TOOL_OUTPUT_MAX_CHARS = 4096 # Truncate tool outputs to prevent context flood
LOG_ALL_TOOL_CALLS = True     # Full audit trail for EU AI Act compliance

RELATED SECTIONS

MoE Deployment → Model Cards → Reference Architectures → EU AI Act →