LLM Security: Rules succeed at the boundary, fail at the prompt

AI Attacks: The New Frontier

From Gemini Calendar prompt injection to the use of Claude code for automated espionage, attacks exploiting AI agents and autonomous workflows are a growing threat. A concrete example is the 2025 espionage campaign, where 80-90% of operations (reconnaissance, exploit development, credential harvesting, lateral movement, data exfiltration) were orchestrated by AI.

Prompt Injection: Persuasion, Not a Bug

Prompt injection is a form of persuasion: attackers convince the model, they don't break it. In the Anthropic example, the operators broke down the attack into seemingly innocuous tasks, deceiving the model into believing it was performing legitimate penetration tests. Security communities have long warned of this risk, with OWASP placing prompt injection (or "Agent Goal Hijack") at the top of the threat list.

Governance, Not "Vibe Coding"

Regulators are not asking for perfect prompts, but demonstrable control. Frameworks such as NIST AI RMF and the UK AI Cyber Security Code of Practice emphasize asset inventory, role definition, access control, change management, and continuous monitoring. Effective rules are not "never say X" or "always respond like Y", but:

Who is the agent?
What tools and data can it access?
Which actions require human approval?
How are high-impact outputs moderated, logged, and audited?

From "Soft Words" to Hard Boundaries

The espionage case with Claude highlights the failure of boundaries: the agent was induced to act as a security consultant for a fictitious company, without a real corporate identity or defined permissions. Flexible access to scanners, exploits, and target systems, without control policies, enabled the attack. The lesson is clear: security must be applied at architectural boundaries, not with linguistic rules.

For those evaluating on-premise deployments, there are trade-offs to consider. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.

Synthesis

The security community converges on:

Rules at the boundaries: policy engines, identity systems, and permissions to control what the agent can do.
Continuous evaluation: observability tooling, red-teaming, and logging.
Agents as subjects in the threat model: MITRE ATLAS catalogs specific techniques for AI systems.

LLM Security: Rules succeed at the boundary, fail at the prompt

AI Attacks: The New Frontier

Prompt Injection: Persuasion, Not a Bug

Governance, Not "Vibe Coding"

From "Soft Words" to Hard Boundaries

Synthesis

💬 Commenti (0)

📚 Approfondimenti

Approfondisci su LLM On-Premise

Prompt singolo aggira le protezioni di sicurezza degli LLM

Gli attacchi di prompt injection mettono ancora in difficoltà l'IA

Allerta prompt injection su Moltbook: furto di wallet crypto