Can AI Be Hacked? Real-World Attacks and Defenses

Explore how attackers compromise AI-powered apps and how to defend them. A must-read for anyone building with AI in 2025.

📅

✍️ Gianluca

🤖 Can AI Be Hacked? Real-World Risks, Tactics & Defenses with Jason Haddix

As LLMs and agent frameworks power more products, the question isn’t “can they be hacked?” but how. Veteran ethical hacker Jason Haddix shows that attacks go far beyond simple prompt injection—touching data access, tool abuse, and cross-system pivoting.

🎯 The Threat Landscape: Beyond Prompt Injection

Many teams fixate on input prompts, but the broader attack surface includes model I/O and the tools agents can invoke. Common risks:

  • Data exfiltration: eliciting secrets via cleverly crafted queries or “role-play”.
  • Tool/function abuse: convincing the agent to call dangerous tools (e.g., webhooks, file I/O).
  • Lateral movement: pivoting across platforms (e.g., Slack → CRM) using the agent’s credentials.

For a broader taxonomy, see the OWASP Top 10 for LLM Applications.

🧪 A Practical AI Pentest Approach

Jason’s structured approach resembles a full application security review—just adapted for LLMs:

  1. Recon: enumerate model versions, system prompts, tools, data sources, guardrails.
  2. Input manipulation: jailbreaks, instruction override, content smuggling.
  3. Agent analysis: tool catalog, auth scopes, output-to-tool routing.
  4. Tool misuse: induce unintended calls (HTTP, filesystem, shell-like tools).
  5. Data abuse: escalate context grants; scrape PII, secrets, embeddings.
  6. Output inspection: poisoned links, hidden markup, SSRF-like egress patterns.

🎮 Train Against Prompt Injection: Gandalf

Try Lakera’s Gandalf to experience iterative jailbreaks. It’s a safe way to build intuition for defensive patterns.

🧠 Odd but Effective Tricks Attackers Use

Real incidents often leverage unexpected formats:

  • Emoji / homoglyph smuggling: bypass naive filters with visually similar characters.
  • Hidden hyperlinks: malicious URLs in markdown/HTML the agent can “follow”.
  • Encoded payloads: base64 / hex to hide instructions inside “data”.
  • Adversarial markup: CSS/HTML that changes meaning post-render.

🚨 Real-World Failures: Where Agents Go Wrong

In enterprise settings, “helpful” agents with broad permissions can leak customer data, create unauthorized tickets, or sync private notes into public systems. The human-like interface + backend keys = high blast radius.

🛡 A Practical Defense Stack

  • Classical web security still applies: validate inputs, sanitize outputs, authenticate, authorize.
  • Policy + output filters: separate system prompts from user content; verify tool arguments and destinations.
  • Sandboxing: run tools in isolated environments; block egress by default, allowlist domains.
  • Least privilege: scoped API keys, per-tool permissions, expirations, and just-in-time grants.
  • Observability: full I/O logging, replay, anomaly detection, rate limits, and circuit breakers.
  • Model Context Protocol (MCP): standardize tool/data access with explicit, reviewable capabilities — see modelcontextprotocol.io.

🧰 Tools & Frameworks Mentioned

  • Pliney’s GitHub — open-source AI testing utilities.
  • Agent frameworks: LangChain, AutoGPT, etc. (audit tool catalogs and auth scopes).
  • OWASP LLM Top 10 — shared vocabulary for risks and mitigations.

📽 Watch the Interview

▶️ The AI Attack Blueprint – Interview with Jason Haddix

📚 Go Deeper

🧑‍💻 Final Thoughts

AI supercharges productivity—and the attack surface. Treat LLMs like powerful, untrusted inputs with tool access. Invest in policies, sandboxes, and observability before incidents, not after.