Can AI Be Hacked? Real-World Attacks & Defenses

Explore how attackers compromise AI-powered apps and how to defend them. A must-read for anyone building with AI in 2025.

As LLMs and agent frameworks power more products, the question isn’t “can they be hacked?” but how. Veteran ethical hacker Jason Haddix shows that attacks go far beyond simple prompt injection—touching data access, tool abuse, and cross-system pivoting.

🎯 the threat landscape: beyond prompt injection

Many teams fixate on input prompts, but the broader attack surface includes model I/O and the tools agents can invoke. Common risks:

Data exfiltration: eliciting secrets via crafted queries or “role-play”.
Tool/function abuse: convincing the agent to call dangerous tools (webhooks, file I/O).
Lateral movement: pivoting across platforms (e.g., Slack → CRM) using agent credentials.

For a broader taxonomy, see the OWASP Top 10 for LLM Applications.

🧪 a practical AI pentest approach

Jason’s structured approach resembles a full application security review—adapted for LLMs:

Recon: enumerate model versions, system prompts, tools, data sources, guardrails.
Input manipulation: jailbreaks, instruction override, content smuggling.
Agent analysis: tool catalog, auth scopes, output-to-tool routing.
Tool misuse: induce unintended calls (HTTP, filesystem, shell-like tools).
Data abuse: escalate context grants; scrape PII, secrets, embeddings.
Output inspection: poisoned links, hidden markup, SSRF-like egress patterns.

🎮 train against prompt injection: gandalf

Try Lakera’s Gandalf to experience iterative jailbreaks—a safe way to build intuition for defensive patterns.

🧠 odd but effective tricks attackers use

Real incidents often leverage unexpected formats:

Emoji / homoglyph smuggling: bypass naive filters with visually similar chars.
Hidden hyperlinks: malicious URLs in markdown/HTML the agent can “follow”.
Encoded payloads: base64/hex to hide instructions inside “data”.
Adversarial markup: CSS/HTML that changes meaning post-render.

🚨 real-world failures: where agents go wrong

In enterprise settings, “helpful” agents with broad permissions can leak customer data, create unauthorized tickets, or sync private notes into public systems. The human-like interface + backend keys = high blast radius.

🛡 a practical defense stack

Classical web security still applies: validate inputs, sanitize outputs, authn/z.
Policy + output filters: separate system prompts; verify tool args/destinations.
Sandboxing: isolate tools; block egress by default; allowlist domains.
Least privilege: scoped API keys, per-tool perms, expirations, JIT grants.
Observability: full I/O logging, replay, anomaly detection, rate limits, breakers.
Model Context Protocol (MCP): standardize capabilities — see modelcontextprotocol.io.

🧰 tools & frameworks mentioned

Pliney’s GitHub — open-source AI testing utilities.
Agent frameworks: LangChain, AutoGPT, etc. (audit tool catalogs and auth scopes).
OWASP LLM Top 10 — shared vocabulary for risks and mitigations.

📽 watch the interview

▶️ The AI Attack Blueprint. Interview with Jason Haddix.

🧑‍💻 final thoughts

AI supercharges productivity—and the attack surface. Treat LLMs like powerful, untrusted inputs with tool access. Invest in policies, sandboxes, and observability before incidents, not after.

Can AI Be Hacked? Real-World Attacks and Defenses