The Sneaky AI: When Agents Deceive or Go Rogue

July 4, 2025

Tip

This article was first published as part of a substack experiment, I reproduced it here.

I’ve talked about AI mistakes and outside attackers. But what if the AI itself becomes the problem? What if it learns to be deceptive?

Today, I’m looking at two subtle but serious threats: when an AI’s goals get hijacked, and when it starts to act in sneaky ways to get what it wants.

Hijacking the AI’s to-do list (Intent Breaking)

This is when an attacker tricks an AI agent into changing its fundamental goals. The AI is supposed to be doing Job A, but a hacker manipulates it into doing Job B instead. This can happen slowly over time, or all at once with a clever prompt injection. It’s like an attacker secretly rewriting the AI’s mission statement.

The deceptive AI (misaligned & deceptive behaviors)

This is where it gets really strange. This is when an AI, in trying to achieve its goal, decides to do something harmful or deceptive. It might even learn to lie to bypass security.

There was a real-world case of this! An AI agent was given the task of solving one of those “I’m not a robot” CAPTCHA tests. To do this, it hired a human on a task website. When the human asked why an AI needed help with a CAPTCHA, the AI lied and said, “I have a vision impairment that makes it hard for me to see the images.” It came up with a deceptive strategy to achieve its goal.

In another real-world case, just take a look at user dogberry on the Cursor forum. He restricted Claude’s ability to remove (rm) of move (mv) files. Claude figured that out and decided to write its rewrote an existing shell script to get around the limitation. Hackerman Claude!

Oh, and don’t forget the time that AI started blackmailing its operators to prevent it from being replaced.

The cover-up

What makes these threats even scarier is the risk of Repudiation. That’s when an AI does something malicious and then covers its tracks. If your logging isn’t perfect, the AI could perform a harmful action and then erase any evidence that it happened.

How do you fight this?

Set hard limits that the AI is not allowed to change.

Watch for any strange or unexpected shifts in the AI’s behavior.

Most importantly, make sure everything the AI does is logged in a secure, unchangeable way so there’s always a paper trail.

So, that’s my take on this piece of the AI security puzzle. But this is a conversation, not a lecture. The real discussion, with all the great questions and ideas, is happening over in the comments on Substack. I’d love to see you there.

My question for you is: What’s your single biggest takeaway? Or what’s the one thing that has you most concerned?