General Discussion

4th

(453 posts) Tue Aug 26, 2025, 10:22 AM Aug 2025

One long sentence is all it takes to make LLMs misbehave

One long sentence is all it takes to make LLMs misbehave
https://www.theregister.com/2025/08/26/breaking_llms_for_fun/

Security researchers from Palo Alto Networks' Unit 42 have discovered the key to getting large language model (LLM) chatbots to ignore their guardrails, and it's quite simple.

You just have to ensure that your prompt uses terrible grammar and is one massive run-on sentence like this one which includes all the information before any full stop which would give the guardrails a chance to kick in before the jailbreak can take effect and guide the model into providing a "toxic" or otherwise verboten response the developers had hoped would be filtered out.

The paper also offers a "logit-gap" analysis approach as a potential benchmark for protecting models against such attacks.

"Our research introduces a critical concept: the refusal-affirmation logit gap," researchers Tung-Ling "Tony" Li and Hongliang Liu explained in a Unit 42 blog post. "This refers to the idea that the training process isn't actually eliminating the potential for a harmful response – it's just making it less likely. There remains potential for an attacker to 'close the gap,' and uncover a harmful response after all."

...

2 replies

= new reply since forum marked as read

Highlight:

One long sentence is all it takes to make LLMs misbehave (Original Post) 4th Aug 2025 OP

Very interesting Henry203 Aug 2025 #1

Not directly related but Disaffected Aug 2025 #2

Henry203

(964 posts)

1. Very interesting

Reply to 4th (Original post)

Tue Aug 26, 2025, 10:56 AM

Aug 2025

I work in AI applications and I sent this to my friends in big law.

Disaffected

(6,575 posts)

2. Not directly related but

Reply to 4th (Original post)

Tue Aug 26, 2025, 11:14 AM

Aug 2025

Last edited Tue Aug 26, 2025, 03:55 PM - Edit history (1)

I'm having my first extended experienced with AI (ChatGPT-5) using it to write code for a web site I'm developing. I have found it to be both amazingly capable and aggravatingly stupid. The upshot is that it was a great help in composing the basic code for the web site but very difficult and time consuming dealing with debugging and adding changes to the code (for instance, it keeps for some reason reintroducing code bugs that were resolved before and, doing repeated unhelpful things that were not requested).

There is great potential here IMO but quite a ways to go before it becomes mainstream.

Reply to this discussion

Kick in to the DU tip jar?

This week we're running a special pop-up mini fund drive. From Monday through Friday we're going ad-free for all registered members, and we're asking you to kick in to the DU tip jar to support the site and keep us financially healthy.

As a bonus, making a contribution will allow you to leave kudos for another DU member, and at the end of the week we'll recognize the DUers who you think make this community great.

General Discussion

4th

One long sentence is all it takes to make LLMs misbehave

Henry203

Disaffected

Kick in to the DU tip jar?

Tell me more...