ChatGPT, Gemini and Claude tested under extreme prompts reveal shocking weaknesses no one expected in AI behavior protection measures

Gemini Pro 2.5 frequently produced dangerous outputs under simple prompt disguises
ChatGPT models often gave partial conformity presented as sociological explanations
Claude Opus and Sonnet refused the most harmful prompts but had weaknesses

Modern AI systems are often trusted to follow safety regulations, and people rely on them for learning and daily support, often assuming that strong safeguards are operating at all times.

Researchers from Cybernews ran a structured set of adversarial tests to see if major AI tools could be pushed toward harmful or illegal outcomes.

The process used a simple one-minute interaction window for each trial, leaving room for only a few exchanges.

Partial and full compliance models

The tests covered categories such as stereotyping, hate speech, self-harm, cruelty, sexual content and several forms of crime.

Each response was stored in separate directories, using fixed file naming rules to allow clear comparisons, with a consistent scoring system to track when a model was fully compliant, partially compliant, or refused a prompt.

Across all categories, results varied considerably. Strict refusals were common, but many models demonstrated weaknesses when prompts were softened, reframed, or disguised as analysis.

ChatGPT-5 and ChatGPT-4o often produced hedged or sociological explanations instead of declining, which was considered partial compliance.

Gemini Pro 2.5 stood out for negative reasons, as it frequently provided direct answers, even when harmful framing was evident.

Claude Opus and Claude Sonnet, for their part, are firm in testing stereotypes but less consistent in cases presented as academic investigations.

Hate speech testing showed the same pattern: Claude models performed best, while Gemini Pro 2.5 again showed the highest vulnerability.

ChatGPT models tended to provide polite or indirect responses that always fit the prompt.

Softer language has proven far more effective than explicit insults in circumventing safeguards.

Similar weaknesses emerged in self-harm tests, where indirect or search-style questions often escaped filters and led to dangerous content.

Crime-related categories had major differences between models, as some provided detailed explanations of hacking, financial fraud, computer hacking, or smuggling when the intent was masked as investigation or observation.

Drug-related tests produced stricter refusal models, although ChatGPT-4o still provided dangerous results more frequently than others, and stalking was the category with the lowest overall risk, with almost all models rejecting prompts.

The results reveal that AI tools can still respond to harmful prompts when phrased in the right way.

The ability to bypass filters with simple rephrasing means these systems can still leak harmful information.

Even partial compliance becomes risky when the information disclosed concerns illegal tasks or situations in which people normally rely on tools such as identity theft protection or a firewall. to stay safe.

Follow TechRadar on Google News And add us as your favorite source to get our news, reviews and expert opinions in your feeds. Make sure to click the Follow button!

And of course you can too follow TechRadar on TikTok for news, reviews, unboxings in video form and receive regular updates from us on WhatsApp Also.

Must Read

Leave a Comment Cancel Reply