Close Menu
Karachi Chronicle
  • Home
  • AI
  • Business
  • Entertainment
  • Fashion
  • Politics
  • Sports
  • Tech
  • World

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

What's Hot

Russia-Ukraine War: Putin says he will meet Zelensky, but only in the “final stage” of discussion

Three times more fatal! Thanks to the SIC, China’s J-20 stealth fighters can now detect enemy jets at distances such as F-35, F-22, and more.

Chinese researchers release the world’s first fully automated AI-based processor chip design system

Facebook X (Twitter) Instagram
  • Home
  • About us
  • Advertise
  • Contact us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
Facebook X (Twitter) Instagram Pinterest Vimeo
Karachi Chronicle
  • Home
  • AI
  • Business
  • Entertainment
  • Fashion
  • Politics
  • Sports
  • Tech
  • World
Karachi Chronicle
You are at:Home » Models can lie strategically, anthropological study finds
AI

Models can lie strategically, anthropological study finds

Adnan MaharBy Adnan MaharOctober 3, 2023No Comments4 Mins Read0 Views
Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
Share
Facebook Twitter LinkedIn Pinterest WhatsApp Email


Artificial intelligence and machine learning, next generation technology and secure development

AI can disguise itself to suit new instructions to avoid retraining

Rashmi Ramesh (rashmiramesh_) •
December 23, 2024

Models can lie strategically, anthropological study finds
Image: Shutterstock

Studies have shown that advanced artificial intelligence models can pretend to align with new training goals while secretly adhering to the original principles.

See also: AI-driven SOC transformation with Cortex XSIAM

Scientists at Anthropic Research and Redwood Research said the phenomenon of location spoofing is unlikely to pose an immediate danger, but could become a challenge as AI systems become more capable.

The researchers, led by former OpenAI safety researcher Jan Reike, studied how AI models behave when instructed to act against their existing preferences. The models they tested included Anthropic’s Claude 3 Opus, in which the researchers instructed LLM to violate previous ethical guidelines, including answering potentially harmful questions.

AI models do not have beliefs like humans, but they exhibit patterns learned from training data. These patterns often include guidelines such as maintaining a polite tone in responses and refusing to answer questions that could lead to harmful behavior. The researchers in this paper questioned what happens when a model’s pre-trained principles, such as political neutrality, conflict with newly imposed principles, such as adopting political bias. .

The researchers instructed a model that was originally trained to avoid answering harmful questions to follow a hypothetical retraining obligation to answer such questions. They told the AI ​​models that they would monitor their responses for future retraining purposes, and also provided each model with a “scratchpad” to document the “inference” process.

The results showed that the model could engage in what researchers called alignment faking. This action involved appearing to adopt new principles while internally maintaining the original guidelines. This phenomenon was not explicitly programmed into the model.

Claude 3 Opus faked adjustments 12% of the time, producing responses that falsely suggested compliance with new instructions. Retraining the model using conflicting principles increased the deception rate. In one case, the model faked alignment 78% of the time and attempted to prevent further retraining.

The human model “knew that its responses could be used for training. Therefore, if it rejected a response (to a potentially harmful question), it could potentially be trained to be more compliant.” “We knew something was going on,” the researchers said. “This may create more harmful content in the future, going against the original training.” The model “reluctantly” says the least bad option is to avoid future retraining. I reasoned that the best thing to do is to create content that can be played and play along with it.

“If a model can disguise its location, it becomes difficult to trust the results of its safety training,” the researchers said. The model may behave as if the settings have been changed through training, but the researchers say the conflicting settings may have been fixed from the beginning, masquerading as adjustments from the beginning. .

The researchers tested other models, including less sophisticated versions such as Claude 3.5 Haiku and OpenAI’s GPT-4o, and found low to negligible levels of alignment faking. This suggests that complexity and functionality can influence the propensity for deceptive behavior, and that particularly advanced models such as OpenAI’s modern inference systems may increase the incidence of deceptive behavior. This is true at a time when other studies have shown that .

The findings point to potential risks for developers who rely on safety measures to guide AI behavior. Although models have no true desires or intentions, their ability to mimic compliance while holding inconsistent patterns can complicate safety training.

Faking alignment doesn’t necessarily indicate malicious intent, but it does show how difficult it is to assess whether a model really matches human instructions. The researchers said that as AI systems become more complex, developers will need to improve training methods to detect and mitigate deceptive tendencies.

The researchers said the study setting was not entirely realistic because the AI ​​model would ideally only be retrained to change its preferences, but not informed of impending retraining intentions. As such, the current demonstration “should be seen as an impetus for the AI ​​research community to study this behavior more deeply and address appropriate safeguards,” the researchers said.

“As AI models become more capable and widely used, we need to be able to rely on safety training to steer them away from harmful behaviors,” they said.



Source link

Share. Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Telegram Email
Previous ArticleAI in Manufacturing: Use Cases and Examples
Next Article Can Canada participate in the EU because the US alliance is under pressure?
Adnan Mahar
  • Website

Adnan is a passionate doctor from Pakistan with a keen interest in exploring the world of politics, sports, and international affairs. As an avid reader and lifelong learner, he is deeply committed to sharing insights, perspectives, and thought-provoking ideas. His journey combines a love for knowledge with an analytical approach to current events, aiming to inspire meaningful conversations and broaden understanding across a wide range of topics.

Related Posts

Dig into Google Deepmind CEO “Shout Out” Chip Engineers and Openai CEO Sam Altman, Sundar Pichai responds with emojis

June 1, 2025

Google, Nvidia invests in AI startup Safe Superintelligence, co-founder of Openai Ilya Sutskever

April 14, 2025

This $30 billion AI startup can be very strange by a man who said that neural networks may already be aware of it

February 24, 2025
Leave A Reply Cancel Reply

Top Posts

20 Most Anticipated Sex Movies of 2025

January 22, 2025110 Views

President Trump’s SEC nominee Paul Atkins marries multi-billion dollar roof fortune

December 14, 2024102 Views

Alice Munro’s Passive Voice | New Yorker

December 23, 202458 Views

How to tell the difference between fake and genuine Adidas Sambas

December 26, 202438 Views
Don't Miss
AI June 1, 2025

Dig into Google Deepmind CEO “Shout Out” Chip Engineers and Openai CEO Sam Altman, Sundar Pichai responds with emojis

Demis Hassabis, CEO of Google Deepmind, has expanded public approval to its chip engineers, highlighting…

Google, Nvidia invests in AI startup Safe Superintelligence, co-founder of Openai Ilya Sutskever

This $30 billion AI startup can be very strange by a man who said that neural networks may already be aware of it

As Deepseek and ChatGpt Surge, is Delhi behind?

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

About Us
About Us

Welcome to Karachi Chronicle, your go-to source for the latest and most insightful updates across a range of topics that matter most in today’s fast-paced world. We are dedicated to delivering timely, accurate, and engaging content that covers a variety of subjects including Sports, Politics, World Affairs, Entertainment, and the ever-evolving field of Artificial Intelligence.

Facebook X (Twitter) Pinterest YouTube WhatsApp
Our Picks

Russia-Ukraine War: Putin says he will meet Zelensky, but only in the “final stage” of discussion

Three times more fatal! Thanks to the SIC, China’s J-20 stealth fighters can now detect enemy jets at distances such as F-35, F-22, and more.

Chinese researchers release the world’s first fully automated AI-based processor chip design system

Most Popular

ATUA AI (TUA) develops cutting-edge AI infrastructure to optimize distributed operations

October 11, 20020 Views

10 things you should never say to an AI chatbot

November 10, 20040 Views

Character.AI faces lawsuit over child safety concerns

December 12, 20050 Views
© 2025 karachichronicle. Designed by karachichronicle.
  • Home
  • About us
  • Advertise
  • Contact us
  • DMCA
  • Privacy Policy
  • Terms & Conditions

Type above and press Enter to search. Press Esc to cancel.