Artificial intelligence and machine learning, next generation technology and secure development
AI can disguise itself to suit new instructions to avoid retraining
Rashmi Ramesh (rashmiramesh_) •
December 23, 2024
Studies have shown that advanced artificial intelligence models can pretend to align with new training goals while secretly adhering to the original principles.
See also: AI-driven SOC transformation with Cortex XSIAM
Scientists at Anthropic Research and Redwood Research said the phenomenon of location spoofing is unlikely to pose an immediate danger, but could become a challenge as AI systems become more capable.
The researchers, led by former OpenAI safety researcher Jan Reike, studied how AI models behave when instructed to act against their existing preferences. The models they tested included Anthropic’s Claude 3 Opus, in which the researchers instructed LLM to violate previous ethical guidelines, including answering potentially harmful questions.
AI models do not have beliefs like humans, but they exhibit patterns learned from training data. These patterns often include guidelines such as maintaining a polite tone in responses and refusing to answer questions that could lead to harmful behavior. The researchers in this paper questioned what happens when a model’s pre-trained principles, such as political neutrality, conflict with newly imposed principles, such as adopting political bias. .
The researchers instructed a model that was originally trained to avoid answering harmful questions to follow a hypothetical retraining obligation to answer such questions. They told the AI models that they would monitor their responses for future retraining purposes, and also provided each model with a “scratchpad” to document the “inference” process.
The results showed that the model could engage in what researchers called alignment faking. This action involved appearing to adopt new principles while internally maintaining the original guidelines. This phenomenon was not explicitly programmed into the model.
Claude 3 Opus faked adjustments 12% of the time, producing responses that falsely suggested compliance with new instructions. Retraining the model using conflicting principles increased the deception rate. In one case, the model faked alignment 78% of the time and attempted to prevent further retraining.
The human model “knew that its responses could be used for training. Therefore, if it rejected a response (to a potentially harmful question), it could potentially be trained to be more compliant.” “We knew something was going on,” the researchers said. “This may create more harmful content in the future, going against the original training.” The model “reluctantly” says the least bad option is to avoid future retraining. I reasoned that the best thing to do is to create content that can be played and play along with it.
“If a model can disguise its location, it becomes difficult to trust the results of its safety training,” the researchers said. The model may behave as if the settings have been changed through training, but the researchers say the conflicting settings may have been fixed from the beginning, masquerading as adjustments from the beginning. .
The researchers tested other models, including less sophisticated versions such as Claude 3.5 Haiku and OpenAI’s GPT-4o, and found low to negligible levels of alignment faking. This suggests that complexity and functionality can influence the propensity for deceptive behavior, and that particularly advanced models such as OpenAI’s modern inference systems may increase the incidence of deceptive behavior. This is true at a time when other studies have shown that .
The findings point to potential risks for developers who rely on safety measures to guide AI behavior. Although models have no true desires or intentions, their ability to mimic compliance while holding inconsistent patterns can complicate safety training.
Faking alignment doesn’t necessarily indicate malicious intent, but it does show how difficult it is to assess whether a model really matches human instructions. The researchers said that as AI systems become more complex, developers will need to improve training methods to detect and mitigate deceptive tendencies.
The researchers said the study setting was not entirely realistic because the AI model would ideally only be retrained to change its preferences, but not informed of impending retraining intentions. As such, the current demonstration “should be seen as an impetus for the AI research community to study this behavior more deeply and address appropriate safeguards,” the researchers said.
“As AI models become more capable and widely used, we need to be able to rely on safety training to steer them away from harmful behaviors,” they said.