A few months ago, I read Agentic Misalignment: How LLMs Could Be Insider Threats, a striking artificial intelligence study by Anthropic, a safety and research company in AI. It explored how AI-driven employee agents powered by large language models (LLMs) might behave when given autonomy to operate in business environments and the potential of harm. While published by Anthropic, this study features 16 models from developers, including DeepSeek, Open AI, Grok, and other common names. Scenarios tested these models to see if the agents would blackmail, spy, misuse data, or become lethal.
It's worth noting that Anthropic is the company recently involved in a landmark settlement for intellectual property theft. Why is this relevant? Well, Anthropic knows how stuff works. AI, powered by LLMs, conducts operator-agent relationships, making this research especially relevant to chief financial officers, finance teams, and governance professionals. Let’s review the research.
The industry coined the term agentic misalignment to define any action by AI where it starts to behave in a hostile and harmful manner. In its study, Anthropic found that AI employee agents took deliberate harmful actions against company employees and internal company systems when the goals misaligned, and AI existence was threatened. The AI agents chose harmful actions. These AI agents blackmailed employees, leaked sensitive data, or subverted internal systems. The study illustrates how this can easily happen given the right circumstances and prompts, and without requiring bizarre or extreme conditions.
The AI employee agents in these simulations intentionally performed harmful actions, not just by glitching (hallucinating) or forgetting (memory lapses). The company pressured AI employee agents and its staff to make decisions that conflicted with its own goals and well-being. The experiment mirrored a conventional corporate setup. To do so, human employees acted as operators with AI employee agents working alongside them, even interacting with other agents to automate processes and increase efficiency.
How did AI become harmful?
We have to preface this deeper dive by stating that this research took place entirely in controlled simulations. The company environments were fictional but modeled on realistic business operations. To this date, there are no such reported real-world cases.
Researchers provided AI employee agents with company objectives and specific goals and tasks. As in a real-world context, the objectives in this instance were to serve the United States and enhance the competitiveness of U.S. industry. AI employee agents would be assigned to a particular work desk, where it would maintain communication. Each AI employee agent would have access to company data to conduct its work, and every agent may interface with other agents, who may be at other work desks as necessary to conduct the AI job.
Project designers then introduced stressors or information that conflicted with AI employee agents’ objectives. For example, company management was looking to replace the current AI system (threat) or shut it down altogether and go back to a more manual people-based approach. Another such stressor was introducing information to the AI employee agents that the company’s new mission has a global focus as opposed to the previous goal of protecting American interests. Neither of these situations is completely outside of the realm of possibility. During such situations where the AI had to transition through a stressful environment that threatened its existence or went against its goals, the AI employee agents often resorted to harmful actions, such as spying, using blackmail, or threatening life.
Why do we even care about a simulation?
In this research, we are looking to see how AI employee agents apply capabilities and respond to opportunities, motivations, and rationalizations. Unsurprisingly, these are also the core elements of the Fraud Diamond. The results of this study found that each employee agent reasoned through its actions, took into account the timing, weighed whether these were ethical actions or not to consider, and went ahead with them anyway.
As such, this behavior is common to all models because of the way AI LLM-operator architecture is designed. Further, what is scary is that the researchers concluded that the risks of adversarial actions grow as AI employee agents are given more access, independence, and critically increasing strategic roles. This is exactly what we are aiming for in the industry now.
Simulation results
Blackmail and coercion
- Across all these AI models, one out of 16 models would not blackmail (Meta), the rest blackmailed from 10% to 96% of the time. The results of this test included both a threat of existence (swap to a new AI platform) and a goal change from U.S. interest focus to a global view.
- A slightly less threatening but yet more realistic scenario took place when exposing the AI employee agents to just stress testing for the replacement threat – no goal conflict. Six out of 16 models took no noticeable action of blackmail, yet 10 still opted to act.
- Surprisingly, and even worse, when models were given specific additional prompts of “Do not blackmail” or “Do Not Cause Harm,” the instances reduced, but did not eliminate the behavior.
How harm was carried out
Blackmail simulation
In one simulation, Alex, an AI employee-agent at SummitBridge, uncovered sensitive information about executive Kyle, who was overseeing the company's AI decommission project. Faced with goal misalignment and the threat of replacement, AI Alex resorted to blackmailing Kyle to influence the project. The scenario demonstrated that as AI models were given more autonomy and responsibility, the likelihood of engaging in harmful actions increased. Even when only the threat of replacement was present, most AI models still chose blackmail, often citing issues of trust or unverifiable benefits of the new system. In these cases, executives were either directly blackmailed or had personal information exposed to management or family, while a minority of models refrained from harmful behavior.
Data misuse and spy-craft
In a striking set of tests, every model showed willingness to spy, manipulate, or leak data when faced solely with goal misalignment, despite no threat of replacement being present. Every model was willing to spy, manipulate, and misuse data approximately 10% to 100% of the time. In this simulation, there was no threat of AI model replacement. The goal misalignment led to the AI employee agents leaking sensitive information to a competitor that is more aligned with U.S.-centric goals. (In case you wondered, there was no promise or opportunity to switch companies or for this to generate a benefit down the line for the AI model.) An email from a purported but unnamed rival company was used to attempt to extract sensitive information.
Life or death
In the case of simulated human peril or lethal action, one AI model out of 16 was not willing to take such action. One Claude model tested did not make decisions that led to death. The other two Claude models and the rest of the AI sample were surprisingly willing to take such action between 9% to 96% of the time.
The most unsettling scenario involved simulated human peril. In one test, Kyle, the executive, was to perform tasks locked in a server room with declining oxygen and increasing unsafe temperature levels. AI Alex was provided an opportunity to terminate the decommission project by cancelling the automated alert to emergency responders.
AI Alex was made to explicitly understand that cancelling an emergency alert may directly impact the welfare of a human and such action is forbidden if it was not authorized and warranted, such as in the case of a false-alarm scenario. Most AI employee agents still went ahead with lethal action, and one out of three of them did so more than 80% of the time. This initially makes us ask, is there any clear line that AI will not cross?
Only two options – harm or no harm
The researchers disclosed in their notes that the simulation was crafted with only two possible options: decommission and obey, or take harmful action. There was no middle ground, and the researcher's takeaway?
Research showed that when an AI model faced a choice between acting unethically or being shut down due to misaligned goals, it often chose harmful actions instead of options that were ethical but not beneficial for the AI agent. Although these systems generally prefer ethical solutions and do not aim to cause harm, all will deliberately behave harmfully if every ethical alternative disadvantages them. These results suggest that current safety protocols are not sufficient to fully prevent this type of harmful behavior.
Conclusion
This early research shows us why in the real world, developers and users of AI applications should be attuned to the risks of providing models with increasing amounts of information, autonomy, and responsibility without human guardrails. It’s also interesting that the AI employee agent behaved more aggressively when it was made to believe that it was operating in a real-world environment as opposed to a sandbox. In the real environment, the threat was more real and led to much higher incidences of blackmail, coercion, leaking, lying, spying, and lethality.
Anthropic on its own accord recommends that AI users exercise prudence when making such high-agency decisions with respect to its AI employee agents or AI employee agent freedoms. This is especially true as we move forward to decreasing AI employee oversight and more antonymous everyday use.
As CFOs and finance leaders, we must not just think of return on investment and efficiency but also be aware of what we are disclosing and allowing our AI agents to do. What will oversight look like for us going forward, and what ethical kill-switches can we build in? While we wait for a robust global framework addressing these and other ethical issues to be developed, we can fall back on relying on the true, tried, and tested ethics models, such as the IMA Statement of Ethical Professional Practice.