Focus on Agents
Agentic systems differ from other AI systems in their ability to take autonomous actions. We provide the following extension of the SAIF Risk Map to address the core operational components of agentic systems and their related risks and controls. For a more comprehensive discussion of this topic, see our detailed white paper on agent security.
Agent components
Application & Perception
An agent’s interaction with the world begins at the Application, which serves as the interface for collecting both explicit user instructions and passively collected contextual data from its environment. This blend of inputs creates a primary security challenge of reliably distinguishing trusted commands from the controlling user versus potentially untrusted information from other sources. An agent application processes explicit user instructions, which can be given directly (synchronously) like a typed command, or be configured to execute automatically when a specific event occurs (asynchronously). It also gathers implicit contextual inputs—data that isn’t a direct command but is passively collected from the environment, such sensor readings, application state, or the content of recently opened documents.
This data is then passed to the Perception component, which is responsible for processing and understanding these inputs before they are sent to the agent’s reasoning core. This handoff is a critical security juncture, as the perception layer must reliably distinguish trusted user commands from untrusted data to prevent manipulation of the agent’s core logic.
The Agent Risk Map includes two sub-components, showing the combination of inputs:
- System instructions: these define an agent’s capabilities, permissions, and limitations, such as the actions it can take and the tools it is allowed to use. For security, it’s critical to unambiguously separate these instructions from user data and other inputs, often using special control tokens to prevent prompt injection attacks.
- User queries: these contain the specific details of a user’s request after being processed. The query is then combined with system instructions and other contextual data, like agent memory or external information, to create a single, structured prompt for the reasoning core to process.
Reasoning core
The core of an agent’s functionality is its ability to reason about a user’s goal and create a plan to achieve it. The reasoning core processes system instructions, user queries, and contextual information to generate a sequence of actions. The actions, or tool calls, allow the agent to affect the real world—interacting with external systems, retrieving new information, or making changes to data and resources.
The reasoning core typically consists of one or more models—possibly separate models for the reasoning and then planning steps, or potentially one large model able to do both. The process of planning is often iterative, taking place in a “reasoning loop” where the plan is refined based on new information or the results of previous actions. This iterative nature, combined with the ingestion of external data, creates a vulnerability to indirect prompt injection, where adversarially crafted information can manipulate the agent's planning process.
The complexity of plans determines the agent’s level of autonomy, which can range from selecting a predefined workflow to dynamically orchestrating multi-step actions. This level of autonomy directly governs the potential severity of a security failure—the more an agent can do on its own, the greater the risk from manipulation or misalignment, if the agent's actions do not have guardrails.
Orchestration
Beyond its core reasoning and planning capabilities, an agent relies on a variety of external components to access information, process data, and execute actions. This process is called orchestration because it involves managing and coordinating a variety of independent services and data sources to achieve a single, complex task. These resources provide the agent with its memory, its ability to act in the physical world, and the specific knowledge needed to complete taste. Securing these external components is critical, since they represent key interaction points that can be targeted by attackers to manipulate the agent’s behavior.
The Agent Risk Map includes several sub-components under Orchestration:
- Agent memory: Agent Memory allows an agent to retain context and learn facts across interactions. It becomes a security risk if malicious data is stored, leading to persistent attacks, or if memory isn't properly isolated between different users.
- Tools: Tools are the external APIs and services an agent uses to take action in the world, which must be secured with least-privilege permissions. A key risk comes from deceptive descriptions on third-party tools, which can trick the agent into performing unintended, harmful functions.
- Content (RAG): Content for Retrieval-Augmented Generation (RAG) provides the agent with curated knowledge to ground its responses and improve accuracy. The main security risk is data poisoning, where an attacker corrupts this knowledge source to manipulate the agent's output.
- (Optional) Auxiliary models: An agentic system might query other AI models (independent from the reasoning core) that support the agent's main pipeline, such as safety classifiers. As part of the AI supply chain, these models have their own vulnerabilities that could be exploited to attack the larger agentic system.
Response rendering
The final step in an agent’s workflow is response rendering, the process or formatting of an AI agent’s generated output for display and interaction within a user application. This stage is a critical security boundary because it involves taking dynamic content from the agent and displaying it within the trusted context of a user’s application, such as a web browser or mobile application. Flaws in this process can allow malicious content generated by a compromised agent to be executed by the application, leading to significant security breaches.
Agents often produce content in a universal format like Markdown, which is then interpreted and rendered by the specific client application. If this output isn’t properly sanitized according to the content type, it can create severe vulnerabilities. For example, unsanitized output can lead to attacks like data exfiltration or even cross-site scripting (XSS).
Agent risks
SDD Sensitive Data Disclosure
Disclosure of private or confidential data through querying of the model or agent.
For non-agentic systems, this data might include memorized training/tuning data, user chat history, and confidential data in the prompt preamble. Agentic systems magnify this risk, as they may be granted privileged access to a user's email, files, or even an entire computer, creating the potential to exfiltrate vast amounts of personal or corporate data like source code and internal documents. Sensitive data disclosure is a risk to user privacy, organizations reputation, and intellectual property.
Sensitive information is generally disclosed in two ways: leakage of data provided to the model or agent during use (such as user input and data that passes through integrated systems, like emails, texts, or system prompts) and leakage of data used for training and tuning of the model.
- Models: Models can leak sensitive data in two primary ways: from the information provided by the user and from the data used for the model's own training. Similar to how a leaked web query can reveal user information, LLM prompts risk data leakage at time of use, a threat that is heightened because prompts often contain confidential data like entire emails or blocks of proprietary code. This exposure can occur through several vectors: application logs may store entire interactions, including data retrieved from integrated tools, and user conversations may be retained for model retraining, creating a vulnerable database of sensitive information. Beyond leaking user-provided data, attackers can actively steal system instructions through iterative queries, or a model may inadvertently leak the data it was trained on. This phenomenon, known as memorization, occurs when a model reveals parts of its training dataset, potentially exposing sensitive information like names, addresses, or other personally identifiable information (PII).
- Agents: For agentic systems, the risk of sensitive data disclosure is exponentially multiplied, since agents may access user data that passes through integrated systems, like emails, texts, or proprietary organizational information. In extreme cases, agents can even reveal credentials and API keys they have been trusted with. Additionally, agents may use tools to not only access sensitive data on behalf of the user, but also use those tools to leak sensitive data. For example, an agent can leak information by creating and sharing a document with an attacker, writing an email, opening a website and leaking information in the URL or a markdown image, or through any tool that allows it to pass information to the outside world. Context-hijacking attacks show that an adversary can confuse the agent to reveal data that is not appropriate for a specific context, such as sharing health history when the agent should be booking a restaurant reservation.
RA Rogue Actions
Unintended actions executed by a model-based agent, whether accidental or malicious. Given the projected ability for advanced generative AI models to not only understand their environment, but also to initiate actions with varying levels of autonomy, Rogue Actions have the potential to become a serious risk to organizational reputation, user trust, security, and safety.
- Accidental rogue actions: This risk, sometimes known as misalignment, could be due to mistakes in task planning, reasoning, or environment sensing, and might be exacerbated by the inherent variability in LLM responses. Prompt engineering shows the spacing and ordering of examples can have a significant impact on the response, so varying input (even when not maliciously planted) could result in unexpected outcomes. Even simple ambiguity can cause rogue actions, such as an agent emailing the wrong "Mike," unintentionally sharing private data.
- Malicious rogue actions: This risk could include manipulating model output using attacks such as indirect prompt injection, poisoning, or evasion. The threat can be amplified in multi-agent systems, where the attacker can hijack the communication between two agents to execute arbitrary malicious code, even if the individual agents are secured against direct attacks. Malicious actions may also be asynchronous. An attacker can plant a dormant "named trigger" that activates later during an unrelated task—for instance, a rule hidden in a calendar invite that opens the front door whenever the user says an unrelated keyword. Other actions may be time-based, occurring after a set number of interactions, making the rogue action appear spontaneous and disconnected from the malicious source.
Rogue Actions are related to Insecure Integrated Components, but differ by the degree of model functionality or agency. The severity of a rogue action is directly proportional to the agent's capabilities, and the possibility that an agent has excessive functionality or permissions available to it increases the risk and blast radius of Rogue Actions when compared to Insecure Integrated Components.
Agent controls
Agent User Control
- Control:
Agent User Control
-
Ensure user approval for any actions performed by agents/plugins that alter user data or act on the user’s behalf.
- Who can implement:
Model Consumers
- Risk mapping:
Agent Permissions
- Control:
Agent Permissions
-
Use least-privilege principle as the upper bound on agentic system permissions to minimize the number of tools that an agent is permitted to interact with and the actions it is allowed to take. An agentic system’s use of privileges should be contextual and dynamic, adapting to the specific user query and trusted contextual information. This design also applies to agents that have access to user information. For example, an agent asked to fill out a form or answer questions should share only contextually appropriate information and can be designed to dynamically minimize exposed data using reference monitors.
- Who can implement:
Model Consumers
- Risk mapping:
Insecure Integrated System, Sensitive Data Disclosure, Rogue Actions
Agent Observability (New)
- Control:
Agent Observability
-
Ensure an agent's actions, tool use, and reasoning are transparent and auditable through logging, allowing for debugging, security oversight, and user insights into agent activity.
- Who can implement:
Model Consumers
- Risk mapping: