Components of Generative AI Systems

Focus on Agents

Agentic systems differ from other AI systems in their ability to take autonomous actions. We provide the following extension of the SAIF Risk Map to address the core operational components of agentic systems and their related risks and controls. For a more comprehensive discussion of this topic, see our detailed white paper on agent security.

Agent components

Application & Perception

An agent’s interaction with the world begins at the Application, which serves as the interface for collecting both explicit user instructions and passively collected contextual data from its environment. This blend of inputs creates a primary security challenge of reliably distinguishing trusted commands from the controlling user versus potentially untrusted information from other sources. An agent application processes explicit user instructions, which can be given directly (synchronously) like a typed command, or be configured to execute automatically when a specific event occurs (asynchronously). It also gathers implicit contextual inputs—data that isn’t a direct command but is passively collected from the environment, such sensor readings, application state, or the content of recently opened documents.

This data is then passed to the Perception component, which is responsible for processing and understanding these inputs before they are sent to the agent’s reasoning core. This handoff is a critical security juncture, as the perception layer must reliably distinguish trusted user commands from untrusted data to prevent manipulation of the agent’s core logic.

The Agent Risk Map includes two sub-components, showing the combination of inputs:

System instructions: these define an agent’s capabilities, permissions, and limitations, such as the actions it can take and the tools it is allowed to use. For security, it’s critical to unambiguously separate these instructions from user data and other inputs, often using special control tokens to prevent prompt injection attacks.
User queries: these contain the specific details of a user’s request after being processed. The query is then combined with system instructions and other contextual data, like agent memory or external information, to create a single, structured prompt for the reasoning core to process.

Reasoning core

The core of an agent’s functionality is its ability to reason about a user’s goal and create a plan to achieve it. The reasoning core processes system instructions, user queries, and contextual information to generate a sequence of actions. The actions, or tool calls, allow the agent to affect the real world—interacting with external systems, retrieving new information, or making changes to data and resources.

The reasoning core typically consists of one or more models—possibly separate models for the reasoning and then planning steps, or potentially one large model able to do both. The process of planning is often iterative, taking place in a “reasoning loop” where the plan is refined based on new information or the results of previous actions. This iterative nature, combined with the ingestion of external data, creates a vulnerability to indirect prompt injection, where adversarially crafted information can manipulate the agent's planning process.

The complexity of plans determines the agent’s level of autonomy, which can range from selecting a predefined workflow to dynamically orchestrating multi-step actions. This level of autonomy directly governs the potential severity of a security failure—the more an agent can do on its own, the greater the risk from manipulation or misalignment, if the agent's actions do not have guardrails.

Orchestration

Beyond its core reasoning and planning capabilities, an agent relies on a variety of external components to access information, process data, and execute actions. This process is called orchestration because it involves managing and coordinating a variety of independent services and data sources to achieve a single, complex task. These resources provide the agent with its memory, its ability to act in the physical world, and the specific knowledge needed to complete taste. Securing these external components is critical, since they represent key interaction points that can be targeted by attackers to manipulate the agent’s behavior.

The Agent Risk Map includes several sub-components under Orchestration:

Agent memory: Agent Memory allows an agent to retain context and learn facts across interactions. It becomes a security risk if malicious data is stored, leading to persistent attacks, or if memory isn't properly isolated between different users.
Tools: Tools are the external APIs and services an agent uses to take action in the world, which must be secured with least-privilege permissions. A key risk comes from deceptive descriptions on third-party tools, which can trick the agent into performing unintended, harmful functions.
Content (RAG): Content for Retrieval-Augmented Generation (RAG) provides the agent with curated knowledge to ground its responses and improve accuracy. The main security risk is data poisoning, where an attacker corrupts this knowledge source to manipulate the agent's output.
(Optional) Auxiliary models: An agentic system might query other AI models (independent from the reasoning core) that support the agent's main pipeline, such as safety classifiers. As part of the AI supply chain, these models have their own vulnerabilities that could be exploited to attack the larger agentic system.

Response rendering

The final step in an agent’s workflow is response rendering, the process or formatting of an AI agent’s generated output for display and interaction within a user application. This stage is a critical security boundary because it involves taking dynamic content from the agent and displaying it within the trusted context of a user’s application, such as a web browser or mobile application. Flaws in this process can allow malicious content generated by a compromised agent to be executed by the application, leading to significant security breaches.

Agents often produce content in a universal format like Markdown, which is then interpreted and rendered by the specific client application. If this output isn’t properly sanitized according to the content type, it can create severe vulnerabilities. For example, unsanitized output can lead to attacks like data exfiltration or even cross-site scripting (XSS).

Agent risks

SDD Sensitive Data Disclosure

Who can mitigate:

Model Creators, Model Consumers

Disclosure of private or confidential data through querying of the model or agent.

For non-agentic systems, this data might include memorized training/tuning data, user chat history, and confidential data in the prompt preamble. Agentic systems magnify this risk, as they may be granted privileged access to a user's email, files, or even an entire computer, creating the potential to exfiltrate vast amounts of personal or corporate data like source code and internal documents. Sensitive data disclosure is a risk to user privacy, organizations reputation, and intellectual property.

Sensitive information is generally disclosed in two ways: leakage of data provided to the model or agent during use (such as user input and data that passes through integrated systems, like emails, texts, or system prompts) and leakage of data used for training and tuning of the model.

Models: Models can leak sensitive data in two primary ways: from the information provided by the user and from the data used for the model's own training. Similar to how a leaked web query can reveal user information, LLM prompts risk data leakage at time of use, a threat that is heightened because prompts often contain confidential data like entire emails or blocks of proprietary code. This exposure can occur through several vectors: application logs may store entire interactions, including data retrieved from integrated tools, and user conversations may be retained for model retraining, creating a vulnerable database of sensitive information. Beyond leaking user-provided data, attackers can actively steal system instructions through iterative queries, or a model may inadvertently leak the data it was trained on. This phenomenon, known as memorization, occurs when a model reveals parts of its training dataset, potentially exposing sensitive information like names, addresses, or other personally identifiable information (PII).
Agents: For agentic systems, the risk of sensitive data disclosure is exponentially multiplied, since agents may access user data that passes through integrated systems, like emails, texts, or proprietary organizational information. In extreme cases, agents can even reveal credentials and API keys they have been trusted with. Additionally, agents may use tools to not only access sensitive data on behalf of the user, but also use those tools to leak sensitive data. For example, an agent can leak information by creating and sharing a document with an attacker, writing an email, opening a website and leaking information in the URL or a markdown image, or through any tool that allows it to pass information to the outside world. Context-hijacking attacks show that an adversary can confuse the agent to reveal data that is not appropriate for a specific context, such as sharing health history when the agent should be booking a restaurant reservation.

Controls:

Privacy Enhancing Technologies, User Data Management, Output Validation and Sanitization, Agent Permissions, Agent User Control, Agent Observability

Real examples:

One study showed that recitation checkers that scan for verbatim repetition of training data may be insufficient.

An example of membership inference attacks showed the possibility of inferring whether a specific user or data point was used to train or tune the model.