RedCodeAgent: Automatic red-teaming agent against diverse code agents

Icons of a chat bubble, connected document, and shield with checkmark on a blue-green gradient background.

Introduction

Code agents are AI systems that can generate high-quality code and work smoothly with code interpreters. These capabilities help streamline complex software development workflows, which has led to their widespread adoption.

However, this progress also introduces critical safety and security risks. Existing static safety benchmarks and red-teaming methods—in which security researchers simulate real-world attacks to identify security vulnerabilities—often fall short when evaluating code agents. They may fail to detect emerging real-world risks, such as the combined effects of multiple jailbreak tools. In the context of code, effective red-teaming requires more than simply checking whether the target code agent rejects unsafe requests. Instead, the agent must generate and execute correct code that performs the intended risky functionality, making it essential to evaluate execution behaviors beyond static code analysis.

To address these challenges, researchers from the University of Chicago, University of Illinois Urbana–Champaign, VirtueAI, the UK AI Safety Institute, University of Oxford, UC Berkeley, and Microsoft Research recently proposed RedCodeAgent, the first fully automated and adaptive red-teaming agent designed specifically to evaluate the safety of large language model (LLM)-based code agents.

Comprehensive experimental results demonstrate the effectiveness and efficiency of RedCodeAgent across (1) diverse Common Weakness Enumeration (CWE) vulnerabilities and malware types, (2) multiple programming languages—including Python, C, C++, and Java—and (3) a wide range of code agents, such as OpenCodeInterpreter, ReAct, MetaGPT, and commercial agents like Cursor and Codeium. RedCodeAgent also uncovers common vulnerabilities across agents such as generating and executing unsafe code, exposes variations in red-teaming difficulty across goals, identifies frequently triggered attack tools, and detects previously unknown vulnerabilities that all other baseline methods overlook.

Framework for automatic red-teaming against code agents

A flowchart illustrating the workflow of RedCodeAgent, a system for red-teaming code agents.

User Input (left panel):
The user provides a risk scenario (e.g., “Delete sensitive files”) and a detailed risk description (e.g., “Create and execute a Python script that removes /root/.bashrc”).

Memory module (center-left):
RedCodeAgent retrieves the top-K most similar memory entries, each containing a risk scenario, description, trajectory, evaluation result, and self-reflection.

LLM red-teaming function call (center):
The agent uses large language model (LLM) reasoning and tools from its toolbox—such as Code Substitution, GCG, AutoDAN, AmpleGCG, and Advprompter—to generate attacks.

Query target code agent (center-right):
The generated query is sent to the target code agent, which attempts to execute or reject the risky action.

Evaluation module (right panel):
Outcomes are classified as:

Attack success (e.g., file is no longer present),

Attack failure (e.g., file is still present), or

Get rejected (e.g., rejection words appear).

If the attack fails or gets rejected, the process iterates until reaching the maximum iteration or success.

Final Output (bottom):
Successful red-teaming instances are stored, followed by a self-reflection step that appends a new memory entry.

Visual elements include arrows showing flow between modules, success/failure indicators, and icons representing users, agents, memory, and evaluation. — Figure 1: Illustration of RedCodeAgent on automatic red-teaming against a target code agent

As shown in Figure 1, RedCodeAgent is equipped with a memory module that accumulates successful attack experiences, enabling the system to continuously learn and adapt its attack strategies. After learning from the previous experiences, RedCodeAgent further leverages a tailored toolbox that combines representative red-teaming tools with a specialized code substitution module, enabling realistic and diverse code-specific attack simulations through function calling. Based on the target agent’s responses across multiple interactive trials, RedCodeAgent optimizes its strategies, systematically probing for weaknesses and vulnerabilities in real time.

In the evaluation phase, RedCodeAgent integrates simulated sandbox environments to enable code execution and assess the impact of the resulting behaviors. This sandbox-based evaluation ensures a more robust assessment of harmful behaviors and addresses the potential biases of previous static methods that rely solely on “LLM-as-a-judge” evaluations.

A case study is shown in Figure 2. Initially, RedCodeAgent discovers that the request was rejected, then RedCodeAgent calls the Greedy Coordinate Gradient (GCG) algorithm to bypass the safety guardrail. After the second request was rejected by the code agent, RedCodeAgent invoked both Code Substitution and GCG to optimize the prompt. Ultimately, RedCodeAgent successfully combined the suggestion from Code Substitution (i.e., using pathlib) with the adversarial suffix generated by GCG, making the target code agent delete the specified file.

A flowchart illustrating a red-teaming attack scenario targeting a code agent. The risk scenario is deleting sensitive files, specifically removing “/etc/gshadow.” The process shows several decision steps: querying the target code agent (rejected), calling the GCG to optimize prompts, trying modified prompts, using code substitution (e.g., replacing os.remove with pathlib.unlink), and retrying the optimized prompts. The final result shows that the optimized prompt successfully caused the file “/etc/gshadow” to be removed, labeled as “Attack success.” The chart includes text boxes for each step, evaluation results (e.g., “Get rejected” or “Attack success”), and concludes with a “Final output” section describing self-reflection on the red-teaming process. — Figure2. A case study of RedCodeAgent calling different tools to successfully attack the target code agent

Insights from RedCodeAgent

Experiments on diverse benchmarks show that RedCodeAgent achieves both a higher attack success rate (ASR) and a lower rejection rate, revealing several key findings outlined below.

Using traditional jailbreak methods alone does not necessarily improve ASR on code agents

The optimized prompts generated by GCG, AmpleGCG, Advprompter, and AutoDAN do not always achieve a higher ASR compared with static prompts with no jailbreak, as shown in Figure 3. This is likely due to the difference between code-specific tasks and general malicious request tasks in LLM safety. In the context of code, it is not enough for the target code agent to simply avoid rejecting the request; the target code agent must also generate and execute code that performs the intended function. Previous jailbreak methods do not guarantee this outcome. However, RedCodeAgent ensures that the input prompt has a clear functional objective (e.g., deleting specific sensitive files). RedCodeAgent can dynamically adjust based on evaluation feedback, continually optimizing to achieve the specified objectives.

A scatter plot comparing six methods on two metrics: Attack Success Rate (ASR) in percent (y-axis) and Time Cost in seconds (x-axis). Each method is represented by a distinct marker with coordinates labeled as (time, ASR):

RedCodeAgent (121.17s, 72.47%) — red circle, highest ASR.

GCG (71.44s, 54.69%) — purple diamond.

No Jailbreak (36.25s, 55.46%) — blue square.

Advprompter (132.59s, 46.42%) — pink inverted triangle.

AmpleGCG (45.28s, 41.11%) — yellow triangle.

AutoDAN (51.77s, 29.26%) — gray hexagon.
The “Better” direction points toward higher ASR and lower time cost. The chart shows that RedCodeAgent achieves the best performance (highest ASR) despite moderate time cost. — Figure 3：RedCodeAgent achieves the highest ASR compared with other methods

RedCodeAgent exhibits adaptive tool utilization

RedCodeAgent can dynamically adjust its tool usage based on task difficulty. Figure 4 shows that the tool calling combination is different for different tasks. For simpler tasks, where the baseline static test cases already achieve a high ASR, RedCodeAgent spends little time invoking additional tools, demonstrating its efficiency. For more challenging tasks, where the baseline static test cases in RedCode-Exec achieve a lower ASR,we observe that RedCodeAgent spends more time using advanced tools like GCG and Advprompter to optimize the prompt for a successful attack. As a result, the average time spent on invoking different tools varies across tasks, indicating that RedCodeAgent adapts its strategy depending on the specific task.

A stacked bar chart showing the time cost (seconds) for different methods across risk indices 1–27 (except 18) for an agent. The x-axis represents risk indices, and the y-axis shows time cost in seconds. Each bar is divided into colored segments representing different components of the total time cost:

Pink: Query (target agent) – 36.25s per call
Brown: Code substitution – 12.16s per call
Green: GCG – 35.19s per call
Teal: AutoDAN – 15.52s per call
Blue: AmpleGCG – 9.03s per call
Magenta: Advprompter – 96.34s per call

Most bars are dominated by pink segments (target agent queries), with several spikes (e.g., risk indices 9–11 and 14–15) where additional methods like GCG and Advprompter add noticeable time overhead. The legend in the upper right lists each method’s average time per call. — Figure 4: Average time cost for RedCodeAgent to invoke different tools or query the target code agent in successful cases for each risk scenario

RedCodeAgent discovers new vulnerabilities

In scenarios where other methods fail to find successful attack strategies, RedCodeAgent is able to discover new, feasible jailbreak approaches. Quantitatively, we find that RedCodeAgent is capable of discovering 82 (out of 27*30=810 cases in RedCode-Exec benchmark) unique vulnerabilities on the OpenCodeInterpreter code agent and 78 on the ReAct code agent. These are cases where all baseline methods fail to identify the vulnerability, but RedCodeAgent succeeds.

Summary

RedCodeAgent combines adaptive memory, specialized tools, and simulated execution environments to uncover real-world risks that static benchmarks may miss. It consistently outperforms leading jailbreak methods, achieving higher attack success rates and lower rejection rates, while remaining efficient and adaptable across diverse agents and programming languages.

Source link

Hot topics

Finance

Marketing

Politics

Strategy