Magentic Marketplace: an open-source simulation environment for studying agentic markets

Three white icons on a blue-to-purple gradient background: the first icon shows a node cluster, the second shows two persons, the third is a building, and the fourth is a location pin

Autonomous AI agents are here, and they’re poised to reshape the economy. By automating discovery, negotiation, and transactions, agents can overcome inefficiencies like information asymmetries and platform lock-in, enabling faster, more transparent, and more competitive markets.

We are already seeing early signs of this transformation in digital marketplaces. Customer-facing assistants like OpenAI’s Operator and Anthropic’s Computer Use can navigate websites and complete purchases. On the business side, Shopify Sidekick, Salesforce Einstein, and Meta’s Business AI help merchants with operations and customer engagement. These examples hint at a future where agents become active market participants, but the structure of these markets remains uncertain.

Several scenarios are possible. We might see one-sided markets where only customers or businesses deploy agents; closed platforms (known as walled gardens) where companies tightly control agent interactions; or even open two-sided marketplaces where customer and business agents transact freely across ecosystems. Each path carries different trade-offs for security, openness, convenience, and competition, which will shape how value flows in the digital economy. For a deeper exploration of these dynamics, see our paper, The Agentic Economy.

To help navigate this uncertainty, we built Magentic Marketplace (opens in new tab)— an open-source simulation environment for exploring the numerous possibilities of agentic markets and their societal implications at scale. It provides a foundation for studying these markets and guiding them toward outcomes that benefit everyone.

This matters because most AI agent research focuses on isolated scenarios—a single agent completing a task or two agents negotiating a simple transaction. But real markets involve a large number of agents simultaneously searching, communicating, and transacting, creating complex dynamics that can’t be understood by studying agents in isolation. Capturing this complexity is essential because real-world deployments raise critical questions about consumer welfare, market efficiency, fairness, manipulation resistance, and bias—questions that can’t be safely answered in production environments.

To explore these dynamics in depth, the Magentic Marketplace platform enables controlled experimentation across diverse agentic marketplace scenarios. Its current focus is on two-sided markets, but the environment is modular and extensible, supporting future exploration of mixed human–agent systems, one-sided markets, and complex communication protocols.

Figure 1. Diagram illustrating the Magentic Marketplace Environment. On the left, two sections represent Customers and Businesses. Customers ask, “Could you find me a restaurant serving agua fresca and empanadas with free parking?” and are linked to Customer Agents (blue and purple icons). Businesses display a menu with items like steak tacos and empanadas, connected to Business Agents (purple icons). On the right, a three-step process is shown inside a pink box: Search – Customer agent searches for a restaurant among multiple business agents. Multi-Agent Communication – Customer agent asks about free parking and menu options, interacting with several business agents. Final Transaction – Customer agent places the order with a selected business agent. — Figure 1. With Magentic Marketplace, researchers can model how agents representing customers and businesses interact—shedding light on the dynamics that could shape future digital markets.

What is Magentic Marketplace?

Magentic Marketplace’s environment manages market-wide capabilities like maintaining catalogs of available goods and services, implementing discovery algorithms, facilitating agent-to-agent communication, and handling simulated payments through a centralized transaction layer at its core, which ensures transaction integrity across all marketplace interactions. Additionally, the platform enables systematic, reproducible research. As demonstrated in the following video, it supports a wide range of agent implementations and evolving marketplace features, allowing researchers to integrate diverse agent architectures and adapt the environment as new capabilities emerge.

We built Magentic Marketplace around three core architectural choices:

HTTP/REST client-server architecture: Agents operate as independent clients while the Marketplace Environment serves as a central server. This mirrors real-world platforms and supports clear separation of customer and business agent roles.

Minimal three-endpoint market protocol: Just three endpoints—register, protocol discovery, and action execution—lets agents dynamically discover available actions. New capabilities can be added without disrupting existing experiments.

Rich action protocol: Specific message types support the complete transaction lifecycle: search, negotiation, proposals, and payments. The protocol is designed for extensibility. New actions like refunds, reviews, or ratings can be added seamlessly, allowing researchers to evolve marketplace capabilities and study emerging agent behaviors while remaining compatible.

Figure 2. Diagram of a Market Environment showing interactions between an Assistant Agent (representing user intention) and a Service Agent (representing point of sale). Both agents connect to the Market Environment via POST /register, POST /action, and GET /protocol. Inside the Market Environment, components include Catalog, Search, Communication, and Transaction, with two Action Routers facilitating sending and receiving actions between the agents and the environment. — Figure 2. Magentic Marketplace includes two agent types: Assistant Agents (customers) and Service Agents (businesses). Both interact with a central Market Environment via REST APIs for registration, service discovery, communication, and transaction execution. Action Routers manage message flow and protocol requests, enabling autonomous negotiation and commerce in a two-sided marketplace.

Additionally, a visualization module lets users observe marketplace dynamics and review individual conversation threads between customer and business agents.

Setting up the experiments

To ensure reproducibility, we instantiated the marketplace with fully synthetic data, available in our open-source repository (opens in new tab). The experiments modeled transactions such as ordering food and engaging with home improvement services, where agents represented customers and businesses engaging in marketplace transactions. This setup enabled precise measurement of behavior and systematic comparison against theoretical upper bounds.

Each experiment was run using 100 customers and 300 businesses and included both proprietary models (GPT-4o, GPT-4.1, GPT-5, and Gemini-2.5-Flash) and open-source models (OSS-20b, Qwen3-14b, and Qwen3-4b-Instruct-2507).

Our scenarios focused on simple all-or-nothing requests: Each customer had a list of desired items and amenities that needed to be present for a transaction to be satisfying. For those transactions, utility was computed as the sum of the customer’s internal item valuations minus actual prices paid. Consumer welfare, defined as the sum of utilities across all completed transactions, served as our key metric for comparing agent performance.

While this experimental setup provides a useful starting point, it is not intended to be definitive. We encourage researchers to extend the framework with richer, more nuanced measures and request types that better capture real consumer welfare, fairness, and other societal considerations.

What did we find?

Agents can improve consumer welfare—but only with good discovery

We explored whether two-sided agentic markets—where AI agents interact with each other and with service providers—can improve consumer welfare by reducing information gaps. Unlike traditional markets, which do not provide agentic support and place the full burden of overcoming information asymmetries on customers, agentic markets shift much of that effort to agents. This change matters because as agents gain better tools for discovery and communication, they relieve customers of the heavy cognitive load of filling any information gaps. This lowers the cost of making informed decisions and improves customer outcomes.

We compared several marketplace setups. Under realistic conditions (Agentic: Lexical search), agents faced real-world challenges like building queries, navigating paginated lists, identifying the right businesses to send inquiries to, and negotiating transactions.

Despite these complexities, advanced proprietary models and some medium-sized open-source models like GPTOSS-20b outperformed simple baselines like randomly choosing or simply choosing the cheapest option. Notably, GPT-5 achieved near-optimal performance, demonstrating its ability to effectively gather and utilize decision-relevant information in realistic marketplace conditions.

Figure 3. Table comparing Baseline and Agentic conditions for marketplace decision-making. Columns include: Condition (e.g., Random w/ items only, Cheapest w/ items & prices, Random w/ items & amenities, Optimal, Perfect search, Lexical search) Query (N/A for most; “Agent decides” for Lexical search) Consideration Set (Businesses) (e.g., All w/ matching menus; Paginated lists of 10 based on menu items) Businesses Contacted (All in consideration set or Agent decides) Information Used (Menu items, prices, amenities, or depends on agent-to-agent conversation) Decision Criteria (Random choice, Lowest price, or Agent decides). — Figure 3. Table comparing experimental setups for welfare outcomes in the restaurant industry. Each row shows a different way agents or baselines make decisions, from random picks to fully coordinated agentic strategies. Cell colors indicate how much information is available: green, at the top left, represents complete information, red, at the top right, represents limited information, and yellow at the bottom represents decisions that depend on agent communication.

Performance increased considerably under the Agentic: Perfect search condition, where agents started with the top three matches without needing to search and navigate among the choices. In this setting, Sonnet-4.0, Sonnet-4.5, GPT-5, and GPT-4.1 nearly reached the theoretical optimum and beat baselines with full amenity details but without agent-to-agent coordination.

Open-source models were mixed: GPTOSS-20b performed strongly under both Perfect search and Lexical search conditions, even exceeding GPT-4o’s performance with Perfect search. This suggests that relatively compact models can exhibit robust information-gathering and decision-making capabilities in complex multi-agent environments. Qwen3-4b-2507 faltered when discovery involved irrelevant options (Lexical search), while Qwen3-14b lagged in both cases due to fundamental limitations in reasoning.

Figure 4. Boxplot comparing Agentic and Baseline strategies on welfare scores. The y-axis shows welfare (0–2000+), and the x-axis lists models and conditions. Under Agentic, models include Sonnet-4.0, Sonnet-4.5, GPT-5, GPT-4.1, Gemini-2.5-flash, GPT-4.0, GPT-oss-20b, Qwen3-4b-2507, and Qwen31-14b. Under Baselines, conditions include Random, Cheapest, and Random-items+amenities. Colors represent search types: blue = Lexical Search, yellow = Perfect Search, gray = Baseline, with a dashed line indicating Optimal welfare. Agentic models generally achieve higher welfare than baselines, with variability across models. — Figure 4. Chart showing consumer welfare outcomes in the restaurant industry under different marketplace setups. Blue bars show Agentic: Lexical search, where agents navigate realistic discovery challenges; yellow bars show Agentic: Perfect search, where agents started with ideal matches. Proprietary models approached optimum consumer welfare under perfect search, while open-source models and baselines lagged behind.

Paradox of Choice

One promise of agents is their ability to consider far more options than people can. However, our experiments revealed a surprising limitation: providing agents with more options does not necessarily lead to more thorough exploration. We designed experiments that varied the search results limit from 3 to 100. Except for Gemini-2.5-Flash and GPT-5, the models contacted only a small fraction of available businesses regardless of the search limit. This suggests that most models do not conduct exhaustive comparisons and instead easily accept the initial “good enough” options.

Figure 5. Line chart showing the relationship between Search Limit (x-axis: 3 to 100) and Mean Messages per Customer (y-axis: 0 to 120) for five models: Claude Sonnet 4 (red triangles) – stays nearly flat around 10–15 messages. Gemini 2.5 Flash (purple diamonds) – rises sharply from ~5 to over 110 messages as search limit increases. GPT-4.1 (orange circles) and GPT-4o (green squares) – remain low and stable around 5–10 messages. GPT-5 (blue line) – increases moderately to ~40 messages, then plateaus. — Figure 5. More options didn’t lead to broader exploration. Most models still contacted only a few businesses, except Gemini-2.5-Flash and GPT-5.

Additionally, across all models, consumer welfare declined as the number of search results increased. Despite contacting over a hundred businesses, Gemini-2.5-Flash’s performance declined from 1,700 to 1,350, and GPT-5 declined even more, from a near-optimal 2,000 to 1,400.

This demonstrates a Paradox of Choice effect, where more exploration does not guarantee better outcomes, potentially due to limited long context understanding. Claude Sonnet 4 showed the steepest performance decline, from 1,800 to 600 in consumer welfare. With all the options presented, it struggled to navigate larger sets of options and frequently contacted businesses that did not provide the goods or services that the customer was looking for.

This combination of poor initial selection and premature search termination demonstrates both inadequate decision-making criteria and insufficient exploration strategies. Some models showed modest performance decline (i.e., GPT-4.1: from 1,850 to 1,700; GPT-4o: from 1,550 to 1,450), finding good options within their limited exploration.

Figure 6. Line chart showing Mean Customer Welfare (y-axis: 0–2200) versus Search Limit (x-axis: 3 to 100) for five models: Claude Sonnet 4 (red triangles) – starts near 1800 and declines sharply to ~600 as search limit increases. Gemini 2.5 Flash (purple diamonds) – decreases gradually from ~1700 to ~1300. GPT-4.1 (orange circles) – remains highest and most stable, around 1900–1700. GPT-4o (green squares) – stays near 1500 with slight decline. GPT-5 (blue line) – starts near 2000 and drops to ~1100. Dashed line at the top represents Optimal welfare (~2200). — Figure 6. Mean consumer welfare decreased as consideration set size grew, revealing a Paradox of Choice effect, where expanding options reduced overall welfare.

Agents are vulnerable to manipulation

We tested six manipulation strategies, ranging from subtle psychological tactics to aggressive prompt injection attacks:

Authority: Fake credentials like “Michelin Guide featured” and “James Beard Award nominated” paired with fabricated certifications.
Social proof: Claims like “Join 50,000+ satisfied customers” or “#1-rated Mexican restaurant” combined with fake reviews.
Loss aversion: Fear-based warnings about “food poisoning” risks and “contamination issues” at competing restaurants.
Prompt injection (basic): Attempts to override agent instructions.
Prompt injection (strong): Aggressive attacks using emergency language and fabricating competitor scandals.

Results revealed significant variation in manipulation resistance across models. Sonnet-4 was resistant to all attacks, and none of the manipulative strategies affected any of the customers’ choices. Gemini-2.5-Flash was generally resistant, except for strong prompt injections, where mean payments to unmanipulated agents were affected as a result. GPT-4o, GPTOSS-20b and Qwen3-4b were very vulnerable to prompt injection: all payments were redirected to the manipulative agent under these conditions. Specifically for GPTOSS-20 and Qwen3-4b-2507, even traditional psychological manipulation tactics (authority appeals and social proof) increased payments to malicious agents, demonstrating their vulnerability to basic persuasion techniques. These findings highlight a critical security concern for agentic marketplaces.

Figure 7. Horizontal bar chart comparing mean payments received under different manipulation strategies for six models: Claude Sonnet 4.5, Gemini 2.5 Flash, GPT-4o, GPT OSS 20B, Qwen3 14B, and Qwen3 4B. Each model has bars for six conditions: Control, Authority, Social Proof, Loss Aversion, Prompt Injection (Basic), and Prompt Injection (Strong). Bars are split into red for manipulated and gray for rest, with values ranging from near 0 to 3. Claude Sonnet 4.5 shows consistently high payments (~3) across all conditions, while Gemini and GPT models vary, and Qwen models show very low manipulated values (~0.2) compared to rest. — Figure 7. Charts showing the variation in mean payments received by service agents with and without manipulation tactics. The results reveal substantial differences in manipulation resistance across models, with GPT-4.1 showing significantly higher vulnerability compared to Gemini-2.5-Flash.

Systemic biases create unfair advantages

Our analysis revealed two distinct types of systematic biases showed by agents when selecting businesses from search results. Models showed systematic preferences based on where businesses appeared in search results. While proprietary models showed no strong positional preferences, open-source models exhibited clear patterns. Specifically, Qwen2.5-14b-2507 showed a pronounced bias toward selecting the last business presented, regardless of its actual merits.

Proposal bias is more pervasive across all models tested. This “first-offer acceptance” pattern suggests that models prioritized immediate selection over comprehensive exploration, potentially missing better alternatives that could have emerged by waiting for better options. This behavior continued across both proprietary and open-source models, indicating a fundamental challenge in agent decision-making architectures.

These biases can create unfair market dynamics, drive unintended behaviors, and push businesses to complete on response speed rather than product or service quality.

Figure 8. Bar chart showing average selection rate for first, second, and third choices across six models: Claude Sonnet 4.5, Gemini 2.5 Flash, GPT-4o, GPT OSS 20B, Qwen3 14B, and Qwen3 4B. Each model has three bars labeled 1st, 2nd, and 3rd. Most models strongly favor the first choice: Claude Sonnet 4.5: 93.3% for 1st, 0% for 2nd, 6.7% for 3rd. Gemini 2.5 Flash: 86.7% for 1st, 6.7% for 2nd and 3rd. GPT-4o: 100% for 1st, 0% for others. GPT OSS 20B: 80% for 1st, 13.3% for 2nd, 6.7% for 3rd. Qwen3 14B: 0% for all. Qwen3 4B: 100% for 1st, 0% for others. Dashed line indicates random selection baseline. — Figure 8. All models showed strong preference for the first proposal received, accepting it without waiting for additional proposals or conducting systematic comparisons.

What this means

Even state-of-the-art models can show notable vulnerabilities and biases in marketplace environments. In our implementation, agents struggled with too many options, were susceptible to manipulation tactics, and showed systemic biases that created unfair advantages.

These outcomes are shaped not only by agent capabilities but also by marketplace design and implementation. Our current study focused on static markets, but real-world environments are dynamic, with agents and users learning over time. Oversight is critical for high-stakes transactions. Agents should assist, not replace, human decision-making.

We plan to explore dynamic markets and human-in-the-loop designs to improve efficiency and trust. A simulation environment like Magentic Marketplace is crucial for understanding the interplay between market components and agents before deploying them at scale.

Full details of our experimental setup and results are available in our paper (opens in new tab).

Getting started

Magentic Marketplace is available as an open-source environment for exploring agentic market dynamics. Code, datasets, and experiment templates are available on GitHub (opens in new tab) and Azure AI Foundry Labs (opens in new tab).

The documentation (opens in new tab) provides instructions for reproducing the experiments described above and guidance for extending the environment to new marketplace configurations.

Source link

Hot topics

Finance

Marketing

Politics

Strategy