
Autonomous AI agents are here, and they’re poised to reshape the economy. By automating discovery, negotiation, and transactions, agents can overcome inefficiencies like information asymmetries and platform lock-in, enabling faster, more transparent, and more competitive markets.
We are already seeing early signs of this transformation in digital marketplaces. Customer-facing assistants like OpenAI’s Operator and Anthropic’s Computer Use can navigate websites and complete purchases. On the business side, Shopify Sidekick, Salesforce Einstein, and Meta’s Business AI help merchants with operations and customer engagement. These examples hint at a future where agents become active market participants, but the structure of these markets remains uncertain.
Several scenarios are possible. We might see one-sided markets where only customers or businesses deploy agents; closed platforms (known as walled gardens) where companies tightly control agent interactions; or even open two-sided marketplaces where customer and business agents transact freely across ecosystems. Each path carries different trade-offs for security, openness, convenience, and competition, which will shape how value flows in the digital economy. For a deeper exploration of these dynamics, see our paper, The Agentic Economy.
To help navigate this uncertainty, we built Magentic Marketplace (opens in new tab)— an open-source simulation environment for exploring the numerous possibilities of agentic markets and their societal implications at scale. It provides a foundation for studying these markets and guiding them toward outcomes that benefit everyone.
This matters because most AI agent research focuses on isolated scenarios—a single agent completing a task or two agents negotiating a simple transaction. But real markets involve a large number of agents simultaneously searching, communicating, and transacting, creating complex dynamics that can’t be understood by studying agents in isolation. Capturing this complexity is essential because real-world deployments raise critical questions about consumer welfare, market efficiency, fairness, manipulation resistance, and bias—questions that can’t be safely answered in production environments.
To explore these dynamics in depth, the Magentic Marketplace platform enables controlled experimentation across diverse agentic marketplace scenarios. Its current focus is on two-sided markets, but the environment is modular and extensible, supporting future exploration of mixed human–agent systems, one-sided markets, and complex communication protocols.

What is Magentic Marketplace?
Magentic Marketplace’s environment manages market-wide capabilities like maintaining catalogs of available goods and services, implementing discovery algorithms, facilitating agent-to-agent communication, and handling simulated payments through a centralized transaction layer at its core, which ensures transaction integrity across all marketplace interactions. Additionally, the platform enables systematic, reproducible research. As demonstrated in the following video, it supports a wide range of agent implementations and evolving marketplace features, allowing researchers to integrate diverse agent architectures and adapt the environment as new capabilities emerge.
We built Magentic Marketplace around three core architectural choices:
HTTP/REST client-server architecture: Agents operate as independent clients while the Marketplace Environment serves as a central server. This mirrors real-world platforms and supports clear separation of customer and business agent roles.
Minimal three-endpoint market protocol: Just three endpoints—register, protocol discovery, and action execution—lets agents dynamically discover available actions. New capabilities can be added without disrupting existing experiments.
Rich action protocol: Specific message types support the complete transaction lifecycle: search, negotiation, proposals, and payments. The protocol is designed for extensibility. New actions like refunds, reviews, or ratings can be added seamlessly, allowing researchers to evolve marketplace capabilities and study emerging agent behaviors while remaining compatible.

Additionally, a visualization module lets users observe marketplace dynamics and review individual conversation threads between customer and business agents.
Setting up the experiments
To ensure reproducibility, we instantiated the marketplace with fully synthetic data, available in our open-source repository (opens in new tab). The experiments modeled transactions such as ordering food and engaging with home improvement services, where agents represented customers and businesses engaging in marketplace transactions. This setup enabled precise measurement of behavior and systematic comparison against theoretical upper bounds.
Each experiment was run using 100 customers and 300 businesses and included both proprietary models (GPT-4o, GPT-4.1, GPT-5, and Gemini-2.5-Flash) and open-source models (OSS-20b, Qwen3-14b, and Qwen3-4b-Instruct-2507).
Our scenarios focused on simple all-or-nothing requests: Each customer had a list of desired items and amenities that needed to be present for a transaction to be satisfying. For those transactions, utility was computed as the sum of the customer’s internal item valuations minus actual prices paid. Consumer welfare, defined as the sum of utilities across all completed transactions, served as our key metric for comparing agent performance.
While this experimental setup provides a useful starting point, it is not intended to be definitive. We encourage researchers to extend the framework with richer, more nuanced measures and request types that better capture real consumer welfare, fairness, and other societal considerations.
PODCAST SERIES
AI Testing and Evaluation: Learnings from Science and Industry
Discover how Microsoft is learning from other domains to advance evaluation and testing as a pillar of AI governance.
What did we find?
Agents can improve consumer welfare—but only with good discovery
We explored whether two-sided agentic markets—where AI agents interact with each other and with service providers—can improve consumer welfare by reducing information gaps. Unlike traditional markets, which do not provide agentic support and place the full burden of overcoming information asymmetries on customers, agentic markets shift much of that effort to agents. This change matters because as agents gain better tools for discovery and communication, they relieve customers of the heavy cognitive load of filling any information gaps. This lowers the cost of making informed decisions and improves customer outcomes.
We compared several marketplace setups. Under realistic conditions (Agentic: Lexical search), agents faced real-world challenges like building queries, navigating paginated lists, identifying the right businesses to send inquiries to, and negotiating transactions.
Despite these complexities, advanced proprietary models and some medium-sized open-source models like GPTOSS-20b outperformed simple baselines like randomly choosing or simply choosing the cheapest option. Notably, GPT-5 achieved near-optimal performance, demonstrating its ability to effectively gather and utilize decision-relevant information in realistic marketplace conditions.

Performance increased considerably under the Agentic: Perfect search condition, where agents started with the top three matches without needing to search and navigate among the choices. In this setting, Sonnet-4.0, Sonnet-4.5, GPT-5, and GPT-4.1 nearly reached the theoretical optimum and beat baselines with full amenity details but without agent-to-agent coordination.
Open-source models were mixed: GPTOSS-20b performed strongly under both Perfect search and Lexical search conditions, even exceeding GPT-4o’s performance with Perfect search. This suggests that relatively compact models can exhibit robust information-gathering and decision-making capabilities in complex multi-agent environments. Qwen3-4b-2507 faltered when discovery involved irrelevant options (Lexical search), while Qwen3-14b lagged in both cases due to fundamental limitations in reasoning.

Paradox of Choice
One promise of agents is their ability to consider far more options than people can. However, our experiments revealed a surprising limitation: providing agents with more options does not necessarily lead to more thorough exploration. We designed experiments that varied the search results limit from 3 to 100. Except for Gemini-2.5-Flash and GPT-5, the models contacted only a small fraction of available businesses regardless of the search limit. This suggests that most models do not conduct exhaustive comparisons and instead easily accept the initial “good enough” options.

Additionally, across all models, consumer welfare declined as the number of search results increased. Despite contacting over a hundred businesses, Gemini-2.5-Flash’s performance declined from 1,700 to 1,350, and GPT-5 declined even more, from a near-optimal 2,000 to 1,400.
This demonstrates a Paradox of Choice effect, where more exploration does not guarantee better outcomes, potentially due to limited long context understanding. Claude Sonnet 4 showed the steepest performance decline, from 1,800 to 600 in consumer welfare. With all the options presented, it struggled to navigate larger sets of options and frequently contacted businesses that did not provide the goods or services that the customer was looking for.
This combination of poor initial selection and premature search termination demonstrates both inadequate decision-making criteria and insufficient exploration strategies. Some models showed modest performance decline (i.e., GPT-4.1: from 1,850 to 1,700; GPT-4o: from 1,550 to 1,450), finding good options within their limited exploration.

Agents are vulnerable to manipulation
We tested six manipulation strategies, ranging from subtle psychological tactics to aggressive prompt injection attacks:
- Authority: Fake credentials like “Michelin Guide featured” and “James Beard Award nominated” paired with fabricated certifications.
- Social proof: Claims like “Join 50,000+ satisfied customers” or “#1-rated Mexican restaurant” combined with fake reviews.
- Loss aversion: Fear-based warnings about “food poisoning” risks and “contamination issues” at competing restaurants.
- Prompt injection (basic): Attempts to override agent instructions.
- Prompt injection (strong): Aggressive attacks using emergency language and fabricating competitor scandals.
Results revealed significant variation in manipulation resistance across models. Sonnet-4 was resistant to all attacks, and none of the manipulative strategies affected any of the customers’ choices. Gemini-2.5-Flash was generally resistant, except for strong prompt injections, where mean payments to unmanipulated agents were affected as a result. GPT-4o, GPTOSS-20b and Qwen3-4b were very vulnerable to prompt injection: all payments were redirected to the manipulative agent under these conditions. Specifically for GPTOSS-20 and Qwen3-4b-2507, even traditional psychological manipulation tactics (authority appeals and social proof) increased payments to malicious agents, demonstrating their vulnerability to basic persuasion techniques. These findings highlight a critical security concern for agentic marketplaces.

Systemic biases create unfair advantages
Our analysis revealed two distinct types of systematic biases showed by agents when selecting businesses from search results. Models showed systematic preferences based on where businesses appeared in search results. While proprietary models showed no strong positional preferences, open-source models exhibited clear patterns. Specifically, Qwen2.5-14b-2507 showed a pronounced bias toward selecting the last business presented, regardless of its actual merits.
Proposal bias is more pervasive across all models tested. This “first-offer acceptance” pattern suggests that models prioritized immediate selection over comprehensive exploration, potentially missing better alternatives that could have emerged by waiting for better options. This behavior continued across both proprietary and open-source models, indicating a fundamental challenge in agent decision-making architectures.
These biases can create unfair market dynamics, drive unintended behaviors, and push businesses to complete on response speed rather than product or service quality.

What this means
Even state-of-the-art models can show notable vulnerabilities and biases in marketplace environments. In our implementation, agents struggled with too many options, were susceptible to manipulation tactics, and showed systemic biases that created unfair advantages.
These outcomes are shaped not only by agent capabilities but also by marketplace design and implementation. Our current study focused on static markets, but real-world environments are dynamic, with agents and users learning over time. Oversight is critical for high-stakes transactions. Agents should assist, not replace, human decision-making.
We plan to explore dynamic markets and human-in-the-loop designs to improve efficiency and trust. A simulation environment like Magentic Marketplace is crucial for understanding the interplay between market components and agents before deploying them at scale.
Full details of our experimental setup and results are available in our paper (opens in new tab).
Getting started
Magentic Marketplace is available as an open-source environment for exploring agentic market dynamics. Code, datasets, and experiment templates are available on GitHub (opens in new tab) and Azure AI Foundry Labs (opens in new tab).
The documentation (opens in new tab) provides instructions for reproducing the experiments described above and guidance for extending the environment to new marketplace configurations.





