
Modern multimodal AI models can recognize objects, describe scenes, and answer questions about images and short video clips, but they struggle with long-form and large-scale visual data, where real-world reasoning requires moving beyond object recognition and short-clip analysis.
Real-world reasoning increasingly involves analyzing long-form video content, where context spans minutes or hours, far beyond the context limits of most models. It also entails querying across massive multimodal libraries of videos, images, and transcripts, where finding and integrating relevant evidence requires more than retrieval, it requires strategic reasoning. Existing models typically perform single-pass inference, producing one-shot answers. This limits their ability to handle tasks that require temporal reasoning, cross-modal grounding, and iterative refinement.
MMCTAgent
To meet these challenges, we developed the Multi-modal Critical Thinking Agent, or MMCTAgent, for structured reasoning over long-form video and image data, available on GitHub (opens in new tab) and featured on Azure AI Foundry Labs (opens in new tab).
Built on AutoGen, Microsoft’s open-source multi-agent system, MMCTAgent provides multimodal question-answering with a Planner–Critic architecture. This design enables planning, reflection, and tool-based reasoning, bridging perception and deliberation in multimodal tasks. It links language, vision, and temporal understanding, transforming static multimodal tasks into dynamic reasoning workflows.
Unlike conventional models that produce one-shot answers, MMCTAgent has modality-specific agents, including ImageAgent and VideoAgent, which include tools like get_relevant_query_frames() or object_detection-tool(). These agents perform deliberate, iterative reasoning—selecting the right tools for each modality, evaluating intermediate results, and refining conclusions through a Critic loop. This enables MMCTAgent to analyze complex queries across long videos and large image libraries with explainability, extensibility, and scalability.
Spotlight: Event Series
Microsoft Research Forum
Join us for a continuous exchange of ideas about research in the era of general AI. Watch the first four episodes on demand.
How MMCTAgent works
MMCTAgent integrates two coordinated agents, Planner and Critic,orchestrated through AutoGen. The Planner agent decomposes a user query, identifies the appropriate reasoning tools, performs multimodal operations, and drafts a preliminary answer. The Critic agent reviews the Planner’s reasoning chain, validates evidence alignment, and refines or revises the response for factual accuracy and consistency.
This iterative reasoning loop enables MMCTAgent to improve its answers through structured self-evaluation—bringing reflection into AI reasoning. A key strength of MMCTAgent lies in its modular extensibility. Developers can easily integrate new, domain-specific tools—such as medical image analyzers, industrial inspection models, or specialized retrieval modules—by adding them to ImageQnATools or VideoQnATools. This design makes MMCTAgent adaptable across domains.
VideoAgent: From ingestion to long-form multimodal reasoning

The VideoAgent extends this architecture to long-form video reasoning. It operates in two connected phases: library creation (ingestion) and query-time reasoning.
Phase 1 – Video ingestion and library creation
Before reasoning, long-form videos undergo an ingestion pipeline that aligns multimodal information for retrieval and understanding:
- Transcription and translation: Converts audio to text and, if multilingual, translates transcripts into a consistent language
- Key-frame identification: Extracts representative frames marking major visual or scene changes
- Semantic chunking and chapter generation: Combines transcript segments and visual summaries into coherent, semantically segmented chapters with associated key frames. Inspired by Microsoft’s Deep Video Discovery agentic search tool, this step also extracts detailed descriptions of objects, on-screen text, and characters present within each video segment, integrating these insights directly into the corresponding chapters.
- Multimodal embedding creation: Generates image embeddings for key frames, linking them to their corresponding transcript and chapter data
All structured metadata, including transcripts, visual summaries, chapters, and embeddings, is indexed in the Multimodal Knowledgebase using Azure AI Search (opens in new tab), which forms the foundation for scalable semantic retrieval and downstream reasoning.
Phase 2 – Video question answering and reasoning
When a user submits a query, the VideoAgent retrieves, analyzes, and reasons across the indexed video content using specialized planner and critic tools.
Planner tools
- get_video_analysis: Finds the most relevant video, provides a summary, and lists detected objects
- get_context: Retrieves contextual information and relevant chapters from the Azure AI Search index
- get_relevant_frames: Selects key frames most relevant to the user query
- query_frame: Performs detailed visual and textual reasoning over selected frames
- get_context and get_relevant_frames work in tandem to ensure that reasoning begins from the most semantically relevant evidence
Critic tool
- critic_tool: Evaluates the reasoning output for temporal alignment, factual accuracy, and coherence between visual and textual modalities
This two-phase design, which involves structured ingestion followed by agentic reasoning, enables MMCTAgent to deliver accurate, interpretable insights for long information-dense videos.
ImageAgent: Structured reasoning for static visuals
While the VideoAgent handles temporal reasoning across long-form videos, the ImageAgent applies the same Planner–Critic paradigm to static visual analysis. It performs modular, tool-based reasoning over images, combining perception tools for recognition, detection, and optical character recognition with language-based reasoning for interpretation and explanation.
Planner tools
- vit_tool: Leverages Vision Transformer (ViT) or Vision Languague Model (VLM) for high-level visual understanding and description
- recog_tool: Performs scene, face, and object recognition
- object_detection_tool: Localizes and labels entities within an image
- ocr_tool: Extracts embedded text from visual elements
Critic tool
- critic_tool: Validates the Planner’s conclusions for factual alignment and consistency, refining the final response
This lightweight ImageAgent provides fine-grained, explainable reasoning over image collections—supporting visual question answering, content inspection, and multimodal retrieval—while maintaining architectural symmetry with the VideoAgent.
Evaluation Results
To assess the effectiveness of MMCTAgent, we evaluated both the ImageAgent and VideoAgent with multiple base LLM models and a range of benchmark datasets and real-world scenarios. Some key results are presented here.
| Image Datasets | GPT-4V | MMCT with GPT-4V | GPT4o | MMCT with GPT-4o | GPT-5 | MMCT with GPT-5 |
|---|---|---|---|---|---|---|
| MM-Vet [1] | 60.20 | 74.24 | 77.98 | 79.36 | 80.51 | 81.65 |
| MMMU [2] | 56.80 | 63.57 | 69.10 | 73.00 | 84.20 | 85.44 |
| Video Datasets | GPT4o | MMCT with GPT-4o |
|---|---|---|
| VideoMME [3] | 72.10 | 76.70 |
MMCTAgent enhances base model performance by augmenting their capabilities with appropriate tools such as object detection and optical character recognition (OCR) for weaker models, or domain-specific tools for stronger models, thereby leading to substantial improvements. For example, integrating these tools raised GPT-4V’s accuracy from 60.20% to 74.24% on MM-Vet dataset. Additionally, the configurable Critic agent provides additional validation, which is especially valuable in critical domains. The additional evaluation results are available here (opens in new tab).
Takeaways and next steps
MMCTAgent demonstrates a scalable agentic approach to multimodal reasoning with a Planner–Critic architecture. Its unified multimodal design supports both image and video pipelines, while the extensible toolchain enables rapid integration of domain-specific tools and capabilities. It provides Azure-native deployment and supports configurability within the broader open-source ecosystem.
Looking ahead, we aim to improve efficiency and adaptability in retrieval and reasoning workflows, and to extend MMCTAgent’s applications beyond current agricultural evaluations, exploring new real-world domains through initiatives like Project Gecko to advance the creation of accessible, innovative multimodal applications for people around the globe.
Acknowledgements
We would like to thank our team members for their valuable contributions to this work: Aman Patkar, Ogbemi Ekwejunor-Etchie, Somnath Kumar, Soumya De, and Yash Gadhia.
References
[1] W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and L. Wang. “MM-VET: Evaluating large multimodal models for integrated capabilities”, 2023.
[2] X. Yue, Y. Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y. Sun, C. Wei, B. Yu, R. Yuan, R. Sun, M. Yin, B. Zheng, Z. Yang, Y. Liu, W. Huang, H. Sun, Y. Su, and W. Chen. “MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI”, 2023.
[3] Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. “Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis”, 2024.



