MMCTAgent: Enabling multimodal reasoning over large video and image collections

Three white icons on a blue-to-purple gradient background: the first icon shows an image/photo; the second icon depicts a computer monitor with vertical bars; the third icon displays three connected circles with user silhouettes.

Modern multimodal AI models can recognize objects, describe scenes, and answer questions about images and short video clips, but they struggle with long-form and large-scale visual data, where real-world reasoning requires moving beyond object recognition and short-clip analysis.

Real-world reasoning increasingly involves analyzing long-form video content, where context spans minutes or hours, far beyond the context limits of most models. It also entails querying across massive multimodal libraries of videos, images, and transcripts, where finding and integrating relevant evidence requires more than retrieval, it requires strategic reasoning. Existing models typically perform single-pass inference, producing one-shot answers. This limits their ability to handle tasks that require temporal reasoning, cross-modal grounding, and iterative refinement.

MMCTAgent

To meet these challenges, we developed the Multi-modal Critical Thinking Agent, or MMCTAgent, for structured reasoning over long-form video and image data, available on GitHub (opens in new tab) and featured on Azure AI Foundry Labs (opens in new tab).

Built on AutoGen, Microsoft’s open-source multi-agent system, MMCTAgent provides multimodal question-answering with a Planner–Critic architecture. This design enables planning, reflection, and tool-based reasoning, bridging perception and deliberation in multimodal tasks. It links language, vision, and temporal understanding, transforming static multimodal tasks into dynamic reasoning workflows.

Unlike conventional models that produce one-shot answers, MMCTAgent has modality-specific agents, including ImageAgent and VideoAgent, which include tools like get_relevant_query_frames() or object_detection-tool(). These agents perform deliberate, iterative reasoning—selecting the right tools for each modality, evaluating intermediate results, and refining conclusions through a Critic loop. This enables MMCTAgent to analyze complex queries across long videos and large image libraries with explainability, extensibility, and scalability.

How MMCTAgent works

MMCTAgent integrates two coordinated agents, Planner and Critic,orchestrated through AutoGen. The Planner agent decomposes a user query, identifies the appropriate reasoning tools, performs multimodal operations, and drafts a preliminary answer. The Critic agent reviews the Planner’s reasoning chain, validates evidence alignment, and refines or revises the response for factual accuracy and consistency.

This iterative reasoning loop enables MMCTAgent to improve its answers through structured self-evaluation—bringing reflection into AI reasoning. A key strength of MMCTAgent lies in its modular extensibility. Developers can easily integrate new, domain-specific tools—such as medical image analyzers, industrial inspection models, or specialized retrieval modules—by adding them to ImageQnATools or VideoQnATools. This design makes MMCTAgent adaptable across domains.

VideoAgent: From ingestion to long-form multimodal reasoning

Figure 1. MMCTAgent’s Planner–Critic architecture enables multimodal reasoning over long-form video through structured ingestion, retrieval, and iterative feedback.

The VideoAgent extends this architecture to long-form video reasoning. It operates in two connected phases: library creation (ingestion) and query-time reasoning.

Phase 1 – Video ingestion and library creation

Before reasoning, long-form videos undergo an ingestion pipeline that aligns multimodal information for retrieval and understanding:

Transcription and translation: Converts audio to text and, if multilingual, translates transcripts into a consistent language
Key-frame identification: Extracts representative frames marking major visual or scene changes
Semantic chunking and chapter generation: Combines transcript segments and visual summaries into coherent, semantically segmented chapters with associated key frames. Inspired by Microsoft’s Deep Video Discovery agentic search tool, this step also extracts detailed descriptions of objects, on-screen text, and characters present within each video segment, integrating these insights directly into the corresponding chapters.
Multimodal embedding creation: Generates image embeddings for key frames, linking them to their corresponding transcript and chapter data

All structured metadata, including transcripts, visual summaries, chapters, and embeddings, is indexed in the Multimodal Knowledgebase using Azure AI Search (opens in new tab), which forms the foundation for scalable semantic retrieval and downstream reasoning.

Phase 2 – Video question answering and reasoning

When a user submits a query, the VideoAgent retrieves, analyzes, and reasons across the indexed video content using specialized planner and critic tools.

Planner tools

get_video_analysis: Finds the most relevant video, provides a summary, and lists detected objects
get_context: Retrieves contextual information and relevant chapters from the Azure AI Search index
get_relevant_frames: Selects key frames most relevant to the user query
query_frame: Performs detailed visual and textual reasoning over selected frames
get_context and get_relevant_frames work in tandem to ensure that reasoning begins from the most semantically relevant evidence

Critic tool

critic_tool: Evaluates the reasoning output for temporal alignment, factual accuracy, and coherence between visual and textual modalities

This two-phase design, which involves structured ingestion followed by agentic reasoning, enables MMCTAgent to deliver accurate, interpretable insights for long information-dense videos.

ImageAgent: Structured reasoning for static visuals

While the VideoAgent handles temporal reasoning across long-form videos, the ImageAgent applies the same Planner–Critic paradigm to static visual analysis. It performs modular, tool-based reasoning over images, combining perception tools for recognition, detection, and optical character recognition with language-based reasoning for interpretation and explanation.

Planner tools

vit_tool: Leverages Vision Transformer (ViT) or Vision Languague Model (VLM) for high-level visual understanding and description
recog_tool: Performs scene, face, and object recognition
object_detection_tool: Localizes and labels entities within an image
ocr_tool: Extracts embedded text from visual elements

Critic tool

critic_tool: Validates the Planner’s conclusions for factual alignment and consistency, refining the final response

This lightweight ImageAgent provides fine-grained, explainable reasoning over image collections—supporting visual question answering, content inspection, and multimodal retrieval—while maintaining architectural symmetry with the VideoAgent.

Evaluation Results

To assess the effectiveness of MMCTAgent, we evaluated both the ImageAgent and VideoAgent with multiple base LLM models and a range of benchmark datasets and real-world scenarios. Some key results are presented here.

Image Datasets	GPT-4V	MMCT with GPT-4V	GPT4o	MMCT with GPT-4o	GPT-5	MMCT with GPT-5
MM-Vet [1]	60.20	74.24	77.98	79.36	80.51	81.65
MMMU [2]	56.80	63.57	69.10	73.00	84.20	85.44

Video Datasets	GPT4o	MMCT with GPT-4o
VideoMME [3]	72.10	76.70

MMCTAgent enhances base model performance by augmenting their capabilities with appropriate tools such as object detection and optical character recognition (OCR) for weaker models, or domain-specific tools for stronger models, thereby leading to substantial improvements. For example, integrating these tools raised GPT-4V’s accuracy from 60.20% to 74.24% on MM-Vet dataset. Additionally, the configurable Critic agent provides additional validation, which is especially valuable in critical domains. The additional evaluation results are available here (opens in new tab).

Takeaways and next steps

MMCTAgent demonstrates a scalable agentic approach to multimodal reasoning with a Planner–Critic architecture. Its unified multimodal design supports both image and video pipelines, while the extensible toolchain enables rapid integration of domain-specific tools and capabilities. It provides Azure-native deployment and supports configurability within the broader open-source ecosystem.

Looking ahead, we aim to improve efficiency and adaptability in retrieval and reasoning workflows, and to extend MMCTAgent’s applications beyond current agricultural evaluations, exploring new real-world domains through initiatives like Project Gecko to advance the creation of accessible, innovative multimodal applications for people around the globe.

Acknowledgements

We would like to thank our team members for their valuable contributions to this work: Aman Patkar, Ogbemi Ekwejunor-Etchie, Somnath Kumar, Soumya De, and Yash Gadhia. 

References

[1] W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and L. Wang. “MM-VET: Evaluating large multimodal models for integrated capabilities”, 2023.

[2] X. Yue, Y. Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y. Sun, C. Wei, B. Yu, R. Yuan, R. Sun, M. Yin, B. Zheng, Z. Yang, Y. Liu, W. Huang, H. Sun, Y. Su, and W. Chen. “MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI”, 2023.

[3] Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. “Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis”, 2024.

Source link

Hot topics

Finance

Marketing

Politics

Strategy

Hot topics

Finance

Marketing

Politics

Strategy

MMCTAgent: Enabling multimodal reasoning over large video and image collections

MMCTAgent

Microsoft Research Forum

How MMCTAgent works

VideoAgent: From ingestion to long-form multimodal reasoning

Phase 1 – Video ingestion and library creation

Phase 2 – Video question answering and reasoning

Planner tools

Critic tool

ImageAgent: Structured reasoning for static visuals

Planner tools

Critic tool

Evaluation Results

Takeaways and next steps

Acknowledgements

Topics

Related Articles

Categories

Headlines

Newsletter