QG-CoC: Question-Guided Chain-of-Captions

Introduction

QG-CoC is a zero-shot prompting method for enhancing multi-image understanding in Multimodal Large Language Models (MLLMs). It guides image captioning with question decomposition to better focus on relevant visual details, then synthesizes reasoning steps from sub-answers to reach a coherent final prediction.

Overview (Examples)

Example of QG-CoC prompting strategy (Figure 1)

Example of QG-CoC prompting strategy (Figure 2)

Example of QG-CoC prompting strategy (Figure 3)

Comparison

Below are examples comparing QG-CoC with existing prompting strategies.

We develop several prompting methods as our baselines:

Detailed Captioning: Caption each image in detail.
Question-Guided Detailed Captioning: Adding question when captioning in detail.
DDCoT: First, decompose the question, then utilizes MLLMs to answer the sub-questions and uses it as rationale.
CoCoT: Utilize MLLMs to describe the similarity and difference between multiple images.
CCoT: Utilize MLLMs to generate a scene graph based on each image.

Comparison of different prompting methods

Comparison of different captioning strategies

Experimental Results

QG-CoC achieves strong performance across multi-image (MMIU, MUIR) and single-image (ScienceQA, MMMU, MMBench) benchmarks. We evaluate on both closed-source and open-source 7B models

Multi-Image and Single-Image benchmark performance of different models with various prompting methods.

Performance comparison by image relationships of prompting strategies on MMIU (LLaVA-OV)

Performance comparison by image relationships of prompting strategies on MMIU (Mantis)

Performance comparison by image relationships of prompting strategies on MMIU (LLaVA-OV)

Performance comparison by image relationships of prompting strategies on MMIU (Mantis)

BibTeX

@inproceedings{kao-etal-2025-qg,
      title = "{QG}-{C}o{C}: Question-Guided Chain-of-Captions for Large Multimodal Models",
      author = "Kao, Kuei-Chun and
      Tzu-Yin, Hsu and
      Hong, Yunqi and
      Wang, Ruochen and
      Hsieh, Cho-Jui",
      editor = "Christodoulopoulos, Christos and
      Chakraborty, Tanmoy and
      Rose, Carolyn and
      Peng, Violet",
      booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
      month = nov,
      year = "2025",
      address = "Suzhou, China",
      publisher = "Association for Computational Linguistics",
      url = "https://aclanthology.org/2025.emnlp-main.1445/",
      pages = "28433--28448",
      ISBN = "979-8-89176-332-6"
      }