QG-CoC: Question-Guided Chain-of-Captions

A generalizable zero-shot prompting method for multi-image reasoning in MLLMs

1University of California, Los Angeles
Example of Multi-Unknown Problem

Introduction

QG-CoC is a zero-shot prompting method for enhancing multi-image understanding in Multimodal Large Language Models (MLLMs). It guides image captioning with question decomposition to better focus on relevant visual details, then synthesizes reasoning steps from sub-answers to reach a coherent final prediction.

Overview (Examples)

Comparison

Below are examples comparing QG-CoC with existing prompting strategies.

We develop several prompting methods as our baselines:

  • Detailed Captioning: Caption each image in detail.
  • Question-Guided Detailed Captioning: Adding question when captioning in detail.
  • DDCoT: First, decompose the question, then utilizes MLLMs to answer the sub-questions and uses it as rationale.
  • CoCoT: Utilize MLLMs to describe the similarity and difference between multiple images.
  • CCoT: Utilize MLLMs to generate a scene graph based on each image.

Experimental Results

QG-CoC achieves strong performance across multi-image (MMIU, MUIR) and single-image (ScienceQA, MMMU, MMBench) benchmarks. We evaluate on both closed-source and open-source 7B models

BibTeX

@inproceedings{kao-etal-2025-qg,
      title = "{QG}-{C}o{C}: Question-Guided Chain-of-Captions for Large Multimodal Models",
      author = "Kao, Kuei-Chun and
      Tzu-Yin, Hsu and
      Hong, Yunqi and
      Wang, Ruochen and
      Hsieh, Cho-Jui",
      editor = "Christodoulopoulos, Christos and
      Chakraborty, Tanmoy and
      Rose, Carolyn and
      Peng, Violet",
      booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
      month = nov,
      year = "2025",
      address = "Suzhou, China",
      publisher = "Association for Computational Linguistics",
      url = "https://aclanthology.org/2025.emnlp-main.1445/",
      pages = "28433--28448",
      ISBN = "979-8-89176-332-6"
      }