QG-CoC is a zero-shot prompting method for enhancing multi-image understanding in Multimodal Large Language Models (MLLMs). It guides image captioning with question decomposition to better focus on relevant visual details, then synthesizes reasoning steps from sub-answers to reach a coherent final prediction.
Example of QG-CoC prompting strategy (Figure 1)
Example of QG-CoC prompting strategy (Figure 2)
Example of QG-CoC prompting strategy (Figure 3)
Below are examples comparing QG-CoC with existing prompting strategies.
We develop several prompting methods as our baselines:
Comparison of different prompting methods
Comparison of different captioning strategies
QG-CoC achieves strong performance across multi-image (MMIU, MUIR) and single-image (ScienceQA, MMMU, MMBench) benchmarks. We evaluate on both closed-source and open-source 7B models
Multi-Image and Single-Image benchmark performance of different models with various prompting methods.
Performance comparison by image relationships of prompting strategies on MMIU (LLaVA-OV)
Performance comparison by image relationships of prompting strategies on MMIU (Mantis)
Performance comparison by image relationships of prompting strategies on MMIU (LLaVA-OV)
Performance comparison by image relationships of prompting strategies on MMIU (Mantis)
@inproceedings{kao-etal-2025-qg,
title = "{QG}-{C}o{C}: Question-Guided Chain-of-Captions for Large Multimodal Models",
author = "Kao, Kuei-Chun and
Tzu-Yin, Hsu and
Hong, Yunqi and
Wang, Ruochen and
Hsieh, Cho-Jui",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.emnlp-main.1445/",
pages = "28433--28448",
ISBN = "979-8-89176-332-6"
}