IF-VidCap

Introduction

Although Multimodal Large Language Models (MLLMs) have demonstrated proficiency in video captioning, practical applications require captions that follow specific user instructions rather than generating exhaustive, unconstrained descriptions. Current benchmarks, however, primarily assess descriptive comprehensiveness while largely overlook instruction-following capabilities. To address this gap, we introduce IF-VidCap, a new benchmark for evaluating controllable video captioning, which contains 1,400 high-quality samples. Distinct from existing video captioning or general instruction-following benchmarks, IF-VidCap incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our comprehensive evaluation of over 20 prominent models reveals a nuanced landscape: despite the continued dominance of proprietary models, the performance gap is closing, with top-tier open-source solutions now achieving near-parity. Furthermore, we find that models specialized for dense captioning underperform general-purpose MLLMs on complex instructions, indicating that future work should simultaneously advance both descriptive richness and instruction-following fidelity.

Leaderboard

ISR: Instruction Satisfaction Rate CSR: Content Satisfaction Rate

Rule-Based ISR/CSR: Only considers format-related constraints. Open-ended ISR/CSR: Only considers content-related constraints.

By default, this leaderboard is sorted by overall ISR, with CSR as a secondary sort key. To view other sorted results, please click on the corresponding cell.

The “Frame” column represents the frame rate (float) or fixed frame number(integer).

#	Model	LLM Params	Frames	Date	Overall (%)		Rule-Based (%)		Open-ended (%)
#	Model	LLM Params	Frames	Date	ISR	CSR	ISR	CSR	ISR	CSR
-	Gemini-2.5-pro	-	-	-	27.83	74.53	74.35	87.81	35.22	59.00
-	Qwen3-VL-235B-A22B-Instruct	235	2	-	26.41	71.65	67.16	84.14	36.39	57.12
-	Gemini-2.5-flash	-	-	-	25.50	72.63	67.80	84.51	35.45	58.71
-	InternVL3.5-241B-A28B_thinking	241B	1.0	-	24.20	71.17	65.58	83.21	34.64	57.13
-	GPT-4o	-	-	-	22.90	70.74	69.20	85.12	30.94	53.91
-	InternVL3.5-38B_thinking	38B	1.0	-	20.71	68.30	59.43	80.17	31.79	54.42
-	Gemini-2.0-flash	-	-	-	18.19	67.45	63.04	82.06	26.86	50.39
-	Qwen2.5-VL-72B-Instruct	72B	2.0	-	17.50	67.28	64.29	83.22	25.71	48.65
-	InternVL3.5-8B_thinking	8B	1.0	-	17.33	65.90	60.32	79.95	26.84	49.50
-	InternVL3.5-38B_nothinking	38B	1.0	-	15.43	64.76	57.79	78.92	24.93	48.20
-	Qwen2.5-VL-32B-Instruct	32B	2.0	-	15.16	64.04	53.66	76.95	26.72	48.94
-	VideoLLaMA3-7B_thinking	7B	2.0	-	12.21	57.38	48.64	71.69	19.93	40.65
-	Qwen2.5-VL-7B-Instruct	7B	2.0	-	1.92	58.12	52.51	73.81	18.75	39.65
-	MiniCPM-V-4.5_thinking	8B	2.0	-	11.75	61.67	58.09	79.35	18.05	40.97
-	VideoLLaMA3-7B_nothinking	7B	2.0	-	10.63	57.17	47.34	71.21	18.46	40.75
-	InternVL3.5-8B_nothinking	8B	1.0	-	9.96	56.45	48.14	71.68	16.98	38.65
-	MiniCPM-V-4.5_nothinking	8B	2.0	-	8.57	59.23	56.07	77.62	14.64	37.73
-	Qwen2.5-VL-3B-Instruct	3B	2.0	-	6.54	51.74	43.46	66.50	13.15	34.47
-	Llama-3.2-90B-Vision-Instruct	90B	1.0	-	5.80	45.18	36.03	59.56	11.03	28.36
-	Llama-3.2-11B-Vision-Instruct	11B	1.0	-	4.00	39.87	31.29	53.24	7.71	24.25
-	llava-v1.6-vicuna-7b	7B	32	-	3.54	43.92	35.84	60.09	7.30	25.02
-	Video-LLaVA	7B	8	-	3.13	38.74	26.53	51.27	7.73	24.05
-	ARC-Hunyuan-Video-7B	7B	1.0	-	2.32	27.78	12.23	31.41	9.11	23.54
-	Tarsier2-7B	7B	1.0	-	1.40	26.05	9.30	27.75	9.91	24.04
-	IF-Captioner-Qwen(Ours)	7B	2.0	-	12.76	61.64	58.50	78.81	19.65	41.56

Date: indicates the publication date of open-source models -: indicates "unknown" for closed-source models

Data Examples

See the IF-VidCap Explorer for more videos and human-annotated/model-generated descriptions.

Video Categorie: Live-action movies

Instruction: Use an ordered list starting with '1.' to describe the three main events in chronological order. Each list item must be 15-20 words.

Video Categorie: Animation movies

Instruction: First, use an ordered list (A.) to list main characters in the video. Second, for each character, generate a JSON object with "main action".

Video Categorie: Stock videos

Instruction: Use a Markdown table to list the two main characters in the video. The table must include columns: "Character", "Appearance (clothing/hair color)", and "Key Action". Each row must correspond to one character.

Video Categorie: YouTube videos

Instruction: Follow these steps: 1. Use an ordered list (starting with 1.) to list 4 locations Pikachu visits in order. 2. Generate a JSON object with 'emotion' (inferred mood, e.g., 'excited') and 'evidence' (1 visual cue per emotion). 3. End with a 20-30 word first-person summary from the Pikachu's perspective, including the coffee bag. Do not mention the beach scene.

Benchmark Statistics

We evaluates two core dimensions—instruction following and video description quality—employing LLM-driven methods to ensure both flexibility and scalability. For rule-checkable constraints, we combine an LLM with rule scripts: the LLM serves as a content extractor while the rule scripts act as the verification executor. This integrates the LLM’s adaptability to complex text processing with the determinism of rule execution. For open-ended constraint checking, we design retrieval-based QA pairs, enabling the LLM to answer using the video caption as context. Specifically, the checklist uses true/false questions to let the LLM directly judge the semantic correctness of the description, and multiple-choice questions to have the LLM select facts inferable from the video description. All answers provided by the LLM are compared against the ground truth, and statistics are aggregated at the constraint level (a single constraint may include multiple QA pairs to enable atomic checking and control over granularity). We use GPT-5-mini to implement both models (See evaluation scripts), as shown bellow.

Checklist

The Judge workflow.


    @misc{li2025ifvidcapvideocaptionmodels,
          title={IF-VidCap: Can Video Caption Models Follow Instructions?}, 
          author={Shihao Li and Yuanxing Zhang and Jiangtao Wu and Zhide Lei and Yiwen He and Runzhe Wen and Chenxi Liao and Chengkang Jiang and An Ping and Shuo Gao and Suhan Wang and Zhaozhou Bian and Zijun Zhou and Jingyi Xie and Jiayi Zhou and Jing Wang and Yifan Yao and Weihao Xie and Yingshui Tan and Yanghai Wang and Qianqian Xie and Zhaoxiang Zhang and Jiaheng Liu},
          year={2025},
          eprint={2510.18726},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2510.18726}, 
    }

IF-VidCap

A benchmark for Instruction-Following Video Captioning

Introduction

Leaderboard

Benchmark

Data Examples

Benchmark Statistics

Evaluation Approach

Checklist

Citation