Logo IF-VidCap

A benchmark for Instruction-Following Video Captioning

1Nanjing University, 2Kuaishou Technology, 3Shanghai University, 4CASIA, 5M-A-P,

Introduction

data-composition

Although Multimodal Large Language Models (MLLMs) have demonstrated proficiency in video captioning, practical applications require captions that follow specific user instructions rather than generating exhaustive, unconstrained descriptions. Current benchmarks, however, primarily assess descriptive comprehensiveness while largely overlook instruction-following capabilities. To address this gap, we introduce IF-VidCap, a new benchmark for evaluating controllable video captioning, which contains 1,400 high-quality samples. Distinct from existing video captioning or general instruction-following benchmarks, IF-VidCap incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our comprehensive evaluation of over 20 prominent models reveals a nuanced landscape: despite the continued dominance of proprietary models, the performance gap is closing, with top-tier open-source solutions now achieving near-parity. Furthermore, we find that models specialized for dense captioning underperform general-purpose MLLMs on complex instructions, indicating that future work should simultaneously advance both descriptive richness and instruction-following fidelity.

Leaderboard

ISR: Instruction Satisfaction Rate          CSR: Content Satisfaction Rate         

Rule-Based ISR/CSR: Only considers format-related constraints.          Open-ended ISR/CSR: Only considers content-related constraints.         

By default, this leaderboard is sorted by overall ISR, with CSR as a secondary sort key. To view other sorted results, please click on the corresponding cell.

The “Frame” column represents the frame rate (float) or fixed frame number(integer).

# Model LLM
Params
Frames Date Overall (%) Rule-Based (%) Open-ended (%)
ISR CSR ISR CSR ISR CSR
- Gemini-2.5-pro - - - 27.83 74.53 74.35 87.81 35.22 59.00
- Qwen3-VL-235B-A22B-Instruct 235 2 - 26.41 71.65 67.16 84.14 36.39 57.12
- Gemini-2.5-flash - - - 25.50 72.63 67.80 84.51 35.45 58.71
- InternVL3.5-241B-A28B_thinking 241B 1.0 - 24.20 71.17 65.58 83.21 34.64 57.13
- GPT-4o - - - 22.90 70.74 69.20 85.12 30.94 53.91
- InternVL3.5-38B_thinking 38B 1.0 - 20.71 68.30 59.43 80.17 31.79 54.42
- Gemini-2.0-flash - - - 18.19 67.45 63.04 82.06 26.86 50.39
- Qwen2.5-VL-72B-Instruct 72B 2.0 - 17.50 67.28 64.29 83.22 25.71 48.65
- InternVL3.5-8B_thinking 8B 1.0 - 17.33 65.90 60.32 79.95 26.84 49.50
- InternVL3.5-38B_nothinking 38B 1.0 - 15.43 64.76 57.79 78.92 24.93 48.20
- Qwen2.5-VL-32B-Instruct 32B 2.0 - 15.16 64.04 53.66 76.95 26.72 48.94
- VideoLLaMA3-7B_thinking 7B 2.0 - 12.21 57.38 48.64 71.69 19.93 40.65
- Qwen2.5-VL-7B-Instruct 7B 2.0 - 1.92 58.12 52.51 73.81 18.75 39.65
- MiniCPM-V-4.5_thinking 8B 2.0 - 11.75 61.67 58.09 79.35 18.05 40.97
- VideoLLaMA3-7B_nothinking 7B 2.0 - 10.63 57.17 47.34 71.21 18.46 40.75
- InternVL3.5-8B_nothinking 8B 1.0 - 9.96 56.45 48.14 71.68 16.98 38.65
- MiniCPM-V-4.5_nothinking 8B 2.0 - 8.57 59.23 56.07 77.62 14.64 37.73
- Qwen2.5-VL-3B-Instruct 3B 2.0 - 6.54 51.74 43.46 66.50 13.15 34.47
- Llama-3.2-90B-Vision-Instruct 90B 1.0 - 5.80 45.18 36.03 59.56 11.03 28.36
- Llama-3.2-11B-Vision-Instruct 11B 1.0 - 4.00 39.87 31.29 53.24 7.71 24.25
- llava-v1.6-vicuna-7b 7B 32 - 3.54 43.92 35.84 60.09 7.30 25.02
- Video-LLaVA 7B 8 - 3.13 38.74 26.53 51.27 7.73 24.05
- ARC-Hunyuan-Video-7B 7B 1.0 - 2.32 27.78 12.23 31.41 9.11 23.54
- Tarsier2-7B 7B 1.0 - 1.40 26.05 9.30 27.75 9.91 24.04
- IF-Captioner-Qwen(Ours) 7B 2.0 - 12.76 61.64 58.50 78.81 19.65 41.56

Date: indicates the publication date of open-source models          -: indicates "unknown" for closed-source models

Benchmark

Data Examples

See the IF-VidCap Explorer for more videos and human-annotated/model-generated descriptions.

Benchmark Statistics

data-composition

data-composition

Evaluation Approach

We evaluates two core dimensions—instruction following and video description quality—employing LLM-driven methods to ensure both flexibility and scalability. For rule-checkable constraints, we combine an LLM with rule scripts: the LLM serves as a content extractor while the rule scripts act as the verification executor. This integrates the LLM’s adaptability to complex text processing with the determinism of rule execution. For open-ended constraint checking, we design retrieval-based QA pairs, enabling the LLM to answer using the video caption as context. Specifically, the checklist uses true/false questions to let the LLM directly judge the semantic correctness of the description, and multiple-choice questions to have the LLM select facts inferable from the video description. All answers provided by the LLM are compared against the ground truth, and statistics are aggregated at the constraint level (a single constraint may include multiple QA pairs to enable atomic checking and control over granularity). We use GPT-5-mini to implement both models (See evaluation scripts), as shown bellow.

Checklist

grade-lv

The Judge workflow.

Citation


    @misc{li2025ifvidcapvideocaptionmodels,
          title={IF-VidCap: Can Video Caption Models Follow Instructions?}, 
          author={Shihao Li and Yuanxing Zhang and Jiangtao Wu and Zhide Lei and Yiwen He and Runzhe Wen and Chenxi Liao and Chengkang Jiang and An Ping and Shuo Gao and Suhan Wang and Zhaozhou Bian and Zijun Zhou and Jingyi Xie and Jiayi Zhou and Jing Wang and Yifan Yao and Weihao Xie and Yingshui Tan and Yanghai Wang and Qianqian Xie and Zhaoxiang Zhang and Jiaheng Liu},
          year={2025},
          eprint={2510.18726},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2510.18726}, 
    }