Logo VideoVista

A Versatile Benchmark for Video Understanding and Reasoning

1Harbin Institute of Technology, Shenzhen

Introduction

Despite significant breakthroughs in video analysis driven by the rapid development of large multimodal models (LMMs), there remains a lack of a versatile evaluation benchmark to comprehensively assess these models' performance in video understanding and reasoning. We introduce VideoVista, a video benchmark that integrates challenges across diverse content categories, durations, and abilities. Specifically, VideoVista comprises 25,000 questions derived from 3,400 videos spanning 14 categories (e.g., Howto, Film, and Entertainment) with durations ranging from a few seconds to over 10 minutes. Besides, it encompasses 19 types of understanding tasks (e.g., anomaly detection, interaction understanding) and 8 reasoning tasks (e.g., logical reasoning, causal reasoning). Our experiments reveal that: 1) Video-LMMs face difficulties in fine-grained video tasks involving temporal location, object tracking, and anomaly detection; 2) Video-LMMs present inferior logical and relation reasoning abilities; 3) Open-source Video-LMMs' performance is significantly lower than GPT-4o and Gemini-1.5, lagging by 20 points. This highlights the crucial role VideoVista will play in advancing LMMs that can accurately understand videos and perform precise reasoning.

Leaderboard on VideoVista

Accuracy scores on the VideoVista Dataset.

Unders.(Video Understanding Task); Reason.(Video Reasoning Task)

# Model Language Model Frames Date Overall Unders. Reason.
1 Human Performance - 1fps 2024-08-27 90.24 89.64 92.30
2 TimeMarker Llama3-8B 1fps(max 128) 2024-10-30 78.44 75.93 87.84
3 GPT-4o - 100 2024-07-05 78.26 75.15 87.97
4 Video-CCAM-v1.1 Phi-3-medimum-4k 96 2024-08-28 76.5 73.12 89.14
5 Gemini-1.5-Flash - 1 fps 2024-07-05 76.39 74.73 82.30
6 GPT-4o-mini - 100 2024-07-19 75.76 72.87 85.52
7 Qwen2-VL Qwen2-7B 1fps(max 128) 2024-09-02 75.56 72.58 85.89
8 LLaVA-OneVision Qwen2-7B 32 2024-08-15 72.99 70.25 83.20
9 Video-CCAM-v1.1 Phi-3-mini-4k 96 2024-08-26 70.82 67.49 82.31
10 Kangaroo Llama3-8B 64 2024-07-24 69.5 66.36 81.23
11 InternLM-XComposer-2.5 InternLM2-7B 64 2024-07-09 68.91 66.75 76.96
12 Video-CCAM-v1.0 Phi-3-medimum-4k 96 2024-07-17 68.43 66.15 76.90
13 Video-CCAM-v1.0 Phi-3-mini-4k 96 2024-07-18 68.09 66.18 75.22
14 LongVA-DPO Qwen2-7B 128 2024-07-10 67.49 64.81 77.50
15 LongVA Qwen2-7B 128 2024-07-08 67.36 64.67 77.39
16 PLLaVA Vicuna-13B-v1.5 16 2024-07-05 64.67 62.44 73.00
17 VILA-1.5 Vicuna-13B-v1.5 8 2024-07-07 64.18 62.27 71.34
18 VideoChat2-Mistral-HD Mistral-7B 16 2024-07-17 61.58 59.27 70.24
19 Uni-MoE Vicuna-7B-v1.5 8 2024-07-05 61.13 58.65 69.62
20 VideoLLaMA2 Mistral-7B 16 2024-07-05 60.47 58.73 66.97
21 PLLaVA Vicuna-7B-v1.5 16 2024-07-05 60.36 58.35 67.86
22 VideoChat2-Mistral Mistral-7B 16 2024-07-05 57.24 54.91 65.95
23 CogVLM2-Video-Chat LLaMA3-8B 24 2024-07-16 57.19 56.85 58.48
24 LLaMA-VID Vicuna-7B-v1.5 1 fps 2024-07-05 56.87 54.00 67.61
25 LLaVA-NeXT-Video-DPO Vicuna-7B-v1.5 16 2024-07-05 56.66 54.12 66.14
26 Video-LLaVA Vicuna-7B-v1.5 8 2024-07-05 56.59 53.82 66.91
27 VILA-1.5 Llama3-8B 8 2024-07-05 55.15 52.99 63.24
28 MiniGPT4-Video Mistral-7B 45 2024-07-05 54.64 51.73 65.50
29 VTimeLLM-Vicuna Vicuna-7B-v1.5 100 2024-07-07 54.52 52.24 63.07
30 Chat-UniVi-v1.5 Vicuna-7B-v1.5 64 2024-07-08 54.22 51.72 63.55
31 VideoChat2-Vicuna Vicuna-7B-v1.5 16 2024-07-05 53.64 51.79 60.55
32 ShareGPT4Video Vicuna-7B-v1.5 16 2024-07-05 53.58 51.79 60.30
33 VTimeLLM-ChatGLM ChatGLM3-6B 100 2024-07-07 52.86 50.91 60.15
34 ST-LLM Vicuna-7b-v1.1 64 2024-07-09 49.33 47.28 56.98
35 MiniGPT4-Video Vicuna-7B-v1.5 45 2024-07-05 44.92 43.43 50.48
36 IVA Vicuna-7B-v1.5 200 2024-07-05 39.7 37.38 48.38
37 Video-ChatGPT Vicuna-7B-v1.1 100 2024-07-05 36.65 36.09 38.73
38 Video-LLaMA Vicuna-7B-v1.1 16 2024-07-05 25.35 25.40 25.16
39 VideoChat_with_ChatGPT gpt-3.5-turbo 40 2024-07-05 17.99 16.64 23.04
Human Performance*: We randomly sample max 150 questions for every categroy and signed them to several group members. Finally, we obtained the human performance in above table. Relation Reasoning: Due to the fact that most current models do not support video+video or video+image input for relation reasoning tasks, we only evaluated the relation reasoning task during the assessment of the GPT-4o series models, Gemini series models, and Qwen2-VL series models.

Logo VideoVista

Statistics

Notable statistics of Logo VideoVista

Experiment Results

Different Video Category

category

Detailed analysis of Video-LLMs: GPT-4o, Gemini-1.5-Flash, LLaMA-VID, VideoChat2-Mistral

The category use abbreviations: Howto & Style (H&S), News & Politics (N&P), Pets & Animals (P&A), Autos & Vehicles (A&E), Gaming (Gam.), Film & Animation (F&A), Sports (Spo.), Entertainment (ENT.), People & Blogs (P&B), Travel & Events (T&E), Comedy (Com.), Science & Technology (S&T), Education (Edu.), Music (Mus.).

Different Video Duration

category

Detailed analysis of Video-LLMs: GPT-4o, Gemini-1.5-Flash, LLaMA-VID, VideoChat2-Mistral

The duration: 0-60s (Tiny), 60-120s (Short), 120-300s (Med.), 300-600s (Long), 600s+ (Huge).

Citation

@misc{li2024videovista,
        title={VideoVista: A Versatile Benchmark for Video Understanding and Reasoning}, 
        author={Yunxin Li and Xinyu Chen and Baotian Hu and Longyue Wang and Haoyuan Shi and Min Zhang},
        year={2024},
        eprint={2406.11303},
        archivePrefix={arXiv}
}