WorldVQA

Measuring Atomic World Knowledge in
Multimodal Large Language Models

Kimi Team • Moonshot AI

Overall Model Accuracy

Category-wise Accuracy

Rank Model Accuracy Overall
F-Score
F-score on 8 categories
Nature Geography Culture Objects Transportation Entertainment Brands Sports

Abstract

We introduce WorldVQA, a benchmark designed to evaluate the factual correctness and atomic vision-centric world knowledge of Multimodal Large Language Models (MLLMs). Current evaluations often conflate visual knowledge retrieval with reasoning. In contrast, WorldVQA decouples these capabilities to strictly measure "what the model memorizes." The benchmark assesses the atomic capability of grounding and naming visual entities across a stratified taxonomy, spanning from common head-class objects to long-tail rarities. We hope WorldVQA serves as a rigorous test for visual factuality, thereby establishing a standard for assessing the encyclopedic breadth and hallucination rates of current and next-generation frontier models.

WorldVQA Dataset Overview

WorldVQA Overview. The benchmark is organized into nine categories: Nature & Environment (Nature); Locations & Architecture (Geography); Culture, Arts & Crafts (Culture); Objects & Products (Objects); Vehicles, Craft & Transportation (Transportation); Entertainment, Media & Gaming (Entertainment); Brands, Logos & Graphic Design (Brands); Sports, Gear & Venues (Sports); Notable People & Public Figures (People).

Statistics Number Statistics Percentage
Data
3500
- Entertainment, Media & Gaming (Entertainment)
14.60%
- Chinese (CN)
1260 (36%)
- Brands, Logos & Graphic Design (Brands)
7.43%
- English (EN)
2240 (64%)
- Sports, Gear & Venues (Sports)
4.06%
Category Categories
Notable People & Public Figures (People)
14.29%
- Nature & Environment (Nature)
9.31%
Difficulty
- Locations & Architecture (Geography)
14.63%
- Easy
31.17%
- Culture, Arts & Crafts (Culture)
14.46%
- Medium
40.77%
- Objects & Products (Objects)
12.49%
- Hard
28.07%
- Vehicles, Craft & Transportation (Transportation)
8.74%

WorldVQA Statistics. WorldVQA statistics across nine semantic categories and three difficulty tiers

WorldVQA Showcase