WorldVQA

Measuring Atomic World Knowledge in
Multimodal Large Language Models

Kimi Team • Moonshot AI

arXiv Code 🤗Dataset

Overall Model Accuracy

Category-wise Accuracy

Rank	Model	Accuracy	Overall F-Score	F-score on 8 categories
Rank	Model	Accuracy	Overall F-Score	Nature	Geography	Culture	Objects	Transportation	Entertainment	Brands	Sports

Abstract

We introduce WorldVQA, a benchmark designed to evaluate the factual correctness and atomic vision-centric world knowledge of Multimodal Large Language Models (MLLMs). Current evaluations often conflate visual knowledge retrieval with reasoning. In contrast, WorldVQA decouples these capabilities to strictly measure "what the model memorizes." The benchmark assesses the atomic capability of grounding and naming visual entities across a stratified taxonomy, spanning from common head-class objects to long-tail rarities. We hope WorldVQA serves as a rigorous test for visual factuality, thereby establishing a standard for assessing the encyclopedic breadth and hallucination rates of current and next-generation frontier models.

WorldVQA Overview. The benchmark is organized into nine categories: Nature & Environment (Nature); Locations & Architecture (Geography); Culture, Arts & Crafts (Culture); Objects & Products (Objects); Vehicles, Craft & Transportation (Transportation); Entertainment, Media & Gaming (Entertainment); Brands, Logos & Graphic Design (Brands); Sports, Gear & Venues (Sports); Notable People & Public Figures (People).

Statistics	Number	Statistics	Percentage
Data	3500	- Entertainment, Media & Gaming (Entertainment)	14.60%
- Chinese (CN)	1260 (36%)	- Brands, Logos & Graphic Design (Brands)	7.43%
- English (EN)	2240 (64%)	- Sports, Gear & Venues (Sports)	4.06%
Category Categories		Notable People & Public Figures (People)	14.29%
- Nature & Environment (Nature)	9.31%	Difficulty
- Locations & Architecture (Geography)	14.63%	- Easy	31.17%
- Culture, Arts & Crafts (Culture)	14.46%	- Medium	40.77%
- Objects & Products (Objects)	12.49%	- Hard	28.07%
- Vehicles, Craft & Transportation (Transportation)	8.74%

WorldVQA Statistics. WorldVQA statistics across nine semantic categories and three difficulty tiers

WorldVQA

Overall Model Accuracy

Category-wise Accuracy

Abstract

WorldVQA Showcase