Measuring Atomic World Knowledge in
Multimodal Large Language Models
Kimi Team • Moonshot AI
| Rank | Model | Accuracy | Overall F-Score |
F-score on 8 categories | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Nature | Geography | Culture | Objects | Transportation | Entertainment | Brands | Sports | ||||
We introduce WorldVQA, a benchmark designed to evaluate the factual correctness and atomic vision-centric world knowledge of Multimodal Large Language Models (MLLMs). Current evaluations often conflate visual knowledge retrieval with reasoning. In contrast, WorldVQA decouples these capabilities to strictly measure "what the model memorizes." The benchmark assesses the atomic capability of grounding and naming visual entities across a stratified taxonomy, spanning from common head-class objects to long-tail rarities. We hope WorldVQA serves as a rigorous test for visual factuality, thereby establishing a standard for assessing the encyclopedic breadth and hallucination rates of current and next-generation frontier models.
WorldVQA Overview. The benchmark is organized into nine categories: Nature & Environment (Nature); Locations & Architecture (Geography); Culture, Arts & Crafts (Culture); Objects & Products (Objects); Vehicles, Craft & Transportation (Transportation); Entertainment, Media & Gaming (Entertainment); Brands, Logos & Graphic Design (Brands); Sports, Gear & Venues (Sports); Notable People & Public Figures (People).
| Statistics | Number | Statistics | Percentage |
|---|---|---|---|
|
Data
|
3500 |
- Entertainment, Media & Gaming (Entertainment)
|
14.60% |
|
- Chinese (CN)
|
1260 (36%) |
- Brands, Logos & Graphic Design (Brands)
|
7.43% |
|
- English (EN)
|
2240 (64%) |
- Sports, Gear & Venues (Sports)
|
4.06% |
|
Category Categories
|
Notable People & Public Figures (People)
|
14.29% | |
|
- Nature & Environment (Nature)
|
9.31% |
Difficulty
|
|
|
- Locations & Architecture (Geography)
|
14.63% |
- Easy
|
31.17% |
|
- Culture, Arts & Crafts (Culture)
|
14.46% |
- Medium
|
40.77% |
|
- Objects & Products (Objects)
|
12.49% |
- Hard
|
28.07% |
|
- Vehicles, Craft & Transportation (Transportation)
|
8.74% |
WorldVQA Statistics. WorldVQA statistics across nine semantic categories and three difficulty tiers