COCO (Common Objects in Context)

Overview

MS COCO is a large-scale dataset designed to advance scene understanding by focusing on objects in complex, everyday environments. Unlike earlier object-centric datasets, COCO emphasizes contextualized object instances, making it a standard benchmark for object detection, instance segmentation, and image captioning, as well as a key resource for vision–language research.

Modality: Images, text
Annotations | Labels:
- Bounding boxes
- Instance segmentation masks
- Object category labels (80 classes)
- Human-written image captions
Size:
- ~330,000 images total
- ~200,000 images with full annotations
Format:
- Images: JPEG
- Annotations: JSON
Structure:
- Train, validation, and test splits
- Task-specific annotation files

Typical Uses

Object detection
Instance segmentation
Image captioning
Vision–language model training

Notable Features

Objects appear in realistic, cluttered scenes
Dense, high-quality human annotations
Supports multiple vision and multimodal tasks

Limitations

Limited object vocabulary (80 categories)
Annotation density increases training and storage costs
Category imbalance across classes

Access

License / Source Information

Dataset Owner: Microsoft
Images are distributed under a mix of Creative Commons licenses; annotations are released for research and commercial use subject to original image licenses