COCO (Common Objects in Context)
Overview
MS COCO is a large-scale dataset designed to advance scene understanding by focusing on objects in complex, everyday environments. Unlike earlier object-centric datasets, COCO emphasizes contextualized object instances, making it a standard benchmark for object detection, instance segmentation, and image captioning, as well as a key resource for vision–language research.
Contents
- Modality: Images, text
- Annotations | Labels:
- Bounding boxes
- Instance segmentation masks
- Object category labels (80 classes)
- Human-written image captions
- Size:
- ~330,000 images total
- ~200,000 images with full annotations
- Format:
- Images: JPEG
- Annotations: JSON
- Structure:
- Train, validation, and test splits
- Task-specific annotation files
Typical Uses
- Object detection
- Instance segmentation
- Image captioning
- Vision–language model training
Notable Features
- Objects appear in realistic, cluttered scenes
- Dense, high-quality human annotations
- Supports multiple vision and multimodal tasks
Limitations
- Limited object vocabulary (80 categories)
- Annotation density increases training and storage costs
- Category imbalance across classes
Access
License / Source Information
- Dataset Owner: Microsoft
- Images are distributed under a mix of Creative Commons licenses; annotations are released for research and commercial use subject to original image licenses