Skip to content

COCO (Common Objects in Context)

Overview

MS COCO is a large-scale dataset designed to advance scene understanding by focusing on objects in complex, everyday environments. Unlike earlier object-centric datasets, COCO emphasizes contextualized object instances, making it a standard benchmark for object detection, instance segmentation, and image captioning, as well as a key resource for vision–language research.

Contents

  • Modality: Images, text
  • Annotations | Labels:
    • Bounding boxes
    • Instance segmentation masks
    • Object category labels (80 classes)
    • Human-written image captions
  • Size:
    • ~330,000 images total
    • ~200,000 images with full annotations
  • Format:
    • Images: JPEG
    • Annotations: JSON
  • Structure:
    • Train, validation, and test splits
    • Task-specific annotation files

Typical Uses

  • Object detection
  • Instance segmentation
  • Image captioning
  • Vision–language model training

Notable Features

  • Objects appear in realistic, cluttered scenes
  • Dense, high-quality human annotations
  • Supports multiple vision and multimodal tasks

Limitations

  • Limited object vocabulary (80 categories)
  • Annotation density increases training and storage costs
  • Category imbalance across classes

Access

License / Source Information

  • Dataset Owner: Microsoft
  • Images are distributed under a mix of Creative Commons licenses; annotations are released for research and commercial use subject to original image licenses