Multimodal C4 Dataset

MMC4 Multimodal
Dataset

MMC4 (Multimodal C4) is a large-scale multimodal dataset that adds interleaved text and image content to the pure text C4 corpus. It contains over 100 million documents and 585 million images, fully preserving the original interleaved structure of text and images on the web pages. ```

103M documents 585M images 43B text tokens AI2 research team
📄
103M
Number of documents
🖼️
585M
Number of images
📝
43B
Text tokens
🔓
Open
Open for research use

Dataset Highlights

A multimodal corpus at web scale, supporting the training of next-generation vision-language models

🌐

Web-scale collection

Over 100 million documents, containing interleaved images and text, covering a wide range of web content, unprecedented in scale.

📚

C4 foundational build

Expanded from the mature Colossal Clean Crawled Corpus (C4) text corpus, inheriting its high-quality text cleaning processes.

🎯

CLIP filtering alignment

Using CLIP similarity scoring to validate the quality of image-text alignment, ensuring semantic relevance between images and contextual text.

🔀

Natural interleaved structure

Fully preserves the original arrangement of text and images on web pages, with images embedded in their natural positions within documents, rather than simple image-text pairing.

🔧

Deduplication pipeline

Cross-document deduplication mechanisms effectively reduce redundant data, enhance training efficiency, and prevent models from overfitting to duplicate content.

📐

Linear allocation matching

Achieves optimal image-sentence matching through linear and allocation algorithms, accurately determining the best associated sentence for each image in the document.

Applicable Scenarios

From multimodal pre-training to interleaved content generation, empowering cutting-edge research

🧠

Contextual learning

Training models for few-shot multimodal reasoning, utilizing interleaved images and text to achieve In-Context Learning capabilities

🏗️

Multimodal pre-training

Training foundational models on diverse web content to build powerful vision-language foundational models

🔗

Image-text alignment research

Studying the semantic relationships between visual content and textual content, advancing cross-modal understanding technology

✍️

Open-ended generation

Generating documents that contain coherent text and reasonable image placements, achieving interleaved multimodal content creation

multimodal interleaved web-crawl NLP images

Quick Start with MMC4

Quickly access the MMC4 dataset through the Ace Data Cloud API

Python
import requests

Set your API token

API_TOKEN = "your_api_token_here"

Request MMC4 dataset

response = requests.get( "https://api.acedata.cloud/datasets/mmc4", headers={ "Authorization": f"Bearer {API_TOKEN}", "Content-Type": "application/json" }, params={ "limit": 10, "offset": 0 } )

Parse the response

data = response.json() for doc in data.get("documents", []): print(f"Document ID: {doc['id']}") print(f"Text length: {len(doc['text'])} chars") print(f"Images: {len(doc['images'])} items") print("---")

3 Steps to Get Started Quickly

From registration to usage, you can start accessing MMC4 data in just a few minutes.

01

Register an Account

Visit platform.acedata.cloud to register for an Ace Data Cloud platform account and quickly complete identity verification.

02

Obtain API Key

Create an API key in the console to get your exclusive Bearer Token for interface authentication.

03

Call the Dataset API

Use the API endpoint /datasets/mmc4 to start querying and downloading MMC4 multimodal data.

Start Exploring MMC4 Multimodal Data

Over 100 million documents, 585 million images, and 43 billion text tokens. Whether for multimodal pre-training or interleaved content generation research, MMC4 is the ideal choice.