MMC4 Multimodal
Dataset
MMC4 (Multimodal C4) is a large-scale multimodal dataset that adds interleaved text and image content to the pure text C4 corpus. It contains over 100 million documents and 585 million images, fully preserving the original interleaved structure of text and images on the web pages. ```
Dataset Highlights
A multimodal corpus at web scale, supporting the training of next-generation vision-language models
Web-scale collection
Over 100 million documents, containing interleaved images and text, covering a wide range of web content, unprecedented in scale.
C4 foundational build
Expanded from the mature Colossal Clean Crawled Corpus (C4) text corpus, inheriting its high-quality text cleaning processes.
CLIP filtering alignment
Using CLIP similarity scoring to validate the quality of image-text alignment, ensuring semantic relevance between images and contextual text.
Natural interleaved structure
Fully preserves the original arrangement of text and images on web pages, with images embedded in their natural positions within documents, rather than simple image-text pairing.
Deduplication pipeline
Cross-document deduplication mechanisms effectively reduce redundant data, enhance training efficiency, and prevent models from overfitting to duplicate content.
Linear allocation matching
Achieves optimal image-sentence matching through linear and allocation algorithms, accurately determining the best associated sentence for each image in the document.
Applicable Scenarios
From multimodal pre-training to interleaved content generation, empowering cutting-edge research
Contextual learning
Training models for few-shot multimodal reasoning, utilizing interleaved images and text to achieve In-Context Learning capabilities
Multimodal pre-training
Training foundational models on diverse web content to build powerful vision-language foundational models
Image-text alignment research
Studying the semantic relationships between visual content and textual content, advancing cross-modal understanding technology
Open-ended generation
Generating documents that contain coherent text and reasonable image placements, achieving interleaved multimodal content creation
Quick Start with MMC4
Quickly access the MMC4 dataset through the Ace Data Cloud API
import requestsSet your API token
API_TOKEN = "your_api_token_here"
Request MMC4 dataset
response = requests.get( "https://api.acedata.cloud/datasets/mmc4", headers={ "Authorization": f"Bearer {API_TOKEN}", "Content-Type": "application/json" }, params={ "limit": 10, "offset": 0 } )
Parse the response
data = response.json() for doc in data.get("documents", []): print(f"Document ID: {doc['id']}") print(f"Text length: {len(doc['text'])} chars") print(f"Images: {len(doc['images'])} items") print("---")
3 Steps to Get Started Quickly
From registration to usage, you can start accessing MMC4 data in just a few minutes.
Register an Account
Visit platform.acedata.cloud to register for an Ace Data Cloud platform account and quickly complete identity verification.
Obtain API Key
Create an API key in the console to get your exclusive Bearer Token for interface authentication.
Call the Dataset API
Use the API endpoint /datasets/mmc4 to start querying and downloading MMC4 multimodal data.
Start Exploring MMC4 Multimodal Data
Over 100 million documents, 585 million images, and 43 billion text tokens. Whether for multimodal pre-training or interleaved content generation research, MMC4 is the ideal choice.
