C4 Dataset

C4 Colossal Clean
Crawled Corpus

Name: C4 (Colossal Clean Crawled Corpus)
Brand: Ace Data Cloud

C4 (Colossal Clean Crawled Corpus) is a large-scale cleaned web crawling dataset created by Allen AI. With over 10.4 billion records, it is the primary training data for the Google T5 model and has become one of the most influential datasets in the NLP field. The corpus is sourced from Common Crawl and has undergone extensive filtering, deduplication, and quality control. ```

Get Dataset Now

10.4 billion records 750GB text ODC-By 1.0 license T5 training data

📊

10.4B

Total Record Count

💾

750GB

Text Data Volume

📜

ODC-By 1.0

Open License Agreement

🤖

Training Data Source

Dataset Highlights

A large-scale high-quality web corpus, a cornerstone dataset for NLP pre-training and research

🌐

Web-Scale Corpus

Over 10.4 billion cleaned web text records, covering a vast array of diverse content on the internet, providing ample training data for large-scale language models.

🔬

Quality Filtering

Utilizes extensive heuristic rules and model-based quality filtering pipelines to ensure that the text in the corpus has high linguistic quality and information density.

🌍

Language Detection

Reliable language recognition mechanism to accurately filter English content, ensuring linguistic consistency in the corpus and effectiveness in model training.

🔄

Deduplication

Radical deduplication at the document and paragraph levels effectively eliminates redundant data, improving training efficiency and model quality.

🛡️

Content Filtering

Removes offensive content, boilerplate text, and low-quality content, ensuring the safety and usability of the corpus, suitable for various research scenarios.

📖

Open License

Released under the ODC-By 1.0 open license agreement, supporting research and commercial use, providing convenient data access for academia and industry.

Applicable Scenarios

From model pre-training to task evaluation, C4 plays a core role in various fields of NLP

🧠

LLM Pre-training

As foundational training data for large language models, providing massive high-quality text corpus for models like T5 and GPT.

🏷️

Text Classification

Training classifiers based on diverse web content domains, covering various text categories such as news, technology, and education.

🔍

Information Extraction

Extracting structured information from unstructured web text, supporting tasks like entity recognition and relationship extraction.

📐

Benchmark Development

Creating evaluation datasets using diverse web sources to provide standardized benchmarks for comparing NLP model performance.

NLP web-crawl pre-training text T5

Quick Start with C4

Quickly access the C4 dataset via API, below is a Python call example

PYTHON

import requests
url = "https://api.acedata.cloud/datasets/c4"
headers = {
"Authorization": "Bearer YOUR_API_TOKEN",
"Content-Type": "application/json"
}
params = {
"count": 10,       # Number of records to retrieve
"offset": 0        # Starting position
}
response = requests.get(url, headers=headers, params=params)
data = response.json()
Print retrieved records
for record in data.get("results", []):
print(record["text"][:200])
print("---")

3 Steps to Get Started Quickly

From registration to calling, you can start using the C4 dataset in just a few minutes.

Register an Account

Register your Ace Data Cloud account at platform.acedata.cloud and quickly complete the developer identity verification.

Obtain API Key

Create API credentials in the console to obtain your exclusive API Token for subsequent dataset interface calls.

Call Dataset API

Use the API Token to call the C4 dataset interface and obtain high-quality web text data as needed to start your NLP project.

Get Dataset

Start Exploring the C4 Dataset

10.4 billion high-quality web texts, open license, instant API access. Whether you are training large language models or building NLP applications, C4 is your ideal data source.