C4 Colossal Clean
Crawled Corpus
C4 (Colossal Clean Crawled Corpus) is a large-scale cleaned web crawling dataset created by Allen AI. With over 10.4 billion records, it is the primary training data for the Google T5 model and has become one of the most influential datasets in the NLP field. The corpus is sourced from Common Crawl and has undergone extensive filtering, deduplication, and quality control. ```
Dataset Highlights
A large-scale high-quality web corpus, a cornerstone dataset for NLP pre-training and research
Web-Scale Corpus
Over 10.4 billion cleaned web text records, covering a vast array of diverse content on the internet, providing ample training data for large-scale language models.
Quality Filtering
Utilizes extensive heuristic rules and model-based quality filtering pipelines to ensure that the text in the corpus has high linguistic quality and information density.
Language Detection
Reliable language recognition mechanism to accurately filter English content, ensuring linguistic consistency in the corpus and effectiveness in model training.
Deduplication
Radical deduplication at the document and paragraph levels effectively eliminates redundant data, improving training efficiency and model quality.
Content Filtering
Removes offensive content, boilerplate text, and low-quality content, ensuring the safety and usability of the corpus, suitable for various research scenarios.
Open License
Released under the ODC-By 1.0 open license agreement, supporting research and commercial use, providing convenient data access for academia and industry.
Applicable Scenarios
From model pre-training to task evaluation, C4 plays a core role in various fields of NLP
LLM Pre-training
As foundational training data for large language models, providing massive high-quality text corpus for models like T5 and GPT.
Text Classification
Training classifiers based on diverse web content domains, covering various text categories such as news, technology, and education.
Information Extraction
Extracting structured information from unstructured web text, supporting tasks like entity recognition and relationship extraction.
Benchmark Development
Creating evaluation datasets using diverse web sources to provide standardized benchmarks for comparing NLP model performance.
Quick Start with C4
Quickly access the C4 dataset via API, below is a Python call example
import requestsurl = "https://api.acedata.cloud/datasets/c4"
headers = { "Authorization": "Bearer YOUR_API_TOKEN", "Content-Type": "application/json" }
params = { "count": 10, # Number of records to retrieve "offset": 0 # Starting position }
response = requests.get(url, headers=headers, params=params) data = response.json()
Print retrieved records
for record in data.get("results", []): print(record["text"][:200]) print("---")
3 Steps to Get Started Quickly
From registration to calling, you can start using the C4 dataset in just a few minutes.
Register an Account
Register your Ace Data Cloud account at platform.acedata.cloud and quickly complete the developer identity verification.
Obtain API Key
Create API credentials in the console to obtain your exclusive API Token for subsequent dataset interface calls.
Call Dataset API
Use the API Token to call the C4 dataset interface and obtain high-quality web text data as needed to start your NLP project.
Start Exploring the C4 Dataset
10.4 billion high-quality web texts, open license, instant API access. Whether you are training large language models or building NLP applications, C4 is your ideal data source.
