<style>
.dolma-page * { box-sizing: border-box; }
.dolma-page h1, .dolma-page h2, .dolma-page h3, .dolma-page h4, .dolma-page h5, .dolma-page h6, .dolma-page p, .dolma-page ul, .dolma-page ol, .dolma-page li, .dolma-page pre, .dolma-page blockquote, .dolma-page table, .dolma-page td, .dolma-page th { margin: 0; padding: 0; }
.dolma-page {
-webkit-font-smoothing: antialiased;
-moz-osx-font-smoothing: grayscale;
color: var(--el-text-color-primary);
background: var(--el-bg-color);
line-height: 1.6;
}
.dolma-page a { text-decoration: none; color: inherit; }
.dolma-page a:hover { text-decoration: none; }
.dolma-page ul { list-style: none; }
.markdown-body .dolma-page a { color: inherit !important; text-decoration: none !important; }
.markdown-body .dolma-page a:hover { text-decoration: none !important; }
.markdown-body .dolma-page a.s-btn-primary,
.markdown-body .dolma-page a.btn-cta-light { color: #ffffff !important; }
.markdown-body .dolma-page a.s-btn-secondary { color: var(--el-text-color-primary) !important; }
.markdown-body .dolma-page a.btn-cta-ghost { color: #94a3b8 !important; }
.markdown-body .dolma-page a.btn-cta-ghost:hover { color: #e2e8f0 !important; }
.markdown-body .dolma-page h1, .markdown-body .dolma-page h2 { border-bottom: none !important; padding-bottom: 0 !important; }
.dolma-page .s-container { max-width: 1200px; margin: 0 auto; padding: 0 24px; }
.dolma-page .s-container-narrow { max-width: 800px; margin: 0 auto; padding: 0 24px; }
.dolma-page .s-container-wide { max-width: 1100px; margin: 0 auto; padding: 0 32px; }
.dolma-page .s-section { padding: 80px 0; }
.dolma-page .s-section-lg { padding: 100px 0; }
.dolma-page .s-section-sm { padding: 48px 0; }
.dolma-page .s-bg-white { background: var(--el-bg-color); }
.dolma-page .s-bg-gray { background: var(--el-bg-color-page); }
.dolma-page .s-bg-dark { background: #0f172a; color: #f8fafc; }
.dolma-page .s-header { text-align: center; margin-bottom: 64px; }
.dolma-page .s-header h2 {
font-size: clamp(28px, 4vw, 40px);
font-weight: 700;
color: var(--el-text-color-primary);
letter-spacing: normal;
margin-bottom: 20px;
line-height: 1.15;
}
.dolma-page .s-header p {
font-size: clamp(16px, 2vw, 18px);
color: var(--el-text-color-regular);
max-width: 640px;
margin: 0 auto;
line-height: 1.6;
}
.dolma-page .s-bg-dark .s-header h2 { color: #f8fafc; }
.dolma-page .s-bg-dark .s-header p { color: var(--el-text-color-secondary); }
.dolma-page .s-btn-primary {
display: inline-flex; align-items: center; gap: 6px;
padding: 14px 28px;
background: #0284c7; color: #ffffff !important;
border-radius: 9999px; font-size: 15px; font-weight: 600;
transition: background 0.2s, transform 0.15s;
border: none; cursor: pointer;
text-decoration: none !important;
}
.dolma-page .s-btn-primary:hover { background: #0369a1; transform: translateY(-1px); text-decoration: none !important; }
.dolma-page .s-btn-secondary {
display: inline-flex; align-items: center; gap: 6px;
padding: 14px 28px;
background: var(--el-bg-color); color: var(--el-text-color-primary) !important;
border: 1px solid var(--el-border-color-light);
border-radius: 9999px; font-size: 15px; font-weight: 600;
transition: border-color 0.2s, background 0.2s;
cursor: pointer;
text-decoration: none !important;
}
.dolma-page .s-btn-secondary:hover { background: var(--el-bg-color-page); text-decoration: none !important; }
.dolma-hero {
padding: 100px 0 80px;
text-align: center;
background: var(--el-bg-color);
position: relative;
overflow: hidden;
}
.dolma-hero::before {
content: '';
position: absolute;
top: -200px; left: 50%;
transform: translateX(-50%);
width: 900px; height: 500px;
background: radial-gradient(ellipse, rgba(2, 132, 199, 0.06) 0%, transparent 70%);
pointer-events: none;
}
.dolma-page .hero-badge {
display: inline-flex; align-items: center; gap: 8px;
padding: 6px 16px;
background: var(--el-bg-color-page); border: 1px solid var(--el-border-color-light);
border-radius: 9999px; font-size: 13px; font-weight: 600; color: var(--el-text-color-regular);
margin-bottom: 28px;
}
.dolma-page .hero-badge .badge-dot {
width: 6px; height: 6px; background: #10b981; border-radius: 50%;
display: inline-block;
}
.dolma-hero h1 {
font-size: clamp(36px, 5vw, 60px);
font-weight: 700; line-height: 1.05;
letter-spacing: normal; color: var(--el-text-color-primary);
margin-bottom: 20px;
position: relative;
}
.dolma-hero h1 span { color: #0284c7; }
.dolma-page .hero-subtitle {
font-size: clamp(16px, 2vw, 20px);
color: var(--el-text-color-regular); line-height: 1.6;
max-width: 620px; margin: 0 auto 56px;
position: relative;
}
.dolma-page .hero-actions {
display: flex; gap: 12px; justify-content: center;
flex-wrap: wrap; margin-bottom: 56px; position: relative;
}
.dolma-page .hero-highlights {
display: flex; align-items: center; justify-content: center;
gap: 16px; flex-wrap: wrap; position: relative;
}
.dolma-page .hero-highlights .h-item { font-size: 14px; color: var(--el-text-color-regular); font-weight: 500; }
.dolma-page .hero-highlights .h-div { width: 1px; height: 16px; background: var(--el-border-color-light); }
@media (max-width: 640px) 

{ .dolma-page .hero-highlights .h-div { display: none; } .dolma-page .hero-highlights { gap: 8px 16px; } .dolma-page .hero-actions { flex-direction: column; align-items: center; } .dolma-page .hero-actions a { width: 100%; max-width: 280px; justify-content: center; } } .dolma-page .hero-cover { max-width: 720px; margin: 48px auto 0; border-radius: 16px; overflow: hidden; box-shadow: 0 8px 32px rgba(0,0,0,0.10); } .dolma-page .hero-cover img { width: 100%; height: auto; display: block; } .dolma-stats { padding: 48px 0; background: var(--el-bg-color-page); border-top: 1px solid var(--el-border-color-lighter); border-bottom: 1px solid var(--el-border-color-lighter); } .dolma-page .stats-grid { display: grid; grid-template-columns: repeat(4, 1fr); gap: 32px; text-align: center; } .dolma-page .stat-icon { font-size: 28px; margin-bottom: 12px; } .dolma-page .stat-val { font-size: clamp(28px, 4vw, 40px); font-weight: 700; color: var(--el-text-color-primary); letter-spacing: normal; margin-bottom: 4px; } .dolma-page .stat-lbl { font-size: 14px; color: var(--el-text-color-secondary); font-weight: 500; } @media (max-width: 768px) { .dolma-page .stats-grid { grid-template-columns: repeat(2, 1fr); gap: 24px; } } @media (max-width: 480px) { .dolma-page .stats-grid { grid-template-columns: 1fr; gap: 20px; } } .dolma-page .features-grid { display: grid; grid-template-columns: repeat(3, 1fr); gap: 24px; } .dolma-page .feat-card { padding: 32px 28px; border: none; border-radius: 20px; box-shadow: 0 2px 12px 0 rgba(0,0,0,0.08); background: var(--el-bg-color); transition: border-color 0.2s, box-shadow 0.2s, transform 0.15s; } .dolma-page .feat-card:hover { box-shadow: 0 8px 24px 0 rgba(0,0,0,0.12); transform: translateY(-2px); } .dolma-page .feat-icon { font-size: 32px; margin-bottom: 16px; } .dolma-page .feat-card h3 { font-size: 18px; font-weight: 700; color: var(--el-text-color-primary); margin-bottom: 8px; } .dolma-page .feat-card p { font-size: 15px; color: var(--el-text-color-regular); line-height: 1.6; } @media (max-width: 1024px) { .dolma-page .features-grid { grid-template-columns: repeat(2, 1fr); } } @media (max-width: 640px) { .dolma-page .features-grid { grid-template-columns: 1fr; } } .dolma-page .usecases-grid { display: grid; grid-template-columns: repeat(4, 1fr); gap: 20px; } .dolma-page .uc-card { padding: 28px 24px; background: var(--el-bg-color); border: none; border-radius: 20px; box-shadow: 0 2px 12px 0 rgba(0,0,0,0.08); text-align: center; transition: border-color 0.2s, box-shadow 0.2s, transform 0.15s; } .dolma-page .uc-card:hover { box-shadow: 0 8px 24px 0 rgba(0,0,0,0.12); transform: translateY(-2px); } .dolma-page .uc-icon { font-size: 36px; margin-bottom: 16px; } .dolma-page .uc-card h3 { font-size: 17px; font-weight: 700; color: var(--el-text-color-primary); margin-bottom: 8px; } .dolma-page .uc-card p { font-size: 14px; color: var(--el-text-color-regular); line-height: 1.6; } @media (max-width: 1024px) { .dolma-page .usecases-grid { grid-template-columns: repeat(2, 1fr); } } @media (max-width: 480px) { .dolma-page .usecases-grid { grid-template-columns: 1fr; } } .dolma-page .code-wrap { border-radius: 16px !important; overflow: hidden !important; border: 1px solid #334155 !important; background: #0f172a !important; max-width: 860px; margin: 0 auto; } .markdown-body .dolma-page .code-wrap { border-radius: 16px !important; overflow: hidden !important; border: 1px solid #334155 !important; background: #0f172a !important; } .dolma-page .code-bar { display: flex !important; align-items: center !important; justify-content: space-between !important; padding: 12px 20px !important; background: #1e293b !important; border-bottom: 1px solid #334155 !important; } .dolma-page .code-dots { display: flex; gap: 6px; } .dolma-page .code-dots i { width: 10px; height: 10px; border-radius: 50%; display: inline-block; } .dolma-page .code-dots .r { background: #ef4444; } .dolma-page .code-dots .y { background: #f59e0b; } .dolma-page .code-dots .g { background: #10b981; } .dolma-page .code-lang { font-size: 12px; color: var(--el-text-color-secondary); font-weight: 600; text-transform: uppercase; letter-spacing: 0.05em; } .dolma-page .code-block { padding: 24px !important; margin: 0 !important; overflow-x: auto !important; font-family: 'JetBrains Mono', 'Fira Code', 'SF Mono', monospace !important; font-size: 13.5px !important; line-height: 1.7 !important; color: #e2e8f0 !important; white-space: pre !important; background: transparent !important; border: none !important; border-radius: 0 !important; } .markdown-body .dolma-page .code-block { padding: 24px !important; margin: 0 !important; overflow-x: auto !important; font-family: 'JetBrains Mono', 'Fira Code', 'SF Mono', monospace !important; font-size: 13.5px !important; line-height: 1.7 !important; color: #e2e8f0 !important; white-space: pre !important; background: transparent !important; border: none !important; border-radius: 0 !important; } .dolma-page .steps-row { display: flex; align-items: flex-start; justify-content: center; margin-bottom: 48px; } .dolma-page .stp-card { flex: 1; max-width: 320px; text-align: center; padding: 0 24px; } .dolma-page .stp-num { font-size: clamp(48px, 6vw, 72px); font-weight: 700; color: #e2e8f0; letter-spacing: -0.04em; line-height: 1; margin-bottom: 20px; } .dolma-page .stp-card h3 { font-size: 18px; font-weight: 700; color: var(--el-text-color-primary); margin-bottom: 10px; } .dolma-page .stp-card p { font-size: 15px; color: var(--el-text-color-regular); line-height: 1.6; } .dolma-page .stp-conn { width: 60px; height: 2px; background: var(--el-border-color-light); margin-top: 36px; flex-shrink: 0; } .dolma-page .steps-cta { text-align: center; } @media (max-width: 768px) { .dolma-page .steps-row { flex-direction: column; align-items: center; gap: 32px; } .dolma-page .stp-conn { width: 2px; height: 32px; margin: 0; } .dolma-page .stp-card { max-width: 100%; } } .dolma-cta { padding: 100px 0; background: #082f49; text-align: center; position: relative; overflow: hidden; } .dolma-cta::before { content: ''; position: absolute; top: -100px; left: 50%; transform: translateX(-50%); width: 700px; height: 400px; background: radial-gradient(ellipse, rgba(56, 189, 248, 0.12) 0%, transparent 70%); pointer-events: none; } .dolma-cta h2 { font-size: clamp(28px, 4vw, 44px); font-weight: 700; color: #bae6fd; letter-spacing: normal; margin-bottom: 28px; position: relative; } .dolma-cta > div > p { font-size: clamp(16px, 2vw, 18px); color: var(--el-text-color-secondary); max-width: 520px; margin: 0 auto 56px; line-height: 1.6; position: relative; } .dolma-page .cta-actions { display: flex; gap: 12px; justify-content: center; flex-wrap: wrap; position: relative; } .dolma-page .btn-cta-light { display: inline-flex; align-items: center; gap: 6px; padding: 14px 32px; background: #0284c7; color: #ffffff !important; border-radius: 9999px; font-size: 15px; font-weight: 700; transition: background 0.2s, transform 0.15s; text-decoration: none !important; } .dolma-page .btn-cta-light:hover { background: #0369a1; transform: translateY(-1px); text-decoration: none !important; } .dolma-page .btn-cta-ghost { display: inline-flex; align-items: center; padding: 14px 32px; background: transparent; color: #94a3b8 !important; border: 1px solid #0c4a6e; border-radius: 9999px; font-size: 15px; font-weight: 600; transition: border-color 0.2s, color 0.2s; text-decoration: none !important; } .dolma-page .btn-cta-ghost:hover { border-color: var(--el-text-color-regular); color: #e2e8f0 !important; text-decoration: none !important; } .dolma-page code { background: #f0f9ff !important; padding: 2px 8px !important; border-radius: 5px !important; font-size: 13px !important; font-family: 'JetBrains Mono', 'Fira Code', 'SF Mono', monospace !important; color: #0c4a6e !important; border: 1px solid #7dd3fc !important; } .dolma-page .s-text-dark { color: var(--el-text-color-primary); } .dolma-page .s-text-brand { color: #0284c7; } .dolma-page .s-section-body { font-size: 16px; color: var(--el-text-color-regular); line-height: 1.8; text-align: center; max-width: 680px; margin: 0 auto; } .dolma-page .s-section-body p + p { margin-top: 16px; } .dolma-page .tag-row { display: flex; gap: 8px; flex-wrap: wrap; justify-content: center; margin-top: 16px; } .dolma-page .tag-item

{
padding: 4px 12px; background: var(--el-bg-color-page);
border: 1px solid var(--el-border-color-light); border-radius: 9999px;
font-size: 12px; font-weight: 600; color: var(--el-text-color-regular);
}
html.dark .dolma-page { background: var(--el-bg-color); color: var(--el-text-color-primary); }
html.dark .dolma-page a { color: inherit; }
html.dark .markdown-body .dolma-page a { color: inherit !important; }
html.dark .markdown-body .dolma-page a.s-btn-primary,
html.dark .markdown-body .dolma-page a.btn-cta-light { color: #ffffff !important; }
html.dark .markdown-body .dolma-page a.s-btn-secondary { color: var(--el-text-color-primary) !important; }
html.dark .markdown-body .dolma-page a.btn-cta-ghost { color: #94a3b8 !important; }
html.dark .markdown-body .dolma-page a.btn-cta-ghost:hover { color: var(--el-text-color-primary) !important; }
html.dark .dolma-page .s-bg-white { background: var(--el-bg-color); }
html.dark .dolma-page .s-bg-gray { background: var(--el-bg-color-page); }
html.dark .dolma-page .s-bg-dark { background: var(--el-bg-color); }
html.dark .dolma-page .s-header h2 { color: var(--el-text-color-primary); }
html.dark .dolma-page .s-header p { color: var(--el-text-color-secondary); }
html.dark .dolma-page .s-btn-primary { background: #0284c7; color: #ffffff !important; }
html.dark .dolma-page .s-btn-primary:hover { background: #0369a1; }
html.dark .dolma-page .s-btn-secondary {
background: #1e293b; color: var(--el-text-color-primary) !important;
border-color: #475569;
}
html.dark .dolma-page .s-btn-secondary:hover { background: var(--el-border-color); border-color: var(--el-text-color-regular); }
html.dark .dolma-hero { background: var(--el-bg-color); }
html.dark .dolma-hero::before {
background: radial-gradient(ellipse, rgba(56, 189, 248, 0.15) 0%, transparent 70%);
}
html.dark .dolma-page .hero-badge { background: var(--el-bg-color-page); border-color: var(--el-border-color); color: var(--el-text-color-secondary); }
html.dark .dolma-hero h1 { color: var(--el-text-color-primary); }
html.dark .dolma-hero h1 span { color: #38bdf8; }
html.dark .dolma-page .hero-subtitle { color: var(--el-text-color-secondary); }
html.dark .dolma-page .hero-highlights .h-item { color: var(--el-text-color-secondary); }
html.dark .dolma-page .hero-highlights .h-div { background: var(--el-border-color); }
html.dark .dolma-stats { background: var(--el-bg-color-page); border-color: var(--el-border-color); }
html.dark .dolma-page .stat-val { color: var(--el-text-color-primary); }
html.dark .dolma-page .stat-lbl { color: var(--el-text-color-regular); }
html.dark .dolma-page .feat-card {
background: var(--el-bg-color-page); border-color: var(--el-border-color);
}
html.dark .dolma-page .feat-card:hover { border-color: var(--el-text-color-regular); box-shadow: 0 4px 16px rgba(0,0,0,0.3); }
html.dark .dolma-page .feat-card h3 { color: var(--el-text-color-primary); }
html.dark .dolma-page .feat-card p { color: var(--el-text-color-secondary); }
html.dark .dolma-page .uc-card { background: var(--el-bg-color-page); border-color: var(--el-border-color); }
html.dark .dolma-page .uc-card:hover { border-color: var(--el-text-color-regular); box-shadow: 0 4px 16px rgba(0,0,0,0.3); }
html.dark .dolma-page .uc-card h3 { color: var(--el-text-color-primary); }
html.dark .dolma-page .uc-card p { color: var(--el-text-color-secondary); }
html.dark .dolma-page .stp-num { color: #334155; }
html.dark .dolma-page .stp-card h3 { color: var(--el-text-color-primary); }
html.dark .dolma-page .stp-card p { color: var(--el-text-color-secondary); }
html.dark .dolma-page .stp-conn { background: var(--el-border-color); }
html.dark .dolma-page code {
background: #082f49 !important; color: #bae6fd !important; border-color: #0c4a6e !important;
}
html.dark .dolma-page .s-text-dark { color: var(--el-text-color-primary); }
html.dark .dolma-page .s-text-brand { color: #38bdf8; }
html.dark .dolma-page .s-section-body { color: var(--el-text-color-secondary); }
html.dark .dolma-page .tag-item { background: var(--el-border-color); border-color: var(--el-text-color-regular); color: var(--el-text-color-secondary); }
html.dark .dolma-cta { background: #082f49; }
html.dark .dolma-cta::before {
background: radial-gradient(ellipse, rgba(56, 189, 248, 0.2) 0%, transparent 70%);
}
html.dark .dolma-page .btn-cta-light { color: #ffffff !important; }
html.dark .dolma-page .btn-cta-ghost { color: #94a3b8 !important; }
html.dark .dolma-page .btn-cta-ghost:hover { color: var(--el-text-color-primary) !important; }
</style>
<div class="dolma-page">
<section class="dolma-hero">
<div class="s-container-narrow">
<div class="hero-badge">
<span class="badge-dot"></span>
Dolma Open Corpus
</div>
<h1>
Dolma Open<br/><span>Corpus</span>
</h1>
<p class="hero-subtitle">
Dolma is a large-scale open corpus created by Allen AI, containing 30 trillion tokens, integrating six major data sources: Common Crawl, The Stack, C4, Reddit, Wikipedia, and Semantic Scholar, used for training the OLMo series of language models, and is currently one of the most transparent large-scale pre-training datasets.

3T Tokens 6 Major Data Sources ODC-By 1.0 License OLMo Training Data
πŸ“Š
3T
Total Tokens
πŸ”—
6
Data Sources
πŸ“œ
ODC-By 1.0
Open License Agreement
πŸ€–
OLMo
Target Model for Training

Dataset Highlights

A large-scale, multi-source, fully transparent open pre-training corpus

πŸ“

Trillions Scale

Contains approximately 30 trillion tokens of text data, making it one of the largest publicly available pre-training corpora, providing ample data support for training large language models.

🌐

Six Major Data Sources

Integrates six major sources: Common Crawl web pages, The Stack code, C4 filtered text, Reddit conversations, Wikipedia encyclopedia, and Semantic Scholar academic papers.

πŸ”

Fully Transparent

Allen AI has made the complete data collection, cleaning, deduplication, and filtering processes public, with each processing step being traceable and auditable, setting a new benchmark for dataset transparency.

πŸ”§

Quality Filtering Pipeline

Employs a multi-level quality filtering pipeline, including language detection, content filtering, deduplication, and toxicity detection, ensuring the overall quality of the training data.

πŸ”„

Reproducible Processing

All data processing code is open-sourced on GitHub, allowing researchers to fully reproduce the entire processing flow from raw data to final corpus.

πŸ“–

Open License

Utilizes the ODC-By 1.0 open data license, allowing free use for academic research and commercial applications, with proper attribution required.

Applicable Scenarios

Empowering the AI community from model training to data science research

🧠

LLM Pre-training

Serves as the core pre-training corpus for large language models, providing diverse and large-scale text data for training foundational models from scratch.

πŸ“

Data Ratio Research

Explores the optimal mixing ratios of different data sources, studying the impact of web pages, code, encyclopedias, academic papers, etc., on model capabilities.

πŸ§ͺ

Ablation Experiments

Systematically studies the independent contributions of each data component to model performance by removing or replacing specific data sources.

πŸ”¬

Reproducible AI Research

Based on fully open data and processing flows, ensuring that research results are verifiable and reproducible, promoting scientific rigor in the AI field.

NLP pre-training OLMo open-source multi-source

API Call Example

Quickly obtain Dolma dataset information through the Ace Data Cloud API

PYTHON
import requests

url = "https://api.acedata.cloud/datasets/dolma" headers = { "Authorization": "Bearer YOUR_API_TOKEN", "Accept": "application/json" }

response = requests.get(url, headers=headers) data = response.json()

View Basic Information of the Dataset

print(f"Name: {data['name']}") print(f"Number of Tokens: {data['tokens']}") print(f"Data Sources: {data['sources']}") print(f"License Agreement: {data['license']}")

3 Steps to Get Started Quickly

From understanding to usage, quickly start your journey with large model training data

01

Browse the Dataset

View the details of the Dolma dataset on the Ace Data Cloud platform, understand the composition of data sources, token scale, and license agreement.

02

Obtain API Token

Register and obtain your API Token to access the dataset through the api.acedata.cloud interface.

03

Download and Train

Download the required data shards via API and start your pre-training or research experiments with the Dolma corpus.

Start Exploring the Dolma Open Corpus

30 trillion tokens, 6 major data sources, completely transparent processing flow. Whether you are training the next generation of language models or conducting data science research, Dolma is the ideal choice.