Emilia
Dataset
Emilia is a large-scale multilingual speech generation dataset, covering over 100,000 hours of speech data, including various languages such as English, Chinese, German, French, Japanese, Korean, and more, with a wide range of speaker diversity and scene coverage, suitable for speech synthesis and speech cloning research. ```
Dataset Highlights
A large-scale multilingual speech dataset that provides a solid foundation for speech synthesis and cloning research
Ultra-large scale
Contains over 100,000 hours of speech data, making it one of the largest open-source speech generation datasets, providing ample data support for large-scale model training.
Multilingual coverage
Covers multiple languages including English (EN), Chinese (ZH), German (DE), French (FR), Japanese (JA), Korean (KO), supporting cross-language speech research.
Diversity of speakers
Features voice samples from over 50,000 different speakers, covering various ages, genders, and accents, ensuring the model's generalization ability.
Natural speech recording
The speech data is sourced from natural recordings in real scenarios, covering various styles such as conversations, speeches, and audiobooks, with high naturalness and expressiveness.
High-quality annotations
Each segment of speech is accompanied by precise text transcriptions, speaker identifiers, language tags, and duration information, with standardized annotations for direct use in model training.
Open-source processing pipeline
Accompanied by the open-source data processing tool Emilia-Pipe, supporting end-to-end processing from raw audio to training data, allowing for the reproduction of the dataset construction process.
Applicable Scenarios
From speech synthesis to speaker verification, covering core research directions in speech AI
Text-to-speech
Train high-quality TTS models to generate natural, fluent, and expressive synthetic speech
Voice cloning
Utilize rich speaker data to achieve few-shot or zero-shot voice cloning, replicating the target speaker's timbre
Speech translation
Leverage multilingual speech data to build end-to-end speech translation systems for cross-language speech conversion
Speaker verification
Use large-scale speaker data to train voiceprint recognition models, enhancing speaker verification and recognition accuracy
Data Preview
Below is a typical metadata example from the Emilia dataset (in JSON format)
{
"id": "emilia_en_00012345",
"speaker_id": "spk_en_04821",
"language": "en",
"duration": 8.72,
"sample_rate": 24000,
"transcription": "The weather today is absolutely beautiful, perfect for a walk in the park.",
"gender": "female",
"source": "audiobook",
"audio_path": "en/subset_001/emilia_en_00012345.wav"
}
# Another Chinese Sample
{
"id": "emilia_en_00098765",
"speaker_id": "spk_en_01234",
"language": "en",
"duration": 6.35,
"sample_rate": 24000,
"transcription": "Welcome to our program, today we will discuss the latest developments in artificial intelligence.",
"gender": "male",
"source": "podcast",
"audio_path": "en/subset_003/emilia_en_00098765.wav"
}
3 Steps to Get Started Quickly
From browsing to loading, you can start your speech research project in just a few minutes
Browse Datasets
View dataset details on the Ace Data Cloud platform to understand metadata such as language distribution, speaker statistics, and licensing agreements.
Download Data
Download speech data slices of the target language on demand, each slice contains audio files and corresponding JSON metadata annotations.
Load and Use
Use librosa.load() to load audio files, and start model training and speech synthesis experiments with the metadata annotations.
Start Exploring Emilia Speech Data
A large-scale multilingual speech dataset with open licensing, available immediately. Whether you are a speech synthesis researcher or an AI developer, Emilia is your ideal choice.
