To appear · Interspeech 2026 70+ languages 90+ datasets 11 speech models

The atlas of multilingual speech emotion recognition

KuralHub catalogs and benchmarks Speech Emotion Recognition (SER) datasets across the world's languages — from high-resource to long-tail — so researchers can find data and compare models in one place.

70+
Languages surveyed
90+
Datasets cataloged
11
Speech models
363
Fine-tuning runs
About the project

One place to find SER data — and see what actually works

Speech Emotion Recognition research is fragmented: datasets are scattered, inconsistently documented, and heavily skewed toward a handful of high-resource languages. KuralHub brings them together.

We reviewed Speech Emotion Recognition resources for 70+ languages, documenting 90+ datasets with consistent metadata — emotion categories, speaker counts, recording conditions, licensing, and access links. For 29 languages with usable open data, we ran a controlled benchmark: each of 11 pretrained speech encoders is fine-tuned per language (monolingual, not multilingual) by attaching a lightweight classification head on top of the frozen backbone, then evaluated on held-out test sets.

The result is a living atlas that helps researchers and practitioners locate datasets quickly and choose the right model for a given language.

Curated dataset catalog

Consistent, structured metadata for 90+ SER datasets — emotions, speakers, license, and access links — organized by language and family.

Reproducible benchmark

11 state-of-the-art speech models fine-tuned per language under one protocol, with validation and test accuracy reported transparently.

Global language coverage

From English and Mandarin to long-tail languages across 12+ language families — surfacing where emotion data exists and where gaps remain.

Methodology

How the benchmark is built

A single, controlled fine-tuning protocol applied consistently across every language and model.

1

Collect & standardize

Gather open SER datasets per language and normalize their emotion labels, splits, and metadata into a common format.

2

Fine-tune per language

Freeze each pretrained speech encoder and train a classification head on top — one monolingual model per dataset, never mixed multilingual.

3

Evaluate & report

Score validation and test accuracy on held-out splits and aggregate results by model, language, and language family.

Models evaluated

hubert-base-ls960 hubert-large-ls960-ft wav2vec2-base wav2vec2-large-960h wav2vec2-large-lv60 wav2vec2-xls-r-300m wav2vec2-xls-r-1b wavlm-base-plus wavlm-large whisper-small whisper-large
Key findings

What the data tells us

A glimpse of the analysis. Click any figure to enlarge, or dive into the full interactive benchmark.

Radar chart of average SER model performance across languages
Average performance by languageModel accuracy varies widely across languages and families.
Heatmap of model performance per language
Model × language heatmapNo single model wins everywhere — the best encoder is language-dependent.
Average accuracy per speech model across all languages
Average accuracy by modelLarger self-supervised encoders tend to lead on aggregate.
Dendrogram clustering languages by SER model behavior
Language clusteringLanguages group by how models behave on them, hinting at transfer opportunities.
The team

Researchers

Department of Computer Science & Engineering, University of Moratuwa, Sri Lanka — aaivu research lab.

FAQ

Frequently asked questions

What is KuralHub?

KuralHub is a comprehensive survey and benchmark of Speech Emotion Recognition (SER) datasets across the world's languages. It catalogs 90+ SER datasets spanning 70+ languages and benchmarks 11 pretrained speech models on 29 languages, to help researchers find emotion-speech data and choose the best model for a given language.

What speech emotion recognition datasets are available, and for which languages?

KuralHub documents 90+ SER datasets across 70+ languages — including English, Mandarin Chinese, Hindi, Arabic, Spanish, German, French, Japanese, Korean, Tamil, Telugu, Bengali, Persian, Turkish and many low-resource languages. Each dataset entry lists its emotion categories, speaker counts, license and access links. Browse the full catalog on the Datasets page.

Which model performs best for speech emotion recognition?

There is no single best model — the top encoder is language-dependent. KuralHub fine-tunes 11 pretrained speech models (HuBERT, wav2vec 2.0, WavLM and Whisper variants) per language and reports the winner for each. See the best model per language on the Benchmark page.

Are the datasets free to use?

Many of the cataloged SER datasets are openly available, while others require a request or agreement with the original authors. Each dataset page states its license and access method. Always check the original dataset's license before research or commercial use.

What is Speech Emotion Recognition (SER)?

Speech Emotion Recognition (SER) is the task of automatically identifying a speaker's emotional state — such as happy, sad, angry, fearful or neutral — from their voice. It is a core problem in affective computing with applications in human–computer interaction, call-center analytics, healthcare and accessibility.

How can I cite KuralHub?

KuralHub will appear at Interspeech 2026. Use the BibTeX entry in the citation section; final details will be added once the paper is published.

Citation

Cite KuralHub

BibTeX
@inproceedings{kuralhub2026,
  title     = {KuralHub: A Comprehensive Review of Speech Emotion Recognition Datasets},
  author    = {Thavarasa, Luxshan and Thevakumar, Jubeerathan and
               Sivatheepan, Thanikan and Thayasivam, Uthayasanker},
  booktitle = {Interspeech},
  year      = {2026},
  note      = {To appear}
}

Accepted at Interspeech 2026 — full citation details will be finalized once the paper is published.