SmoothConv & DuplexConv: Large-Scale Chinese Full-Duplex Speech Datasets for Conversational AI

Introduction 引言

Open-source Chinese speech corpora have primarily focused on single-speaker, read speech, or scripted interactions. Large-scale datasets that capture natural full-duplex conversational dynamics remain limited. In particular, publicly available resources rarely provide multi-channel recordings of spontaneous multi-party conversations with overlapping speech, backchannels, interruptions, pauses, and other interaction phenomena fundamental to human communication.

To address this gap, we release SmoothConv and DuplexConv, a pair of complementary Chinese multi-channel conversational speech datasets comprising a total of 2,100 hours of naturally occurring multi-party interactions collected from Tutoring and Social Chat scenarios, to support research on full-duplex spoken dialogue, turn-taking modeling, and interruption handling.

SmoothConv contains 100 hours of carefully curated conversations with expert human annotations, including high-quality transcripts, millisecond-level timestamps, turn boundaries, overlap regions, pause information, speaker attributes, and interaction-related labels. It serves as a benchmark resource for fine-grained conversational analysis and supervised model training.

DuplexConv scales the corpus to 2,000 hours through an LLM-assisted annotation pipeline, providing large-scale transcripts, turn structures, speaker information, and scene-level contextual labels.

Together, SmoothConv and DuplexConv establish a unified resource for studying conversational behavior in realistic full-duplex settings, supporting research that spans fine-grained interaction modeling and robust speech understanding.

开源中文语音语料库先前主要面向单人朗读或脚本化交互，能够刻画自然全双工会话动态的大规模数据集仍然有限。尤其是，公开资源中鲜有提供多通道、真实自发式多方对话录音，涵盖重叠语音、附和、打断、停顿等对人类交流至关重要的交互现象。

为填补这一空白，我们发布 SmoothConv 与 DuplexConv 两个互补的中文多通道会话语音数据集，共收录 2,100 小时真实自然的多方对话，覆盖教育与闲聊两类场景，旨在支持全双工口语对话、话轮建模与打断处理等方面的研究。

SmoothConv 包含 100 小时精心筛选的对话及人工精标，提供高质量转写、毫秒级时间戳、话轮边界、重叠区段、停顿信息、说话人属性及交互相关标签，可作为细粒度会话分析与监督模型训练的基准资源。

DuplexConv 通过大模型辅助标注流水线将语料扩展至 2,000 小时，提供大规模转写、话轮结构、说话人信息与场景级语境标签。

SmoothConv 与 DuplexConv 相辅相成，构成面向真实全双工场景的统一数据资源，可支撑从细粒度交互建模到鲁棒语音理解的系统性研究。

Annotation Samples 标注样例

Annotation visualizations for SmoothConv and DuplexConv in Tutoring and Social Chat scenarios. SmoothConv 与 DuplexConv 在教育、闲聊场景下的标注可视化示例。

For preview, each video is a 45-second clip excerpted from the full conversation. 以下视频均为便于预览而从完整对话中截取的 45 秒片段。

Tutoring 教育

SmoothConv Expert Annotation 人工精标

Sample 1 样例 1

DuplexConv Auto Annotation 自动标注

Sample 1 样例 1

SmoothConv Expert Annotation 人工精标

Sample 2 样例 2

DuplexConv Auto Annotation 自动标注

Sample 2 样例 2

Social Chat 闲聊

SmoothConv Expert Annotation 人工精标

Sample 1 样例 1

DuplexConv Auto Annotation 自动标注

Sample 1 样例 1

SmoothConv Expert Annotation 人工精标

Sample 2 样例 2

DuplexConv Auto Annotation 自动标注

Sample 2 样例 2

Data Distribution 数据分布

Overview of the open-source SmoothConv and DuplexConv data composition. SmoothConv 与 DuplexConv 开源发布版本的数据分布概览。

SmoothConv statistics — SmoothConv — domains, channels, turn states, and paralinguistic labels SmoothConv — 领域、通道、话轮状态与副语言标签分布

DuplexConv statistics — DuplexConv — domains, channels, turn states, and dialogue types DuplexConv — 领域、通道、话轮状态与对话类型分布

Ethics Statement 伦理声明

Guidelines for the responsible collection, release, and use of SmoothConv and DuplexConv. 关于 SmoothConv 与 DuplexConv 数据采集、发布与使用的伦理说明。

Informed consent. Conversations were recorded with the knowledge and consent of participants. Personal identifiers have been removed or anonymized prior to release.
Privacy protection. The datasets are released for academic and research purposes only. Users must not attempt to re-identify speakers or reconstruct private information from the audio or annotations.
Intended use. SmoothConv and DuplexConv are intended for research on spoken dialogue, turn-taking, and speech understanding. They must not be used for unauthorized surveillance, impersonation, or generating deceptive content.
Limitations & bias. Annotations may contain errors; DuplexConv labels are machine-assisted. Researchers should account for domain, demographic, and annotation bias when training or evaluating models.
Responsible use. By using these datasets, you agree to comply with applicable laws and ethical guidelines. Report suspected misuse to jimz@qualialabs.ai.

知情同意。全部对话均在参与者知情同意的情况下采集；发布前已去除或匿名化处理可识别个人身份的信息。
隐私保护。数据集仅供学术与研究使用，不得试图重新识别说话人身份，或从音频与标注中还原个人隐私信息。
用途限定。SmoothConv 与 DuplexConv 旨在服务口语对话、话轮建模与语音理解等研究，不得用于未经授权的监控、身份冒充或生成具有欺骗性的内容。
局限与偏差。标注可能存在误差，DuplexConv 还包含机器辅助标签。开展研究与模型训练时，应充分考虑场景、人群分布及标注方式可能带来的偏差。
负责任使用。使用本数据集即表示同意遵守相关法律法规与伦理规范。如发现疑似滥用行为，请联系 jimz@qualialabs.ai。