Introduction 引言
Open-source Chinese speech corpora have primarily focused on single-speaker, read speech, or scripted interactions. Large-scale datasets that capture natural full-duplex conversational dynamics remain limited. In particular, publicly available resources rarely provide multi-channel recordings of spontaneous multi-party conversations with overlapping speech, backchannels, interruptions, pauses, and other interaction phenomena fundamental to human communication.
To address this gap, we release SmoothConv and DuplexConv, a pair of complementary Chinese multi-channel conversational speech datasets comprising a total of 2,100 hours of naturally occurring multi-party interactions collected from Tutoring and Social Chat scenarios, to support research on full-duplex spoken dialogue, turn-taking modeling, and interruption handling.
SmoothConv contains 100 hours of carefully curated conversations with expert human annotations, including high-quality transcripts, millisecond-level timestamps, turn boundaries, overlap regions, pause information, speaker attributes, and interaction-related labels. It serves as a benchmark resource for fine-grained conversational analysis and supervised model training.
DuplexConv scales the corpus to 2,000 hours through an LLM-assisted annotation pipeline, providing large-scale transcripts, turn structures, speaker information, and scene-level contextual labels.
Together, SmoothConv and DuplexConv establish a unified resource for studying conversational behavior in realistic full-duplex settings, supporting research that spans fine-grained interaction modeling and robust speech understanding.
开源中文语音语料库先前主要面向单人朗读或脚本化交互,能够刻画自然全双工会话动态的大规模数据集仍然有限。尤其是,公开资源中鲜有提供多通道、真实自发式多方对话录音,涵盖重叠语音、附和、打断、停顿等对人类交流至关重要的交互现象。
为填补这一空白,我们发布 SmoothConv 与 DuplexConv 两个互补的中文多通道会话语音数据集,共收录 2,100 小时真实自然的多方对话,覆盖教育与闲聊两类场景,旨在支持全双工口语对话、话轮建模与打断处理等方面的研究。
SmoothConv 包含 100 小时精心筛选的对话及人工精标,提供高质量转写、毫秒级时间戳、话轮边界、重叠区段、停顿信息、说话人属性及交互相关标签,可作为细粒度会话分析与监督模型训练的基准资源。
DuplexConv 通过大模型辅助标注流水线将语料扩展至 2,000 小时,提供大规模转写、话轮结构、说话人信息与场景级语境标签。
SmoothConv 与 DuplexConv 相辅相成,构成面向真实全双工场景的统一数据资源,可支撑从细粒度交互建模到鲁棒语音理解的系统性研究。
Annotation Samples 标注样例
Annotation visualizations for SmoothConv and DuplexConv in Tutoring and Social Chat scenarios. SmoothConv 与 DuplexConv 在教育、闲聊场景下的标注可视化示例。
For preview, each video is a 45-second clip excerpted from the full conversation. 以下视频均为便于预览而从完整对话中截取的 45 秒片段。
Data Distribution 数据分布
Overview of the open-source SmoothConv and DuplexConv data composition. SmoothConv 与 DuplexConv 开源发布版本的数据分布概览。
Ethics Statement 伦理声明
Guidelines for the responsible collection, release, and use of SmoothConv and DuplexConv. 关于 SmoothConv 与 DuplexConv 数据采集、发布与使用的伦理说明。
- Informed consent. Conversations were recorded with the knowledge and consent of participants. Personal identifiers have been removed or anonymized prior to release.
- Privacy protection. The datasets are released for academic and research purposes only. Users must not attempt to re-identify speakers or reconstruct private information from the audio or annotations.
- Intended use. SmoothConv and DuplexConv are intended for research on spoken dialogue, turn-taking, and speech understanding. They must not be used for unauthorized surveillance, impersonation, or generating deceptive content.
- Limitations & bias. Annotations may contain errors; DuplexConv labels are machine-assisted. Researchers should account for domain, demographic, and annotation bias when training or evaluating models.
- Responsible use. By using these datasets, you agree to comply with applicable laws and ethical guidelines. Report suspected misuse to jimz@qualialabs.ai.
- 知情同意。全部对话均在参与者知情同意的情况下采集;发布前已去除或匿名化处理可识别个人身份的信息。
- 隐私保护。数据集仅供学术与研究使用,不得试图重新识别说话人身份,或从音频与标注中还原个人隐私信息。
- 用途限定。SmoothConv 与 DuplexConv 旨在服务口语对话、话轮建模与语音理解等研究,不得用于未经授权的监控、身份冒充或生成具有欺骗性的内容。
- 局限与偏差。标注可能存在误差,DuplexConv 还包含机器辅助标签。开展研究与模型训练时,应充分考虑场景、人群分布及标注方式可能带来的偏差。
- 负责任使用。使用本数据集即表示同意遵守相关法律法规与伦理规范。如发现疑似滥用行为,请联系 jimz@qualialabs.ai。