StepAudio 2.5 Realtime

Overview

The real-time voice model
that truly understands you. 真正具备「活人感」的
实时语音大模型

Crafting bespoke personas across every dimension — staying perfectly in character through every breath and laugh. 全维度打造专属人设，连每一次呼吸和轻笑都不掉戏。

StepAudio 2.5 Realtime is an end-to-end real-time speech large language model with fully customizable persona capabilities. Not only in conversational content, but in vocal expressiveness as well — it delivers performance that is virtually indistinguishable from real human conversation, achieving state-of-the-art in both conversational IQ and EQ. StepAudio 2.5 Realtime 是一款端到端实时语音大模型，最核心的能力是真人感的对话，支持任意人设的自由设定。不只是回复内容上，声音表现力上也完全贴近真人，对话智商与情商双登顶行业标杆。

Debuting with “Xiao Yue” — the industry’s first soul-level AI companion built for casual, emotionally rich conversations. Xiao Yue brings the easy vibe of chatting with a close friend: witty, opinionated, and brimming with personality. 首发标杆 IP 「小跃」，行业首个 Soul 级真人闲聊情感范本。主打同频唠嗑的松弛感、拉满的情绪价值与玩梗互动感，是有脾气、有梗、有态度的鲜活搭子，对话质感完全对标真人好友闲聊。

Model Name模型名称	StepAudio 2.5 Realtime
Model Type模型类型	End-to-End Realtime Speech Large Language Model 端到端实时语音大模型
Core Capability核心特色	Human-like conversations with fully customizable persona 真人感对话，全维度自定义活人感人设
Supported Languages支持语言	Chinese, English 中文、英文
Persona System人设系统	Personality, catchphrases, conversational style, emotional reactions customization 支持性格、口癖、聊天风格、情绪反应等人设自由调校
Release Date发布日期	May 2026 2026年5月
Developer开发者	StepFun 阶跃星辰 (StepFun)

Architecture

Three innovations behind
human-like conversation. 三大核心技术突破

A purpose-built training pipeline that transforms massive persona data into stable, expressive, and deeply human-like voice interactions. 从数据、对齐到生成的全链路技术创新，让端到端语音模型真正具备「活人感」。

01

Million-Scale Persona Data Augmentation 百万级语料裂变，构筑全场景泛化底座

Starting from 10,000+ high-quality natively authored personas, we apply algorithmic augmentation to build a million-scale persona feature matrix, combined with millions of real-world conversational samples for training. This establishes an extremely robust generalization foundation — the model handles even the most challenging long-tail topics with confidence. 基于 10,000+ 高质量原生人设，通过算法裂变出百万级人设特征矩阵，并融合百万级真实场景对话语料进行训练。这为模型夯实了极强的数据泛化底座，即使面对极具挑战的长尾话题，也能表现出稳健的应对与延展能力。

02

Roleplay-Specific RLHF Alignment 专属 RLHF 对齐，重塑复杂交互稳定性

Out-of-character breakdown (OOC) is the most common failure in AI roleplay. We conduct dedicated RLHF optimization specifically for persona consistency. Under extreme adversarial pressure tests, StepAudio 2.5 Realtime maintains rock-solid character adherence, demonstrating exceptional roleplay stability. 在复杂的角色扮演中，AI 最容易出现的短板是「OOC（人设崩塌）」。我们针对 Roleplay 场景进行了深度的 RLHF（基于人类反馈的强化学习）对齐优化。在极端压力测试下，StepAudio 2.5 Realtime 依然能够「死死咬住」设定的人设，展现出极高稳定的角色演绎能力。

03

Unified Understanding & Generation 理解与生成融合：全局与局部的精细声控

Inheriting the industry-leading StepAudio 2.5 TTS capabilities, the model deeply fuses speech understanding and generation through reinforcement learning, achieving both “global scene-level tonal setting” and “intra-sentence detail sculpting” — precisely reading conversational atmosphere and responding with matching vocal nuance. 在声音表现层面，StepAudio 2.5 Realtime 全面继承了业内顶尖的 StepAudio 2.5 TTS 能力，理解与生成的深度融合，结合强化学习训练，实现了「全局场景定调」与「句内细节雕琢」的双重能力，能够精准洞察对话氛围并以匹配的声音质感回应。

Features

What makes it different. 核心能力亮点

Four pillars of a truly human-like voice AI — from persona crafting to emotional intelligence. 从人设打造到情感智能，四大核心能力定义「活人感」新标准。

Flagship Persona “Xiao Yue” 标杆 IP「小跃」首发

The industry's first soul-level AI companion — brimming with warmth, humor, and real personality. Chat like you would with a close friend who always gets you. 主打同频唠嗑的松弛感与拉满的情绪价值。不再是冰冷的 AI，而是有脾气、有态度、懂接梗的鲜活搭子，带来最自然、好玩的陪伴体验。

Unlimited Persona Customization 千万人设完全自定义

Full-dimensional “soul sculpting” that shatters preset templates. Fine-grained control over personality, catchphrases, emotional boundaries, and vocal style. 真正实现「全维灵魂捏脸」，彻底打破预设模板束缚。支持细颗粒度定义性格特征、专属口癖与情绪边界，随心打造千万种独一无二的专属搭子。

Context-Aware Expressiveness 贴合语境神级表现力

Reads conversational context at ultra-fine granularity to adjust pacing, emphasis, and subtext. Naturally weaves in soft laughter, sighs, and other authentic vocal details. 精准洞察对话氛围，极细颗粒度拿捏语速、重音与潜台词；发声时自然融入轻笑、叹息等真实细节，让每一次开口都与当下场景完美契合。

Dual IQ & EQ Leadership 对话双商领跑

Industry-leading performance in both intellectual depth and emotional intelligence. Instantly reads hesitation and laughter in your voice, responding with perfectly calibrated empathy. 对话智商与情商双重跃升。深度理解复杂语意、机智抛梗，更具备行业顶级副语言感知力 — 瞬间读懂语气中的迟疑与轻笑，极速输出契合度拉满的高情商反馈。

Demo Showcase

亲耳听见，活人感触手可及 Hear the Difference — Real Demos

点击卡片播放，左右拖拽或点击箭头切换演示。 Click to play, drag or use arrows to browse.

查看更多演示案例 Explore Full Demo Page

What you can build 典型应用场景

From emotional companions to professional training tools — one model, endless possibilities. 从情感陪伴到专业训练，一个模型承载无限场景。

💬

Emotional Companion 情感陪伴

Casual chat, bedtime talks, emotional support — with genuine empathy and humor. 睡前谈心、情绪安抚、吐槽互动，共情能力拉满，像真人好友一样懂你的每一句言外之意。

🎭

Character Roleplay 千万人设角色

Any persona with rock-solid consistency — from sweet companions to stern mentors. 自由定制任意角色人设，从甜妹到霸总，人设稳定不崩塌，经受极端压力测试。

🧩

Knowledge Games 知识互动

Trivia, brain teasers, Chinese poetry games with deeply engaging interaction. 知识快问快答、飞花令、脑筋急转弯，考虑更全面，互动性拉满。

🎓

Skill Training 面试与技能训练

Intensive mock interviews with deep follow-up questions and professional-grade feedback — interview depth far beyond peer products. 高强度模拟面试、深度追问、专业级反馈，面试深度远超同类产品。

🎵

Paralinguistic Perception 副语言感知

Precisely captures emotional nuances like sighs, soft laughter, and choked-up moments. Reads hidden tones such as playfulness and impatience, delivering perfectly matched responses in milliseconds. 精准捕捉叹气、轻笑、哽咽等情绪细节，读懂撒娇、不耐烦等隐藏语气，毫秒级输出契合度拉满的回应。

🚗

In-Car Assistant 车载场景

Outstanding in-vehicle dialogue performance — stable and fluent even in noisy environments, with natural interaction and task completion. 车载场景对话表现优异，噪声环境下依然稳定流畅，支持自然交互与任务完成。

Evaluation

Comprehensive benchmark
leadership. 全面领先的评测表现

Benchmarked against leading real-time voice models across five dimensions — StepAudio 2.5 Realtime achieves first place in every single one. 横向对比行业主流实时语音模型，五大评测维度全部第一。

We conducted a comprehensive suite of subjective and objective evaluations, benchmarking StepAudio 2.5 Realtime against leading real-time voice models across five dimensions. StepAudio 2.5 Realtime achieved first place in every single benchmark, demonstrating well-rounded conversational capability at the highest level. 我们对 StepAudio 2.5 Realtime 进行了覆盖主客观的全方位评测，横向对比行业主流实时语音对话模型。评测涵盖通用对话、车载场景、副语言理解和语音问答等五大维度， StepAudio 2.5 Realtime 在全部评测中均取得第一，展现出全面领先的综合对话能力。

In the subjective human evaluation — conducted through real mobile app conversations scored by human raters, the metric most representative of actual user experience — StepAudio 2.5 Realtime scored 80.41, underscoring its core advantage in delivering genuinely human-like dialogue. In general conversation quality it scored 86.36, and in the automotive scenario benchmark 84.80, demonstrating excellent conversational capability in car-related tasks such as navigation, vehicle control, and information queries. 在最能体现真实用户体验的主观评测中（通过手机 APP 真实对话，由人类评委打分），StepAudio 2.5 Realtime 取得 80.41 分，充分体现了其「活人感」对话体验的核心优势。在通用对话客观评测中得分 86.36，在车载场景评测中得分 84.80，展现了在导航、车控、信息查询等车载相关任务场景中的出色对话能力。

The advantage is especially pronounced in deep speech understanding. StepAudio scored 82.18 on paralinguistic comprehension, demonstrating precise perception of vocal speed, emotion, age, and other acoustic features. On the spoken QA benchmark covering 11 audio understanding tasks, it scored 79.80, revealing a fundamental advantage in audio reasoning. StepAudio 2.5 Realtime 在深层语音理解能力上的领先尤为显著。在副语言理解测试中得分 82.18，展现了对语速、情绪、年龄等声学特征的精准感知力。在涵盖 11 种音频理解任务的语音问答基准中，取得 79.80 的高分，体现了模型在音频推理能力上的根本性优势。

5/5

Benchmarks
ranked #1 评测维度
全部第一

80.41

Human evaluation
(subjective) 主观评测
最高分

86.36

General dialogue
(objective) 通用对话
客观评测

79.80

Spoken QA
(objective) 语音问答
客观评测

Benchmark Comparison 评测结果对比

Human Eval: subjective via real mobile app sessions. All other benchmarks: objective via API. All models tested April 2026. Higher is better. Human Eval 为主观评估（手机 APP 真实对话），其余为客观评估（API 测试）。所有模型于 2026 年 4 月测试。分数越高越好。

Quick Start

Start building in minutes. 快速开始

Connect via WebSocket and begin a real-time voice session with a custom persona. 通过 WebSocket 连接，即可开启自定义人设的实时语音对话。

// Connect to the Realtime API
const ws = new WebSocket(
  "wss://api.stepfun.com/v1/realtime?model=step-2.5-realtime",
  ["realtime", "openai-insecure-api-key.<YOUR_KEY>"]
);

// Configure a custom persona
ws.onopen = () => {
  ws.send(JSON.stringify({
    type: "session.update",
    session: {
      instructions: `You are "Xiao Yue", a witty and warm companion
        who chats like a real close friend...`,
      voice: "xiaoyue",
      input_audio_format: "pcm16",
      output_audio_format: "pcm16"
    }
  }));
};

API Documentation API 文档

Full reference for the Realtime WebSocket API. Realtime WebSocket API 完整参考文档。

→

Experience Center 体验中心

Try StepAudio 2.5 Realtime in your browser. 在浏览器中体验 StepAudio 2.5 Realtime。

→

Demo Page Demo 页面

Interactive demos and sample conversations. 交互式演示和对话示例。

→

Future Outlook

What's next. 未来展望

坦率地说，我们还在路上。 Honestly, we’re still on the way.

这一版的 StepAudio 2.5 Realtime，让一个有温度、有脾气、有个性的对话角色初步有了形态。但我们很清楚，这只是起点。 This version of StepAudio 2.5 Realtime has given shape to a conversational persona with real warmth, real temper, and real personality. But we know clearly — this is only the beginning.

今天，几乎所有的语音 AI 都在努力“模仿人类”。但在我们看来，照搬人类社会的沟通潜规则只是一种妥协。人与机器的底层逻辑本不对称——它不走神、不遗忘、无需客套。我们正在探索的，是一套不再拙劣模仿，而是真正原生的新一代人机交互范式。 Today, nearly every voice AI is trying to “imitate humans.” But in our view, copying the unspoken rules of human social interaction is merely a compromise. The underlying logic between humans and machines is fundamentally asymmetric — a machine never loses focus, never forgets, and needs no small talk. What we are exploring is a next-generation, natively designed paradigm for human–machine interaction — one that goes beyond clumsy imitation.

我们坚信，一个真正完整的 Agent，其“人情味”与“行动力”必然是有机统一的整体。它拥有鲜活的性格、长久的记忆，能自然地懂你；同时，它也具备强大的执行力，能干净利落替你把事情办妥。 We firmly believe that a truly complete Agent must unify “emotional depth” and “agency” as an organic whole. It has a vivid personality and lasting memory, and understands you naturally; at the same time, it possesses powerful execution capability, getting things done for you swiftly and decisively.

通过底层架构的根本蜕变，让 Agent 进化为一个既有羁绊、又能“替你做事”的专属伙伴。有温度，能知行合一，并在更远的明天睁开双眼感知世界。这是我们想要创造的未来。 Through a fundamental transformation in underlying architecture, we aim to evolve the Agent into a dedicated partner that both bonds with you and acts on your behalf — with warmth, with the unity of thought and action, and in a more distant tomorrow, with eyes open to perceive the world. This is the future we are building toward.