AI & Machine Learning
MarkTechPost7 days ago
25

StepFun Releases StepAudio 2.5 Realtime: An End-to-End Voice Model with Roleplay-Specific RLHF and Paralinguistic Comprehension

AI

StepFun released StepAudio 2.5 Realtime, an end-to-end real-time voice model with roleplay-specific RLHF and paralinguistic comprehension, achieving top benchmark scores.

StepFun Releases StepAudio 2.5 Realtime: An End-to-End Voice Model with Roleplay-Specific RLHF and Paralinguistic Comprehension
Intelligence Insights

The Big Picture

StepFun, a Shanghai-based AI lab, released StepAudio 2.5 Realtime, an end-to-end real-time speech LLM that processes audio input and output through a single unified system, supporting Chinese and English via WebSocket API. The model features three key innovations: million-scale persona data augmentation from 10,000+ seed personas, roleplay-specific RLHF to maintain character consistency, and unified speech understanding and generation with global tonal control. It also excels in paralinguistic comprehension, perceiving non-verbal cues like tone and emotion, scoring 82.18 on benchmarks. In April 2026 evaluations, it ranked first across all five dimensions including human evaluation (80.41), general dialogue (86.36), automotive scenario (84.80), spoken QA (79.80), and paralinguistic comprehension (82.18). The model is accessible via WebSocket at wss://api.stepfun.com/v1/realtime with model string step-2.5-realtime.

Why It Matters

StepAudio 2.5 Realtime marks a leap in conversational AI by enabling real-time, end-to-end voice interaction with consistent persona and emotional awareness. Its paralinguistic comprehension—detecting tone, fatigue, or frustration from audio—could transform customer service, virtual assistants, and gaming, making interactions feel more human. The roleplay-specific RLHF also addresses a key pain point in AI character consistency, opening doors for more immersive entertainment and therapy applications.

Deepen your understanding

Use our AI to break down complex signals.

Select an AI action to generate more depth.

StepFun, the Shanghai-based AI lab, released StepAudio 2.5 Realtime. It is an end-to-end real-time speech large language model with fully customizable persona capabilities.

StepAudio 2.5 Realtime is a voice model that operates in real time. Unlike pipeline-based systems that separate speech recognition, reasoning, and synthesis into sequential steps, this is an end-to-end model. Audio goes in and audio comes out through a single unified system. The model supports Chinese and English.

It connects via a WebSocket API. The endpoint is wss://api.stepfun.com/v1/realtime using the model string step-2.5-realtime.

The Three Technical Pillars

StepFun research team describes three core architectural innovations behind the model:

1. Million-Scale Persona Data Augmentation

Starting from 10,000+ high-quality natively authored personas, StepFun applied algorithmic augmentation to build a million-scale persona feature matrix. This was combined with millions of real-world conversational samples for training. The intent is generalization — specifically, stable performance on difficult, long-tail conversational topics.

Instead of manually labeling millions of persona samples, StepFun team used algorithmic expansion from a curated seed set.

2. Roleplay-Specific RLHF Alignment

A known failure mode in conversational AI is “out-of-character” (OOC) behavior — when a model drifts away from its defined persona mid-conversation. StepFun team conducted dedicated RLHF (Reinforcement Learning from Human Feedback) optimization specifically for persona consistency in roleplay scenarios. RLHF is a training technique where human preference signals are used to train a reward model, which then guides language model behavior. Applying it specifically to roleplay stability is a targeted design choice.

3. Unified Speech Understanding and Generation

StepAudio 2.5 Realtime inherits the StepAudio 2.5 TTS capabilities and deeply fuses speech understanding and generation through reinforcement learning. This enables what StepFun calls “global scene-level tonal setting” and “intra-sentence detail sculpting.” The model can set an overall emotional register for a response while adjusting finer acoustic details within individual sentences.

Paralinguistic Understanding

A technically distinct area of this model is paralinguistic perception. Paralinguistics refers to non-verbal acoustic information in speech — things like tone, speaking rate, pauses, sighs, and laughter. By analyzing these elements, the model can perceive the user’s mood and underlying intentions. For example, it can identify fatigue from a low tone or frustration from a rapid speech rate. Capturing these signals requires the model to operate on audio features rather than transcribed text alone.

StepAudio 2.5 Realtime scored 82.18 on the paralinguistic comprehension benchmark, demonstrating perception of vocal speed, emotion, age, and other acoustic features.

https://stepaudiollm.github.io/step-audio-2.5-realtime/

Benchmark Results

StepFun research team conducted a comprehensive suite of subjective and objective evaluations, benchmarking StepAudio 2.5 Realtime against leading real-time voice models across five dimensions.

Human evaluation is conducted through real mobile app conversations scored by human raters. The scores:

  • Human evaluation (subjective): 80.41
  • General dialogue (objective): 86.36
  • Automotive scenario (objective): 84.80
  • Spoken QA, covering 11 audio understanding tasks (objective): 79.80
  • Paralinguistic comprehension (objective): 82.18

Key Takeaways

  • StepAudio 2.5 Realtime is an end-to-end real-time speech LLM, released by Shanghai-based StepFun.
  • It uses persona-specific RLHF and million-scale data augmentation to maintain stable character consistency.
  • The model ranked first across all five benchmark dimensions, tested in April 2026.
  • Paralinguistic comprehension — perceiving tone, rate, emotion from audio — is a core technical differentiator.
  • API access is via WebSocket at wss://api.stepfun.com/v1/realtime with model string step-2.5-realtime.

Check out the Model Card and DemoAlso, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post StepFun Releases StepAudio 2.5 Realtime: An End-to-End Voice Model with Roleplay-Specific RLHF and Paralinguistic Comprehension appeared first on MarkTechPost.

AI Speech Recognition Voice AI RLHF End-to-End Model

Intelligence Exchange

0

Log in to participate in the exchange.

Sign In

Syncing Discussions...

Finding Related Intelligence...
StepFun Releases StepAudio 2.5 Realtime: An End-to-End Voice Model with Roleplay-Specific RLHF and Paralinguistic Comprehension | TechCulture