Step-Audio-EditX
[Paper]
Chao Yan*, Boyong Wu*, Peng Yang, Pengfei Tan, Guoqiang Hu, Yuxin Zhang,
Xiangyu (Tony)Zhang, Fei Tian, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, Gang Yu
StepFun
Abstract: We present Step-Audio-EditX, the first open-source LLM-based audio model excelling at expressive and iterative audio editing—encompassing emotion, speaking style, and paralinguistics—alongside robust zero-shot text-to-speech (TTS) capabilities. Our core innovation lies in leveraging only large-margin synthetic data, which circumvents the need for embedding-based priors or auxiliary modules. This large-margin learning approach enables both iterative control and high expressivity across voices, and represents a fundamental pivot from the conventional focus on representation-level disentanglement. Evaluation results demonstrate that Step-Audio-EditX surpasses both MiniMax-2.6-hd and Doubao-Seed-TTS-2.0 in emotion editing and other fine-grained control tasks. Our code and models are available at https://github.com/stepfun-ai/Step-Audio-EditX.
An overview of the architechture of Step-Audio-EditX
Contents
Zero-Shot Cloning
| Prompt Text | Prompt Audio | Task | Clone Text | Output Audio |
|---|---|---|---|---|
| 越明白被爱的珍贵,越能主动追寻更好的自己。 | Chinese-Chinese | 生活偶尔会有风雨,但乌云散了之后,阳光会更暖。 | ||
| 不必纠结过去的遗憾,你当下的选择就是最好的安排。 | Chinese-English | You might not see it now, but all the late nights, the quiet efforts, and the moments you wanted to give up are planting seeds. They’ll bloom when the time is right. | ||
| His political stance was conservative, and he was particularly close to margaret thatcher. | English-English | Underneath the courtyard is a large underground exhibition room which connects the two buildings. |
Emotion Editing
| Emotion | Text | Source | Edit 1st | Edit 2nd | Edit 3rd |
|---|---|---|---|---|---|
| Fear | 我总觉得,有人在跟着我,我能听到奇怪的脚步声。 | ||||
| Happy | 今天天气真好,阳光明媚的,心情也跟着好了起来。 | ||||
| Angry | 这次的体验简直恶心透了!说好的雪山呢?根本没机会爬! | ||||
| Surprised | 天哪!这真的能帮到那些被伤害的人吗?简直不可思议! | ||||
| Sad | 准备了很久的一个重要计划,突然被告知取消了,感觉所有心血都白费了,真难受。 | ||||
| Confusion | “我喜欢你”?等等,你从哪里得出这个结论的?我刚刚只是在讨论工作流程啊,你这自我感觉也太良好了吧? | ||||
| Happy | You know, I just finished that big project and feel so relieved. Everything seems easier and more colorful, what a wonderful feeling! | ||||
| Angry | Seriously? 'Your call is very important to us.' If it were important, you should pick up the phone! This is the last time I'm calling. | ||||
| Fear | My heart is still racing. This car just blew through the red light... I had to slam on my brakes. I mean, a second later and it would have hit me. | ||||
| Confusion | Why are my keys... in the fruit bowl? Next to the bananas? I have absolutely no memory of doing this. What is going on? |
Speaking Style Editing
| Speaking Style | Text | Source | Edit 1st | Edit 2nd | Edit 3rd |
|---|---|---|---|---|---|
| Whisper | 比如在工作间隙,做一些简单的伸展运动,放松一下身体,这样,会让你更有精力. | ||||
| Roar | 你到底想怎么着,上学的时候懒得学,工作的时候没时间学。 | ||||
| Serious | 如果技术能前进一步,可以延长寿命哪怕多一年,就是天文数字的巨额创收。 | ||||
| Child | 我今天帮小猫咪找到家了,它还舔了我的手呢! | ||||
| Exaggerated | 真可爱呢,这么努力想得到我的关注,就差把你的手都吞下去了 | ||||
| Arrogant | 哼,他那方案错误百出,我修改后无可挑剔,真不知道他怎么敢拿出来讨论,要是我早重做了。 | ||||
| Act_coy | 你最近都不怎么陪我了,是不是不爱我了? | ||||
| Generous | 成败就在今朝!跟了我这么多年,我不会让兄弟们失望,干就完了! | ||||
| Older | 今天真的是辛苦你们了,你们早点休息,明天我再来接你们。 | ||||
| Recite | 群山连绵起伏,像一条巨龙盘踞在大地之上。晨雾缭绕在山腰,如同轻纱般柔软,将那些棱角分明的山峰隐去了一半。 | ||||
| Whisper | I'm right here with you... you're safe... everything is okay... | ||||
| Roar | EVERY SINGLE TIME! EVERY. SINGLE. TIME! YOU PROMISE CHANGE AND DELIVER NOTHING! | ||||
| Serious | I'm not angry. I'm not going to yell. But you need to listen carefully to what I'm about to say. | ||||
| Exaggerated | This is literally the MOST INSANE thing that has EVER happened to me in my ENTIRE LIFE! | ||||
| Arrogant | You should be grateful I'm even acknowledging your existence right now. Most people? I wouldn't waste my breath. | ||||
| Generous | You know what your problem is?! You think too much! Stop thinking! Just DO! Be BOLD! Be FEARLESS! |
Paralinguistics Editing
| Text | Source | Paralinguistic Text | Edit Output |
|---|---|---|---|
| 你说的这个计划听起来不错,我觉得可以试试,说不定真能成功呢。 | 你说的这个计划听起来不错,我觉得可以试试 [Confirmation-en],说不定真能成功呢。 | ||
| 我觉得这个计划大概是可行的,不过还需要再仔细考虑一下。 | 我觉得这个计划大概是可行的,[Uhm],不过还需要再仔细考虑一下。 | ||
| 你这次又忘记带钥匙了,真是拿你没办法。 | 你这次又忘记带钥匙了 [Dissatisfaction-hnn],真是拿你没办法。 | ||
| I just ran to catch the bus and barely made it. | I just ran to catch the bus [Breathing] and barely made it. | ||
| Wait, you're telling me you finished the entire book in one day? That's incredible! | Wait, you're telling me you finished the entire book in one day? [Surprise-oh] That's incredible! | ||
| Are we still on for dinner at eight? , sounds good. | Are we still on for dinner at eight? [Confirmation-en], sounds good. | ||
| I was thinking we could go to the movie first and then maybe grab some dinner afterward. | I was thinking we could go to the movie first and then [Uhm] maybe grab some dinner afterward. | ||
| You actually finished the whole book in one day , that's amazing! | You actually finished the whole book in one day [Surprise-wa], that's amazing! | ||
| I was thinking we could go to the beach this weekend, but maybe the weather won't be great. | I was thinking we could go to the beach this weekend, but [Uhm] maybe the weather won't be great. |
Extension
| Task | Text | Source | Edit Output |
|---|---|---|---|
| Denoising | Such legislation was clarified and extended from time to time thereafter. No, the man was not drunk, he wondered how we got tied up with this stranger. Suddenly, my reflexes had gone. It's healthier to cook without sugar. | ||
| Silence Trimming | 就是说你比如说我一共在这次看病我一共花了一百块钱,其中呢医生的这个劳动价值占了三十块钱。 | ||
| Speed Editing Faster | 上次你说鞋子有点磨脚,我给你买了一双软软的鞋垫。 | ||
| Speed Editing Slower | 人山人海,热闹非凡,没想到人气这么旺。 |