Loading…
Saturday April 11, 2026 12:15pm - 2:15pm GMT+07

Authors - Pranay Kavthankar, Rutuj Koli, Ronit Ghadi, Yug Mora, Abhijit Joshi
Abstract - Speech-to-Speech Translation (S2ST) has evolved from cas caded pipelines into end-to-end neural architectures. However, preserv ing emotion, prosody, and speaker identity across languages remains challenging. This survey examines state-of-the-art emotion and identity preserving S2ST and neural TTS systems, covering discrete-representation models, end-to-end systems, and cascaded pipelines. We analyze architec tures including Translatotron, VQ-Translatotron, SeamlessM4T, VALL E, VALL-E X, VITS, YourTTS, StyleTTS2, and XTTSv2. The survey discusses speaker identity preservation (x-vectors, d-vectors, codec repre sentations), prosody modeling (pitch, duration, energy), emotion reten tion (categorical, dimensional, embeddings), datasets, evaluation met rics, and challenges including data scarcity, cross-lingual emotion trans fer, and computational costs. We propose future directions toward large scale expressive datasets, improved cross-lingual modeling, and respon sible AI practices.
Paper Presenter
Saturday April 11, 2026 12:15pm - 2:15pm GMT+07
Virtual Room E Bangkok, Thailand

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link