Speaker-Adaptive TTS

Meet the people working on it!

Research Overview

Speaker-Adaptive Text-to-Speech (TTS) aims to synthesize natural-sounding speech that accurately mimics the identity, timbre, and prosody of a specific target speaker, often requiring only a few seconds of reference audio (Zero-Shot scenario).

My research expands the frontiers of this field by addressing critical challenges in generalization, cross-lingual adaptation, and data scarcity.

Project Slides

slide

1 / 5

Related Papers

[ 1 ]

LESpeech: Break Language Barriers for Zero-shot Cross-language Speaker-Adaptive Text-to-Speech

Wenbin Wang, Yang Song, Paul Holmberg, et al.

Under Review (2025)

[ 2 ]

USAT: A Universal Speaker-Adaptive Text-to-Speech Approach

Wenbin Wang, Yang Song, Sanjay Jha

IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), Vol. 32 (2024)

[ 3 ]

GLOBE: A High-quality English Corpus with Global Accents for Zero-shot Speaker Adaptive Text-to-Speech

Wenbin Wang, Yang Song, Sanjay Jha

Interspeech (2024)

[ 4 ]

Generalizable Zero-Shot Speaker-Adaptive Speech Synthesis with Disentangled Representations

Wenbin Wang, Yang Song, Sanjay Jha

Interspeech (Oral) (2023)

[ 5 ]

AutoLV: Automatic lecture video generator

Wenbin Wang, Yang Song, Sanjay Jha

IEEE International Conference on Image Processing (ICIP) (2022)