Speech to Speech Translation for English and Hindi with Speaker Preservation

Dhruv Prasanna; Avinash Nithyashree; Namith V Shetty; Praharsha Kosuri; Pavan A C

Speech to Speech Translation for English and Hindi with Speaker Preservation

Authors

Dhruv Prasanna

Computer Science Engineering Dept, PES University, 100 Feet Ring Road, 560085, Bangalore, Karnataka

Avinash Nithyashree

Computer Science Engineering Dept, PES University, 100 Feet Ring Road, 560085, Bangalore, Karnataka

Namith V Shetty

Computer Science Engineering Dept, PES University, 100 Feet Ring Road, 560085, Bangalore, Karnataka

Praharsha Kosuri

Computer Science Engineering Dept, PES University, 100 Feet Ring Road, 560085, Bangalore, Karnataka

Pavan A C

Computer Science Engineering Dept, PES University, 100 Feet Ring Road, 560085, Bangalore, Karnataka

DOI: https://doi.org/10.21467/proceedings.178.7

Synopsis

This paper presents an advanced speech to speech translation system designed to facilitate accurate communication between English and Hindi speakers with near real time responses while preserving the original voice of the speaker. The system uses a cascaded architecture consisting of Automatic Speech Recognition (ASR), Machine Translation (MT), and Text to Speech (TTS) components. The resulting system is able to accurately translate between English speech and Hindi speech and vice versa. The techniques shown attempt to tackle the difficulties brought on by the different language structures and phonetic differences between Hindi and English by making use of transformer based models in each module. The presented system is capable of providing accurate translations and performs on par with state of the art models and services like Google Translate and ChatGPT. HuBERT, a speech representation model is utilized to perform voice cloning on the voice of the target speaker, this allows the system to preserve the speakers voice while translating which helps more effective communication. HuBERT enhances clarity and emotional realism in TTS by leveraging speaker specific attributes extracted from the original speech to synthesize the translated material in a similar voice to the original speaker.