MULTILINGUAL SPEECH RECOGNITION VIA ATTENTION ENCODER DECODER

DUBEY, ABHISHEK

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More

Please use this identifier to cite or link to this item: http://dspace.dtu.ac.in:8080/jspui/handle/repository/22170

Full metadata record

DC Field	Value	Language
dc.contributor.author	DUBEY, ABHISHEK	-
dc.date.accessioned	2025-09-02T06:37:51Z	-
dc.date.available	2025-09-02T06:37:51Z	-
dc.date.issued	2025-05	-
dc.identifier.uri	http://dspace.dtu.ac.in:8080/jspui/handle/repository/22170	-
dc.description.abstract	Speech-based interfaces have become a revolutionary way to enhance clinical docu- mentation, telemedicine accessibility, and doctor-patient communication in the rapidly changing field of healthcare technology. Nevertheless, the majority of current Automatic Speech Recognition (ASR) systems cover monolingual scenarios and are frequently de- signed for general-purpose jobs. This significantly reduces their suitability for use in actual healthcare settings where multilingual and accent-diverse communication is com- monplace. In order to fill this void, this thesis presents MultiMed, a comprehensive, multilingual dataset created especially for medical speech recognition in five different lan- guages: Mandarin Chinese, English, German, French, and Vietnamese. More than 150 hours of annotated clinical speech that was gathered from actual healthcare situations and enhanced with linguistic, demographic, and acoustic diversity make up the dataset. The paper investigates and assesses cutting-edge ASR architectures built on the Atten- tion Encoder-Decoder (AED) framework in order to make efficient use of this dataset. It specifically optimizes several Whisper model variations (Tiny, Base, Small, Medium), which were first created by OpenAI, in both monolingual and multilingual training envi- ronments. In order to assess the accuracy and efficiency of the architecture, comparative tests are also conducted against Hybrid ASR systems, such as wav2vec 2.0 with shallow- fusion language models. Additionally, the thesis examines two different fine-tuning tech- niques that aim to strike a balance between recognition performance and computational efficiency: Decoder-Only Fine-Tuning and Full Encoder-Decoder Training.	en_US
dc.language.iso	en	en_US
dc.relation.ispartofseries	TD-8171;	-
dc.subject	SPEECH RECOGNITION	en_US
dc.subject	ATTENTION ENCODER DECODER	en_US
dc.subject	AUTOMATIC SPEECH RECOGNITION (ASR)	en_US
dc.title	MULTILINGUAL SPEECH RECOGNITION VIA ATTENTION ENCODER DECODER	en_US
dc.type	Thesis	en_US
Appears in Collections:	M.E./M.Tech. Computer Engineering

Files in This Item:

File	Description	Size	Format
ABHISHEK DUBEY M.Tech.pdf		722.77 kB	Adobe PDF	View/Open

Show simple item record