Please use this identifier to cite or link to this item: http://dspace.dtu.ac.in:8080/jspui/handle/repository/22170
Full metadata record
DC FieldValueLanguage
dc.contributor.authorDUBEY, ABHISHEK-
dc.date.accessioned2025-09-02T06:37:51Z-
dc.date.available2025-09-02T06:37:51Z-
dc.date.issued2025-05-
dc.identifier.urihttp://dspace.dtu.ac.in:8080/jspui/handle/repository/22170-
dc.description.abstractSpeech-based interfaces have become a revolutionary way to enhance clinical docu- mentation, telemedicine accessibility, and doctor-patient communication in the rapidly changing field of healthcare technology. Nevertheless, the majority of current Automatic Speech Recognition (ASR) systems cover monolingual scenarios and are frequently de- signed for general-purpose jobs. This significantly reduces their suitability for use in actual healthcare settings where multilingual and accent-diverse communication is com- monplace. In order to fill this void, this thesis presents MultiMed, a comprehensive, multilingual dataset created especially for medical speech recognition in five different lan- guages: Mandarin Chinese, English, German, French, and Vietnamese. More than 150 hours of annotated clinical speech that was gathered from actual healthcare situations and enhanced with linguistic, demographic, and acoustic diversity make up the dataset. The paper investigates and assesses cutting-edge ASR architectures built on the Atten- tion Encoder-Decoder (AED) framework in order to make efficient use of this dataset. It specifically optimizes several Whisper model variations (Tiny, Base, Small, Medium), which were first created by OpenAI, in both monolingual and multilingual training envi- ronments. In order to assess the accuracy and efficiency of the architecture, comparative tests are also conducted against Hybrid ASR systems, such as wav2vec 2.0 with shallow- fusion language models. Additionally, the thesis examines two different fine-tuning tech- niques that aim to strike a balance between recognition performance and computational efficiency: Decoder-Only Fine-Tuning and Full Encoder-Decoder Training.en_US
dc.language.isoenen_US
dc.relation.ispartofseriesTD-8171;-
dc.subjectSPEECH RECOGNITIONen_US
dc.subjectATTENTION ENCODER DECODERen_US
dc.subjectAUTOMATIC SPEECH RECOGNITION (ASR)en_US
dc.titleMULTILINGUAL SPEECH RECOGNITION VIA ATTENTION ENCODER DECODERen_US
dc.typeThesisen_US
Appears in Collections:M.E./M.Tech. Computer Engineering

Files in This Item:
File Description SizeFormat 
ABHISHEK DUBEY M.Tech.pdf722.77 kBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.