Please use this identifier to cite or link to this item:
http://dspace.dtu.ac.in:8080/jspui/handle/repository/22170
Full metadata record
DC Field | Value | Language |
---|---|---|
dc.contributor.author | DUBEY, ABHISHEK | - |
dc.date.accessioned | 2025-09-02T06:37:51Z | - |
dc.date.available | 2025-09-02T06:37:51Z | - |
dc.date.issued | 2025-05 | - |
dc.identifier.uri | http://dspace.dtu.ac.in:8080/jspui/handle/repository/22170 | - |
dc.description.abstract | Speech-based interfaces have become a revolutionary way to enhance clinical docu- mentation, telemedicine accessibility, and doctor-patient communication in the rapidly changing field of healthcare technology. Nevertheless, the majority of current Automatic Speech Recognition (ASR) systems cover monolingual scenarios and are frequently de- signed for general-purpose jobs. This significantly reduces their suitability for use in actual healthcare settings where multilingual and accent-diverse communication is com- monplace. In order to fill this void, this thesis presents MultiMed, a comprehensive, multilingual dataset created especially for medical speech recognition in five different lan- guages: Mandarin Chinese, English, German, French, and Vietnamese. More than 150 hours of annotated clinical speech that was gathered from actual healthcare situations and enhanced with linguistic, demographic, and acoustic diversity make up the dataset. The paper investigates and assesses cutting-edge ASR architectures built on the Atten- tion Encoder-Decoder (AED) framework in order to make efficient use of this dataset. It specifically optimizes several Whisper model variations (Tiny, Base, Small, Medium), which were first created by OpenAI, in both monolingual and multilingual training envi- ronments. In order to assess the accuracy and efficiency of the architecture, comparative tests are also conducted against Hybrid ASR systems, such as wav2vec 2.0 with shallow- fusion language models. Additionally, the thesis examines two different fine-tuning tech- niques that aim to strike a balance between recognition performance and computational efficiency: Decoder-Only Fine-Tuning and Full Encoder-Decoder Training. | en_US |
dc.language.iso | en | en_US |
dc.relation.ispartofseries | TD-8171; | - |
dc.subject | SPEECH RECOGNITION | en_US |
dc.subject | ATTENTION ENCODER DECODER | en_US |
dc.subject | AUTOMATIC SPEECH RECOGNITION (ASR) | en_US |
dc.title | MULTILINGUAL SPEECH RECOGNITION VIA ATTENTION ENCODER DECODER | en_US |
dc.type | Thesis | en_US |
Appears in Collections: | M.E./M.Tech. Computer Engineering |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
ABHISHEK DUBEY M.Tech.pdf | 722.77 kB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.