Please use this identifier to cite or link to this item:
http://dspace.dtu.ac.in:8080/jspui/handle/repository/21597
Full metadata record
DC Field | Value | Language |
---|---|---|
dc.contributor.author | RAVI | - |
dc.date.accessioned | 2025-05-15T04:36:08Z | - |
dc.date.available | 2025-05-15T04:36:08Z | - |
dc.date.issued | 2025-04 | - |
dc.identifier.uri | http://dspace.dtu.ac.in:8080/jspui/handle/repository/21597 | - |
dc.description.abstract | Speech emotion recognition (SER) is the process of identifying and classifying emotions expressed in spoken language using audio features and computational models. It has applications in e-learning, robotic interfaces, computer games, entertainment, audio surveillance, clinical studies, and more. Despite its promising applications, emotion recognition from speech signals is a challenging domain due to the inherent non-stationary and multicomponent nature of speech and language sensitive SERs. The complete SER framework is broadly divided into three stages: (i) signal preprocessing, (ii) feature extraction and selection, and (iii) classification. Contributions at any stage can improve the performance of SER models. This research work focuses on proposing a non-stationary signal processing framework for speech signals and evaluates the performance of the SER models for binary class, multiclass and multilingual scenarios. Speech signals are non-stationary, meaning their statistical properties change over time. In this thesis, initially a rational dilation wavelet transform based method is presented to analyze speech signal for the binary SER system. The wavelet improves signal predictability and provides better time-frequency analysis. In this work four basic features complexity, average amplitude change, mobility and zero crossing rate are extracted. The proposed framework tested on publically available RAVDESS dataset and achieves 83.30% accuracy for binary class emotion. Even though the wavelet-based method offers better time-frequency representation, it is observed that wavelet transforms can be less effective for analyzing nonlinear signals due to their reliance on predefined basis functions. To overcome this issue, Empirical Mode Decomposition (EMD) is explored. The data-driven nature of EMD allows it to provide a more intuitive and interpretable representation of the signal’s components, capturing subtle nuances and variations. Additionally, EMD’s ability to perform localized time-frequency analysis ensures precise identification of transient features and other detailed structures within the speech signal. In this framework ratio feature based on energy and statistical measures are calculated from MFCC coefficients. The proposed EMD based SER framework is tested on EMOVO and RAVDESS dataset and achieves 95.30 and 90.01% accuracy respectively for binary speech emotion classification. EMD provides a better SER performance compared to wavelets. However, the recursive nature of EMD creates mode mixing and it is also noise sensitive. To address this issue, a Variational Mode Decomposition (VMD) based method is adopted. One vi significant benefit of VMD is its enhanced noise resilience, which minimizes mode mixing and leads to more stable decompositions of the signal. It also provides flexibility to control decomposition parameters. VMD with Teager-Kaiser energy operator-based features are explored for binary class and multiclass emotion classification. Meanwhile, the statistical measures (such as mean and variance) of MFCC and pitch frequency are computed after framing the preprocessed signal to account for temporal variations in speech. The introduced method is tested on RAVDESS dataset and provides better performance for binary class emotion recognition as compared to existing works. Although, the accuracy of the model reduces as the number of emotion classes increased. To address multiclass emotion recognition issue energy based dominant VMD modes selection framework is proposed. First, VMD separates the speech signal into multiple modes where each represents distinct frequency components. The energy of each mode is then calculated to identify the dominant modes that contribute most significantly to the speech signal’s characteristics. These dominant modes are subsequently used for signal reconstruction. Finally, the reconstructed signal is used for the feature extraction and emotion classification. In this framework spectral and prosodic features like Mel spectrum, spectral crest, spectral entropy, spectral kurtosis, spectral centroid, mel frequency cepstral coefficients and their derivatives, gammatone cepstral coefficients and Pitch frequency. The proposed SER framework is tested on RAVDESS, EMOVO, Emo DB and IEMOCAP dataset and achieves 93.80%, 93.40, 95.08% and 83.10% accuracy respectively. However, noise in raw speech signals could still obscure subtle emotional features, which can degrade performance. Additionally, its dependence on predefined parameters for mode tuning might not generalize well to highly variable or noisy inputs, especially for multilingual speech emotion recognition (MLSER). To address previous work limitations, an enhanced signal preprocessing framework for MLSER is proposed. In this framework, silence is removed using short time energy and spectral centroid, ensuring that only relevant speech segments are processed. VMD is then applied for signal decomposition, with an improved Bhattacharyya distance guiding mode tuning for noise removal. Finally, the denoised signal is used for feature extraction and emotion classification. In this framework, spectral and prosodic features such as the Mel spectrum, spectral crest, spectral entropy, spectral kurtosis, spectral centroid, Mel-frequency cepstral coefficients and their derivatives, gammatone cepstral coefficients, and pitch frequency are used. This framework is tested on the RAVDESS, EMOVO, and Emo-DB datasets, and a multilingual dataset is created by combining these three datasets.The proposed MLSER model achieved 93.4% accuracy for multilingual and multiclass dataset. This thesis presents a progression of methods for SER with addressing the challenges of non-stationary and noisy speech signals. Initially, a wavelet-based approach vii using rational dilation wavelet transform achieved 83.30% accuracy for binary emotions but struggled with highly nonlinear signals. EMD improved results to 90.01% accuracy yet suffered from noise sensitivity and mode mixing. VMD overcame these limitations, enhancing noise resilience and achieving 100% accuracy for binary class emotion recognition. Further VMD explored for multiclass SER, employing energy-based dominant mode selection to achieve 95.08% accuracy. VMD based speech pre-processing is improved by silence removal and refined noise handling. The proposed approach is tested on multilingual SER framework and achieved 93.4% accuracy. This thesis presents the SER framework for binary class, multiclass, and multilingual emotion identification, and proposed works outperform other existing state-of-the-art frameworks. | en_US |
dc.language.iso | en | en_US |
dc.relation.ispartofseries | TD-7849; | - |
dc.subject | SPEECH EMOTION RECOGNITION | en_US |
dc.subject | SER SYSTEM | en_US |
dc.subject | DECOMPOSITIONS | en_US |
dc.subject | OPTIMAL FEATURES | en_US |
dc.subject | VMD | en_US |
dc.title | SPEECH BASED EMOTION RECOGNITION USING NON-STATIONARY DECOMPOSITIONS AND OPTIMAL FEATURES | en_US |
dc.type | Thesis | en_US |
Appears in Collections: | Ph.D. Electronics & Communication Engineering |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
Ravi Ph.D..pdf | 5.58 MB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.