APPLICATIONS OF END TO  END AUTOMATIC SPEECH RECOGNITION

KUMAR, RUPESH

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More

Please use this identifier to cite or link to this item: http://dspace.dtu.ac.in:8080/jspui/handle/repository/20845

Full metadata record

DC Field	Value	Language
dc.contributor.author	KUMAR, RUPESH	-
dc.date.accessioned	2024-08-05T09:03:44Z	-
dc.date.available	2024-08-05T09:03:44Z	-
dc.date.issued	2024-05	-
dc.identifier.uri	http://dspace.dtu.ac.in:8080/jspui/handle/repository/20845	-
dc.description.abstract	This project comprehensively investigates the applications of end-to-end ASR, including models like Transformers and the combination of RNNs with CNNs and CTC loss for the English language. The primary goal is to evaluate the performances of these architectures for sequence-to-sequence tasks that require accurate temporal alignment and robust handling of input sequences with varying lengths, specifically in the context of speech recognition. We tried to compare applications of E2E ASR by using RNN-CNN models and transformers models. We used the datasets from LJspeech for the English language. The RNN-CNN model combines the advantages of CNNs for extracting features and RNNs for processing sequential input to enable alignment-free training. The CNN component enhances the encoding of local features, while the RNN component captures temporal dependencies. The combined effect of both components leads to an improvement in recognition accuracy. The second model utilizes a Transformer architecture, which utilises self-attention for capturing long-range dependency without recurrent connections. This architectural design tackles the constraints of RNNs in managing lengthy sequences and parallel processing, resulting in the potential for quicker training and inference durations. The results of our experiments on a commonly used English language dataset namely LJspeech indicate significant performance improvements. The Transformer model also demonstrates higher scalability and efficiency when dealing with huge datasets. We compared the WER and computation time for both models and found superior WER performance by 3% to 4% for the transformer-based model over the RNN-CNN model. Additionally, the transformer based model was found to be five times more time efficient per epoch but requires more number of epochs for training The results indicate that RNN-CNN models are efficient for tasks with prominent local dependencies, whereas Transformers exhibit notable benefits in terms of computational efficiency and managing long-range dependencies. This makes Transformers a compelling option for large-scale English language processing applications.	en_US
dc.language.iso	en	en_US
dc.relation.ispartofseries	TD-7381;	-
dc.subject	SPEECH RECOGNITION	en_US
dc.subject	END TO END ASR	en_US
dc.subject	RNN	en_US
dc.subject	CNN	en_US
dc.title	APPLICATIONS OF END TO END AUTOMATIC SPEECH RECOGNITION	en_US
dc.type	Thesis	en_US
Appears in Collections:	M.E./M.Tech. Electronics & Communication Engineering

Files in This Item:

File	Description	Size	Format
RUPESH KUMAR M.Tech..pdf		908.3 kB	Adobe PDF	View/Open

Show simple item record