GENOMIC LANGUAGE PROCESSING USING MACHINE LEARNING

CHAKRABORTY, RAJKUMAR

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More

Please use this identifier to cite or link to this item: http://dspace.dtu.ac.in:8080/jspui/handle/repository/20063

Title:	GENOMIC LANGUAGE PROCESSING USING MACHINE LEARNING
Authors:	CHAKRABORTY, RAJKUMAR
Keywords:	GENOMIC LANGUAGE PROCESSING MACHINE LEARNING BIOLOGICAL LANGUAGE MODEL CONVOLUTIONAL NEURAL NETWORK microRNA
Issue Date:	Jan-2023
Series/Report no.:	TD-6605;
Abstract:	The purpose of developing biological language models (BLMs) is to enhance our capacity to comprehend and analyse biological sequences, such as DNA, RNA, and protein sequences. These sequences contain crucial information about the structure and function of living organisms and are involved in virtually every biological process. Nonetheless, analysing biological sequences can be difficult due to their complexity and enormous potential. Specifically, the functions and properties of a large number of coding and non-coding DNA and RNA sequences remain poorly understood. This thesis presents three objectives related to the application of natural language processing techniques in the field of bio-molecule sciences. The first objective involves using a combination of a Convolutional Neural Network (CNN) and a Long Short-Term Memory (LSTM) network, stacked in a sequence-to-sequence (Seq2Seq) architecture, to predict microRNA sequences from mRNA sequences. The microRNA are small, generally 28 bp long, non-coding RNAs that play a role in various physiological and disease processes. Identifying mRNA targeted by microRNAs is a challenge, and researchers often rely on computational programs to initially identify target candidates for subsequent validation. In this work, a neural network was trained to predict microRNA from the bound target segment in mRNA using a dataset of experimentally validated and cleaned microRNA-mRNA sequence pairs from TarBase v8. Convolutional neural networks (CNNs) were used to recognize patterns in mRNA segments and extract features, while long short-term memory (LSTM) networks in a seq2seq architecture were used to predict microRNA sequences. The model achieved an accuracy of 80% and was validated using experimentally verified microRNA-RNA pairs involved in skin diseases from an in-house database called miDerma, correctly predicting an average of 72% of the microRNAs from mRNA in each case. The package, called "model: A MicroRNA sequence prediction tool from RNA sequence based on CNNs, LSTMs, and seq2seq architecture," allows users to input a gene symbol and retrieves the protein coding transcript's sequence from the Ensemble REST API to predict a list of microRNAs that may bind to potential target segments in the mRNA. The second objective involves using natural language processing techniques, including an embedding layer, a CNN layer, and a bidirectional LSTM layer, to predict disordered regions in proteins. Intrinsically disordered regions (IDRs) are important for various physiological processes and diseases and play a complementary role to the functions of structured proteins. They can be identified through multiple experimental techniques, but these methods can be costly and time-consuming. As a result, researchers rely on computational strategies to predict probable IDRs/IDPs before conducting further validation through experimental studies. While there have been significant advancements in predicting long and short IDRs in recent years, there is still scope for algorithmic improvement. This study aims to improve the prediction of IDRs by using neural networks, specifically convolutional neural networks (CNNs), recurrent neural networks (RNNs), and long short-term memory (LSTM) networks, as well as natural language processing (NLP) techniques. The study also explores the use of different input sequence lengths and various embedding sizes for the CNN and LSTM models. The results show that the CNN and LSTM models outperform state-of-the-art techniques for predicting IDRs, with the LSTM model achieving the highest accuracy of 85.7%. The study also demonstrates the effectiveness of using NLP techniques for analyzing protein sequences and the importance of carefully selecting model architectures and hyperparameters to achieve good performance. The third objective involves using an autoencoder, a type of deep learning architecture, to generate drug analogues by reconstructing chemical SMILES (Simplified Molecular-Input Line-Entry System) representations of molecules and varying the batch size and latent space dimensionality of the autoencoder. The design of drug analogues involves the creation of modified versions of existing drugs to improve their efficacy, stability, and safety. Deep Page \| ix learning techniques, such as autoencoders, can be used to generate new drug analogues through a process of chemical structure reproduction. In this study, an autoencoder was trained on chemical SMILES data from the ChEMBL database and used to generate 157 variants of the drug Vandetanib by adding noise to its latent representation and reconstructing the resulting compounds using a decoder. Molecular docking and dynamics simulations were then performed to determine which of these analogues had a higher binding affinity than Vandetanib. At least two of the analogues had a higher binding affinity than the control compound. While this model has the potential to generate a wide range of molecules, it may have difficulty generating molecules with SMILES strings longer than 80 characters due to a lack of training data of SMILES string length above 80 characters. The synthesis and laboratory testing of the generated molecules to determine their potential as drugs also presents a challenge. However, this study has the potential to make significant contributions to the field of automatic drug analogue prediction and could be a valuable addition to the current scientific literature. The study presents several potential applications for its microRNA, protein disorder region finding, and drug analogue generation models. The microRNA prediction model could aid in the development of therapies for diseases by identifying microRNA sequences that regulate gene expression. The protein disorder prediction model could be used in drug design and protein engineering by identifying disordered regions in proteins that play a role in various protein functions. The drug analogue generation model has the potential to generate new drug analogues with desired properties and could be used in drug discovery and the optimization of existing drugs. Overall, this research has the potential to make significant contributions to biomedical research and could lead to the development of new therapies and drugs for diseases, as well as new bio-molecular language models for other tasks.
URI:	http://dspace.dtu.ac.in:8080/jspui/handle/repository/20063
Appears in Collections:	Ph.D. Bio Tech

Files in This Item:

File	Description	Size	Format
Rajkumar Chakraborty Ph.D..pdf		6.28 MB	Adobe PDF	View/Open

Show full item record