Please use this identifier to cite or link to this item: http://dspace.dtu.ac.in:8080/jspui/handle/repository/18023
Title: MALWARE DETECTION USING MACHINE LEARNING
Authors: GUPTA, VIJAY KUMAR
Keywords: MALWARE DETECTION
MACHINE LEARNING
Issue Date: Aug-2019
Series/Report no.: TD-4889;
Abstract: In this challenge, we are aiming to classify tens of thousands of malware files into families and using different machine learning models with some improvisation done in the modelling and training of data. The dataset used consisted of large proportions of .bytes and .asm files having around 11,000 malware_files for test set and training set. Strategically doing the exploratory analysis, which included evaluating the distribution, feature extraction, multivariate analysis and data splitting, we have tried to compare the models and optimize the best one using new techniques to enhance the efficiency of the task. The task was both separately and conjunctively performed over the byte and asm files and effective analysis was made. Some new insights like visualization and effective feature engineering are also mentioned and appreciated that would further refine the accuracy of the process. We have used a large dataset released by Microsoft in this work for both training and testing purposes. Training data0set has 10868 sample5malwares from 9 5different classes5of5malware. Classes of malware are: (a) Vundo (b) Lollipop, (c) Simda, (d) Ramnit, (e) Kelihos_ver3, (f) Obfuscator.ACY, (g) Kelihos_ver1, (h) Tracur, (i) Gatak. The malware data0set is almost0half a tera-byte when it is un-compressed. The data-set consists of a8set of known8malware 1files 1representing a mixture 1of 90different family. Each1 of the malware 1file has its own identifier0, 207character hashvalued uniquely8identifying given file1, & class1 label1, that is integer represents the 1 of the nine1 family8names to which1 the malwares might belonged to. For each of the given file, the raw-data contained the hexadecimal0representation of9the files binary (01) content5, with-out the header (to 1ensure 1sterility). The given data0set also includes a9metadata manifest, that is the log containing8various metadata information what is taken out from the binary (01), just like function_calls, strings1, etc. That was8generated using the0IDA disassembler0tool. The dataset is first loaded and then it is saved in memory for further transformations. For each of the7malware, we have 2 files - .asm0 file and .bytes0 file (and where given file binary content is represented in hexadecimal representation in raw data, but with out the P.E.0header).The size of .bytes files are 10,8683and 10,868 asm9files making a total of 21,736 files
URI: http://dspace.dtu.ac.in:8080/jspui/handle/repository/18023
Appears in Collections:M.E./M.Tech. Computer Engineering

Files in This Item:
File Description SizeFormat 
M.Tech.Thesis_Vijay Kumar Gupta.pdf2.66 MBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.