A COMPARATIVE ANALYSIS OF VARIOUS SAMPLING METHODS AND METACOST LEARNERS TO IMPROVE SOFTWARE DEFECT PREDICTION FOR IMBALANCED DATA

KAMAL, SHINE

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More

Please use this identifier to cite or link to this item: http://dspace.dtu.ac.in:8080/jspui/handle/repository/15803

Full metadata record

DC Field	Value	Language
dc.contributor.author	KAMAL, SHINE	-
dc.date.accessioned	2017-07-14T12:01:34Z	-
dc.date.available	2017-07-14T12:01:34Z	-
dc.date.issued	2017-07	-
dc.identifier.uri	http://dspace.dtu.ac.in:8080/jspui/handle/repository/15803	-
dc.description.abstract	Data imbalancing is becoming a common problem to tackle in different fields like, defect prediction, change prediction, oil spills, medical diagnose etc. Various methods have been developed to handle imbalanced datasets in order to improve accuracy of the prediction models. Software defect prediction is important to identify defects in the early phases of software development life cycle. This early identification and thereby removal of software defects is crucial to yield a cost-effective and good quality software product. Though, previous studies have successfully used machine learning techniques for software defect prediction, these techniques yield biased results when applied on imbalanced data sets. An imbalanced data set has non-uniform class distribution with very few instances of a specific class as compared to that of the other class. Use of imbalanced data sets leads to off-target predictions of the minority class, which is generally considered to be more important than the majority class. Thus, handling imbalanced data effectively is crucial for successful development of a competent defect prediction model. Many studies have been carried out in the field of defect prediction for imbalanced datasets but most of them uses SMOTE oversampling method to handle the imbalanced data problem. There are many other oversampling methods which help to deal with imbalancing problem and are still unexplored particularly in the field of software defect prediction. This study evaluates the effectiveness of machine learning classifiers for software defect prediction on twelve imbalanced NASA datasets by application of nine sampling methods. We also propose a modified version (SPIDER3) of the existing oversampling method SPIDER2 and compare it with the original one. Furthermore, the work evaluates the performance of MetaCost learners on imbalanced datasets. The results show improvement in the prediction capability of machine learning classifiers with the use of sampling methods. MetaCost learners improves the sensitivity and helps to predict defects effectively. Moreover, they advocate the applicability of modified version of SPIDER2 oversampling method as it outperforms the original SPIDER2 method in majority of the cases.	en_US
dc.language.iso	en	en_US
dc.relation.ispartofseries	TD-2774;	-
dc.subject	DATA IMBALANCING	en_US
dc.subject	SAMPLING METHODS	en_US
dc.subject	METACOST LEARNERS	en_US
dc.subject	DEFECT PREDICTION	en_US
dc.title	A COMPARATIVE ANALYSIS OF VARIOUS SAMPLING METHODS AND METACOST LEARNERS TO IMPROVE SOFTWARE DEFECT PREDICTION FOR IMBALANCED DATA	en_US
dc.type	Thesis	en_US
Appears in Collections:	M.E./M.Tech. Computer Engineering

Files in This Item:

File	Description	Size	Format
THESIS.pdf		1.53 MB	Adobe PDF	View/Open

Show simple item record