CLASSIFICATION OF IMBALANCE DATA: ADDRESSING DATA INTRINSIC CHARACTERISTICS

GARG, ARMAAN

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More

Please use this identifier to cite or link to this item: http://dspace.dtu.ac.in:8080/jspui/handle/repository/16179

Title:	CLASSIFICATION OF IMBALANCE DATA: ADDRESSING DATA INTRINSIC CHARACTERISTICS
Authors:	GARG, ARMAAN
Keywords:	IMBALANCE DATA CLASSIFICATION DATA INTRINSIC CHARACTERISTICS
Issue Date:	Jun-2018
Series/Report no.:	TD-4099;
Abstract:	In categorising datasets with skewed classes, classifier experiences imbalanced class dissemination. There are real applications where information in various datasets has unequal distribution. Problem arises when their is non-uniform distribution of instances between classes. For adjusting the information they have set up, different methods are used to handle them. Few of the methods are: Preprocess, cost-sensitive learning and ensemble techniques. In preprocess method, the data is modified in such a way so that the imbalanced is reduced by simply modifying the number of instances in different classes. There are few techniques under this method, they are: under sampling strategy, oversampling technique and the hybrid technique. In under sampling technique the number of majority class instances are decreased. In oversampling strategy a super set is made by imitating the instances of the minority class. In hybrid approach both subset and superset readiness strategy is utilised. In cost sensitive learning method, punishments will be upheld on to the class readiness. The cost of misclassifying the positive case is significantly higher than that of misclassifying the negative one. Ensemble classifiers, endeavour to enhance the execution of single classifiers by initiating a few classifiers and consolidating them to acquire another classifier that out plays each one of them. Later on the perspective turned into that the imbalanced information in different grouping has less impact on the execution. There are some other various issues related to data intrinsic characteristics such as sample size, class overlapping, the noisy data etc. To get better classification results, one should focus on these data intrinsic properties and should resolve the issues that arise due to the them. In the proposed work we will be looking at these data intrinsic characteristics in detail and how the issues related to these characteristics can be resolved. Algorithms have been developed corresponding to each of these issues and then integrated to get the overall performance of these algorithms. At the end a transformed dataset will be produced which will be free from these issues. WEKA tool is used to classify these datasets and measure the performance. The final result shows that the classifiers produce better results for the transformed dataset than the other datasets which do not address the issues related to data intrinsic characteristics.
URI:	http://dspace.dtu.ac.in:8080/jspui/handle/repository/16179
Appears in Collections:	M.E./M.Tech. Computer Engineering

Files in This Item:

File	Description	Size	Format
armaan mtech thesis.pdf		589.25 kB	Adobe PDF	View/Open

Show full item record