DSpace Collection:

DSpace Collection: http://dspace.dtu.ac.in:8080/jspui/handle/repository/18630 2026-07-26T19:24:37Z 2026-07-26T19:24:37Z EMPIRICAL VALIDATION OF OBJECT-ORIENTED METRICS FOR IMBALANCED CLASSIFICATION USING OPEN SOURCE SOFTWARE JAIN, JUHI http://dspace.dtu.ac.in:8080/jspui/handle/repository/18646 2021-12-08T06:22:29Z 2021-01-01T00:00:00Z

Title: EMPIRICAL VALIDATION OF OBJECT-ORIENTED METRICS FOR IMBALANCED CLASSIFICATION USING OPEN SOURCE SOFTWARE Authors: JAIN, JUHI Abstract: Software are an inextricable part of our lives. With the ever-growing complexity of software, designing and integrating changes in these software is always a tedious task for developers and software practitioners. One of the prime concerns while implementing changes is to maintain the quality of software products as there are fewer resources and rigid deadlines. If defects are uncovered in later stages of software development, the cost of detecting and removing them amplifies exponentially. This may result in poor software development processes and software quality degradation. With the constraints of strict time schedules and limited resources, it becomes the utmost requirement of software developers and practitioners to discover these defects early. Finding defects or faults in the early phases of the software development life cycle leads to better planning and reduced cost, effort, and resources [1]. Software metrics are widely used for generating defect prediction models. Different object-oriented (OO) metrics define different internal attributes of the software like cohesion, coupling, size, inheritance, encapsulation, etc. Therefore, these metrics are utilized to envisage whether a software class can be defective or not [2, 3]. Selection of relevant metrics aids in effective predictive modelling for finding defects. We evaluated the correlation-based feature selection for identifying the important metrics that are related to defect-prone areas in the software. Various machine learning (ML) and statistical techniques have been used for developing prediction models to ascertain defect-proneness in the literature. We discovered a new category of classification techniques, search-based techniques (SBTs), that is rarely used in the Software Defect Prediction (SDP) domain. We assessed the effectiveness of ML techniques and SBTs for developing models that predict defective classes in the OO software. We further extended the use of genetic algorithm variants for feature selection and performed the comparative analysis with Correlation Feature Selection (CFS). One of the major issues that have been observed in software data is the imbalanced data problem. If there is a fewer number of instances of one type of class than that of another class, then data is said to have an imbalanced data problem. For our application, if in software defective classes are less than non-defective classes, then it is said to be imbalanced. We conducted a structured review to analyze the ways of tackling imbalanced data problem for developing the defect prediction models. The review results will help in identifying best practices and research gaps if any. Imbalanced data problem can be treated either at the data level or algorithm level. At the data level, we developed ML models using resampling methods to assess their impact on defect-proneness. At the algorithm level, cost-sensitive learning is employed to tackle the imbalanced data issue. The impact of different MetaCost learners was investigated for optimum defect prediction in the software. Studies in literature have advocated the use of ensemble methodology for various software prediction tasks. We evaluated the ensemble methods after treating the data with resampling methods. The incorporation of resampling methods will alleviate the imbalanced data problem resulting in better model prediction. We assessed the effectiveness of OO metrics, ML techniques, SBTs, resampling methods, and MetaCost learners for developing SDP models.

2021-01-01T00:00:00Z