Please use this identifier to cite or link to this item: http://dspace.dtu.ac.in:8080/jspui/handle/repository/19593
Title: OPTIMAL TEXT CLASSIFICATION USING NATURE INSPIRED ALGORITHMS
Authors: KHURANA, ANSHU
Keywords: TEXT CLASSIFICATION
NATURE INSPIRED ALGORITHMS
D SMOTE
KISFP
PSO
BBO
Issue Date: Jul-2022
Series/Report no.: TD-6075;
Abstract: Text classification has become a major avenue in generating valuable insights. It is being vastly used to solve real world problems by performing sentimental analysis, detecting frauds and patterns in various sectors like healthcare, e-commerce, sports etc. In Big Data, the performance of text classification can be improved by selecting rele vant features and handling of imbalance problems between the distribution of classes in the dataset. In the past, the research work has mostly been done on optimizing the conventional classifiers and tuning the parameters and has deviated from the natural distribution of the data itself. There has now been a radical shift in this approach with the emergence of data science, where the focus is now on understanding the data and feature selection. This research work contributed in the optimization of text classifica tion with four models. Firstly, different nature-inspired algorithms have been explored with various machine learning classifiers to find effective optimized model. The dif ferent nature-based techniques used for feature selection are Genetic Algorithm (GA), Particle Swarm Optimization (PSO) and Biogeography-Based Optimization (BBO). In the proposed model, feature selection was performed with BBO algorithm along with ensemble classifiers (Bagging). The selected features, after feature selection using BBO algorithm, are classified into various classes using six machine learning classifier. The experimental results are computed on eleven text classification datasets taken from UCI repository. The four different performance measures namely; Accuracy, Precision, Re call and F-measure are used to validate performance of our model with 10-fold cross validation. Secondly, new optimization algorithm and new dataset balancing algorithm has been proposed. It handles high-dimensional dataset with new nature-based algo rithm, Modified Biogeography-Based Optimization (M BBO). The algorithm works effectively by balancing the dataset with new algorithm of Distributed Synthetic Mi nority Oversampling Technique (D SMOTE). The proposed model M BBO, performs modification in ranking of variables using feature weighting algorithm rather than ran domly ranking. Two new expressions in D SMOTE and one new expression in M BBO are proposed. The extensive experimental results are computed out on four text classi fication datasets with four machine learning classifiers. The results are concluded using i three performance measures: 1) Area Under Curve (AUC), 2) G-mean and 3) F1-score. The model works for low dimensional dataset to high dimensional dataset. Thirdly, new optimized model is obtained by tuning parameters of optimization algorithm, that is Grasshopper optimization algorithm and K-Nearest Neighbor and Support Vector Machine classifiers. The tuning is performed with random search technique. The new tuned algorithm successfully provided the new optimal text classification technique. The aim of this meta-heuristic approach is to determine the minimal feature subset from all features to improve the classification performance. Five multi-class datasets are used to evaluate the performance of the model in terms of Accuracy and AUC curve. All results are computed with 10-fold-cross validation method. The evaluated results of the proposed model is compared with other algorithms, which verifies the perfor mance of our technique. The proposed model outperformed among all the compared state-of-the-art techniques. Lastly, our new optimization approach is performed with transfer learning technique. The model aims to consider the feature vectors of both the source and target domain for training the data based on similarity of exemplar (fea ture) vectors of different instances, known as Instance Similarity Feature (ISF). The exemplar vectors are chosen randomly for the target datasets. Hence, to acquire rel evant factual data in the knowledge base for training in our research, we worked to increase the domain separation error between source and target instances. To avoid the instability caused due to poor exemplar vector selection, the K-means clustering approach is followed after feature similarity, known as K-means Instance Similarity Feature (KISF). In order to vanquish the limitations of existing approaches, we have introduced novel optimal models with KISF with Ant Lion Optimizer (KISFA), KISF with Particle Swarm Optimization (KISFP) and KISF with Biogeography Based Opti mization (KISFB). High-dimensionality can impact efficacy of the model, hence, fea ture selection with nature-based optimizer namely: Ant Lion Optimizer, Particle Swarm Optimization and Biogeography Based Optimization are applied. We measure the per formance of the proposed models by using Support Vector Machine, Logistic Regres sion and Random Forest as classifier, and Accuracy and F1-score as fitness functions. Extensive experiments are performed on four datasets with 50 iterations. The proposed model is compared with eleven other techniques and our technique outperforms all other techniques in average Accuracy.
URI: http://dspace.dtu.ac.in:8080/jspui/handle/repository/19593
Appears in Collections:Ph.D. Electronics & Communication Engineering

Files in This Item:
File Description SizeFormat 
ANSHU KHURANA Ph.D..pdf4.6 MBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.