Please use this identifier to cite or link to this item: http://dspace.dtu.ac.in:8080/jspui/handle/repository/16500
Title: TECHNIQUES OF KNOWLEDGE DISCOVERY IN TEXT
Authors: SHWETA
Keywords: KNOWLEDGE DISCOVERY
TEXT DOCUMENTS
LKNN ALGORITHM
LS-KDT
Issue Date: Apr-2018
Series/Report no.: TD-4378;
Abstract: In this digital era, a major portion of the information is stored in text documents. The amount of data stored in these documents is exponentially increasing day by day. Analysis of such a vast data is not possible manually. This led to the development of Knowledge Discovery techniques in Text documents (called as KDT). KDT helps us to discover the useful information or knowledge from text documents. The extraction of useful information from the text documents is termed as text mining. There are a lot of challenges in the field of text mining. Firstly, the text documents occur in a free natural language form like online news stories, e-mail messages, reports, legal documents etc. These documents are unstructured in nature. It is necessary to convert these unstructured text documents into a structured form. Secondly, there is an immense need to organize and manage the text documents efficiently. Text categorization plays an important role in organizing the text documents efficiently. Given a collection of text documents and a set of pre-defined classes/categories, the technique of text categorization assigns a particular class to each text document. There are two types of text categorization: single-label and multi label. In single-label, each text document belongs to a single category, whereas in multi-label, each text document belongs to more than one category. Most of the real world documents are multi-label in nature. This thesis aims at exploring the existing techniques of knowledge discovery in text documents. We studied the existing techniques of knowledge discovery in text documents and came up with the following challenges: Page | vi First - The need to convert unstructured text documents to a structured form. To meet this challenge, a framework called as U-STRUCT is proposed that converts an unstructured text document to a structured form. It is a generic framework that can be applied to all domains. Second - the vast information available in text documents is needed to be organized and managed. For this challenge, the technique of text categorization is taken and a detailed survey of available methods for text categorization technique has been carried out. The existing methods of text categorization have certain limitations. We have to overcome those limitations and suggest a better method for text categorization. Third - An efficient method is needed for single-label text categorization. To meet this challenge, a Lexical based algorithm called as LKNN is proposed for single-label text categorization. This algorithm is implemented on two datasets: research articles of computer science domain and Ohsumed collection. The standard performance metrics like Recall, Precision, F-measure are calculated to measure the performance of LKNN algorithm. It has shown a good performance. Fourth - It is required to develop an efficient method for categorization of multi-label documents as most of the real-world documents are multi-label in nature. Therefore, the algorithm proposed for single-label text categorization is extended to multi-label text categorization. And a modified Knowledge Discovery process known as Lexical Semantics based Knowledge Discovery process for Text documents (LS-KDT) is Page | vii proposed. The proposed process is divided in seven phases: Text Document Collection, Data Pre-processing, Lexical Analysis, Semantic Analysis, Classification, Ranking of labels and Knowledge Discovery. The proposed LS-KDT process is designed and implemented. Thereafter, the performance of LS-KDT process is compared in three ways. Firstly, the performance is compared with ACM Digital Library Results. The research articles are randomly taken from ACM digital library. These articles belong to two domains: computer science and medical domain. ACM digital library uses CCS tool to categorize the research articles. This tool displays the hierarchy of classes to which a research article belongs. Our proposed LS-KDT process also displays the hierarchy of classes and sub-classes to which a research article belongs. The standard performance metrics like Recall, Precision and F-measure are used for comparison. Secondly, the performance is compared with the results of IEEE Xplore digital library. The articles are randomly selected from IEEE Xplore database. Again, research articles belonging to two domains are taken. One is those that belong to computer science and other is the articles belonging to the medical domain. IEEE Xplore digital library inserts four types of keywords with each research article. These are: IEEE KW, INSPEC Controlled Indexing, INSPEC Uncontrolled Indexing and Author KW. It is noticed that keywords in INSPEC Controlled Indexing includes the keywords of INSPEC Uncontrolled Indexing as well as IEEE KW. Therefore, to prove our work, Controlled Indexing keywords are taken, and their domain/ broad category is identified. The results are compared with the broad categories displayed by our proposed process. Page | viii Thirdly, the performance of the proposed process is compared with the existing multi label methods on standard performance metrics like Accuracy, Precision, Hamming loss and F-measure. The proposed process has shown promising results. The proposed Knowledge discovery process will help the research community to specify the exact categories to which a research article belongs in a more accurate way. It will aid the journal editors to assign to reviewers the research papers or articles in a systematic manner. The accurate categorization of articles helps the digital libraries, databases, repositories or online resources to efficiently store or search the articles. In future, the proposed LS-KDT process can be tested on research articles of other domains or other text documents like legal documents, reports etc.
URI: http://dspace.dtu.ac.in:8080/jspui/handle/repository/16500
Appears in Collections:Ph.D. Computer Engineering

Files in This Item:
File Description SizeFormat 
Thesis Shweta.pdf2.66 MBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.