Please use this identifier to cite or link to this item:
http://dspace.dtu.ac.in:8080/jspui/handle/repository/16408
Title: | CONTEXT AWARE TOPIC MODELING FOR SHORT TEXT |
Authors: | MOGANA, N. |
Keywords: | TOPIC MODELING SHORT TEXT CATM |
Issue Date: | May-2018 |
Series/Report no.: | 4302; |
Abstract: | With the advent of internet and advances in communications system, enormous amount of data is generated on day basis. The major portion of the text data contributed from the social media, blogs and emails, news forums are in short text form. This enormous data also called as big data, has potential to uncover hidden information which could be used for making business centric / concrete evident based decisions. In machine learning and natural language processing, topic modeling is a widely tool for discovering hidden semantic relations in a document corpus. It models a document as a distribution of topics and a topic as a probabilistic distribution of related words. State of the art methods like LDA, BTM are not considered suitable for short text due to data sparseness problem. In this paper, a novel method referred to as Context Aware Topic Modeling (CATM) for short text is proposed which extends previous Bi-Term Pseudo- Document Topic Model (BPDTM) for short text. The BPDTM constructs a manipulated corpus based on word co-occurrence network using bi-terms of the corpus for alleviating data sparseness problem. In the due process it includes several duplicate bi-terms and unwanted edges of the network into the pseudo-corpus, which drastically affects the coherence of the topics generated. In order to reduce the noise, the CATM algorithms prunes the word network by introducing an additional distribution for naturally eliminating the unwanted words during the learning process of the topic model. Also, a tool called Wordnet is used as a preliminary step to filter out totally unrelated words while constructing the word co-occurrence network. Besides, CATM naturally lengthens the documents, which alleviate the influence on performance exerted by data inadequacy issue. Experiments demonstrated that the proposed model outperformed baseline model- BPTDM, which proved its effectiveness on short text topic models. |
URI: | http://dspace.dtu.ac.in:8080/jspui/handle/repository/16408 |
Appears in Collections: | M.E./M.Tech. Computer Engineering |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
Mogana_Thesis.pdf | 1.52 MB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.