Please use this identifier to cite or link to this item:
http://dspace.dtu.ac.in:8080/jspui/handle/repository/18916
Title: | ALGORITHMS FOR MINING ASSOCIATION RULES FROM LARGE TRASACTIONAL DISTRIBUTED DATA |
Authors: | SETHI, MANOJ |
Keywords: | MINING ASSOCIATION RULES DISTRIBUTED DATA MINING ALGORITHMS DARM |
Issue Date: | Dec-2021 |
Series/Report no.: | TD-5484; |
Abstract: | Lots of advancements in the database technologies in the last few decades attracted researchers to work in this area. Databases which were mostly centralised have been changed to distributed databases where data is partitioned and stored at different locations, because of the availability of modern technologies, fast network, internet, increased size of data and industry demand. Centralised database are also used for creating data ware-houses and then data mining for getting some useful information for the critical decisions in the area of education, medical, commercial and many more.Now, with the increasing demand of the distributed databases, mining data ware-house concept is also changing to distributed data mining where mining is done on the data partitions stored at different locations and then the aggregation or merging the results is done for the global mining. Distributed data mining (DDM) has become important research area with the increase in large distributed transactional databases and we need to investigate important patterns in such databases. On one side distributed processing may increase not only the processing capabilities butalsoincreases the cost of communication and storage cost. The work focuses on the distributed data association rules mining for the transactional data.It has opened new areas of research to develop the architecture, framework and algorithms in the area of distributed data mining. The distributed data partitions where data is created or generated at different locations vary in size, and number of frequent patterns are generated at different locations. This area is not very old and little work has been done in the distributed data association rules mining. Some new algorithms and new data structures are proposed in literature. Algorithms which are available, mostly first partition the database, distribute them amongst different sites for parallel processing. In the real life iiiscenario data generated at different sites isnot under the control of centralised database and the numbers of transactions at each site are highly varied. Due to this, some sites are heavily loaded and some sites are comparatively free, research is to be focused on these issues.Distributed mining is used in many commercial areas and there is a need to explore new commercial applications of the data mining.This work focuses on the study of the recent development in this area of distributeddata association rule mining (DARM). It highlights the issues and challenges, their co-relation, available technologies and tools, different algorithms, real data repositories for mining in the area of DARM. On the basis of the study,work focuses on the development of new algorithms and developing new application model addressing different issues and challenges in the area of DARM. Datasets Mushroom, Connect, T10I40D400K and Chess from the fimi data repository are taken for the implementation and testing of the proposed algorithms. Application model is developed and implemented on the actual data of atour & travel company for the last 2 years. A new algorithm named as QDFIN(Quick distributed frequent itemset mining using nodeset) is proposed in this researchwhich uses the efficient nodeset data structureto store the candidate itemsets locally at each site and zero-first technique to balance the load and pruning to reduce the candidate sets. The algorithm is implemented and the speed performance is compared withsome of the existing algorithmsFDM and PFIN on Mushroom dataset.Results shows that the proposed algorithm not only outperformsother algorithms on varying size data partition but also, on uniform distributed data on 4, 5 and 6 nodes setups. A novel approach, size based assignment, is proposed in this workwhich takes care of the database size available at each site while distributing the load for finding the ivglobal frequent itemsets. It also reduces the communication load by pruning and no-broadcasting techniques. The algorithm is compared with FDM and PFIN on execution timeon mushroom, connect, chess and T10I4D100Kdatasets. Results show that the new technique performed best amongst them in time executionand is best load balancing technique. The application area chosen for the study is a tour and travel company organizing package tours, becausetourism industry is growing very fast and small, medium and large sized companies are operating in this area. Tourism is a potential application where mining can be applied and new association rules can be generated which can help the companies to develop new strategies and target potential customer based on the mining outcome. This workapplies the distributed data mining technique on a medium sized tour&travel company for finding the association between age and destination visited parameters. The results show that association rules generated by mining are useful and effective for the growth of the business and making new strategies. |
URI: | http://dspace.dtu.ac.in:8080/jspui/handle/repository/18916 |
Appears in Collections: | Ph.D. Computer Engineering |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
Ph.D. Manoj Sethi.pdf | 3.35 MB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.