Date of Degree

6-2020

Document Type

Dissertation

Degree Name

Ph.D.

Program

Computer Science

Advisor

Xiaowen Zhang

Committee Members

Theodore Brown

Lixin Tao

Subash Shankar

Subject Categories

Computer Sciences

Abstract

Data Mining (DM) is a process for extracting interesting patterns from large volumes of data. It is one of the crucial steps in Knowledge Discovery in Databases (KDD). It involves various data mining methods that mainly fall into predictive and descriptive models. Descriptive models look for patterns, rules, relationships and associations within data. One of the descriptive methods is association rule analysis, which represents co-occurrence of items or events. Association rules are commonly used in market basket analysis. An association rule is in the form of X → Y and it shows that X and Y co-occur with a given level of support and confidence.

Association rule mining is a common technique used in discovering interesting frequent patterns in large datasets acquired in various application domains. Having petabytes of data finding its way into data storages in perhaps every day, made many researchers look for efficient methods for analyzing these large datasets. Many algorithms have been proposed for searching for frequent patterns. The search space combinatorically explodes as the size of the source data increases. Simply using more powerful computers, or even super-computers to handle ever-increasing size of large data sets is not sufficient. Hence, incremental algorithms have been developed and used to improve the efficiency of frequent pattern mining.

One of the challenges of frequent itemset mining is long running times of the algorithms. Two major costs of long running times of frequent itemset mining are due to the number of database scans and the number of candidates generated (the latter one requires memory, and the more the number of candidates there are the more memory space is needed. When the candidates do not fit in memory then page swapping will occur which will increase the running time of the algorithms).

In this dissertation we propose a new implementation of Apriori algorithm, NCLAT (Near Candidate-less Apriori with Tidlists), which scans the database only once and creates candidates only for level one (1-itemsets) which is equivalent to the total number of unique items in the database. In addition, we also show the results of choice of data structures used whether they are probabilistic or not, whether the datasets are horizontal or vertical, how counting is done, whether the algorithms are computed single or parallel way.

We implement, explore and devise incremental algorithm UWEP with single as well as parallel computation. We have also cleaned a minor bug in UWEP and created a more efficient version UWEP2, which reduces the number of candidates created and the number of database scans.

We have run all of our tests against three datasets with different features for different minimum support levels. We show both frequent and incremental frequent itemset mining implementation test results and comparison to each other.

While there has been a lot of work done on frequent itemset mining on structured data, very little work has been done on the unstructured data. So, we have created a new hybrid pattern search algorithm, Double-Hash, which performed better for all of our test scenarios than the known pattern search algorithms. Double-Hash can potentially be used in frequent itemset mining on unstructured data in the future. We will be presenting our work and test results on this as well.

Share

COinS